I believe, Do-files should be the core of your statistical analyses and literally everything you do. Do-files allow you to recap each and every step, to keep track about changes and to automatize your work. And the best thing of all you can search (Ctrl+F) and replace within a do-file using Ctrl+H.
1. Do-file basics – Visual structure & commenting
Find do files here:⇓
2. The data-do file
To avoid any accidental changes in the original data I suggest using a data-do-file that you can run before every analysis. Also it ensures the safety of your data, because you do not need to save the dataset locally
This is how you open a dataset:
The IDENTIFIERVARIABLE is for example id, referring to each participant in a study, keepusing again allows you to choose specific variables, if not identified the entire dataset will be merged. Generating a variable: merge1 gives you a way to check if your files were merged.
Now we come to the interesting part – what do you want to have in your initial dataset – do-file? I suggest sections on outcome, exposures, covariates and exclusion/inclusion
Have a clear idea how you want to name your variables. Having a consistent data-collection wave identifier like w01 or fup0 aso. will make it much easier to work with variables later on. Further adding “bin” for binary variables or “cont” can make it easier to know which variable you work with. You want to make them as easy to remember and use as possible.
The following commands are shown in a example dataset in STATA (bpwide.dta):
Remember the operators for generating new variables and if -clauses:
+ addition; – subtraction; * multiplication; / division; ^ power;
& and; | or ( you can find that sign next to the Z-key, “Shift+\”)
== equal; > greater than;< less than; >= > or equal; ; ! not; != not equal (instead of ! you can also use ~)
For more complicated variables and summaries I use egen command, full description here Stata manual: egen – some examples here, plus xtile command:
|egen newvar=rowtotal (var1 var2 ..)
||sums up the variables even if some of them are missing (! this does not happen when using gen=var1+var2..)
|egen newvar=rowmiss (var1)
||counts missing variables
|xtile newvar=var if…, nquantiles (X)
||command to make n-tiles, e.g. tertiles (X=3)
Be very careful in choosing your egen command, because they differ how missing values are handled!
Finally add as much commentary as possible to your command, you can use your data do-file to detect changes in the data, mistakes in coding (! which are awful, because they mess up all your work – therefore always tab and sum variables and see if they look reasonable), but also you can share your do files with other researchers when you constructed them understandable. And this is a rare moment where I take some time to make something look nice – because the prettier the easier to handle – change indent levels and add some clear structure by using signs – and as soon as you have one you can use it as a template for future Data- do-files.
Making a nice data do-file can take some time (a whole day or even several days) BUT this is a great investment. If you add references and clear comments, you can answer many questions on your methods by only using your data-do-file. If you then mark revisions at the start you can also track when and what you changed, so you keep it pretty and up to date.
3. How to run a Do-file
There are two options how to run a do-file – (A) select and click “Execute (do)” (B) use do “filelocation\name” – which will show you the output, (C) using run “filelocation\name” – will not show you any output. I recommend using do when your file is fresh and you want to check everything, also I would recommend saving a Log-file with your data-do file using the option (A/B) so you can check changes at later stages and have a record of what you’ve done. When you are familiar with your data – option (C) is a great fast way. You can then add the run-command at the beginning of further do files for example for analysis you create.