FAQ on example code sent to the statalist by Maarten Buis
When posting messages to statalist I often add example code. I typically identify the example by typing it between:
*----------- begin example -------------
*------------ end example --------------
These examples often contain some lines that are not central to the point in the post, but do make the example work. For instance, these could be some data preparation commands. In these lines I often use some tricks or shortcuts that I do not explain. In this FAQ I discuss and explain the most common of these tricks.
This is
not in any way a replacement of the
statalist FAQ. This is only meant to help people understand the examples I sent to statalist.
Table of content
How can I make the example work?
Within the email select the example and copy it:
Within Stata type
doedit in the command window:
This will open the do-file editor:
Paste the example in the do-file editor:
Do the do-file by pressing the last button on the right (see
here for instructions for older versions of Stata):
And you will get the output:
Why don't you just give the output?
a) The output usually doesn't format well in the message, making them very hard to read. Moreover, statalist doesn't allow attachments, so no graphs could be sent that way.
b) The most useful part of an example is playing with it. This way, at each step of the example you can look at the variables, change some commands, add some commands, etc.
What does sysuse auto, clear mean?
For an example you need example data. The
auto dataset is such an example dataset, and it is shipped with Stata, so everybody with Stata can use it. The command
sysuse is just a convenient command to access example datasets that are shipped with Stata, i.e. you don't have to remember where Stata stored them. When using your dataset you should use the
use command to load it.
Sometimes people sent the first couple of cases of the relevant variables in their dataset to show the structure of their data. Where possible I will use that data in my example, by using
input. This command requires a list of variable names. The data type of these variables is by default
float, i.e. numbers and accurate up to 8 digits. If the variable is a string (letters) the variable name has to be preceded with
str# where
# the length (number of characters) of the longest string in that variable. Sometimes these datasets contain a very long id variable. Depending on the length (number of digits) of this variable, I will want to
input that variable as either a
long, a
double, or a
str#. For more on this see
this entry on the
ATS website
Why do you recode rep78 1/2 = 3?
When I need a categorical or ordinal variable with more than two categories and I use the
auto dataset, than I use the variable
rep78. However, as can be seen below, the first two categories of this variable are almost empty. This can sometimes cause problems, so I often combine the first three categories .
recode rep78 1/2 = 3 means that Stata will recode all values of
rep78 from 1 trough 2 to the value 3. In this case it means that whenever Stata sees a 1 or a 2 it will change it to a 3. As a result all cars within the first three categories will now have the same code, i.e. the categories are combined.
Why do I add the variable baseline together with the nocons option?
Many Stata estimation commands have an option to display the coefficients in exponentiated form. Depending on the model these can be interpreted as odds ratio, risk ratios, incidence rate ratios, etc. To interpret the size of these ratios it is often useful to know the baseline odds, risk, incidence rate, etc, which happens to be the exponentiated constant. A doubling of these for a unit change in a explanatory variable is a lot more impressive if the baseline is already large: twice a small number is still a small number, but twice a large number is a huge number. Unfortunately, Stata will suppress the display of this baseline value. I trick Stata in displaying these values by first creating a variable
baseline, which is always 1, and than add that variable to the model together with the
nocons option, so this variable plays the role of constant without Stata knowing it. Also see
this Stata Tip.
What does gen rep3 = rep78 == 3 if rep78 < . do?
It creates a dummy variable that is 1 if
rep78 is 3, missing if
rep78 is missing, and zero otherwise. For a detailed explanation see
this official Stata FAQ.
Alternative and shorter methods for creating a series of dummy variables from
rep78 are:
xi i.rep78
tab rep78, gen(rep)
However I don't like the variable names these commands produce.
Why do you sometimes add if rep78 < . to a command?
Within the
auto dataset the variable
rep78 is the only variable with missing values. Missing values in Stata are represented by a
. and are the largest possible values. So any value less than
. are valid observations. For a detailed explanation see
this and
this official Stata FAQ.
What does !missing(var1,var2) mean?
The function
missing(var1,var2) returns a 1 (=true) for every observation where
var1 and/or
var2 are missing and a 0 (=false) otherwise. The
! is a negation, so together they return a "true" for every observation that has observed values on both
var1 and
var2, and a "false" for all other observations. Also see
this official Stata FAQ.
What does gen domestic = !foreign do?
It creates a dummy variable that is 1 if the car is domestic and 0 if it is foreign. This is a convenient way to "flip" a dummy variable. For a detailed explanation see
this official Stata FAQ. Notice I did not add
if foreign < .. This is bad style, but it works since I know that the variable
foreign does not contain any missing values.
What does _I* mean?
After
xi Stata creates new variables with names that start with
_I. If I want to refer to all variables created by
xi (and I didn't had any variables with names starting with
_I before I called
xi) than I can do so by typing
_I*. The
* is a wildcard, so
_I* can be read like all variables whose name start with
_I, see
help varlist.
What does the ? mean in w??
All variables whose name begins with a w and with only one other character, see
help varlist.
Where did the _N or _n come from?
Sometimes it is necessary to know the total number of observations or the current observation number. In Stata these are called respectively
_N and
_n. For more see
this entry on the
ATS website. There is however one exception:
_n within the
display command (often abbreviated as
di) means: "display a new line" instead of "current observation number".
What does `=sqrt(5)' or `=_N+3' do?
Within some Stata commands there are options that require one to give a number, while I actually want to give it an expression. By typing the expression as
`=expression', the expression is evaluated and all Stata sees is the number that is the result of the expression, making both me and Stata happy. Notice the
` and the
', they are necessary.
What does vecdiag(cholesky(diag(vecdiag(V))))' mean?
I typically use this to create a column vector containing standard errors. To be exact, this command creates a column vector containing the square root of the diagonal elements of the matrix V. Stata estimation commands typically store the variance covariance matrix and not the standard errors. The diagonal elements of the variance covariance matrix are the standard errors squared. So the square root of the diagonal elements of this matrix are the standard errors. Nowadays I more often use the set of tricks discussed in this
Stata tip.
What does _dots do?
Within a loop one can use the
_dots command to display dots that will tell you how far the loop has progressed (and how long you'll have to wait till it is finished). For more, see
this Stata tip.
Why do some lines end with /* and the next line begin with */?
If Stata sees a hard return it interprets that as the end of the command. Some commands get so long that they do not fit on one line and you want to break it up. Breaking the command by adding hard returns is a bad idea since Stata will see each hard return as the end of a command. A solution is to comment the hard return out. Comments are texts within a do file that are for human readers only and are ignored by Stata. One way to identify a piece of text as a comment is to put it between
/* and
*/. Stata will ignore anything that is between these two symbols, including hard returns. So if you end a line with
/* and begin the next with
*/, Stata will think it is all on one line, so one command. An alternative is to end a line with
///. You can find more on this in the User's guide chapter 16.1.3 (
[U] 16.1.3).
Why do some lines end with ///?
/// is a way to break up a long command over multiple lines. You can find more on this in the User's guide chapter 16.1.3 (
[U] 16.1.3).
What does #delim ; and #delim cr do?
Sometimes I want to wrap a line inside a string. For instance, I am making a local that contains a very long string. Now I cannot use the technique in the section above. Instead I change the delimiter to
;, so Stata no longer considers an enter as a sign of the end of a command, but continues reading until it sees a
;. When I am done with that string I usually change the delimiter back to return by typing
#delim cr.
What do the lines tempname b and scalar `b'=... do?
These two lines are used to store a number in the scalar
`b'. It is similar to typing
local b = ..., except that a local is accurate to a minimum of 11 decimal digits, while a scalar is accurate to 15 or 16 decimal digits. See the User's guide chapter 18.5 (
[U] 18.5), and
this post by Bill Gould. I use scalars when I think numerical precision might matter.
What does tempfile do?
Sometimes it is necessary or convenient to store some data. However, I don't want to keep it, so I often use
tempfile. If I type
tempfile temp than that reserves the name
`temp' (note the ` and the '), which I can use when I store data. That dataset will remain available for as long the do-file runs, and is immediately removed once the do-file has finished.
What does capture program drop do?
Sometimes it is necessary or convenient to create a program within a do-file. When writing the example I often (always) don't get it right the first time round, so I do a do-file many times. The second time I do a do-file that creates a program Stata will complain, since I will attempt to create a program that already exists. So I first need to drop that program, before I can create it again. Lets say I named the program
prog, than I should type
program drop prog, before I create the program. However, if I sent the example like this to the statalist, than it will complain the first time a statalist member tries to run the example, since that example will try to drop the program
prog which does not yet exists. That is what the command
capture is for: this will ensure that Stata will continue running even if
program drop prog creates an error.
I could remove the line
program drop prog, and than it would run fine the first time. However, someone would get an error message if she started to play with the example (which is the best way to understand it) and do it multiple times.