FAQ on example code sent to the statalist by Maarten Buis

When posting messages to statalist I often add example code. I typically identify the example by typing it between:

*----------- begin example -------------
*------------ end example --------------
These examples often contain some lines that are not central to the point in the post, but do make the example work. For instance, these could be some data preparation commands. In these lines I often use some tricks or shortcuts that I do not explain. In this FAQ I discuss and explain the most common of these tricks.

This is not in any way a replacement of the statalist FAQ. This is only meant to help people understand the examples I sent to statalist.

Table of content

How can I make the example work?

Within the email select the example and copy it:
copy the example

Within Stata type doedit in the command window:
within Stata type doedit

This will open the do-file editor:
this opens the do-file editor

Paste the example in the do-file editor:
paste the example

Do the do-file by pressing the last button on the right (see here for instructions for older versions of Stata):
do the do file

And you will get the output:
paste the example

Table of content

Why don't you just give the output?

a) The output usually doesn't format well in the message, making them very hard to read. Moreover, statalist doesn't allow attachments, so no graphs could be sent that way.
b) The most useful part of an example is playing with it. This way, at each step of the example you can look at the variables, change some commands, add some commands, etc.

Table of content

What does sysuse auto, clear mean?

For an example you need example data. The auto dataset is such an example dataset, and it is shipped with Stata, so everybody with Stata can use it. The command sysuse is just a convenient command to access example datasets that are shipped with Stata, i.e. you don't have to remember where Stata stored them. When using your dataset you should use the use command to load it.

Table of content

Why do you put things like long, double, or str8 between variable names when you use input?

Sometimes people sent the first couple of cases of the relevant variables in their dataset to show the structure of their data. Where possible I will use that data in my example, by using input. This command requires a list of variable names. The data type of these variables is by default float, i.e. numbers and accurate up to 8 digits. If the variable is a string (letters) the variable name has to be preceded with str# where # the length (number of characters) of the longest string in that variable. Sometimes these datasets contain a very long id variable. Depending on the length (number of digits) of this variable, I will want to input that variable as either a long, a double, or a str#. For more on this see this entry on the ATS website

Table of content

Why do you recode rep78 1/2 = 3?

When I need a categorical or ordinal variable with more than two categories and I use the auto dataset, than I use the variable rep78. However, as can be seen below, the first two categories of this variable are almost empty. This can sometimes cause problems, so I often combine the first three categories .

table of rep78

recode rep78 1/2 = 3 means that Stata will recode all values of rep78 from 1 trough 2 to the value 3. In this case it means that whenever Stata sees a 1 or a 2 it will change it to a 3. As a result all cars within the first three categories will now have the same code, i.e. the categories are combined.

Table of content

Why do I add the variable baseline together with the nocons option?

This is something I did in older posts, before I had access to Stata 12. In Stata 12 this trick is no longer necessary, as is discussed here.

In older versions of Stata the issue was that Stata used to suppress the display of the constant when requesting the coefficients in exponentiated form. Depending on the model these exponentiated coefficients can be interpreted as odds ratio, risk ratios, incidence rate ratios, etc. To interpret the size of these ratios it is often useful to know the baseline odds, risk, incidence rate, etc, which happens to be the exponentiated constant. A doubling of these for a unit change in a explanatory variable is a lot more impressive if the baseline is already large: twice a small number is still a small number, but twice a large number is a huge number. Unfortunately, Stata used to suppress the display of this baseline value. I tricked Stata in displaying these values by first creating a variable baseline, which is always 1, and than add that variable to the model together with the nocons option, so this variable plays the role of constant without Stata knowing it. Also see this Stata Tip.

Table of content

What does gen rep3 = rep78 == 3 if rep78 < . do?

It creates an indicator (dummy) variable that is 1 if rep78 is 3, missing if rep78 is missing, and zero otherwise. For a detailed explanation see this official Stata FAQ.

Alternative and shorter methods for creating a series of indicator variables from rep78 are:
xi i.rep78
tab rep78, gen(rep)
However I don't like the variable names these commands produce.

Since Stata 11 one can add categorical variables directly without xi using factor variables.

Table of content

What does gen rep3Xprice = (rep78 == 3)*price if rep78 < . do?

It creates an interaction between the variable price and an indicator (dummy) variable that is 1 if rep78 is 3, missing if rep78 is missing, and zero otherwise, without first making the indicator variable. For a detailed explanation see this official Stata FAQ. In many cases I would now use Stata's factor variables, but there are still situations where this is a convenient trick.

Table of content

Why do you sometimes add if rep78 < . to a command?

Within the auto dataset the variable rep78 is the only variable with missing values. Missing values in Stata are represented by a . and are the largest possible values. So any value less than . are valid observations. For a detailed explanation see this and this official Stata FAQ.

Table of content

What does !missing(var1,var2) mean?

The function missing(var1,var2) returns a 1 (=true) for every observation where var1 and/or var2 are missing and a 0 (=false) otherwise. The ! is a negation, so together they return a "true" for every observation that has observed values on both var1 and var2, and a "false" for all other observations. Also see this official Stata FAQ.

Table of content

What does gen domestic = !foreign do?

It creates a dummy variable that is 1 if the car is domestic and 0 if it is foreign. This is a convenient way to "flip" a dummy variable. For a detailed explanation see this official Stata FAQ. Notice I did not add if foreign < .. This is bad style, but it works since I know that the variable foreign does not contain any missing values.

Table of content

What does gen marst = !never_married + married if !missing(never_married, married) do?

In one of the build-in datasets, nlsw88, marital status is included as two indicator variables, aptly called never_married and married. In order to make use of the factor variable notation I need to combine these into a single categorical variable. I like the order 0=never married, 1=widowed/divorces, 2=married, this is what the above command gives me.

If someone is never married, she will have a 1 on never_married and a 0 otherwise. The ! reverses that, so !never_married will be 0 when never married and 1 otherwise. married will be 1 when the person married and 0 otherwise. So when someone is never married she will receive a 0 + 0 = 0 on marst. When someone is divorced/widowed she will receive a 1 + 0 = 1 on marst. When someone is married she will receive a 1 + 1 = 2 on marst.

Finally, the ! does not just reverses 0 into 1 and vice versa, but it turns all non-zeros, including missing values, into 1. So to keep the missing values missing I need to add if !missing(never_married, married).

Table of content

What does _I* mean?

After xi Stata creates new variables with names that start with _I. If I want to refer to all variables created by xi (and I didn't had any variables with names starting with _I before I called xi) than I can do so by typing _I*. The * is a wildcard, so _I* can be read like all variables whose name start with _I, see help varlist.

Table of content

What does the ? mean in w??

All variables whose name begins with a w and with only one other character, see help varlist.

Table of content

What does # or ## mean in a list of variables?

It means that I am using Stata's factor variable notation to create interactions. Typically I would use that in combination with i.varname to indicate that varname is to be treated as a categorical variable and/or c.varname to indicate that varname is to be treated as a continuous variable. As a special case I can also use the factor variable notation to include a quadratic curve by typing c.varname##c.varname. This will include varname and varname2 to the model.

Table of content

What does floor(x) or ceil(x) mean?

ceil(x) always rounds the number x up to the next integer, while floor(x) always rounds the number x down to the preceding integer. So ceil(6.1)=7 and ceil(6.9)= 7, while floor(6.1)=6 and floor(6.9)=6. This allows for some neat little tricks, as discussed in this Stata tip and this Stata tip.

Table of content

What does mod(x, y) mean?

mod(x, y) returns the remainder when dividing x by y. So, mod(1,2)=1, mod(2,2)=0, and mod(3,2)=1. This allows for some neat tricks as discussed in this Stata tip.

Table of content

Where did the _N or _n come from?

Sometimes it is necessary to know the total number of observations or the current observation number. In Stata these are called respectively _N and _n. For more see this entry on the ATS website. There is however one exception: _n within the display command (often abbreviated as di) means: "display a new line" instead of "current observation number".

Table of content

What does bys var : gen byte mark = _n == 1 followed by twoway ... if mark ... do?

This typically happens when I want to plot predictions and I know that that prediction is the same for all observations that share the same value of var. If I did not do this I would plot a marker for each observation, even though many would be overlaid on top of one another and you would not see them. This tends to make Stata graphs large (in terms of memory) and slow. With this trick I prevent this by plotting only one marker per unique value of var. Also see this Stata tip.

Table of content

What does `=sqrt(5)' or `=_N+3' do?

Within some Stata commands there are options that require one to give a number, while I actually want to give it an expression. By typing the expression as `=expression', the expression is evaluated and all Stata sees is the number that is the result of the expression, making both me and Stata happy. Notice the ` and the ', they are necessary.

Table of content

What does `: something' do?

A very powerful way of manipulating the content of macros is using so called extended macro functions. For example if I want to remove duplicate elements in a macro foo I can type
local foo "a b b a"
local foo_uniq : list uniq foo
di `"`foo_uniq'"'
If all I want to do is display the results, I can use a short-cut:
local foo "a b b a"
di `"`: list uniq foo'"'

Table of content

What does 1e4 or 2.1e-4 mean?

with very large or very small numbers one can end up typing and (mis)counting lots of zeros. I sometimes avoid this by typing those numbers in exponential format. So 1e4 is 1×104=10,000 and 2.1e-4 is 2.1×10-4=0.00021.

Table of content

What does if !_rc mean?

You can let Stata ignore errors in a command and continue by prefixing it with the capture command. The statement if !_rc immediately following capture means "do whatever follows if there was no problem".
capture will leave behind the scalar _rc, which contains the return code. The return code is the code assigned to error messages. For example r(198) or r(505). The code 0 is reserved for "no problems". In Stata logic the value 0 is "false" and the ! negates a logical statement. So if everything went ok !_rc will evaluate to !0, which is "not false" or "true". All non-zero values are treated as "true", so if there was a problem than !_rc will evaluate to "not true" or "false". Also see this official Stata FAQ.

Table of content

What does vecdiag(cholesky(diag(vecdiag(V))))' mean?

I typically use this to create a column vector containing standard errors. To be exact, this command creates a column vector containing the square root of the diagonal elements of the matrix V. Stata estimation commands typically store the variance covariance matrix and not the standard errors. The diagonal elements of the variance covariance matrix are the standard errors squared. So the square root of the diagonal elements of this matrix are the standard errors. Nowadays I more often use the set of tricks discussed in this Stata tip.

Table of content

What does _dots do?

Within a loop one can use the _dots command to display dots that will tell you how far the loop has progressed (and how long you'll have to wait till it is finished). For more, see this Stata tip.

Table of content

Why do some lines end with /* and the next line begin with */?

If Stata sees a hard return it interprets that as the end of the command. Some commands get so long that they do not fit on one line and you want to break it up. Breaking the command by adding hard returns is a bad idea since Stata will see each hard return as the end of a command. A solution is to comment the hard return out. Comments are texts within a do file that are for human readers only and are ignored by Stata. One way to identify a piece of text as a comment is to put it between /* and */. Stata will ignore anything that is between these two symbols, including hard returns. So if you end a line with /* and begin the next with */, Stata will think it is all on one line, so one command. An alternative is to end a line with ///. You can find more on this in the User's guide chapter 16.1.3 ([U] 16.1.3).

Table of content

Why do some lines end with ///?

/// is a way to break up a long command over multiple lines. You can find more on this in the User's guide chapter 16.1.3 ([U] 16.1.3).

Table of content

What does #delim ; and #delim cr do?

Sometimes I want to wrap a line inside a string. For instance, I am making a local that contains a very long string. Now I cannot use the technique in the section above. Instead I change the delimiter to ;, so Stata no longer considers an enter as a sign of the end of a command, but continues reading until it sees a ;. When I am done with that string I usually change the delimiter back to return by typing #delim cr.

Table of content

What do the lines tempname b and scalar `b'=... do?

These two lines are used to store a number in the scalar `b'. It is similar to typing local b = ..., except that a local is accurate to a minimum of 11 decimal digits, while a scalar is accurate to 15 or 16 decimal digits. See the User's guide chapter 18.5 ([U] 18.5), and this post by Bill Gould. I use scalars when I think numerical precision might matter.

Table of content

What does tempfile do?

Sometimes it is necessary or convenient to store some data. However, I don't want to keep it, so I often use tempfile. If I type tempfile temp than that reserves the name `temp' (note the ` and the '), which I can use when I store data. That dataset will remain available for as long the do-file runs, and is immediately removed once the do-file has finished.

Table of content

What does capture program drop do?

Sometimes it is necessary or convenient to create a program within a do-file. When writing the example I often (always) don't get it right the first time round, so I do a do-file many times. The second time I do a do-file that creates a program Stata will complain, since I will attempt to create a program that already exists. So I first need to drop that program, before I can create it again. Lets say I named the program prog, than I should type program drop prog, before I create the program. However, if I sent the example like this to the statalist, than it will complain the first time a statalist member tries to run the example, since that example will try to drop the program prog which does not yet exists. That is what the command capture is for: this will ensure that Stata will continue running even if program drop prog creates an error.

I could remove the line program drop prog, and than it would run fine the first time. However, someone would get an error message if she started to play with the example (which is the best way to understand it) and do it multiple times.

Table of content

What does program drop _all do?

This is an alternative solution to the problem discussed above. It will drop all programs in working memory (i.e. it does not touch the hard-disk). By not explicitly naming them I avoid the problem that there are no programs to drop the first time someone runs my example.

Table of content