margdistfit: Post-estimation command that compares the observed and theoretical marginal distributions.

Author: Maarten L. Buis

margdistfit is a post-estimation command for checking how well distributional assumptions of a parametric regression model fit to the data. It does so by comparing the marginal distribution implied by the regression model to the distribution of the dependent variable. This comparison is done through either a probability-probability plot, a quantile-quantile plot, a hanging rootogram, or a plot of the two cumulative density functions.

The key concept in this command is the marginal distribution. The idea behind a parametric regression model is that it assumes a distribution for the dependent variable, and this distribution can be described in terms of a small number of parameters: e.g. the mean and the standard deviation in case of the normal/Gaussian distribution. One or more of these distribution parameters, typically the mean, is allowed to differ from observation to observation depending on the values of the explanatory variables. So, the marginal distribution of the dependent variable implied by the model is a mixture distribution of N distributions, such that each component distribution gets the parameters of one of the observations in the data.

To give an indication of how much deviation from the theoretical distribution is still legitimate, the graph will also show the distribution of several (by default 20) simulated variables under the assumption that the regression model is true. By default, the simulations include both uncertainty due to uncertainty about the parameter estimates and uncertainty due to the fact that they are random draws from a distribution. This is achieved by creating the simulated variables in two steps: first the parameters are drawn from their sampling distribution, and than the simulated variable is drawn given those parameters.

margdenfit may be used after estimating a model with regress or betafit.

This package can be installed by typing in Stata: ssc install margdistfit

Supporting materials

Examples

In the example below I model log hourly wage using linear regression. One can think of this as fitting a normal distribution to the data, but allow the mean to differ from observation to observation depending on the values of the explanatory variables. So the distribution of wage according to this model is a mixture distribution of N (in this case 2,229) normal distributions, such that each component distribution has a mean equal to the predicted log wage and a standard error equal to the root mean squared error/standard error of the estimate.

In case of linear regression the distributional assumption is not very important, but it can still be useful to spot patterns in the data that deviate from the model. In this case, the dependent variable does not match the theoretical distribution well in the right tail. One might investigate whether or not wage was top-coded, that is, whether all reported wages over a given cut-off value where assigned the cut-off value rather than the actual wage.

. sysuse nlsw88, clear (NLSW, 1988 extract)

. gen lnw = ln(wage)

. qui reg lnw i.race south grade c.ttl_exp##c.ttl_exp c.tenure##c.tenure

. . set seed 1234567

. margdistfit, pp refopts(lcolor(red))

. . set seed 1234567

. margdistfit, qq refopts(lcolor(red))

. . set seed 1234567

. margdistfit, cumul refopts(lcolor(red))

. . set seed 1234567

. margdistfit, hangroot(jitter(5)) (bin=34, start=-.52652043, width=.13520522)

.

[do-file]

first example graph

second example graph

third example graph

fourth example graph

margdistfit can also be used after betafit, which fits a beta distribution. The beta distribution is a distribution for a bounded variable and is thus often used for modeling proportions. In the example below I model the proportions of its budget a city spends on its own organization with a beta distribution in which I let the mean and the scale parameter (mu and phi, respectively) depend on covariates. In this case it is the left tail that needs some attention: There are too few cases with very low proportions. Substantively that makes sense: there is a minimum larger than 0 under which that proportion in practice cannot go. For most application I would not consider this deviation too big of a problem, but if I would want to use this model to make statements on cities with very low proportions (presumably efficient cities) I would be a bit careful.

. use http://fmwww.bc.edu/repec/bocode/c/citybudget.dta, clear (Spending on different categories by Dutch cities in 2005)

. qui betafit governing , mu( minorityleft noleft popdens houseval) /// > phi(minorityleft noleft popdens houseval)

. set seed 1234567

. margdistfit, pp refopts(lcolor(red))

. . set seed 1234567

. margdistfit, qq refopts(lcolor(red))

. . set seed 1234567

. margdistfit, cumul refopts(lcolor(red))

. . set seed 1234567

. margdistfit, hangroot(jitter(5) start(1e-6) bin(30)) (bin=30, start=1.000e-06, width=.01591239)

. . set seed 1234567

. margdistfit, hangroot(susp notheor jitter(2) start(1e-6) bin(30)) (bin=30, start=1.000e-06, width=.01591239)

[do-file]

fifth example graph

sixth example graph

seventh example graph

eighth example graph

nineth example graph