Chun's Stata Notes
In this page, [EMS] refers to Essential Medical Statistics, a very good introductory book on biostatistics.
Getting help
Help pages: Type help
followed by the name of a command or a function will display the corresponding help page.
 Connection between help pages and GUI: On the top right of most help pages, there are links labelled as dialogs. Click one of them to get to the menu counterpart(s) of the command.
 Pointers to other relevant commands: At the end of a help page, there often is a section titled "Also see", which includes pointers to relevant commands. The links in this section often point me to the commands that suit my need.
 Help on topic groups: For example,
help estimation commands
shows all estimation/model fitting commands, and help functions
shows all functions.
 You can type
help
followed by nothing or go to Help > Contents to start a help window and continue from there.
Useful web pages:
Useful web pages accessible from Stata

net from http://www.ats.ucla.edu/stat/stata/ado
: UCLA's Stata programs. Type net cd teach
to enter the directory for teaching demonstrations.

net from http://www.stata.com/
: Stata's web page.
 You can use
net
to check current site, net cd dirname
to change directory to "dirname" (as if in Unix), net describe pkgname
to describe a package (named "pkgname"), net install pkgname
to install a package, and net link linkname
to go to another site.
Miscellaneous
Updates: Type update
for information of your current installation. Type update query
to compare the current installation with the newest available from www.stata.com. You also can set the program to check for updates automatically.
Stata news: Just type news
.
Set memory: Type memory
for current memory setting. When Stata starts, it allocates a fixed amount of memory to use. You can use set memory 50m, perm
to set the amount to 50 Megabytes in the future.
Calculator: Stata can be used as a simple commandline calculator by using the display
(or di
) command. For example, di 15/4
gives 3.75, di log(3)
gives the natural logarithm of 3, di log10(5)
gives the base10 logarithm of 5, di chi2tail(2, 3.87)
gives the pvalue of a test statistic 3.87 compared to the chisquared distribution with 2 degrees of freedon.
Functions for tail probabilities: Stata doesn't have a clear, consistent format.

normal(1.96)
gives the standard normal lefttail probability of Pr(z ≤ 1.96)

ttail(10, 1.8)
gives the righttail probability Pr(t_{10} ≥ 1.8)

Ftail(2, 10, 3.5)
gives the righttail probability of Pr(F_{2,10} ≥ 3.5)

chi2tail(3, 2.5)
gives the righttail probability of Pr(χ^{2}_{3} ≥ 2.5)

Binomial(50, 29, 0.3)
gives the righttail probability of Pr(X ≥ 29  n = 50, p = 0.3).
 For details of more such functions, type
help density functions
.
Exploring data
I find it useful to learn the basic information about a dataset through commands describe
, label list
, notes
, and st
. For more details, try codebook
, labelbook
.
Variable types: (Type help data types
for details.) Almost all statistical software packages need to store variables in different ways depending on the nature of the variables. When a software package reads in a data set in a nondefault format (e.g. ASCII/text files), it has to decide on the types of variables based on the patterns it sees. This may cause a problem if you are not careful.
 Example 1: If you have a value 16o (with zero mistyped as letter 'o') in variable "height", or if "." or "?" is used for missing values, the software may think these are possible values of the variable and thus the variable must not be numeric. As a result, the variable may be stored as categorical or character strings. Later on, when you want to calculate the mean (or any other analysis), the software may give you an error message because it cannot do this on categorical variables or strings.
 Example 2: If you have an "ID" variable and code it using numbers, or if you have a categorical variable with five categories and code it as 15, the software may think the variable is numeric and store it as numbers. Later on, when you do analysis on this variable, the software will treat them as numbers instead of factors and thus give wrong results, without any warning at all.
 When you see strange error messages or strange results, check if the variables are in correct types. Stata command
describe
can give you some basic information on variable types. Some variables may be disguised by labels; use label list
to see all the labels used in the data. You can change variable types using command format
if you want to.
Summary: The commands summarize
and inspect
can give you basic information about variables. To get more detailed information, add the , detail
option after the command. For example, su age language, d
will give detailed summary information on variables age
and language
.
Count: To count the number of records meeting some criteria, use the count
command. For example, count mi(language)
gives the number of records with missing values for variable language, count if age<9
gives the number of records with age<9.
Tables: tabulate
is a more powerful command than table
. To get some additional statistics, you can try tabulate var1 var2, cchi2 chi2 column expected row
.
Graphics
Type help graph intro
for more information on Stata graphics. Graph Examples provides examples to show what types graphs Stata can create.
Bar chart for a categorical variable: You have to create multiple variables, each for a category. For example, use tabulate sex, gen(aa)
will create two variables aa1
and aa2
as indicator variables of the two sexes. To make it cleaner, I can use tabulate sex, gen(sex)
to create sex1
and sex2
. Then I use graph hbar (rawsum) sex1sex2
to draw a horizontal bar graph. Horizontal bars tend to be better than vertical bars.
Pie chart is easier. graph pie, over(sex)
will do it. Pie charts tend to be not as effective as bar charts, unless extensive labelling is done.
Cumulative frequency distribution of a variable abc
can be drawn by first generating a new variable (say, cumabc
) using cumul abc, gen(cumabc)
, and then using line cumabc abc, sort
.
Scatterplot with jittering: Try twoway (scatter fev1 sex, jitter(30))
.
Bar chart for a continuous variable: In the Peru lung study, suppose you want to draw average fev1 for the four sexrespsymptoms groups with error bars. You need first to calculate some values:
collapse (mean) meanfev1=fev1 (sd) sdfev1=fev1 (count) n=fev1, by(sex resp)
gen hifev1 = meanfev1 + invttail(n1,0.025) * (sdfev1 / sqrt(n))
gen lofev1 = meanfev1  invttail(n1,0.025) * (sdfev1 / sqrt(n))
To make the graphs nicer, you may want to define labels:
label define sexlabel 0 female 1 male
label value sex sexlabel
label define symplabel 0 "no symptom" 1 symptom
label value respsymptoms symplabel
Then draw the graph using graph bar meanfev1, over(sex) over(resp)
. To draw a nicer bar with error bars, follow the instruction here.
Common issues
Labels: You can create labels for datasets, variables, or the values of variables so that the output will be easier to read and interpret. To list all existent value labels on the current dataset, use label dir
or label list
; for detailed information, use labelbook
. To create a label for a dataset, use label data
; to create a label for a variable, use label variable
; to created labels for the values of a variable, you need first to define the labels by using label define
and then to assign them to the variable by using label value
. For the last situation, there is an example under the bar chart section in this page. Type help label
for details.
Analysis by subsets: Suppose you have two groups of subjects defined by a variable, say group
, and you want to study a variable, say abc
, across the groups. Stata has inconsistent ways of doing this depending on the command you use. Some commands require a by var:
prefix, some require a by(var)
option after a comma after the command, and some require a over(var)
option.
Creation of indicator variables: xi
will generate indicator (dummy) variables for the variable(s) that marked with i.
. The names for the newly created indicator variables by default start with _I
; using the prefix()
option gets rid of this default. Example commands are:
xi, prefix(): logistic mf i.agegrp
xi, prefix(): logistic mf i.agegrp*area
xi, prefix(): logistic mf i.agegrp*i.area
Model fitting: Stata views the latest model fit (or "estimation" in Stata terminology, generated through an "estimation command") as the current model, and you can issue "postestimation" commands to obtain the results of the fitted model or to use the fitted model for prediction. Type help estimation commands
to see all estimation/model fitting commands.
Reture values: Many commands can return values, which can be used in the next command or saved for future use. The return values can be in either rclass or eclass. You can always type return list
and ereturn list
to see what results can be returned. General commands generate rclass return values and estimation commands (i.e. model fitting commands) generate eclass return values. Type help return
for details. For example, if I want to use the parameter estimates of a logistic regression in my future analysis, I can print them by matrix list e(b)
or extract them right after fitting the regression by matrix define A = e(b)
. This saves the coefficients into a matrix A.
Likelihood ratio test: If you want to compare two models, you have to store the results of a model. For example, the command est store A
will store the results of then currect model as "A". Later on, you can compare the results stored in "A" and those of the current model by lrtest A .
or lrtest . A
or just lrtest A
, where "." stands for the current model.
Note that the likelihood ratio test can be used ONLY to compare two nested models (one full model and one reduced model). When you use lrtest
to compare models that are not nested, Stata will assume the model with fewer parameters as the reduced model and generate results. But the results are not valid.
Continuous outcome variables
Ttest and ANOVA (EMS 79)
Confidence interval: ci
is the command.
Ttest on a variable stratified by another variable: An example is to test if fev1 is different in children less than 9 years old and in those older than 9 years old. If the stratifying variable doesn't have two levels, you need to generate a new variable first and carry out the ttest:
gen agegroup = age < 9
ttest fev1, by(agegroup)
ANOVA: anova
is the general command. For oneway ANOVA models, oneway
and loneway
are more useful. For the data in EMS Table 9.1(a), oneway hemoglbn type, t
will generate the results in EMS Table 9.1(c). (anova hemoglbn type
will generate the same results.)
Multiple comparison tests in ANOVA: In the command oneway hemoglbn type, b sch si
, the options after comma makes the command generate multiple comparison test results using Bonferroni, Scheffé, and Sidak procedures. To carry out Tukey's multiple comparison test, you can use the prcomp
command in the "sg101" package. [Type net search sg101
and click on the link and follow instructions for installation.] Once you have installed the package, type prcomp hemoglbn type, anova tukey
to get Tukey's test results.
Linear regression (EMS 1012)
regress
is one major command. See Regression with Stata for details.
For the data in EMS Table 10.1, regress plasma weight
will generate the results in EMS 10.2 and 10.4.
If you do a simple linear regression and want to draw the fitted line and its confidence interval, try the following. In the Peru lung study, try the following:
regress fev1 age
scatter fev1 age  lfit fev age
scatter fev1 age  lfitci fev age
rvfplot, yline(0)
To carry out a Ftest to see if age and height combined will have significant effect on the outcome after sex is taken into account, use test
command. For example, try this:
regress fev1 age height sex
test age height
The test
command allows you to carry out various tests. Check its help page for details.
To obtain residuals and put them into a variable, say aa
, use command predict aa, resid
. To obtain Cook's distance in a variable, say bb
, use predict bb, cooksd
. To obtain standardized residuals in a variable, say cc
, use predict cc, rstandard
.
qnorm
draws an inverse normal plot. swilk
performs ShapiroWilk test for normality.
Categorical outcome variables
Analyses on contigency tables (EMS 1517)
 Single binomial distribution: point estimate, CI, and ztest

prtesti 1000 123 .13, count
will generate the results in EMS 15.56.

cii 1000 123, wald
will generate point estimate and CI.
 Risk difference: point estimate, CI, and ztest

prtesti 240 20 220 80, count
will generate the results in EMS 16.3 (data in EMS Table 16.2).

csi 20 80 220 140
will genrate point estimate and CI, but with a test on odds ratio.
 Risk ratio: point estimate, CI

csi 39 6 29961 59994
will generate the results in EMS 16.5 (data in EMS Table 16.4), but with a test on odds ratio.
 Odds ratio: point estimate, CI, and test

cci 81 995 57 867, woolf
will generate the results in EMS 16.7 (data in EMS Table 16.5), but with a test based on the square of the zstatistic (which is equivalent to the ztest in EMS).

csi 81 57 995 867, or woolf
will generate the same results.
 Pearson's chisquared test and Fisher's exact test

tabi 20 220 \ 80 140, chi2
will generate the results in EMS 17.2 (data in EMS Table 16.2 or 17.1).

tabi 1 12 \ 3 9, exact
will generate the results in EMS 17.3 (data in EMS Table 17.2).

tabi 20 18 12 \ 32 20 8 \ 18 12 10
will generate the results in EMS 17.4 (data in EMS Table 17.4).
 Tests of trend and of departure from trend:
 We need to input data in EMS Table 17.5 like this:
clear
input menarche triceps freq
1 0 15
0 0 156
1 1 29
0 1 197
1 2 36
0 2 150
end


tabodds menarche triceps [weight=freq]
will generate the results in EMS 17.5.

logistic menarche triceps [weight=freq]
will generate logistic regression results.

ptrendi 15 156 0 \ 29 197 1 \ 36 150 2
will generate the results in EMS 17.5, using a formula for V that is slightly different from that in EMS p. 175 (N − 1 is replaced by N). (Note that ptrendi
can be installed by typing ssc install ptrend
.)
Analyses on stratified contigency tables (EMS 18, 21)
For data in EMS Table 18.2, there are multiple approaches to studying the effect of area type (rural or urban) on the probability of having antibodies to leptospirosis, while taking into account the effect due to gender. We are assuming the effect as measured by odds ratio is the same for males and females; in other words, we assume no interaction between area type and gender. On the basis of this assumption, we are estimating that "common" effect.
 MantelHaenszel method: point estimate, CI, and test
 We need to input data in EMS Table 18.2 like this:
clear
input gender area antibody freq
1 1 1 36
1 1 0 14
1 0 1 50
1 0 0 50
0 1 1 24
0 1 0 126
0 0 1 10
0 0 0 90
end


mhodds antibody area gender [weight=freq], compare(1,0)
will generate the results in EMS 18.4.

cc antibody area [weight=freq], by(gender)
also will generate these results as well as the result of test of heterogeneity (EMS 18.5). Note that test of heterogeneity is equivalent to test of interaction effect between area
and gender
.
 Logistic regression:
logistic antibody area gender [weight=freq]
will generate similar, but different, results using logistic regression.
 Conditional logistic regression: It will generate another different set of results. The
clogit
command treats [weight=]
differently from we did in the above commands. Therefore, we need to input data as 400 rows before carrying out conditional logistic regression.
clear
set obs 400
gen gender = 0
replace gender = 1 in 1/150
gen area = 0
foreach i in 1/50 151/300 {
replace area = 1 in `i'
}
gen antibody = 0
foreach i in 1/36 51/100 151/174 301/310 {
replace antibody = 1 in `i'
}
** conditional logistic regression
clogit antibody area, group(gender) or
 Matched studies:
mcci 184 54 14 63
will generate the McNemar's test result and OR estimate in EMS 21.23, but its CI calculations are based on formulae different from those in EMS. We also can try the MantelHaenszel method or conditional logistic regression, which are equivalent for paired binary data:
clear
set obs 630
gen speciman = int((_n+1)/2)
gen bell = 0
replace bell = 1 if mod(_n,2)==0
gen outcome = 0
replace outcome = 1 in 1/368
replace outcome = 1 if _n>=369 & _n<=476 & mod(_n,2)==0
replace outcome = 1 if _n>=477 & _n<=504 & mod(_n,2)==1
** MantelHaenszel method
mhodds outcome bell speciman
** conditional logistic regression (same results)
clogit outcome bell, group(speciman) or
** another command to do conditional logistic regression
xtlogit outcome bell, fe i(speciman) or
** randomeffect logistic regression (different results)
xtlogit outcome bell, re i(speciman) or
 The following calculates the kappa statistics (EMS 36.3) on the agreement between the two tests:
clear
input bell katokatz freq
1 1 184
1 0 54
0 1 14
0 0 63
end
kap bell katokatz [weight=freq], tab
Logistic regression (EMS 1920)
See Logistic Regression with Stata and here for more information on logistic regression. help logistic estimation commands
will list relevant commands (but without logistic
, wierd).
Examples in EMS 1920 use the Onchocerciasis data, which can be downloaded here.
 Study solely the effect of residence area (EMS 19.2)

logistic mf area
or logit mf area, or
generates output in odds scale

cc mf area, woolf
generates the same results. This is because the input variable area
is binary.

logistic mf area, coef
or logit mf area
generates output in parameter scale (i.e. log(odds) scale)
 Study solely the effect of age group (EMS 19.45) (See the "Common issues" section for explanations of
xi
, est store
, and lrtest
.)
 Age group as a categorical variable (EMS 19.4):
xi, prefix(): logistic mf i.agegrp
generates output in odds scale.
 Wald test for age group effect: After fitting the above regression, type
test agegrp_1 agegrp_2 agegrp_3
.
 Likelihood ratio test for age group effect:
** Model with age group
xi, prefix(): logistic mf i.agegrp
est store A
** Model without age group
logistic mf
lrtest A, stat

 Age group as a continuous variable (EMS 19.5):
logistic mf agegrp
 We can generate a graph to see the fit of this model compared to the data. Use
table agegrp
to see which groups tend to have high influence on results.
xi, prefix(): logistic mf i.agegrp
predict risk1
gen logodds1 = log(risk1/(1risk1))
logit mf agegrp
matrix define A=e(b)
** Draw the groupwise risk and the predicted risk curve
scatter risk1 agegrpfunction 11/(exp(A[1,2]+A[1,1]*x)+1),range(agegrp)
graph rename graph1, replace
** Draw groupwise logodds and fitted regression line as EMS Figure 19.3
scatter logodds1 agegrp  function A[1,2]+A[1,1]*x, range(agegrp)

 Compare the above two models to see if the extra parameterization of treating age group as a categorical variable is significantly better. We cannot directly compare the two models because they appear not to be nested. (Actually, in this case, they are nested, as shown below. In general, always make sure the models are nested.) We can do this:
xi, prefix() i.agegrp
** Model with age group as categories but parameterized to include
** the continuous agegrp variable
logistic mf agegrp agegrp_2 agegrp_3
est store A
** Model with agegrp as a continuous variable
logistic mf agegrp
lrtest A, stat
 Study two variables simultaneously: residence area and age group (See the "Common issues" section for explanations of
xi
, est store
, and lrtest
.)
 No interaction between the two variables (EMS 20.2):
xi, prefix(): logistic mf area i.agegrp
 Interaction is modeled:
xi, prefix(): logistic mf i.agegrp*area
 The following code generate EMS Figure 20.1. Use
tab area agegrp
to see which groups tend to have high influence on results.
xi, prefix(): logistic mf i.agegrp*area
predict risk2
gen logodds2 = log(risk2/(1risk2))
xi, prefix(): logistic mf area i.agegrp
predict risk3
gen logodds3 = log(risk3/(1risk3))
** sort data so that the line plot will not be messy
sort agegrp
** Draw groupwise logodds and fitted logodds as EMS Figure 20.1
** Copy the next two lines into one line before hitting Enter
scatter logodds2 agegrp  line logodds3 agegrp if area==0
 line logodds3 agegrp if area==1
** Graph in probability scale. The difference between two areas
** are no longer the same for different age groups.
scatter risk2 agegrp  line risk3 agegrp if area==0
 line risk3 agegrp if area==1

 To test for interaction, we need to fit two models, one with interaction and one without interaction, and compare their likelihood using a likelihood ratio test. The following can be used to test for interaction while treating age group as a categorical variable:
** Model with full interaction between age group and area
xi, prefix(): logistic mf i.agegrp*area
est store A
** Model without interaction
xi, prefix(): logistic mf area i.agegrp
lrtest A, stat
The following can be used to test for interaction while treating age group as a continous variable:
** Model with interaction between age group and area
gen agebyarea = agegrp*are
logistic mf agegrp area agebyarea
est store A
** Model without interaction
logistic mf area agegrp
lrtest A, stat
When we treated age group as a categorical variable, the interaction model with three extra parameters only increased the 2x(loglikelihood) by 5.27 over the noninteraction model, with corresponding pvalue 0.153. When we treated age group as a continuous variable, the interaction model with only one extra parameter increased the 2x(loglikelihood) by 6.46 over the noninteraction model (a different noninteraction model), with corresponding pvalue 0.011. This is a good example to demonstrate the fact that results should always be interpreted with context. These two "interactions" are different interactions! In practice, the test with fewer degrees of freedom often is more powerful.
The interaction analysis treating age group as a categorical variable is comparing the fitness of the two parallel zigzagged lines with the eight dots (previous code is copied here):
scatter logodds2 agegrp  line logodds3 agegrp if area==0
 line logodds3 agegrp if area==1
The interaction analysis treating age group as a continous variable is comparing the fitness of the blue lines and the red lines (assuming agebyarea
and logodds2
have been defined as above):
logistic mf agegrp area agebyarea
matrix define B=e(b)
logistic mf agegrp area
matrix define C=e(b)
** Copy the next five lines into one line before hitting Enter
scatter logodds2 agegrp
 function B[1,4]+B[1,1]*x, range(agegrp) lc(blue)
 function B[1,4]+B[1,2]+(B[1,1]+B[1,3])*x, range(agegrp) lc(blue)
 function C[1,3]+C[1,1]*x, range(agegrp) lc(red)
 function C[1,3]+C[1,2]+C[1,1]*x, range(agegrp) lc(red)
 If the outcome variable has more than two categories and the categories are naturally ordered, the appropriate analysis is ordinal logistic regression (also called proportional odds model). The Stata command for it is
ologit
.
Count outcome variables
Analyses on rates (EMS 2223)
 Single Poisson distribution: rate estimate, s.e. of the rate estimate, CI.
cii 873 57, poisson
will generate results in EMS 22.56, with numbers in count per childyear, not per 1000 childyears. In the latter scale, the CI will be from 49.45 to 84.59, different from that in EMS 22.6. In fact, the CI calculated by Stata is based on the Poisson likelihood function. The lower bound is λ_{L} such that Pr[Poisson(λ_{L}) ≥ 57] = 0.025, and the upper bound is λ_{U} such that Pr[Poisson(λ_{U}) ≤ 57] = 0.025. If the data are input as below, strate
will generate CI as in EMS 22.6.
 Rate difference:
iri 33 24 355 518
gives results in EMS 23.2. If the data are input as below, stir poor
and ir infection poor childyear
give the same results.
 Rate ratio: We need to input data in EMS Table 23.2 like this:
clear
input infection childyear poor stove outcome
28 251 1 1 1
5 52 0 1 1
5 104 1 2 1
19 466 0 2 1
end
gen unit = childyear/infection
stset unit [freq=infection]
Note that the command stset
is important. It declares data as survivaltime data, with a total of 57 failures (sum of the frequency variable infection
). Stata will treat the first row as 28 copies and will multiply the time variable by 28. This is why we had to define a time variable unit
as the total childyear divided by the number of infections.

 Overall and stratified failure rates and their CIs:
strate
gives overal failure rate and strate stove poor
gives results stratified by stove and poor.

stmh poor, compare(1,0)
gives rate ratio and its CI as in EMS 23.2, but with a different test. Alternatively, we may carry out a Poisson regression: poisson outcome poor [freq=infection], e(unit) irr
.

stmh poor, by(stove) compare(1,0)
gives the results in EMS 23.3: Rate ratio for each stove stratum, MantelHaenszel estimate of the assumed common rate ratio acrosss stove strata, the associated CI, test for rate ratio, and test for heterogeneity. [Note stmh poor stove, compare(1,0)
also gives some of the information.] Again, we may carry out a Poisson regression: poisson outcome poor stove [freq=infection], e(unit) irr
. Note that test for heterogeneity is equivalent to test for interaction effect between poor
and stove
.
Poisson regression (EMS 24)
After opening the caerphilly_ems.dta file, type des
, label list
, and st
to learn the structure of the data.
 Summarize the rates stratified by current smoking status:
stir cursmoke
will generate the results in EMS Table 24.3.
 Study solely the effect of current smoking status:
poisson mi cursmoke, e(folltime)
will generate the results in EMS Table 24.5, while adding the option irr
will generate the results in rate ratio (EMS Table 24.4; same as stmh cursmoke
).
 Study solely the effect of social class:
 Social class as a categorical variable:
xi, prefix(): poisson mi i.socclass, e(folltime) ir
will generate the results in EMS Table 24.6.
 Social class as a continuous variable:
poisson mi socclass, e(folltime) ir
will generate the results in EMS Table 24.7.
 Compare the two models to see if treating social class as a categorical variable significantly improves the fit compared to treating it as a continuous variable. Note that these two models are nested. You can see which groups have high influence on the fit by
strate socclass
.
** Model with social class as a categorical variable
xi, prefix(): poisson mi i.socclass, e(folltime) ir
est store A
predict irate, ir
label var irate "incidence rate"
** Model with social class as a continuous variable
poisson mi socclass, e(folltime) ir
lrtest A, stat
matrix define A=e(b)
** Graphical comparison
scatter ir socclass  function exp(A[1,2]+A[1,1]*x), range(socclass)
 Study current smoking status and social class simultaneously:

stmh cursmoke, by(socclass)
generate the results in EMS Table 24.8.

xi: poisson mi cursmoke i.socclass, e(folltime) ir
generate the results in EMS Table 24.9.
Survival outcome variables
Survival Analysis with Stata is a good resource on survival analysis. Type help st
for more details.
Analysis on hazards (EMS 26)
After opening the pbc_ems.dta file, type des
, label list
, and st
to learn the structure of the data.
The follow up time is recorded in years. We define a new variable to reflect the time in days by gen days = time*365.25
, and redeclare the survivaltime data structure by stset days, failure(d)
.

sts list if cenc0==1
will generate EMS Table 26.2. ltable days d if cenc0==1, graph
will generate EMS Table 26.2 and Figure 26.2. sts graph if cenc0==1, g
will generate a different flavor of EMS Figure 26.2.

sts graph, by(cenc0)
or ltable days d, by(cenc0) notable graph overlay
will generate EMS Figure 26.3.

sts graph, by(cenc0) na
will generate a graph of cumulative hazard. To generate EMS Figure 26.4, we need to add more options: sts graph, by(cenc0) na xscale(log) yscale(log) xlabel(10 100 1000 5000) ylabel(.001 .01 .1 1 5)
.
 MantelCox method:
stmc cenc0
will generate the results in EMS 26.5.
Cox proportional hazards regression (EMS 27)

stcox cenc0
will generate the results in EMS 27.2.
EMS other chapters
Nonparametrics (EMS 30.2)
 One sample: An observation has two parts of information: magnitude and sign. A ttest statistic is calculated based on both. We can replace the magnitude information by the corresponding ranks. Wilcoxon's signed rank test (
signrank
) is calculated based on the rank information and sign. A sign test (signtest
) is calculated based on the sign information alone, effective making the outcome as binomial.
 Two samples: Wilcoxon's rank sum test (
ranksum
)
 More than two samples: KruskalWallis test (
kwallis
)
 Rank correlations:
spearman
can be used to calculate Spearman's rank correlation coefficients. ktau
can be used to calculate Kendall's rank correlation coefficients.
ROC curve (EMS 36.2)
 In the Peru lung study, lower volumn of air a child could breathe out in a second tends to be associated with respiratory symptoms (coded as 1). Thus, the two variables appear in negative association. However, the command
roctab
assumes positive association, and we have to define a new variable before hand: gen negfev1 = fev1
. Then we can use roctab respsymptoms negfev1, sum graph
to output the results and EMS Figure 36.1.
 If we want to compare the ROC curves for boys and girls, we can use
roccomp respsymptoms negfev1, by(sex) sum graph
.
Advanced topics
Matrix
Matrix is very useful in statistics. Stata has many matrix related functions. Type help matrix
to start learning them.
Programming
Programming is useful in many scenarios: (1) When you analyze data using a protocol repeatedly, you may want to put the code into a program so that you can just call the program instead of repeating the many commands. (2) When you carry out a simulation, you will repeat a procedure many times. It will be very efficient to write a program and execute the program. (3) When people implement a statistical method in Stata, they write and distribute programs. Type help program
to start learning Stata programming.
Simulations
Note: Stata is not a good program to carry out sophisticated simulations, for which R is a better choice.
Generation of random variables: uniform()
will generate a random number between 0 and 1 (i.e. from the uniform(0, 1) distribution). To generate a random number from another distribution, apply the corresponding inverse cumulative distribution function. For example, invnormal(uniform())
will generate a random number from N(0, 1). help density functions
will list many such functions.
For example, if you want to generate a variable with 100 random numbers from a normal distribution with mean 3 and SD = 7, use:
clear
set obs 100
gen aa = invnormal(uniform()) * 7 + 3
Generation of normal random variables: Normal variables are used so often that Stata has a designated function for drawing them. drawnorm
can be used to generate normal variables, independent or dependent.
Sampling: sample
can be used to sample from existing values without replacement. bsample
and bootstrap
can be used to sample from existing values with replacement.
Simulation: simulate
is a major command used in simulations. It calls another command repeatedly and stores the results of each call. The command that is called often is a program defined by the user.
The following programs give you a sense of how people write simulation programs in Stata:
 Sampling.do is a program to simulate the repeated sampling procedure similar to that in EMS 4.5. The program treats the data as a population, and draws a random sample from a variable and calculates the sample mean. This procedure is repeated 100 times to demonstrate the variation of the sample mean as an estimate of the population mean. The program outputs summary and histogram of the 100 estimates. It also draw samples in different colors depending on whether the 95% confidence interval calculated based on the sample covers the real average or not.
 You may need to modify the program to run it on your computer. After you have saved the program, go to Stata and press ctrl8 [or, Window > Dofile Editor > New Dofile], then open the program. Modify the four lines near the beginning to try it on any dataset/variable/sample size/number of samples. Make sure the pathname is correct for the data file in your computer. After this, you can run the program by going to File > Do and choose the program.
 You can change the sample size to other numbers and rerun the program to see the effect of sample size on variation of the estimates.
 SimulatedSampling_ttest.do is a program to show the variation of sampling and its impact on test results as a function of sample means and sample SDs.
 SimulatedSampling_regression.do is a program to show the variation of regression line estimate as a function of sample size and standard deviation in the error term.
Stata graphics, under the hood
Stata graphics, under the hood was a talk by Vince Wiggins on the capabilities of Stata graphics. The talk was given from within Stata. You need to install the talk before starting it.
net from http://www.stata.com/users/vwiggins
net describe uk04
net install uk04
After this, you need to check if the PERSONAL ado directory exists. Type sysdir
to see what directory is designated for PERSONAL. Then check if the directory exist; if not, create it yourself. Then use ukgrtalk
to install the files needed for the talk. After this, you can type help ukgrtalk
and follow the instructions.
System settings and limits
Try query
to see the various system parameter settings. To change a setting, use set
command. To obtain the current setting dynamically (e.g. in a program), use cclass return values. For details, type creturn list
, which gives a more comprehensive list than query
does. Click on the links of the output to see corresponding help pages. Type help limits
will list all the limitations your version of Stata imposes.