ChunStataNotes < Main < Vanderbilt Biostatistics Wiki

<font size="3"><span style="font-family: times new roman,times,serif;">
---+++!! Chun's Stata Notes
In this page, [EMS] refers to _[[http://www.blackwellpublishing.com/essentialmedstats][Essential Medical Statistics]]_, a very good introductory book on biostatistics.
%TOC%

---++++ Getting help
*Help pages*:  Type =help= followed by the name of a command or a function will display the corresponding help page.
   * Connection between help pages and GUI:  On the top right of most help pages, there are links labelled as dialogs.  Click one of them to get to the menu counterpart(s) of the command.
   * Pointers to other relevant commands:  At the end of a help page, there often is a section titled "Also see", which includes pointers to relevant commands.  The links in this section often point me to the commands that suit my need.
   * Help on topic groups:  For example, =help estimation commands= shows all estimation/model fitting commands, and =help functions= shows all functions.
   * You can type =help= followed by nothing or go to Help -> Contents to start a help window and continue from there.

*Useful web pages*:
   * [[http://www.stata.com][http://www.stata.com]]
   * [[http://www.ats.ucla.edu/stat/stata/][UCLA Stata Resources]]:  A rich site.  The topics under "Learning Stata" are very helpful.
      * [[http://www.ats.ucla.edu/stat/stata/examples/default.htm][Stata Textbook Examples]]:  If you happen to being reading one of the books listed on this site, you may download Stata code to run the examples in the book.
      * [[http://www.ats.ucla.edu/stat/stata/webbooks/reg/][Regression with Stata]] and [[http://www.ats.ucla.edu/stat/stata/topics/regression.htm][here]]
      * [[http://www.ats.ucla.edu/stat/stata/webbooks/logistic/][Logistic Regression with Stata]] and [[http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.htm][here]] 
      * [[http://www.ats.ucla.edu/STAT/stata/seminars/stata_survival/default.htm][Survival Analysis with Stata]] and [[http://www.ats.ucla.edu/stat/stata/topics/Survey.htm][here]]
      * [[http://www.ats.ucla.edu/stat/stata/topics/graphics.htm][Graphics]]
      * [[http://www.ats.ucla.edu/stat/stata/topics/data_management.htm][Data management]]
   * [[http://statcomp.ats.ucla.edu/stata/][UCLA Stata Portal]]: Many useful links
   * [[http://edirc.repec.org/data/debocus.html][Stata Users Group meetings]]:  Many talks are downloadable.
   * [[http://data.princeton.edu/wws509/stata/default.htm][Stata Logs]] by Dr. Germán Rodríguez
   * Under various topics below, you may find some additional links.

*Useful web pages accessible from Stata*
   * =net from <nop>http://www.ats.ucla.edu/stat/stata/ado= :  UCLA's Stata programs.  Type =net cd teach= to enter the directory for teaching demonstrations.
   * =net from <nop>http://www.stata.com/= :  Stata's web page.
   * You can use =net= to check current site, =net cd dirname= to change directory to "dirname" (as if in Unix), =net describe pkgname= to describe a package (named "pkgname"), =net install pkgname= to install a package, and =net link linkname= to go to another site.

---++++ Miscellaneous
*Updates*:  Type =update= for information of your current installation.  Type =update query= to compare the current installation with the newest available from www.stata.com.  You also can set the program to check for updates automatically.

*Stata news*:  Just type =news=.

*Set memory*:  Type =memory= for current memory setting.  When Stata starts, it allocates a fixed amount of memory to use.  You can use =set memory 50m, perm= to set the amount to 50 Megabytes in the future.

*Calculator*:  Stata can be used as a simple command-line calculator by using the =display= (or =di=) command.  For example, =di 15/4= gives 3.75, =di log(3)= gives the natural logarithm of 3, =di log10(5)= gives the base-10 logarithm of 5, =di chi2tail(2, 3.87)= gives the p-value of a test statistic 3.87 compared to the chi-squared distribution with 2 degrees of freedon.

*Functions for tail probabilities*:  Stata doesn't have a clear, consistent format.
   * =normal(1.96)= gives the standard normal left-tail probability of Pr(_z_ &le; 1.96)
   * =ttail(10, 1.8)= gives the right-tail probability Pr(t<sub>10</sub> &ge; 1.8)
   * =Ftail(2, 10, 3.5)= gives the right-tail probability of Pr(F<sub>2,10</sub> &ge; 3.5)
   * =chi2tail(3, 2.5)= gives the right-tail probability of Pr(&chi;<sup>2</sup><sub>3</sub> &ge; 2.5)
   * =Binomial(50, 29, 0.3)= gives the right-tail probability of Pr(X &ge; 29 | n = 50, p = 0.3).
   * For details of more such functions, type =help density functions=.

---++++ Exploring data
I find it useful to learn the basic information about a dataset through commands =describe=, =label list=, =notes=, and =st=.  For more details, try =codebook=, =labelbook=.

*Variable types*:  (Type =help data types= for details.)  Almost all statistical software packages need to store variables in different ways depending on the nature of the variables.  When a software package reads in a data set in a non-default format (e.g. ASCII/text files), it has to decide on the types of variables based on the patterns it sees.  This may cause a problem if you are not careful.
   * Example 1:  If you have a value 16o (with zero mistyped as letter 'o') in variable "height", or if "." or "?" is used for missing values, the software may think these are possible values of the variable and thus the variable must not be numeric.  As a result, the variable may be stored as categorical or character strings.  Later on, when you want to calculate the mean (or any other analysis), the software may give you an error message because it cannot do this on categorical variables or strings.
   * Example 2:  If you have an "ID" variable and code it using numbers, or if you have a categorical variable with five categories and code it as 1-5, the software may think the variable is numeric and store it as numbers.  Later on, when you do analysis on this variable, the software will treat them as numbers instead of factors and thus give wrong results, without any warning at all.
   * When you see strange error messages or strange results, check if the variables are in correct types.  Stata command =describe= can give you some basic information on variable types.  Some variables may be disguised by labels; use =label list= to see all the labels used in the data.  You can change variable types using command =format= if you want to.

*Summary*:  The commands =summarize= and =inspect= can give you basic information about variables.  To get more detailed information, add the =, detail= option after the command.  For example, =su age language, d= will give detailed summary information on variables =age= and =language=.

*Count*:  To count the number of records meeting some criteria, use the =count= command.  For example, =count mi(language)= gives the number of records with missing values for variable language, =count if age<9= gives the number of records with age<9.

*Tables*:  =tabulate= is a more powerful command than =table=.  To get some additional statistics, you can try =tabulate var1 var2, cchi2 chi2 column expected row=.

---++++ Graphics
Type =help graph intro= for more information on Stata graphics.  [[http://www.ats.ucla.edu/stat/stata/library/GraphExamples/default.htm][Graph Examples]] provides examples to show what types graphs Stata can create.

*Bar chart for a categorical variable*:  You have to create multiple variables, each for a category.  For example, use =tabulate sex, gen(aa)= will create two variables =aa1= and =aa2= as indicator variables of the two sexes.  To make it cleaner, I can use =tabulate sex, gen(sex)= to create =sex1= and =sex2=.  Then I use =graph hbar (rawsum) sex1-sex2= to draw a horizontal bar graph.  Horizontal bars tend to be better than vertical bars.

*Pie chart* is easier.  =graph pie, over(sex)= will do it.  Pie charts tend to be not as effective as bar charts, unless extensive labelling is done.

*Cumulative frequency distribution* of a variable =abc= can be drawn by first generating a new variable (say, =cumabc=) using =cumul abc, gen(cumabc)=, and then using =line cumabc abc, sort=.

*Scatterplot with jittering*:  Try =twoway (scatter fev1 sex, jitter(30))=.

*Bar chart for a continuous variable*:  In the Peru lung study, suppose you want to draw average fev1 for the four sex-respsymptoms groups with error bars.  You need first to calculate some values:
<verbatim>
collapse (mean) meanfev1=fev1 (sd) sdfev1=fev1 (count) n=fev1, by(sex resp)
gen hifev1 = meanfev1 + invttail(n-1,0.025) * (sdfev1 / sqrt(n))
gen lofev1 = meanfev1 - invttail(n-1,0.025) * (sdfev1 / sqrt(n))
</verbatim>
To make the graphs nicer, you may want to define labels:
<verbatim>
label define sexlabel 0 female 1 male
label value sex sexlabel
label define symplabel 0 "no symptom" 1 symptom
label value respsymptoms symplabel
</verbatim>
Then draw the graph using =graph bar meanfev1, over(sex) over(resp)=.  To draw a nicer bar with error bars, follow the instruction [[http://www.ats.ucla.edu/stat/stata/faq/barcap.htm][here]].


---++++ Common issues
*Labels*:  You can create labels for datasets, variables, or the values of variables so that the output will be easier to read and interpret.  To list all existent value labels on the current dataset, use =label dir= or =label list=; for detailed information, use =labelbook=.  To create a label for a dataset, use =label data=; to create a label for a variable, use =label variable=; to created labels for the values of a variable, you need first to define the labels by using =label define= and then to assign them to the variable by using =label value=.  For the last situation, there is an example under the bar chart section in this page.  Type =help label= for details.

*Analysis by subsets*:  Suppose you have two groups of subjects defined by a variable, say =group=, and you want to study a variable, say =abc=, across the groups.  Stata has inconsistent ways of doing this depending on the command you use.  Some commands require a =by var:= prefix, some require a =by(var)= option after a comma after the command, and some require a =over(var)= option.

*Creation of indicator variables*:  =xi= will generate indicator (dummy) variables for the variable(s) that marked with =i.=.  The names for the newly created indicator variables by default start with =_I=; using the =prefix()= option gets rid of this default.  Example commands are:
<verbatim>
xi, prefix(): logistic mf i.agegrp
xi, prefix(): logistic mf i.agegrp*area
xi, prefix(): logistic mf i.agegrp*i.area
</verbatim>

*Model fitting*:  Stata views the latest model fit (or "estimation" in Stata terminology, generated through an "estimation command") as the current model, and you can issue "postestimation" commands to obtain the results of the fitted model or to use the fitted model for prediction.  Type =help estimation commands= to see all estimation/model fitting commands.

*Reture values*:  Many commands can return values, which can be used in the next command or saved for future use.  The return values can be in either r-class or e-class.  You can always type =return list= and =ereturn list= to see what results can be returned.  General commands generate r-class return values and estimation commands (i.e. model fitting commands) generate e-class return values.  Type =help return= for details.  For example, if I want to use the parameter estimates of a logistic regression in my future analysis, I can print them by =matrix list e(b)= or extract them right after fitting the regression by =matrix define A = e(b)=.  This saves the coefficients into a matrix A.

*Likelihood ratio test*:  If you want to compare two models, you have to store the results of a model.  For example, the command =est store A= will store the results of then currect model as "A".  Later on, you can compare the results stored in "A" and those of the current model by =lrtest A .= or =lrtest . A= or just =lrtest A=, where "." stands for the current model.

Note that the likelihood ratio test can be used ONLY to compare two nested models (one full model and one reduced model).  When you use =lrtest= to compare models that are not nested, Stata will assume the model with fewer parameters as the reduced model and generate results.  But the results are not valid.


---++++ Continuous outcome variables
---+++++ T-test and ANOVA (EMS 7-9)
*Confidence interval*:  =ci= is the command.

*T-test on a variable stratified by another variable*:  An example is to test if fev1 is different in children less than 9 years old and in those older than 9 years old.  If the stratifying variable doesn't have two levels, you need to generate a new variable first and carry out the t-test:
<verbatim>
gen agegroup = age < 9
ttest fev1, by(agegroup)
</verbatim>

*ANOVA*:  =anova= is the general command.  For one-way ANOVA models, =oneway= and =loneway= are more useful.  For the [[%ATTACHURL%/EMShaemoglobin.txt][data]] in EMS Table 9.1(a), =oneway hemoglbn type, t= will generate the results in EMS Table 9.1(c).  (=anova hemoglbn type= will generate the same results.)

*Multiple comparison tests in ANOVA*:  In the command =oneway hemoglbn type, b sch si=, the options after comma makes the command generate multiple comparison test results using Bonferroni, Scheffé, and Sidak procedures.  To carry out Tukey's multiple comparison test, you can use the =prcomp= command in the "sg101" package.  [Type =net search sg101= and click on the link and follow instructions for installation.]  Once you have installed the package, type =prcomp hemoglbn type, anova tukey= to get Tukey's test results.

---+++++ Linear regression (EMS 10-12)
=regress= is one major command.  See [[http://www.ats.ucla.edu/stat/stata/webbooks/reg/][Regression with Stata]] for details.

For the [[%ATTACHURL%/EMSplasma.txt][data]] in EMS Table 10.1, =regress plasma weight= will generate the results in EMS 10.2 and 10.4.

If you do a simple linear regression and want to draw the fitted line and its confidence interval, try the following.  In the Peru lung study, try the following:
<verbatim>
regress fev1 age
scatter fev1 age || lfit fev age
scatter fev1 age || lfitci fev age
rvfplot, yline(0)
</verbatim>

To carry out a F-test to see if age and height combined will have significant effect on the outcome after sex is taken into account, use =test= command.  For example, try this:
<verbatim>
regress fev1 age height sex
test age height
</verbatim>
The =test= command allows you to carry out various tests.  Check its help page for details.

To obtain residuals and put them into a variable, say =aa=, use command =predict aa, resid=.  To obtain Cook's distance in a variable, say =bb=, use =predict bb, cooksd=.  To obtain standardized residuals in a variable, say =cc=, use =predict cc, rstandard=.

=qnorm= draws an inverse normal plot.  =swilk= performs Shapiro-Wilk test for normality.


---++++ Categorical outcome variables
---+++++ Analyses on contigency tables (EMS 15-17)
   * Single binomial distribution: point estimate, CI, and z-test
      * =prtesti 1000 123 .13, count= will generate the results in EMS 15.5-6.
      * =cii 1000 123, wald= will generate point estimate and CI.
   * Risk difference: point estimate, CI, and z-test
      * =prtesti 240 20 220 80, count= will generate the results in EMS 16.3 (data in EMS Table 16.2).
      * =csi 20 80 220 140= will genrate point estimate and CI, but with a test on odds ratio.
   * Risk ratio: point estimate, CI
      * =csi 39 6 29961 59994= will generate the results in EMS 16.5 (data in EMS Table 16.4), but with a test on odds ratio.
   * Odds ratio: point estimate, CI, and test
      * =cci 81 995 57 867, woolf= will generate the results in EMS 16.7 (data in EMS Table 16.5), but with a test based on the square of the z-statistic (which is equivalent to the z-test in EMS).
      * =csi 81 57 995 867, or woolf= will generate the same results.
   * Pearson's chi-squared test and Fisher's exact test
      * =tabi 20 220 \ 80 140, chi2= will generate the results in EMS 17.2 (data in EMS Table 16.2 or 17.1).
      * =tabi 1 12 \ 3 9, exact= will generate the results in EMS 17.3 (data in EMS Table 17.2).
      * =tabi 20 18 12 \ 32 20 8 \ 18 12 10= will generate the results in EMS 17.4 (data in EMS Table 17.4).
   * Tests of trend and of departure from trend:
      * We need to input data in EMS Table 17.5 like this:
<verbatim>
clear
input menarche triceps freq
1 0 15
0 0 156
1 1 29
0 1 197
1 2 36
0 2 150
end
</verbatim>
      * =tabodds menarche triceps [weight=freq]= will generate the results in EMS 17.5.
      * =logistic menarche triceps [weight=freq]= will generate logistic regression results.
      * =ptrendi 15 156 0 \ 29 197 1 \ 36 150 2= will generate the results in EMS 17.5, using a formula for V that is slightly different from that in EMS p. 175 (N &minus; 1 is replaced by N).  (Note that =ptrendi= can be installed by typing =ssc install ptrend=.)

---+++++ Analyses on stratified contigency tables (EMS 18, 21)
For data in EMS Table 18.2, there are multiple approaches to studying the effect of area type (rural or urban) on the probability of having antibodies to leptospirosis, while taking into account the effect due to gender.  We are assuming the effect as measured by odds ratio is the same for males and females; in other words, we assume no interaction between area type and gender.  On the basis of this assumption, we are estimating that "common" effect.
   * Mantel-Haenszel method: point estimate, CI, and test
      * We need to input data in EMS Table 18.2 like this:
<verbatim>
clear
input gender area antibody freq
1 1 1 36
1 1 0 14
1 0 1 50
1 0 0 50
0 1 1 24
0 1 0 126
0 0 1 10
0 0 0 90
end
</verbatim>
      * =mhodds antibody area gender [weight=freq], compare(1,0)= will generate the results in EMS 18.4.
      * =cc antibody area [weight=freq], by(gender)= also will generate these results as well as the result of test of heterogeneity (EMS 18.5).  Note that test of heterogeneity is equivalent to test of interaction effect between =area= and =gender=.
   * Logistic regression: =logistic antibody area gender [weight=freq]= will generate similar, but different, results using logistic regression.
   * Conditional logistic regression: It will generate another different set of results.  The =clogit= command treats =[weight=]= differently from we did in the above commands.  Therefore, we need to input data as 400 rows before carrying out conditional logistic regression.
<verbatim>
clear
set obs 400
gen gender = 0
replace gender = 1 in 1/150
gen area = 0
foreach i in 1/50 151/300 {
  replace area = 1 in `i'
}
gen antibody = 0
foreach i in 1/36 51/100 151/174 301/310 {
  replace antibody = 1 in `i'
}
** conditional logistic regression
clogit antibody area, group(gender) or
</verbatim>

   * Matched studies:  =mcci 184 54 14 63= will generate the <nop>McNemar's test result and OR estimate in EMS 21.2-3, but its CI calculations are based on formulae different from those in EMS.  We also can try the Mantel-Haenszel method or conditional logistic regression, which are equivalent for paired binary data:
<verbatim>
clear
set obs 630
gen speciman = int((_n+1)/2)
gen bell = 0
replace bell = 1 if mod(_n,2)==0
gen outcome = 0
replace outcome = 1 in 1/368
replace outcome = 1 if _n>=369 & _n<=476 & mod(_n,2)==0
replace outcome = 1 if _n>=477 & _n<=504 & mod(_n,2)==1

** Mantel-Haenszel method
mhodds outcome bell speciman
** conditional logistic regression (same results)
clogit outcome bell, group(speciman) or
** another command to do conditional logistic regression
xtlogit outcome bell, fe i(speciman) or
** random-effect logistic regression (different results)
xtlogit outcome bell, re i(speciman) or
</verbatim>

   * The following calculates the kappa statistics (EMS 36.3) on the agreement between the two tests:
<verbatim>
clear
input bell katokatz freq
1 1 184
1 0 54
0 1 14
0 0 63
end
kap bell katokatz [weight=freq], tab
</verbatim>

---+++++ Logistic regression (EMS 19-20)
See [[http://www.ats.ucla.edu/stat/stata/webbooks/logistic/][Logistic Regression with Stata]] and [[http://www.ats.ucla.edu/stat/stata/topics/logistic_regression.htm][here]] for more information on logistic regression.  =help logistic estimation commands= will list relevant commands (but without =logistic=, wierd).

Examples in EMS 19-20 use the Onchocerciasis data, which can be downloaded [[http://www.blackwellpublishing.com/essentialmedstats/datasets/oncho_ems.dta][here]].

   * Study solely the effect of residence area (EMS 19.2)
      * =logistic mf area= or =logit mf area, or= generates output in odds scale
      * =cc mf area, woolf= generates the same results.  This is because the input variable =area= is binary.
      * =logistic mf area, coef= or =logit mf area= generates output in parameter scale (i.e. log(odds) scale)

   * Study solely the effect of age group (EMS 19.4-5) (See the "Common issues" section for explanations of =xi=, =est store=, and =lrtest=.)
      * Age group as a categorical variable (EMS 19.4): =xi, prefix(): logistic mf i.agegrp= generates output in odds scale.
         * Wald test for age group effect:  After fitting the above regression, type =test agegrp_1 agegrp_2 agegrp_3=.
         * Likelihood ratio test for age group effect:
<verbatim>
** Model with age group
xi, prefix(): logistic mf i.agegrp
est store A
** Model without age group
logistic mf
lrtest A, stat
</verbatim>
      * Age group as a continuous variable (EMS 19.5): =logistic mf agegrp=
         * We can generate a graph to see the fit of this model compared to the data.  Use =table agegrp= to see which groups tend to have high influence on results.
<verbatim>
xi, prefix(): logistic mf i.agegrp
predict risk1
gen logodds1 = log(risk1/(1-risk1))
logit mf agegrp
matrix define A=e(b)
** Draw the group-wise risk and the predicted risk curve
scatter risk1 agegrp||function 1-1/(exp(A[1,2]+A[1,1]*x)+1),range(agegrp)
graph rename graph1, replace
** Draw group-wise log-odds and fitted regression line as EMS Figure 19.3
scatter logodds1 agegrp || function A[1,2]+A[1,1]*x, range(agegrp)
</verbatim>
      * Compare the above two models to see if the extra parameterization of treating age group as a categorical variable is significantly better.  We cannot directly compare the two models because they appear not to be nested.  (Actually, in this case, they are nested, as shown below.  In general, always make sure the models are nested.)  We can do this:
<verbatim>
xi, prefix() i.agegrp
** Model with age group as categories but parameterized to include
** the continuous agegrp variable
logistic mf agegrp agegrp_2 agegrp_3
est store A
** Model with agegrp as a continuous variable
logistic mf agegrp
lrtest A, stat
</verbatim>

   * Study two variables simultaneously: residence area and age group (See the "Common issues" section for explanations of =xi=, =est store=, and =lrtest=.)
      * No interaction between the two variables (EMS 20.2):  =xi, prefix(): logistic mf area i.agegrp=
      * Interaction is modeled: =xi, prefix(): logistic mf i.agegrp*area=
      * The following code generate EMS Figure 20.1.  Use =tab area agegrp= to see which groups tend to have high influence on results.
<verbatim>
xi, prefix(): logistic mf i.agegrp*area
predict risk2
gen logodds2 = log(risk2/(1-risk2))
xi, prefix(): logistic mf area i.agegrp
predict risk3
gen logodds3 = log(risk3/(1-risk3))
** sort data so that the line plot will not be messy
sort agegrp
** Draw group-wise log-odds and fitted log-odds as EMS Figure 20.1
** Copy the next two lines into one line before hitting Enter
scatter logodds2 agegrp || line logodds3 agegrp if area==0
  || line logodds3 agegrp if area==1
** Graph in probability scale.  The difference between two areas
** are no longer the same for different age groups.
scatter risk2 agegrp || line risk3 agegrp if area==0
  || line risk3 agegrp if area==1
</verbatim>

      * To test for interaction, we need to fit two models, one with interaction and one without interaction, and compare their likelihood using a likelihood ratio test.  The following can be used to test for interaction while treating age group as a categorical variable:
<verbatim>
** Model with full interaction between age group and area
xi, prefix(): logistic mf i.agegrp*area
est store A
** Model without interaction
xi, prefix(): logistic mf area i.agegrp
lrtest A, stat
</verbatim>
The following can be used to test for interaction while treating age group as a continous variable:
<verbatim>
** Model with interaction between age group and area
gen agebyarea = agegrp*are
logistic mf agegrp area agebyarea
est store A
** Model without interaction
logistic mf area agegrp
lrtest A, stat
</verbatim>

When we treated age group as a categorical variable, the interaction model with *three* extra parameters only increased the 2x(log-likelihood) by 5.27 over the non-interaction model, with corresponding p-value 0.153.  When we treated age group as a continuous variable, the interaction model with only *one* extra parameter increased the 2x(log-likelihood) by 6.46 over the non-interaction model (a different non-interaction model), with corresponding p-value 0.011.  This is a good example to demonstrate the fact that results should always be interpreted with context.  These two "interactions" are different interactions!  In practice, the test with fewer degrees of freedom often is more powerful.

The interaction analysis treating age group as a categorical variable is comparing the fitness of the two parallel zig-zagged lines with the eight dots (previous code is copied here):
<verbatim>
scatter logodds2 agegrp || line logodds3 agegrp if area==0
  || line logodds3 agegrp if area==1
</verbatim>

The interaction analysis treating age group as a continous variable is comparing the fitness of the blue lines and the red lines (assuming =agebyarea= and =logodds2= have been defined as above):
<verbatim>
logistic mf agegrp area agebyarea
matrix define B=e(b)
logistic mf agegrp area
matrix define C=e(b)
** Copy the next five lines into one line before hitting Enter
scatter logodds2 agegrp
  || function B[1,4]+B[1,1]*x, range(agegrp) lc(blue)
  || function B[1,4]+B[1,2]+(B[1,1]+B[1,3])*x, range(agegrp) lc(blue)
  || function C[1,3]+C[1,1]*x, range(agegrp) lc(red)
  || function C[1,3]+C[1,2]+C[1,1]*x, range(agegrp) lc(red)
</verbatim>
   * If the outcome variable has more than two categories and the categories are naturally ordered, the appropriate analysis is ordinal logistic regression (also called proportional odds model).  The Stata command for it is =ologit=.

---++++ Count outcome variables
---+++++ Analyses on rates (EMS 22-23)
   * Single Poisson distribution: rate estimate, s.e. of the rate estimate, CI.  =cii 873 57, poisson= will generate results in EMS 22.5-6, with numbers in count per child-year, not per 1000 child-years.  In the latter scale, the CI will be from 49.45 to 84.59, different from that in EMS 22.6.  In fact, the CI calculated by Stata is based on the Poisson likelihood function.  The lower bound is &lambda;<sub>L</sub> such that Pr[Poisson(&lambda;<sub>L</sub>) &ge; 57] = 0.025, and the upper bound is &lambda;<sub>U</sub> such that Pr[Poisson(&lambda;<sub>U</sub>) &le; 57] = 0.025.  If the data are input as below, =strate= will generate CI as in EMS 22.6.
   * Rate difference:  =iri 33 24 355 518= gives results in EMS 23.2.  If the data are input as below, =stir poor= and =ir infection poor childyear= give the same results.
   * Rate ratio:  We need to input data in EMS Table 23.2 like this:
<verbatim>
clear
input infection childyear poor stove outcome
28 251 1 1 1
 5  52 0 1 1
 5 104 1 2 1
19 466 0 2 1
end
gen unit = childyear/infection
stset unit [freq=infection]
</verbatim>
Note that the command =stset= is important.  It declares data as survival-time data, with a total of 57 failures (sum of the frequency variable =infection=).  Stata will treat the first row as 28 copies and will multiply the time variable by 28.  This is why we had to define a time variable =unit= as the total child-year divided by the number of infections.
      * Overall and stratified failure rates and their CIs: =strate= gives overal failure rate and =strate stove poor= gives results stratified by stove and poor.
      * =stmh poor, compare(1,0)= gives rate ratio and its CI as in EMS 23.2, but with a different test.  Alternatively, we may carry out a Poisson regression: =poisson outcome poor [freq=infection], e(unit) irr=.
      * =stmh poor, by(stove) compare(1,0)= gives the results in EMS 23.3:  Rate ratio for each stove stratum, Mantel-Haenszel estimate of the assumed common rate ratio acrosss stove strata, the associated CI, test for rate ratio, and test for heterogeneity.  [Note =stmh poor stove, compare(1,0)= also gives some of the information.]  Again, we may carry out a Poisson regression: =poisson outcome poor stove [freq=infection], e(unit) irr=.  Note that test for heterogeneity is equivalent to test for interaction effect between =poor= and =stove=.

---+++++ Poisson regression (EMS 24)
After opening the caerphilly_ems.dta file, type =des=, =label list=, and =st= to learn the structure of the data.
   * Summarize the rates stratified by current smoking status: =stir cursmoke= will generate the results in EMS Table 24.3.
   * Study solely the effect of current smoking status: =poisson mi cursmoke, e(folltime)= will generate the results in EMS Table 24.5, while adding the option =irr= will generate the results in rate ratio (EMS Table 24.4; same as =stmh cursmoke=).
   * Study solely the effect of social class:
      * Social class as a categorical variable: =xi, prefix(): poisson mi i.socclass, e(folltime) ir= will generate the results in EMS Table 24.6.
      * Social class as a continuous variable: =poisson mi socclass, e(folltime) ir= will generate the results in EMS Table 24.7.
      * Compare the two models to see if treating social class as a categorical variable significantly improves the fit compared to treating it as a continuous variable.  Note that these two models are nested.  You can see which groups have high influence on the fit by =strate socclass=.
<verbatim>
** Model with social class as a categorical variable
xi, prefix(): poisson mi i.socclass, e(folltime) ir
est store A
predict irate, ir
label var irate "incidence rate"
** Model with social class as a continuous variable
poisson mi socclass, e(folltime) ir
lrtest A, stat
matrix define A=e(b)
** Graphical comparison
scatter ir socclass || function exp(A[1,2]+A[1,1]*x), range(socclass)
</verbatim>

   * Study current smoking status and social class simultaneously:
      * =stmh cursmoke, by(socclass)= generate the results in EMS Table 24.8.
      * =xi: poisson mi cursmoke i.socclass, e(folltime) ir= generate the results in EMS Table 24.9.


---++++ Survival outcome variables
[[http://www.ats.ucla.edu/STAT/stata/seminars/stata_survival/default.htm][Survival Analysis with Stata]] is a good resource on survival analysis.  Type =help st= for more details.

---+++++ Analysis on hazards (EMS 26)
After opening the pbc_ems.dta file, type =des=, =label list=, and =st= to learn the structure of the data.

The follow up time is recorded in years.  We define a new variable to reflect the time in days by =gen days = time*365.25=, and re-declare the survival-time data structure by =stset days, failure(d)=.
   * =sts list if cenc0==1= will generate EMS Table 26.2.  =ltable days d if cenc0==1, graph= will generate EMS Table 26.2 and Figure 26.2.  =sts graph if cenc0==1, g= will generate a different flavor of EMS Figure 26.2.
   * =sts graph, by(cenc0)= or =ltable days d, by(cenc0) notable graph overlay= will generate EMS Figure 26.3.
   * =sts graph, by(cenc0) na= will generate a graph of cumulative hazard.  To generate EMS Figure 26.4, we need to add more options: =sts graph, by(cenc0) na xscale(log) yscale(log) xlabel(10 100 1000 5000) ylabel(.001 .01 .1 1 5)=.
   * Mantel-Cox method:  =stmc cenc0= will generate the results in EMS 26.5.

---+++++ Cox proportional hazards regression (EMS 27)
   * =stcox cenc0= will generate the results in EMS 27.2.

---++++ EMS other chapters
---+++++ Nonparametrics (EMS 30.2)
   * One sample:  An observation has two parts of information: magnitude and sign.  A t-test statistic is calculated based on both.  We can replace the magnitude information by the corresponding ranks.  Wilcoxon's signed rank test (=signrank=) is calculated based on the rank information and sign.  A sign test (=signtest=) is calculated based on the sign information alone, effective making the outcome as binomial.
   * Two samples:  Wilcoxon's rank sum test (=ranksum=)
   * More than two samples:  Kruskal-Wallis test (=kwallis=)

   * Rank correlations: =spearman= can be used to calculate Spearman's rank correlation coefficients. =ktau= can be used to calculate Kendall's rank correlation coefficients.

---+++++ ROC curve (EMS 36.2)
   * In the Peru lung study, lower volumn of air a child could breathe out in a second tends to be associated with respiratory symptoms (coded as 1).  Thus, the two variables appear in negative association.  However, the command =roctab= assumes positive association, and we have to define a new variable before hand: =gen negfev1 = -fev1=.  Then we can use =roctab respsymptoms negfev1, sum graph= to output the results and EMS Figure 36.1.
   * If we want to compare the ROC curves for boys and girls, we can use =roccomp respsymptoms negfev1, by(sex) sum graph=.

---++++ Advanced topics
---+++++ Matrix
Matrix is very useful in statistics.  Stata has many matrix related functions.  Type =help matrix= to start learning them.

---+++++ Programming
Programming is useful in many scenarios:  (1) When you analyze data using a protocol repeatedly, you may want to put the code into a program so that you can just call the program instead of repeating the many commands.  (2) When you carry out a simulation, you will repeat a procedure many times.  It will be very efficient to write a program and execute the program.  (3) When people implement a statistical method in Stata, they write and distribute programs.  Type =help program= to start learning Stata programming.

---+++++ Simulations
Note: Stata is not a good program to carry out sophisticated simulations, for which R is a better choice.

*Generation of random variables*:  =uniform()= will generate a random number between 0 and 1 (i.e. from the uniform(0, 1) distribution).  To generate a random number from another distribution, apply the corresponding inverse cumulative distribution function.  For example, =invnormal(uniform())= will generate a random number from N(0, 1).  =help density functions= will list many such functions.

For example, if you want to generate a variable with 100 random numbers from a normal distribution with mean 3 and SD = 7, use:
<verbatim>
clear
set obs 100
gen aa = invnormal(uniform()) * 7 + 3
</verbatim>

*Generation of normal random variables*:  Normal variables are used so often that Stata has a designated function for drawing them.  =drawnorm= can be used to generate normal variables, independent or dependent.

*Sampling*:  =sample= can be used to sample from existing values without replacement.  =bsample= and =bootstrap= can be used to sample from existing values with replacement.

*Simulation*:  =simulate= is a major command used in simulations.  It calls another command repeatedly and stores the results of each call.  The command that is called often is a program defined by the user.

The following programs give you a sense of how people write simulation programs in Stata:
   * *[[%PUBURL%/%WEB%/IntroBiostatCourse2006StataLab/Sampling.do][Sampling.do]]* is a program to simulate the repeated sampling procedure similar to that in EMS 4.5.  The program treats the data as a population, and draws a random sample from a variable and calculates the sample mean.  This procedure is repeated 100 times to demonstrate the variation of the sample mean as an estimate of the population mean.  The program outputs summary and histogram of the 100 estimates.  It also draw samples in different colors depending on whether the 95% confidence interval calculated based on the sample covers the real average or not.
      * You may need to modify the program to run it on your computer.  After you have saved the program, go to Stata and press ctrl-8 [or, Window -> Do-file Editor -> New Do-file], then open the program.  Modify the four lines near the beginning to try it on any dataset/variable/sample size/number of samples.  Make sure the pathname is correct for the data file in your computer.  After this, you can run the program by going to File -> Do and choose the program.  
      * You can change the sample size to other numbers and re-run the program to see the effect of sample size on variation of the estimates.

   * *[[%PUBURL%/%WEB%/IntroBiostatCourse2006StataLab/SimulatedSampling_t-test.do][SimulatedSampling_t-test.do]]* is a program to show the variation of sampling and its impact on test results as a function of sample means and sample SDs.

   * *[[%PUBURL%/%WEB%/IntroBiostatCourse2006StataLab/SimulatedSampling_regression.do][SimulatedSampling_regression.do]]* is a program to show the variation of regression line estimate as a function of sample size and standard deviation in the error term.

---+++++ Stata graphics, under the hood 
[[http://ideas.repec.org/p/boc/usug04/8.html][Stata graphics, under the hood]] was a talk by Vince Wiggins on the capabilities of Stata graphics.  The talk was given from within Stata.  You need to install the talk before starting it.
<verbatim>
net from http://www.stata.com/users/vwiggins
net describe uk04
net install uk04
</verbatim>
After this, you need to check if the PERSONAL ado directory exists.  Type =sysdir= to see what directory is designated for PERSONAL.  Then check if the directory exist; if not, create it yourself.  Then use =ukgrtalk= to install the files needed for the talk.  After this, you can type =help ukgrtalk= and follow the instructions.

---+++++ System settings and limits
Try =query= to see the various system parameter settings.  To change a setting, use =set= command.  To obtain the current setting dynamically (e.g. in a program), use c-class return values.  For details, type =creturn list=, which gives a more comprehensive list than =query= does.  Click on the links of the output to see corresponding help pages.  Type =help limits= will list all the limitations your version of Stata imposes.

</span></font>
This topic: Main > WebHome > Education > IntroBiostatCourse2006 > ChunStataNotes
Topic revision: 23 Apr 2007, ChunLi (raw view)
Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback