Stata Notes for Classes

  • Some of STATA codes for classes are listed.
  • [EMS] refers to Essential Medical Statistics.
  • Some of materials were copied/modified from course materials of Biostatistics, M.P.H. program at Vanderbilt University, and the course textbook, "Statistical Modeling for Biomedical Researchers", 2nd Ed., in press, by William Dupont. [WD] refers to William Dupont's book

Some basics for Stata need to know

  • four windows: Results , Command , Review , Variables
  • command line interface
  • pulldown menus
  • log file: keep track what you are doing
    • go to menus, File --> Log --> Begin, save as .log
    • use icon in the tool bar
  • creat/open a dataset:
    • use input x y --> enter data --> end
    • open Data Editor --> enter data
    • use menus or infile command to import data file
  • explore the dataset: Data Browser and Data Editor
  • basic commands:
    • list, codebook, describe, summarize
    • set memory
  • graphs:
    • use menus
    • use commands: very good summary of Stata commands can be found in [WD]
  • exit Stata: menus --> File --> save or save as
  • getting help: syntax
    • example: graph box fev1, over(respsymptoms)
    • qualifier and options : there must be a comma between the last qualifier (fev1) and the first option (over(respsymptoms))
    • command prefix : precedes the command, separated from the main command by a colon, e.g. by group: egen avg = mean(dbp)
    • abbreviations: the minimum abbreviation is underlined in Stata reference manuals or Help
  • do file: rerun previous analyses
    • go to menus, File --> Do --> save as .do
    • use icon in the tool bar
    • save review contents as .do

[EMS] Chapter 3 Displaying the data

Frequencies (categorical variables): need to know
  • Table 3.1 STAT data format ASCII format
  • label data: label data "The method of delivery recoreded for 600 births in a hospital"
  • make and delete notes: first note notes: "Data from EMS Table. 3.1" ; second notes notes: edited on Jan. 15, 2007"
  • define label: label define deliverylab 1 "Normal" 2 "Forceps" 3 "Caesarean section"
  • put label: label values delivery deliverylab
  • generate table: tabulate delivery
  • Fig. 3.1 Bar chart: input Normal Forceps Caesarean, 478 65 57, end, graph hbar Normal Forceps Caesarean or gen y = 1, then graph hbar (count) y, over(delivery)
  • Fig. 3.2 Pie chart: graph pie y, over(delivery)
Frequency distributions (numerical variables):
  • Table 3.2 ASCII format
  • infile id hemo using "C:\Teaching\IGP\data\haemoglobin.txt", clear
  • Table 3.2 (b):
    • egen hemocat = cut(hemo), at(8, 9, 10, 11, 12, 13, 14, 15, 16), or egen hemocat = cut(hemo), at(8(1)16)
    • tabulate hemocat
    • stem hemo, lines(1)
  • Fig. 3.3 Histogram: histogram hemo, width(1) start(8) frequency xtitle("Haemoglobin level (g/100ml)") need to know
Shapes of frequency distributions R notes for classes

Cumulative frequency distributions, quantiles and percentiles: need to know
  • Fig. 3.8 Boxplot: graph box hemo
  • codebook hemo, and summarize hemo
Displaying the association between two variables: need to know
  • Table 3.4 STAT data format: tabulate village source [weight=freq] , use option row, col
  • Fig. 3.9 - 3.12 Peru lung study data, which can be obtained EMS official web site under "perulung_ems".
  • Fig. 3.9 Scatter plots: twoway (scatter fev1 age), ylabel(0(1)3) ytick(0 1 2 3) ymtick(0(0.5)3) ytitle("FEV1 (litres)")
  • Fig. 3.10 Scatter plots: twoway (scatter fev1 respsymptoms)
  • Fig. 3.11 Scatter plots: twoway (scatter fev1 respsymptoms, jitter(10))
  • Fig. 3.12 Box and whiskers plots: graph box fev1, over(respsymptoms)
  • another way: use dotplot fev1, over(respsymptoms) median center
Displaying time trends:

[EMS] Chapter 4 Means, standard deviations and standard errors

egen meanvol = mean(volume)
display meanvol
gen dev = volume - meanvol
gen dev2 = dev^2
gen vol2 = volume^2

egen volsum= total(volume)
egen vol2sum= total(vol2)

display vol2sum - volsum^2/8

egen dev2sum = total(dev2)

di _N
di dev2sum
di sqrt(dev2sum/(_N-1))

summarize volume

collapse (mean) mean_vol=volume (sd) sd_volume=volume
list mean_vol sd_volume
  • Sampling variations and standard errors:

[EMS] Chapter 5 The normal distribution

  • Normal distributions and standard normal distributions: R notes for classes
  • Calculating area under the curve of the normal distribution and finding percentage points (z-score) of the normal distribution
help density functions
* AUC of normal density function 
*find probability % below the specified z-score
di normal(1.31)
* AUC in upper tail of distribution
di 1-normal(1.31)
* AUC in lower tail of distribution
di 1-normal(1.77)
* AUC between two z values
di normal(0.54) - normal(-1)
* value corresponding to specified tail area
input mu sigma z
171.5 6.5 1.64
di mu + z*sigma
drop mu sigma z
* percentage points of normal density function (find z value corresponding %)
di invnormal(.95)
di invnormal(.975)

[EMS] Chapter 6 Confidence interval for a mean

  • Section 6.2 Large sample case (normal distribution): Example 6.1
input mu sd n
24.2 5.9 100
*find 5% percent point
gen z = invnormal(.975)
gen se = sd/sqrt(n)
gen l_ci = mu - z*se
gen u_ci = mu + z*se
cii n mu sd
drop mu-u_ci
*find 10%, 1% percent point: invnormal(.95); invnormal(.995)
* n is d.f. in Stata invttail(n, p) command
drop n
gen n=7
gen t = invttail(n, .025)
gen se = sd/sqrt(n)
gen l_ci = mu - t*se
gen u_ci = mu + t*se

[EMS] Chapter 7 Comparison of two means: confidence intervals, hypothesis tests and p-values

[EMS] Chapter 9 Analysis of variance

[EMS] Chapter 10 Linear regression and correlation

[EMS] Chapter 11 Multiple regression

[EMS] Chapter 12 Goodness of fit and regression diagnostics

[EMS] Chapter 13 Transformation

[EMS] Chapter 16-17

Topic revision: r34 - 23 Feb 2007, LeenaChoi

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback