IGP 304: Stata Labs

Some words from the instructor

Knowing the software is vital to your learning of statistics and your course performance. If you learn statistics without knowing the software, you will become crippled, armed with theory and methods and ideas but not knowing how to put them into practice.

These exercises are designed as stimulants to learning Stata and exploring its capabilities. These exercises will be treated the same way as assigned reading materials, and I will check if you have tried these exercises by randomly picking some of you to tell us how you would carry out the procedures.

If you don't have Stata installed on your own computers or lab computers, you can have access to it at College of Arts & Science Microcomputer Labs. [The labs are open to all with Vanderbilt ID but you may not be able to enter the buildings at night. At least you can go there during the day. The version they have is version 8, not 9. Version 8 should do all we want to do, probably with the same commands but somewhat different menu setup.]

If you have chosen software packages other than Stata, please let me know.

Assigned on 01/13

  • Do some exploratory analyses on Ryan’s dataset:
    • find a way to let Stata output the number of subjects that are less than or equal to 5 years old
    • find a way to let Stata output the number of subjects that have missing values for the language variable
    • calculate the average age of the subjects

Assigned on 01/16

  • Read in the Peru lung study data, which can be obtained here under "perulung_ems". A brief description of the study is in page 27 of EMS. [You may download the data in Stata format and open it directly. As a useful alternative exercise, you may download the data in Excel format, open it in Excel and save as a text file (in comma delimited format), and then import the data into Stata. Although useful, the second approach does lose information on variable labels.] Do the following on this data set:
    • Describe the distribution of sex
    • Describe the distribution of age

Assigned on 01/18

  • Find a way to make an informative bar chart and pie chart like Figures 3.1 and 3.2 in EMS. This will involve changing labels, legends, titles, etc.
  • Make sure you can get all the descriptive graphics and summary statistics from Stata all any variable of your choice. If you can add nice labels/legends/titles, that will be great. If you can share your findings with the class, please let me know.

Assigned on 01/20

  • Let's imagine the 636 subjects in the Peru lung study are the population, and we can afford to study only a small subset and want to estimate the population average of fev1 based on the subset.
    • David kindly wrote a Stata program Sampling.do to simulate the repeated sampling procedure similar to that in EMS 4.5. The program draws a random sample from the 636 subjects and calculates the sample mean for fev1. This procedure is repeated 100 times to demonstrate the variation of the sample mean as an estimate of the population mean. The program outputs summary and histogram of the 100 estimates.
    • You may need to modify the program to run it on your computer. After you have saved the program, go to Stata and press ctrl-8 [or, Window -> Do-file Editor -> New Do-file], then open the program. Modify the two lines containing pathnames so that the pathname is correct for the data file in your computer, and save it. Now you can run the program by going to File -> Do and choose the program.
    • You can change the sample size to other numbers and re-run the program to see the effect of sample size on variation of the estimates.

Assigned on 01/24

  • We have a new version of Sampling.do, which can draw samples in different colors depending on whether the 95% confidence interval calculated based on the sample covers the real average or not. It also is easier to modify the code to try it on other dataset/variable/sample size/number of samples. All you need to do is to change the four lines near the beginning to fit your need. Try it on other variables and on variables that are either skewed or not unimodal.
  • Variable type: Almost all statistical software packages need to store variables in different ways depending on the nature of the variables. When they read in a data set in non-default formats (e.g. ASCII/text files), they have to decide on the types of variables based on the patterns they see. This may cause a problem if you are not careful.
    • Example 1: If you have a value 16o (with zero mistyped as oh) in variable "height", or if "." or "?" is used for missing values in the input file, the software may think these are possible values of the variable and thus the variable must not be numeric. As a result, the variable may be stored as categorical or character strings. Later on, when you want to calculate the mean (or any other analysis), the software may give you an error message because it cannot do this on categorical variables or strings.
    • Example 2: If you have an "ID" variable and code it using numbers, or if you have a categorical variable with five categories and code it as 1-5, the software may think the variable is numeric and store it as numbers. Later on, when you do analysis on this variable, the software will treat them as numbers instead of factors and thus give wrong results, without any warning at all.
    • When you see strange error messages or strange results, check if the variables are in correct types. Stata command describe can give you some basic information on variable types. You can change variable types using command format if you want to.

Assigned on 01/27

  • Bob has a question on how we compare two groups of changes. Check here for details and the data.
    • Read data into Stata. [This is a good exercise on how to handle data stored in other formats. You need to save the data in a format Stata can read in. Some manual editing may be needed so that the input file can be read in correctly by Stata.]
    • Make graphical displays of the data to help you visualize the changes and comparisons?
    • Think how you can answer Bob's question by analyzing the data. [Don't be afraid. Fresh ideas often are good ones.]
  • In the Peru lung study, suppose you are interested to see if fev1 is different in children less than 9 years old and in those older than 9 years old.
    • Do a t-test to see if there is a difference between these two groups.
    • What assumptions did you make in your test? How can you check for the validity of these assumptions? If any of the assumptions looks wrong, what else can you do to remedy this?

Assigned on 01/30

  • David and I wrote another Stata simulation program SimulatedSampling_t-test.do. This is to show the variation of sampling and its impact on test results as a function of sample means and sample SDs.

Assigned on 02/01

  • I have tried some exploratory analyses on Bob's data. See here. If you have other ideas on analyzing the data, let me know.
  • For the Peru lung study, use Stata to test if fev1 is different for boys and girls, if fev1 is different for those having respiratory symptoms and those not having the symptoms, and if fev1 is different for combinations of gender and respiratory symptoms.

Assigned on 02/03

  • For the data in EMS Table 9.1, try
anova hemoglbn type
oneway hemoglbn type, t
oneway hemoglbn type, b sch si
The last command gives multiple comparison test results (Bonferroni, Scheffe, Sidak). Unfortunately, there is no Tukey's multiple comparison test in the standard distribution. It is implemented in the prcomp command in the user-contributed "sg101" package. [Type net search sg101 and click on the link and follow instructions for installation.] Once you have installed the package, type prcomp hemoglbn type, anova tukey to get the Tukey's test.
  • For the Peru lung study, compare the results of the following analyses:
anova fev1 sex
anova fev1 sex resp
anova fev1 sex resp sex*resp
regress fev1 sex
regress fev1 sex resp

Assigned on 02/06

  • Try to repeat the analysis in EMS data in Chapter 10. Generate all the results and graphs in this chapter.
  • In the Peru lung study, suppose you want to do a simple linear regression to study the fev1 as a function of age. Try the following:
regress fev1 age
scatter fev1 age || lfit fev age
scatter fev1 age || lfitci fev age
Also try to do separate linear regression fitting of fev1 as a function of age on the children with respiratory symptoms and those without. Compare the results on these two subsets. Can you draw graphs to compare them?

Assigned on 02/08

  • Judith has a survey data set that has been distributed in the class. Check here for details.
    • Read the "MD raw data" into Stata. Again, you need to save the data to a text file first and may need to do some editing.
    • For each question, Judith summarized the distribution of answers in the Excel file (under "MD summary" tab). What Stata command(s) can you use to generate these summaries? If you want to see two-way distributions of two questions, how will you do in Stata?
    • How do you address Judith's study goal?
    • Does the abstract reflect enough information they have gathered?
  • David and I wrote a program SimulatedSampling_regression.do, to show the variation of regression line estimate as a function of sample size and standard deviation in the error term.
  • This page summarizes how to do diagnostics in linear regression. Try the commands in the page.
  • The last link is part of a web book Regression with Stata. It teaches regression while showing how to carry out all the analyses in Stata. Check it out!

Assigned on 02/22

  • Note: I haven't assigned anything in this Stata Lab page for two weeks, but I hope you have had enough incentive to explore and learn Stata by yourself. Stata help pages and Internet search are the best tools for learning Stata.
  • In my Stata Notes page, I added a section on various Stata commands for tests on contigency tables that are covered in EMS 15-18. Try out these commands and explore their options that are listed in help pages.
  • Sylvie has a study on inflammatory process and oxidative damage. Check here for details.
    • Read the data into Stata.
    • How can you analyze the data to answer Sylvie's questions?

Assigned on 03/13

  • Read in the Onchocerciasis study data, which can be obtained here under "oncho_ems". A brief description of the study is in EMS p.191.
    • Explore the data using commands like describe, tabulate, tab1, tab2, etc.
    • Try out the analyses in EMS 19 and 20. Look at my Stata Notes page for helps on logistic regression.
  • The web book Logistic Regression with Stata teaches logistic regression while showing how to carry out all the analyses in Stata. Check it out!
  • Similar as linear regressions, logistic regressions have assumptions and some data points may be more influential than others. Thus, diagnostics is important. The above web book has a whole chapter on diagnostics.

Topic revision: r33 - 30 Oct 2006, ChunLi

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback