Statistical Modeling for Biomedical Researchers
William D. Dupont
1. Introduction......................................................................................................................... 1
1.1. Algebraic Notation...................................................................................................... 1
1.2. Descriptive Statistics................................................................................................... 2
1.2.1. Dot Plot................................................................................................................ 2
1.2.2. Sample Mean...................................................................................................... 3
1.2.3. Residual............................................................................................................... 3
1.2.4. Sample Variance................................................................................................ 3
1.2.5. Sample Standard Deviation............................................................................ 4
1.2.6. Percentile and Median..................................................................................... 4
1.2.7. Box Plot............................................................................................................... 4
1.2.8. Histogram............................................................................................................ 5
1.2.9. Scatter Plot......................................................................................................... 5
1.3. The Stata Statistical Software Package.................................................................. 5
1.3.1. Downloading Data from my Web Site........................................................... 5
1.3.2. Creating Dot Plots with Stata......................................................................... 6
1.3.3. Stata Command Syntax................................................................................... 7
1.3.4. Obtaining Interactive Help from Stata......................................................... 8
1.3.5. Stata Log Files................................................................................................... 9
1.3.6. Displaying Other Descriptive Statistics with Stata.................................... 9
1.4. Inferential Statistics................................................................................................. 11
1.4.1. Probability Density Function....................................................................... 11
1.4.2. Mean, Variance and Standard Deviation.................................................. 12
1.4.3. Normal Distribution....................................................................................... 12
1.4.4. Expected Value................................................................................................ 12
1.4.5. Standard Error................................................................................................ 13
1.4.6. Null Hypothesis, Alternative Hypothesis and P value........................... 13
1.4.7. 95% Confidence Interval............................................................................... 14
1.4.8. Statistical Power.............................................................................................. 14
1.4.9. The z and Student's t Distributions............................................................. 15
1.4.10. Paired t Test................................................................................................... 16
1.4.11. Performing Paired t Tests with Stata....................................................... 17
1.4.12. Independent t Test Using a Pooled Standard Error Estimate............ 19
1.4.13. Independent t Test using Separate Standard Error Estimates.......... 20
1.4.14. Independent t-tests using Stata................................................................. 21
1.4.15. The Chi-Squared Distribution.................................................................. 23
1.5. Additional Reading................................................................................................... 23
1.6. Exercises...................................................................................................................... 24
1.7. Figures for Chapter 1............................................................................................... 25
2: Simple Linear Regression............................................................................................ 37
2.1. Sample Covariance................................................................................................... 37
2.2. Sample Correlation Coefficient.............................................................................. 38
2.3. Population Covariance and Correlation Coefficient.......................................... 38
2.4. Conditional Expectation.......................................................................................... 39
2.5. Simple Linear Regression Model........................................................................... 39
2.6. Fitting the Linear Regression Model.................................................................... 40
2.7. Historical Trivia: Origin of the Term Regression................................................ 41
2.8. Determining the Accuracy of Linear Regression Estimates............................ 42
2.9. Ethylene Glycol Poisoning Example..................................................................... 43
2.10. 95% Confidence
Interval for y = a + bx Evaluated at x.......................... 44
2.11. 95% Prediction Interval for the Response of a New Patient.......................... 44
2.12. Simple Linear Regression with Stata................................................................. 45
2.13. Lowess Regression................................................................................................... 49
2.14. Plotting a Lowess Regression Curve in Stata................................................... 49
2.15. Residual Analyses................................................................................................... 50
2.16. Studentized Residual Analysis Using Stata..................................................... 52
2.17. Transforming the x and y Variables.................................................................... 53
2.17.1. Stabilizing the Variance.............................................................................. 53
2.17.2. Correcting for Non-linearity....................................................................... 54
2.17.3. Example: Research Funding and Morbidity for 29 Diseases............... 54
2.18. Analyzing Transformed Data with Stata........................................................... 55
2.19. Testing the Equality of Regression Slopes........................................................ 58
2.19.1. Example: The Framingham Heart Study............................................... 59
2.20. Comparing Slope Estimates with Stata.............................................................. 60
2.21. Additional Reading................................................................................................. 63
2.22. Exercises................................................................................................................... 64
2.23. Tables and Figures for Chapter 2........................................................................ 66
3: Multiple Linear Regression........................................................................................ 81
3.1. The Model.................................................................................................................... 81
3.2. Confounding Variables............................................................................................ 81
3.3. Estimating the Parameters for a Multiple Linear Regression Model............ 82
3.4. R2 Statistic for Multiple Regression Models........................................................ 82
3.5. Expected Response in the Multiple Regression Model...................................... 83
3.6. The Accuracy of Multiple Regression Parameter Estimates........................... 83
3.7. Leverage...................................................................................................................... 84
3.8. 95% Confidence Interval for .............................................................................. 84
3.9. 95% Prediction Intervals......................................................................................... 85
3.10. Example: The Framingham Heart Study......................................................... 85
3.10.1. Preliminary Univariate Analyses............................................................. 86
3.11. Scatterplot Matrix Graphs.................................................................................... 87
3.11.1. Producing Scatterplot Matrix Graphs with Stata.................................. 87
3.12. Modeling Interaction in Multiple Linear Regression...................................... 88
3.12.1. The Framingham Example........................................................................ 88
3.13. Multiple Regression Modeling of the Framingham Data............................... 89
3.14. Intuitive Understanding of a Multiple Regression Model.............................. 90
3.14.1. The Framingham Example........................................................................ 90
3.15. Calculating 95% Confidence and Prediction Intervals................................... 92
3.16. Multiple Linear Regression with Stata.............................................................. 92
3.17. Automatic Methods of Model Selection.............................................................. 96
3.17.1. Forward Selection using Stata................................................................... 96
3.17.2. Backward Selection....................................................................................... 99
3.17.3. Forward Stepwise Selection........................................................................ 99
3.17.4. Backward Stepwise Selection...................................................................... 99
3.17.5. Pros and Cons of Automated Model Selection...................................... 100
3.18. Collinearity............................................................................................................ 100
3.19. Residual Analyses................................................................................................. 101
3.20. Influence................................................................................................................. 101
3.20.1. D Influence Statistic................................................................................ 102
3.20.2. Cook’s Distance............................................................................................ 102
3.20.3. The Framingham Example...................................................................... 103
3.21. Residual and Influence Analyses Using Stata............................................... 104
3.22. Additional Reading............................................................................................... 107
3.23. Exercises................................................................................................................. 107
3.24. Tables and Figures for Chapter 3...................................................................... 110
4: Simple Logistic Regression....................................................................................... 121
4.1. Example: APACHE Score and Mortality in Patients with Sepsis............... 121
4.2. Sigmoidal Family of Logistic Regression Curves............................................. 121
4.3. The Log Odds of Death Given a Logistic Probability Function..................... 122
4.4. The Binomial Distribution.................................................................................... 123
4.5. Simple Logistic Regression Model....................................................................... 123
4.6. Generalized Linear Model..................................................................................... 124
4.7. Contrast Between Logistic and Linear Regression.......................................... 124
4.8. Maximum Likelihood Estimation........................................................................ 125
4.8.1. Variance of Maximum Likelihood Parameter Estimates..................... 125
4.9. Statistical Tests and Confidence Intervals........................................................ 126
4.9.1. Likelihood Ratio Tests................................................................................. 126
4.9.2. Quadratic Approximations to the Log Likelihood Ratio Function..... 127
4.9.3. Score Tests...................................................................................................... 128
4.9.4. Wald Tests and Confidence Intervals....................................................... 128
4.9.5. Which Test Should You Use?...................................................................... 129
4.10. Sepsis Example...................................................................................................... 130
4.11. Logistic Regression with Stata........................................................................... 130
4.12. Odds Ratios and the Logistic Regression Model............................................. 132
4.13. 95% Confidence Interval for the Odds Ratio Associated with a Unit
Increase in x............................................................................................................ 132
4.13.1. Calculating this Odds Ratio with Stata................................................. 133
4.14. Logistic Regression with Grouped Response Data......................................... 133
4.15. 95% Confidence Interval for the .............................................................. 134
4.16. 95% Confidence Intervals for Proportions....................................................... 135
4.17. Example: The Ibuprofen in Sepsis Trial......................................................... 135
4.18. Logistic Regression with Grouped Data Using Stata.................................... 137
4.19. Simple 2´2 Case-Control Studies...................................................................... 140
4.19.1. Example: The Ille-et-Vilaine Study of Esophageal Cancer
and Alcohol............................................................................................................... 140
4.19.2. Review of Classical Case-Control Theory.............................................. 140
4.19.3. 95% Confidence Interval for the Odds Ratio: Woolf’s Method........... 142
4.19.4. Test of the Null Hypothesis that the Odds Ratio Equals One........... 142
4.19.5. Test of the Null Hypothesis that Two Proportions are Equal........... 143
4.20. Logistic Regression Models for 2´2 Contingency Tables............................. 143
4.20.1. Nuisance Parameters................................................................................. 144
4.20.2. 95% Confidence Interval for the Odds Ratio: Logistic Regression... 144
4.21. Creating a Stata Data File.................................................................................. 144
4.22. Analyzing Case-Control Data with Stata........................................................ 146
4.23. Regressing Disease Against Exposure............................................................. 147
4.24. Additional Reading............................................................................................... 149
4.25. Exercises................................................................................................................. 149
4.26. Tables and Figures for Chapter 4...................................................................... 151
5: Multiple Logistic Regression................................................................................... 161
5.1. Mantel-Haenszel Estimate of an Age-Adjusted Odds Ratio.......................... 161
5.2. Mantel-Haenszel Statistic for Multiple 2´2 Tables.................................... 162
5.3. 95% Confidence Interval for the Age-Adjusted Odds Ratio........................... 163
5.4. Breslow and Day’s Test for Homogeneity........................................................... 163
5.5. Calculating the Mantel-Haenszel Odds Ratio using Stata............................ 165
5.6. Multiple Logistic Regression Model.................................................................... 167
5.7. 95% Confidence Interval for an Adjusted Odds Ratio.................................... 169
5.8. Logistic Regression for Multiple 2´2 Contingency Tables............................. 169
5.9. Analyzing Multiple 2´2 Tables with Stata........................................................ 171
5.10. Handling Categorical Variables in Stata........................................................ 173
5.11. Effect of Dose of Alcohol on Esophageal Cancer Risk................................... 174
5.11.1. Analyzing Model with Stata.................................................................... 175
5.12. Effect of Dose of Tobacco on Esophageal Cancer Risk.................................. 176
5.13. Deriving Odds Ratios from Multiple Parameters........................................... 176
5.14. The Standard Error of a Weighted Sum of Regression Coefficients.......... 177
5.15. Confidence Intervals for Weighted Sums of Coefficients............................. 177
5.16. Hypothesis Tests for Weighted Sums of Coefficients..................................... 178
5.17. The Estimated Variance-Covariance Matrix.................................................. 178
5.18. Multiplicative Models of Two Risk Factors...................................................... 179
5.19. Multiplicative Model of Smoking, Alcohol, and Esophageal Cancer......... 180
5.20. Fitting a Multiplicative Model with Stata....................................................... 181
5.21. Model of Two Risk Factors with Interaction.................................................... 185
5.22. Model of Alcohol, Tobacco, and Esophageal Cancer with Interaction
Terms....................................................................................................................... 186
5.23. Fitting a Model with Interaction using Stata................................................. 187
5.24. Model Fitting: Nested Models and Model Deviance..................................... 190
5.25. Effect Modifiers and Confounding Variables.................................................. 191
5.26. Goodness-of-Fit Tests........................................................................................... 192
5.26.1. The Pearson Goodness-of-Fit
Statistic............................................. 192
5.27. Hosmer-Lemeshow Goodness-of-Fit Test......................................................... 193
5.27.1. An Example: The Ille-et-Vilaine Cancer Data Set.............................. 194
5.28. Residual and Influence Analysis....................................................................... 195
5.28.1. Standardized Pearson Residual.............................................................. 196
5.28.3. Influence Statistic....................................................................................... 196
5.28.3. Residual Plots of the Ille-et-Vilaine Data on Esophageal Cancer.... 197
5.29. Using Stata for Goodness-of-Fit Tests and Residual Analyses................... 198
5.30. Frequency Matched Case-Control Studies...................................................... 204
5.31. Conditional Logistic Regression........................................................................ 205
5.32. Analyzing Data with Missing Values............................................................... 205
5.32.1. Cardiac Output in the Ibuprofen in Sepsis Study............................... 206
5.32.2. Modeling Missing Values with Stata...................................................... 207
5.33. Additional Reading............................................................................................... 209
5.34. Exercises................................................................................................................. 210
5.35. Tables and Figures for Chapter 5...................................................................... 212
6: Introduction to Survival Analysis......................................................................... 219
6.1. Survival and Cumulative Mortality Functions................................................ 219
6.2. Right Censored Data.............................................................................................. 220
6.3. Kaplan-Meier Survival Curves............................................................................ 220
6.4. An Example: Genetic Risk of Recurrent Intracerebral Hemorrhage........... 221
6.5. 95% Confidence Intervals for Survival Functions........................................... 222
6.6. Cumulative Mortality Function........................................................................... 223
6.7. Censoring and Bias................................................................................................. 224
6.8. Logrank Test............................................................................................................ 224
6.9. Using Stata to Derive Survival Functions and the Logrank Test................ 226
6.10. Logrank Test for Multiple Patient Groups...................................................... 231
6.11. Hazard Functions................................................................................................. 231
6.12. Proportional Hazards........................................................................................... 232
6.13. Relative Risks and Hazard Ratios..................................................................... 232
6.14. Proportional Hazards Regression Analysis..................................................... 233
6.15. Hazard Regression Analysis of the Intracerebral Hemorrhage Data....... 234
6.16. Proportional Hazards Regression Analysis with Stata................................. 234
6.17. Tied Failure Times............................................................................................... 235
6.18. Additional Reading............................................................................................... 236
6.19. Exercises................................................................................................................. 236
6.20. Tables and Figures for Chapter 6 238
7: Hazard Regression Analysis..................................................................................... 249
7.1. Proportional Hazards Model................................................................................. 249
7.2. Relative Risks and Hazard Ratios....................................................................... 249
7.3. 95% Confidence Intervals and Hypothesis Tests............................................. 250
7.4. Nested Models and Model Deviance.................................................................... 251
7.5. An Example: The Framingham Heart Study.................................................. 251
7.5.1. Univariate Analyses..................................................................................... 251
7.5.2. Multiplicative Model of DBP and Gender on Risk of CHD................... 253
7.5.3. Using Interaction Terms to Model the Effects of Gender and
DBP on CHD............................................................................................................. 253
7.5.4. Adjusting for Confounding Variables....................................................... 254
7.5.5. Interpretation................................................................................................ 255
7.5.6. Alternate Models........................................................................................... 256
7.6. Cox-Snell Generalized Residuals and Proportional Hazards Models.......... 256
7.7. Proportional Hazards Regression Analysis using Stata................................. 257
7.8. Stratified Proportional Hazards Models............................................................. 268
7.9. Survival Analysis with Ragged Study Entry.................................................... 269
7.9.1. Kaplan-Meier Survival Curve and the Logrank Test with
Ragged Entry........................................................................................................... 269
7.9.2. Age, Sex, and CHD in the Framingham Heart Study.......................... 270
7.9.3. Proportional Hazards Regression Analysis with Ragged Entry......... 270
7.9.4. Survival Analysis with Ragged Entry using Stata............................... 271
7.10. Hazard Regression Models with Time Dependent Covariates.................... 273
7.10.1. Cox-Snell Residuals for Models with Time-Dependent
Covariates................................................................................................................. 274
7.10.2. Testing the Proportional Hazards Assumption.................................... 275
7.10.3. Alternative Models..................................................................................... 275
7.11. Modeling Time-Dependent Covariates with Stata......................................... 275
7.12. Additional Reading............................................................................................... 280
7.13. Exercises................................................................................................................. 280
7.14. Tables and Figures for Chapter 7...................................................................... 282
8: Introduction to Poisson Regression: Inferences on Morbidity and
Mortality Rates.................................................................................................................. 293
8.1. Elementary Statistics Involving Rates.............................................................. 293
8.2. Calculating Relative Risks from Incidence Data Using Stata...................... 294
8.3. The Binomial and Poisson Distributions........................................................... 295
8.4. Simple Poisson Regression for 2´2 Tables......................................................... 296
8.5. Poisson Regression and the Generalized Linear Model.................................. 297
8.6. Contrast Between Poisson, Logistic, and Linear Regression......................... 297
8.7. Simple Poisson Regression with Stata............................................................... 298
8.8. Poisson Regression and Survival Analysis........................................................ 299
8.8.1. Recoding Survival Data on Patients as Patient-Year Data................. 299
8.8.2. Converting Survival Records to Person-Years of Follow-Up
using Stata................................................................................................................ 300
8.9. Converting the Framingham Survival Data Set to Person-Time Data...... 303
8.10. Simple Poisson Regression with Multiple Data Records.............................. 308
8.11. Poisson Regression with a Classification Variable........................................ 308
8.12. Applying Simple Poisson Regression to the Framingham Data................. 309
8.13. Additional Reading............................................................................................... 311
8.14. Exercises................................................................................................................. 312
8.15. Tables and Figures for Chapter 8...................................................................... 315
9: Multiple Poisson Regression.................................................................................... 319
9.1. Multiple Poisson Regression Model..................................................................... 319
9.2. An Example: The Framingham Heart Study.................................................. 321
9.2.1. A Multiplicative Model of Gender, Age and Coronary Heart
Disease....................................................................................................................... 322
9.2.2. A Model of Age Gender and CHD with Interaction Terms................... 324
9.2.3. Adding Confounding Variables to the Model.......................................... 325
9.3. Using Stata to Perform Poisson Regression...................................................... 326
9.4. Residual Analyses for Poisson Regression Models........................................... 334
9.4.1. Deviance Residuals....................................................................................... 334
9.5. Residual Analysis of Poisson Regression Models Using Stata...................... 335
9.6. Additional Reading................................................................................................. 337
9.7. Exercises................................................................................................................... 337
9.8. Tables and Figures for Chapter 9........................................................................ 339
10: Fixed Effects Analysis of Variance...................................................................... 345
10.1. One-Way Analysis of Variance.......................................................................... 345
10.2. Multiple Comparisons.......................................................................................... 347
10.3. Reformulating Analysis of Variance as a Linear Regression Model......... 349
10.4. Non-parametric Methods..................................................................................... 349
10.5. Kruskal-Wallis Test.............................................................................................. 350
10.6. Example: A Polymorphism in the Estrogen Receptor Gene....................... 351
10.7. One-Way Analyses of Variance using Stata................................................... 352
10.8. Two-Way Analysis of Variance, Analysis of Covariance, and Other
Models..................................................................................................................... 358
10.9. Additional Reading............................................................................................... 359
10.10. Exercises............................................................................................................... 360
10.11. Tables and Figures for Chapter 10................................................................. 361
11: Repeated-Measures Analysis of Variance........................................................ 363
11.1. Example: Effect of Race and Dose of Isoproterenol on Blood Flow............ 363
11.2. Exploratory Analysis of Repeated Measures Data Using Stata................. 364
11.3. Response Feature Analysis................................................................................. 368
11.4. Example: The Isoproterenol Data Set............................................................. 369
11.5. Response Feature Analysis using Stata.......................................................... 370
11.6. The Area-Under-the-Curve Response Feature.............................................. 376
11.7. Generalized Estimating Equations................................................................... 377
11.8. Common Correlation Structures....................................................................... 378
11.9. GEE Analysis and the Huber-White Sandwich Estimator.......................... 379
11.10. Example: Analyzing the Isoproterenol Data with GEE............................ 380
11.11. Using Stata to Analyze the Isoproterenol Data Set Using GEE.............. 382
11.12. GEE Analyses with Logistic or Poisson Models.......................................... 386
11.13. Additional Reading............................................................................................ 386
11.14. Exercises............................................................................................................... 387
11.15. Tables and Figures for Chapter 11................................................................. 389
Appendix A: Summary of Stata Commands Used in this Text..................... 396
References............................................................................................................................. 408