Statistical Modeling for Biomedical Researchers

 

William D. Dupont

 

1. Introduction......................................................................................................................... 1

1.1. Algebraic Notation...................................................................................................... 1

1.2. Descriptive Statistics................................................................................................... 2

1.2.1. Dot Plot................................................................................................................ 2

1.2.2. Sample Mean...................................................................................................... 3

1.2.3. Residual............................................................................................................... 3

1.2.4. Sample Variance................................................................................................ 3

1.2.5. Sample Standard Deviation............................................................................ 4

1.2.6. Percentile and Median..................................................................................... 4

1.2.7. Box Plot............................................................................................................... 4

1.2.8. Histogram............................................................................................................ 5

1.2.9. Scatter Plot......................................................................................................... 5

1.3. The Stata Statistical Software Package.................................................................. 5

1.3.1. Downloading Data from my Web Site........................................................... 5

1.3.2. Creating Dot Plots with Stata......................................................................... 6

1.3.3. Stata Command Syntax................................................................................... 7

1.3.4. Obtaining Interactive Help from Stata......................................................... 8

1.3.5. Stata Log Files................................................................................................... 9

1.3.6. Displaying Other Descriptive Statistics with Stata.................................... 9

1.4. Inferential Statistics................................................................................................. 11

1.4.1. Probability Density Function....................................................................... 11

1.4.2. Mean, Variance and Standard Deviation.................................................. 12

1.4.3. Normal Distribution....................................................................................... 12

1.4.4. Expected Value................................................................................................ 12

1.4.5. Standard Error................................................................................................ 13

1.4.6. Null Hypothesis, Alternative Hypothesis and P value........................... 13

1.4.7. 95% Confidence Interval............................................................................... 14

1.4.8. Statistical Power.............................................................................................. 14

1.4.9. The z and Student's t Distributions............................................................. 15

1.4.10. Paired t Test................................................................................................... 16

1.4.11. Performing Paired t Tests with Stata....................................................... 17

1.4.12. Independent t Test Using a Pooled Standard Error Estimate............ 19

1.4.13. Independent t Test using Separate Standard Error Estimates.......... 20

1.4.14. Independent t-tests using Stata................................................................. 21

1.4.15. The Chi-Squared Distribution.................................................................. 23

1.5. Additional Reading................................................................................................... 23

1.6. Exercises...................................................................................................................... 24

1.7. Figures for Chapter 1............................................................................................... 25

 

2: Simple Linear Regression............................................................................................ 37

2.1. Sample Covariance................................................................................................... 37

2.2. Sample Correlation Coefficient.............................................................................. 38

2.3. Population Covariance and Correlation Coefficient.......................................... 38

2.4. Conditional Expectation.......................................................................................... 39

2.5. Simple Linear Regression Model........................................................................... 39

2.6. Fitting the Linear Regression Model.................................................................... 40

2.7. Historical Trivia: Origin of the Term Regression................................................ 41

2.8. Determining the Accuracy of Linear Regression Estimates............................ 42

2.9. Ethylene Glycol Poisoning Example..................................................................... 43

2.10. 95% Confidence Interval for y = a + bx Evaluated at x.......................... 44

2.11. 95% Prediction Interval for the Response of a New Patient.......................... 44

2.12. Simple Linear Regression with Stata................................................................. 45

2.13. Lowess Regression................................................................................................... 49

2.14. Plotting a Lowess Regression Curve in Stata................................................... 49

2.15. Residual Analyses................................................................................................... 50

2.16. Studentized Residual Analysis Using Stata..................................................... 52

2.17. Transforming the x and y Variables.................................................................... 53

2.17.1. Stabilizing the Variance.............................................................................. 53

2.17.2. Correcting for Non-linearity....................................................................... 54

2.17.3. Example: Research Funding and Morbidity for 29 Diseases............... 54

2.18. Analyzing Transformed Data with Stata........................................................... 55

2.19. Testing the Equality of Regression Slopes........................................................ 58

2.19.1. Example:  The Framingham Heart Study............................................... 59

2.20. Comparing Slope Estimates with Stata.............................................................. 60

2.21. Additional Reading................................................................................................. 63

2.22. Exercises................................................................................................................... 64

2.23. Tables and Figures for Chapter 2........................................................................ 66

 

 

3: Multiple Linear Regression........................................................................................ 81

3.1. The Model.................................................................................................................... 81

3.2. Confounding Variables............................................................................................ 81

3.3. Estimating the Parameters for a Multiple Linear Regression Model............ 82

3.4. R2 Statistic for Multiple Regression Models........................................................ 82

3.5. Expected Response in the Multiple Regression Model...................................... 83

3.6. The Accuracy of Multiple Regression Parameter Estimates........................... 83

3.7. Leverage...................................................................................................................... 84

3.8. 95% Confidence Interval for .............................................................................. 84

3.9. 95% Prediction Intervals......................................................................................... 85

3.10. Example:  The Framingham Heart Study......................................................... 85

3.10.1. Preliminary Univariate Analyses............................................................. 86

3.11. Scatterplot Matrix Graphs.................................................................................... 87

3.11.1. Producing Scatterplot Matrix Graphs with Stata.................................. 87

3.12. Modeling Interaction in Multiple Linear Regression...................................... 88

3.12.1. The Framingham Example........................................................................ 88

3.13. Multiple Regression Modeling of the Framingham Data............................... 89

3.14. Intuitive Understanding of a Multiple Regression Model.............................. 90

3.14.1. The Framingham Example........................................................................ 90

3.15. Calculating 95% Confidence and Prediction Intervals................................... 92

3.16. Multiple Linear Regression with Stata.............................................................. 92

3.17. Automatic Methods of  Model Selection.............................................................. 96

3.17.1. Forward Selection using Stata................................................................... 96

3.17.2. Backward Selection....................................................................................... 99

3.17.3. Forward Stepwise Selection........................................................................ 99

3.17.4. Backward Stepwise Selection...................................................................... 99

3.17.5. Pros and Cons of Automated Model Selection...................................... 100

3.18. Collinearity............................................................................................................ 100

3.19. Residual Analyses................................................................................................. 101

3.20. Influence................................................................................................................. 101

3.20.1. D Influence Statistic................................................................................ 102

3.20.2. Cook’s Distance............................................................................................ 102

3.20.3. The Framingham Example...................................................................... 103

3.21. Residual and Influence Analyses Using Stata............................................... 104

3.22. Additional Reading............................................................................................... 107

3.23. Exercises................................................................................................................. 107

3.24. Tables and Figures for Chapter 3...................................................................... 110

 

4: Simple Logistic Regression....................................................................................... 121

4.1. Example: APACHE  Score and Mortality in Patients with Sepsis............... 121

4.2. Sigmoidal Family of Logistic Regression Curves............................................. 121

4.3. The Log Odds of Death Given a Logistic Probability Function..................... 122

4.4. The Binomial Distribution.................................................................................... 123

4.5. Simple Logistic Regression Model....................................................................... 123

4.6. Generalized Linear Model..................................................................................... 124

4.7. Contrast Between Logistic and Linear Regression.......................................... 124

4.8. Maximum Likelihood Estimation........................................................................ 125

4.8.1. Variance of Maximum Likelihood Parameter Estimates..................... 125

4.9. Statistical Tests and Confidence Intervals........................................................ 126

4.9.1. Likelihood Ratio Tests................................................................................. 126

4.9.2. Quadratic Approximations to the Log Likelihood Ratio Function..... 127

4.9.3. Score Tests...................................................................................................... 128

4.9.4. Wald Tests and Confidence Intervals....................................................... 128

4.9.5. Which Test Should You Use?...................................................................... 129

4.10. Sepsis Example...................................................................................................... 130

4.11. Logistic Regression with Stata........................................................................... 130

4.12. Odds Ratios and the Logistic Regression Model............................................. 132

4.13. 95% Confidence Interval for the Odds Ratio Associated with a Unit

         Increase in x............................................................................................................ 132

4.13.1. Calculating this Odds Ratio with Stata................................................. 133

4.14. Logistic Regression with Grouped Response Data......................................... 133

4.15. 95% Confidence Interval for the .............................................................. 134

4.16. 95% Confidence Intervals for Proportions....................................................... 135

4.17. Example:  The Ibuprofen in Sepsis Trial......................................................... 135

4.18. Logistic Regression with Grouped Data Using Stata.................................... 137

4.19. Simple 2´2 Case-Control Studies...................................................................... 140

4.19.1. Example:  The Ille-et-Vilaine Study of Esophageal Cancer

 and Alcohol............................................................................................................... 140

4.19.2. Review of Classical Case-Control Theory.............................................. 140

4.19.3. 95% Confidence Interval for the Odds Ratio: Woolf’s Method........... 142

4.19.4. Test of the Null Hypothesis that the Odds Ratio Equals One........... 142

4.19.5. Test of the Null Hypothesis that Two Proportions are Equal........... 143

4.20. Logistic Regression Models for 2´2 Contingency Tables............................. 143

4.20.1. Nuisance Parameters................................................................................. 144

4.20.2. 95% Confidence Interval for the Odds Ratio: Logistic Regression... 144

4.21. Creating a Stata Data File.................................................................................. 144

4.22. Analyzing Case-Control Data with Stata........................................................ 146

4.23. Regressing Disease Against Exposure............................................................. 147

4.24. Additional Reading............................................................................................... 149

4.25. Exercises................................................................................................................. 149

4.26. Tables and Figures for Chapter 4...................................................................... 151

 

 

5: Multiple Logistic Regression................................................................................... 161

5.1. Mantel-Haenszel Estimate of an Age-Adjusted Odds Ratio.......................... 161

5.2. Mantel-Haenszel Statistic for Multiple 2´2 Tables.................................... 162

5.3. 95% Confidence Interval for the Age-Adjusted Odds Ratio........................... 163

5.4. Breslow and Day’s Test for Homogeneity........................................................... 163

5.5. Calculating the Mantel-Haenszel Odds Ratio using Stata............................ 165

5.6. Multiple Logistic Regression Model.................................................................... 167

5.7. 95%  Confidence Interval for an Adjusted Odds Ratio.................................... 169

5.8. Logistic Regression for Multiple 2´2 Contingency Tables............................. 169

5.9. Analyzing Multiple 2´2 Tables with Stata........................................................ 171

5.10. Handling Categorical Variables in Stata........................................................ 173

5.11. Effect of Dose of Alcohol on Esophageal Cancer Risk................................... 174

5.11.1. Analyzing Model  with Stata.................................................................... 175

5.12. Effect of Dose of Tobacco on Esophageal Cancer Risk.................................. 176

5.13. Deriving Odds Ratios from Multiple Parameters........................................... 176

5.14. The Standard Error of a Weighted Sum of Regression Coefficients.......... 177

5.15. Confidence Intervals for Weighted Sums of Coefficients............................. 177

5.16. Hypothesis Tests for Weighted Sums of Coefficients..................................... 178

5.17. The Estimated Variance-Covariance Matrix.................................................. 178

5.18. Multiplicative Models of Two Risk Factors...................................................... 179

5.19. Multiplicative Model of Smoking, Alcohol, and Esophageal Cancer......... 180

5.20. Fitting a Multiplicative Model with Stata....................................................... 181

5.21. Model of Two Risk Factors with Interaction.................................................... 185

5.22. Model of Alcohol, Tobacco, and Esophageal Cancer with Interaction

Terms....................................................................................................................... 186

5.23. Fitting a Model with Interaction using Stata................................................. 187

5.24. Model Fitting:  Nested Models and Model Deviance..................................... 190

5.25. Effect Modifiers and Confounding Variables.................................................. 191

5.26. Goodness-of-Fit Tests........................................................................................... 192

5.26.1. The Pearson  Goodness-of-Fit Statistic............................................. 192

5.27. Hosmer-Lemeshow Goodness-of-Fit Test......................................................... 193

5.27.1. An Example: The Ille-et-Vilaine Cancer Data Set.............................. 194

5.28. Residual and Influence Analysis....................................................................... 195

5.28.1. Standardized Pearson Residual.............................................................. 196

5.28.3. Influence Statistic....................................................................................... 196

5.28.3. Residual Plots of the Ille-et-Vilaine Data on Esophageal Cancer.... 197

5.29. Using Stata for Goodness-of-Fit Tests and Residual Analyses................... 198

5.30. Frequency Matched Case-Control Studies...................................................... 204

5.31. Conditional Logistic Regression........................................................................ 205

5.32. Analyzing Data with Missing Values............................................................... 205

5.32.1. Cardiac Output in the Ibuprofen in Sepsis Study............................... 206

5.32.2. Modeling Missing Values with Stata...................................................... 207

5.33. Additional Reading............................................................................................... 209

5.34. Exercises................................................................................................................. 210

5.35. Tables and Figures for Chapter 5...................................................................... 212

 

 

6: Introduction to Survival Analysis......................................................................... 219

6.1. Survival and Cumulative Mortality Functions................................................ 219

6.2. Right Censored Data.............................................................................................. 220

6.3. Kaplan-Meier Survival Curves............................................................................ 220

6.4. An Example: Genetic Risk of Recurrent Intracerebral Hemorrhage........... 221

6.5. 95% Confidence Intervals for Survival Functions........................................... 222

6.6. Cumulative Mortality Function........................................................................... 223

6.7. Censoring and Bias................................................................................................. 224

6.8. Logrank Test............................................................................................................ 224

6.9. Using Stata to Derive Survival Functions and the Logrank Test................ 226

6.10. Logrank Test for Multiple Patient Groups...................................................... 231

6.11. Hazard Functions................................................................................................. 231

6.12. Proportional Hazards........................................................................................... 232

6.13. Relative Risks and Hazard Ratios..................................................................... 232

6.14. Proportional Hazards Regression Analysis..................................................... 233

6.15. Hazard Regression Analysis of the Intracerebral Hemorrhage Data....... 234

6.16. Proportional Hazards Regression Analysis with Stata................................. 234

6.17. Tied Failure Times............................................................................................... 235

6.18. Additional Reading............................................................................................... 236

6.19. Exercises................................................................................................................. 236

6.20. Tables and Figures for Chapter 6                                                                       238

 

7: Hazard Regression Analysis..................................................................................... 249

7.1. Proportional Hazards Model................................................................................. 249

7.2. Relative Risks and Hazard Ratios....................................................................... 249

7.3. 95% Confidence Intervals and Hypothesis Tests............................................. 250

7.4. Nested Models and Model Deviance.................................................................... 251

7.5. An Example:  The Framingham Heart Study.................................................. 251

7.5.1. Univariate Analyses..................................................................................... 251

7.5.2. Multiplicative Model of DBP and Gender on Risk of CHD................... 253

7.5.3. Using Interaction Terms to Model the Effects of Gender and

DBP on CHD............................................................................................................. 253

7.5.4. Adjusting for Confounding Variables....................................................... 254

7.5.5. Interpretation................................................................................................ 255

7.5.6. Alternate Models........................................................................................... 256

7.6. Cox-Snell Generalized Residuals and Proportional Hazards Models.......... 256

7.7. Proportional Hazards Regression Analysis using Stata................................. 257

7.8. Stratified Proportional Hazards Models............................................................. 268

7.9. Survival Analysis with Ragged Study Entry.................................................... 269

7.9.1. Kaplan-Meier Survival Curve and the Logrank Test with

Ragged Entry........................................................................................................... 269

7.9.2. Age, Sex, and CHD in the Framingham Heart Study.......................... 270

7.9.3. Proportional Hazards Regression Analysis with Ragged Entry......... 270

7.9.4. Survival Analysis with Ragged Entry using Stata............................... 271

7.10. Hazard Regression Models with Time Dependent Covariates.................... 273

7.10.1. Cox-Snell Residuals for Models with Time-Dependent

Covariates................................................................................................................. 274

7.10.2. Testing the Proportional Hazards Assumption.................................... 275

7.10.3. Alternative Models..................................................................................... 275

7.11. Modeling Time-Dependent Covariates with Stata......................................... 275

7.12. Additional Reading............................................................................................... 280

7.13. Exercises................................................................................................................. 280

7.14. Tables and Figures for Chapter 7...................................................................... 282

 

8: Introduction to Poisson Regression: Inferences on Morbidity and

Mortality Rates.................................................................................................................. 293

8.1. Elementary Statistics Involving Rates.............................................................. 293

8.2. Calculating Relative Risks from Incidence Data Using Stata...................... 294

8.3. The Binomial and Poisson Distributions........................................................... 295

8.4. Simple Poisson Regression for 2´2 Tables......................................................... 296

8.5. Poisson Regression and the Generalized Linear Model.................................. 297

8.6. Contrast Between Poisson, Logistic, and Linear Regression......................... 297

8.7. Simple Poisson Regression  with Stata............................................................... 298

8.8. Poisson Regression and Survival Analysis........................................................ 299

8.8.1. Recoding Survival Data on Patients as Patient-Year Data................. 299

 

8.8.2. Converting Survival Records to Person-Years of Follow-Up

using Stata................................................................................................................ 300

8.9. Converting the Framingham Survival Data Set to Person-Time Data...... 303

8.10. Simple Poisson Regression with Multiple Data Records.............................. 308

8.11. Poisson Regression with a Classification Variable........................................ 308

8.12. Applying Simple Poisson Regression to the Framingham Data................. 309

8.13. Additional Reading............................................................................................... 311

8.14. Exercises................................................................................................................. 312

8.15. Tables and Figures for Chapter 8...................................................................... 315

 

 

9: Multiple Poisson Regression.................................................................................... 319

9.1. Multiple Poisson Regression Model..................................................................... 319

9.2. An Example:  The Framingham Heart Study.................................................. 321

9.2.1. A Multiplicative Model of Gender, Age and Coronary Heart

Disease....................................................................................................................... 322

9.2.2. A Model of Age Gender and CHD with Interaction Terms................... 324

9.2.3. Adding Confounding Variables to the Model.......................................... 325

9.3. Using Stata to Perform Poisson Regression...................................................... 326

9.4. Residual Analyses for Poisson Regression Models........................................... 334

9.4.1. Deviance Residuals....................................................................................... 334

9.5. Residual Analysis of Poisson Regression Models Using Stata...................... 335

9.6. Additional Reading................................................................................................. 337

9.7. Exercises................................................................................................................... 337

9.8. Tables and Figures for Chapter 9........................................................................ 339

 

10: Fixed Effects Analysis of Variance...................................................................... 345

10.1. One-Way Analysis of Variance.......................................................................... 345

10.2. Multiple Comparisons.......................................................................................... 347

10.3. Reformulating Analysis of Variance as a Linear Regression Model......... 349

10.4. Non-parametric Methods..................................................................................... 349

10.5. Kruskal-Wallis Test.............................................................................................. 350

10.6. Example:  A Polymorphism in the Estrogen Receptor Gene....................... 351

10.7. One-Way Analyses of Variance using Stata................................................... 352

10.8. Two-Way Analysis of Variance, Analysis of Covariance, and Other

Models..................................................................................................................... 358

10.9. Additional Reading............................................................................................... 359

10.10. Exercises............................................................................................................... 360

10.11. Tables and Figures for Chapter 10................................................................. 361

 

 

11: Repeated-Measures Analysis of Variance........................................................ 363

11.1. Example:  Effect of Race and Dose of Isoproterenol on Blood Flow............ 363

11.2. Exploratory Analysis of Repeated Measures Data Using Stata................. 364

11.3. Response Feature Analysis................................................................................. 368

11.4. Example:  The Isoproterenol Data Set............................................................. 369

11.5. Response Feature Analysis using Stata.......................................................... 370

11.6. The Area-Under-the-Curve Response Feature.............................................. 376

11.7. Generalized Estimating Equations................................................................... 377

11.8. Common Correlation Structures....................................................................... 378

11.9. GEE Analysis and the Huber-White Sandwich Estimator.......................... 379

11.10. Example:  Analyzing the Isoproterenol Data with GEE............................ 380

11.11. Using Stata to Analyze the Isoproterenol Data Set Using GEE.............. 382

11.12. GEE  Analyses with Logistic or Poisson Models.......................................... 386

11.13. Additional Reading............................................................................................ 386

11.14. Exercises............................................................................................................... 387

11.15. Tables and Figures for Chapter 11................................................................. 389

 

 

Appendix A:  Summary of Stata Commands Used in this Text..................... 396

 

 

References............................................................................................................................. 408