Regression Modeling Strategies
Frank E Harrell Jr
University of Virginia
fharrell@virginia.edu

Audience: Statisticians and persons from other quantitative disciplines who are interested in multivariable regression analysis of univariate responses, in developing, validating, and graphically describing multivariable predictive models, and in covariable adjustment in randomized clinical trials . The course will be of particular interest to:

Applied statisticians who want to learn new methodology for flexibly fitting all types of multivariable regression models while making estimation of optimal covariable transformations an explicit part of the modeling process.
Those who want to learn how to develop models that are likely to predict future observations as accurately as they predicted responses from the data used to fit the models.
Statisticians who want to learn how to graphically present complex regression models to non-statisticians.
Analysts who would like to learn how to incorporate multiple imputation with regression models to handle missing and incomplete data.
Quantitatively-minded epidemiologists and others who need to use binary and time-to-event (survival) models for analyzing and predicting outcomes in observational studies.
Biostatisticians, health services and outcomes researchers, and economists who need to study or predict health outcomes or resource utilization.
Biostatisticians working in clinical trials who would like to learn about the need for adjusting for covariables in perfectly balanced randomized trials and to be introduced to developing analytic plans for such adjustment.

Prerequisites: A good general knowledge of statistical estimation and inference methods and a good command of ordinary linear regression. Those who want to run the laboratory exercises themselves or who want to use S-Plus to use the methods taught in this course in their everyday work should have had a previous introduction to S-Plus. Participants are encouraged to read references [2, 3, 5] in advance. Those interested in covariable adjustment in randomized clinical trials may also want to read [4].

Course Format: Generally the first 2/3 of each day of this three-day course will consist of lectures on statistical methodology and graphical methods for interpreting complex models and presenting them to non-statisticians. In the remaining time students will gain hands-on experience in using the freely available S-Plus Design library for developing, checking, validating, testing formal hypotheses, and graphically interpreting multivariable predictive models using real datasets. See
http://hesweb1.med.virginia.edu/biostat/s/Design.html for information related to Design and to datasets for learning and testing modeling methods. Course Description The first part of the course presents the following elements of multivariable predictive modeling for a single response variable: using regression splines to relax linearity assumptions, perils of variable selection and overfitting, where to spend degrees of freedom, shrinkage, imputation of missing data, data reduction, and interaction surfaces. Then a default overall modeling strategy will be described. This is followed by methods for graphically understanding models (e.g., using nomograms) and using re-sampling to estimate a model's likely performance on new data. Then the freely available S-Plus Design library will be overviewed. Design facilitates most of the steps of the modeling process. Next, statistical methods related to binary logistic models will be covered. Three of the following case studies will be presented: an exploration of voting tendencies over U.S. counties in the 1992 presidential election, an interactive exploration of the survival status of Titanic passengers, an interactive case study in developing a survival time model for critically ill patients, and a case study in Cox regression. In the hands-on computer lab students will develop, validate, and graphically describe multivariable regression models themselves. This short course will survey the advantages of modeling in randomized trials and will provide some guidance in developing a prospective statistical plan for use in a Phase III clinical trial. The methods covered in this course will apply to almost any regression model, including ordinary least squares, logistic regression models, and survival models.

Objectives:

Be familiar with modern methods for fitting multivariable regression models:
1. accurately
2. in a way the sample size will allow, without overfitting
3. uncovering complex non-linear or non-additive relationships
4. testing for and quantifying the association between one or more predictors and the response, with possible adjustment for other factors
Be able to understand the different types of missing data and how to use multiple imputation to incorporate partial covariable information.
Be able to validate models for predictive accuracy and to detect overfitting
Be able to interpret fitted models using both parameter estimates and graphics
Be able to critique the literature to detect models that are likely to be unreliable
Understand benefits of covariable adjustment in randomized studies

Outline:

Planning for Modeling
Covariable Adjustment in Randomized Clinical Trials
1. Gaining efficiency
2. Reducing bias even with perfect balance
Notation for Regression Models
Interpreting Model Parameters
1. Nominal Predictors
2. Interactions
Relaxing Linearity Assumption for Continuous Predictors
1. Simple Nonlinear Terms
2. Splines for Estimating Shape of Regression Function and Determining Predictor Transformations
3. Cubic Spline Functions
4. Restricted Cubic Splines
5. Nonparametric regression
6. Advantages of Splines over Other Methods
Tests of Association
Assessment of Model Fit
1. Regression Assumptions
2. Modeling and Testing Interactions
Missing Data
1. Types of Missingness
2. Understanding Patterns of Missing Values
3. Problems with Simple Alternatives to Imputation
4. Strategies for Developing Imputation Algorithms
5. Single Conditional Mean Imputation
6. Multiple Imputation
7. S-Plus Software for Fitting Models and Adjusting Variances for Multiple Imputation
Multivariable Modeling Strategy
1. Pre-Specification of Predictor Complexity
2. Variable Selection
3. Overfitting and Limits on Number of Predictors
4. Shrinkage
5. Data Reduction
Resampling, Validating, Describing, and Simplifying the Model
1. The Bootstrap
2. Model Validation
3. Graphically Describing the Fitted Model
4. Simplifying the Model by Approximating It
S-Plus Design library
Case Study using Least Squares Multiple Regression: Voting Patterns in U.S. Counties
Binary Logistic Regression
1. The Model
2. Assessment of Model Fit
3. Quantifying Predictive Ability
4. Validating & Describing the Fitted Model
5. S-Plus Functions
Interactive Case Study: Binary Logistic Model for Survival of Titanic Passengers
1. Missing Data
2. Nonparametric Regression
3. Development of Logistic Model
4. Multiple Imputation to Handle Missing Passenger Ages
Interactive Case Study: Development of a Long-Term Survival Model for Critically Ill Patients
Cox Proportional Hazards Model
1. The Model
2. Checking Goodness of Fit
3. Quantifying Predictive Ability
4. Validation
5. S-Plus Functions
Case Study in Cox Regression
1. Choosing the Number of Parameters
2. Checking Proportional Hazards
3. Testing Interactions
4. Describing Predictor Effects
5. Validating the Model
6. Presenting the Model

Instructor: Dr. Harrell is Professor of Biostatistics and Statistics and Chief of the Division of Biostatistics and Epidemiology, Department of Health Evaluation Sciences, University of Virginia School of Medicine, Charlottesville. He received his Ph.D. in biostatistics from the University of North Carolina, Chapel Hill in 1979, where he studied under P.K. Sen. Dr. Harrell has been involved in statistical computing since 1969 and is the author of many S-Plus functions and SAS procedures. Since 1973 he has been involved in medical applications of statistics, especially in the area of survival analysis and clinical prediction modeling. He is an editorial consultant for the Journal of Clinical Epidemiology, is on the editorial board of Statistics in Medicine, is co-managing editor of the journal Health Services and Outcomes Research Methodology, and is a consultant to FDA and the pharmaceutical industry.

Handouts: Participants will receive copies of the 206 slides that will be presented and a copy of the book on which the course is based, Regression Modeling Strategies written by the instructor. See
http://hesweb1.med.virginia.edu/biostat/rms for information about this text.

Background

Regression models are frequently used to develop diagnostic, prognostic, and health resource utilization models in clinical, health services, outcomes, pharmacoeconomic, and epidemiologic research, and in a multitude of non-health-related areas. Regression models are also used to adjust for patient heterogeneity in randomized clinical trials, to obtain tests that are more powerful and valid than unadjusted treatment comparisons. Models must be flexible enough to fit nonlinear and non-additive relationships, but unless the sample size is enormous, the approach to modeling must avoid common problems with data mining or data dredging that result in overfitting and a failure of the predictive model to validate on new subjects. All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this short course will emphasize methods for assessing and satisfying the first two. Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines.

References

[1]: F. E. Harrell. Regression Modeling Strategies, with Applications to Linear Models, Survival Analysis and Logistic Regression. Springer, New York, 2001.
[2]: F. E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15:36-87, 1996.
[3]: F. E. Harrell, P. A. Margolis, S. Gove, K. E. Mason, E. K. Mulholland, D. Lehmann, L. Muhe, S. Gatchalian, and H. F. Eichenwald. Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Statistics in Medicine, 17:90-44, 1998.
[4]: W. W. Hauck, S. Anderson, and S. M. Marcus. Should we adjust for covariates in nonlinear regression analyses of randomized trials? Controlled Clinical Trials, 19:24-56, 1998.
[5]: A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. Journal of the American Medical Association, 262:270-707, 1989.

Hardcopy Handouts and Books Supplied to Participants:

http://hesweb1.med.virginia.edu/biostat/rms/rms.pdf
Printout of this web page
Printout of web page http://hesweb1.med.virginia.edu/biostat/s/data
http://hesweb1.med.virginia.edu/biostat/rms/ShortCourse.hw.pdf
http://hesweb1.med.virginia.edu/biostat/teaching/biostat.mod/formulas.pdf
Copy of solutions to the above lab assignments
Self-quizzes and quiz solutions
Copy of Regression Modeling Strategies