- Audience: Statisticians and persons from other quantitative
disciplines who are interested in multivariable regression analysis of
univariate responses, in developing, validating, and graphically
describing multivariable predictive models. The course will be of particular interest to:
Applied statisticians who want to learn new methodology for
flexibly fitting all types of multivariable regression models while
making estimation of optimal covariable
transformations an explicit part of the modeling process.
- Those who want to learn how to develop models that are likely to
predict future observations as accurately as they predicted responses
from the data used to fit the models.
- Statisticians who want to learn how to graphically present
complex regression models to non-statisticians.
- Analysts who would like to
be introduced to multiple
imputation with regression models to handle missing and incomplete
- Quantitatively-minded epidemiologists and others who need to use
binary or ordinal logistic models and time-to-event (survival)
models for analyzing and predicting outcomes in observational studies.
- Biostatisticians, health services and outcomes researchers, and
economists who need to study or predict health outcomes or
- Prerequisites: A good general knowledge of statistical estimation
and inference methods and a good command of ordinary linear regression.
Those who want to run the laboratory exercises
themselves or who want to use S-Plus to use the methods taught in
this course in their everyday work should have had a previous
introduction to S-Plus. Participants are encouraged to read
references [1, 2, 3] in advance.
- Course Format:
This is a short course in two 3.5 hour
segments. The morning session will cover modeling and model
validation methods; the afternoon session will cover the overall
modeling strategy, examples of displaying effects of predictors and
of presenting models graphically to non-statisticians, and case
- Course Description:
The first part of the course presents the following
elements of multivariable predictive modeling for a single response
variable: using regression splines to relax linearity assumptions,
perils of variable selection and overfitting, where to spend degrees
of freedom, shrinkage, imputation of missing data, data reduction,
and interaction surfaces. Then a
default overall modeling strategy will be described. This is followed
by methods for graphically understanding models (e.g., using
nomograms) and using re-sampling to estimate a model's likely
performance on new data. Then the freely available S-Plus
Design library will be overviewed. Design
facilitates most of the steps of the modeling process.
Two of the following three
case studies will be presented: an interactive
exploration of the survival status of Titanic passengers,
an interactive case study in developing a survival time
model for critically ill patients, and a case study in Cox regression.
The methods covered in this course will apply to almost any regression
model, including ordinary least squares, logistic regression models,
and survival models.
Be familiar with modern methods for fitting multivariable regression models:
- in a way the sample size will allow, without overfitting
- uncovering complex non-linear or non-additive relationships
- testing for and quantifying the association between one or
more predictors and the response, with possible adjustment
for other factors
- Be able to validate models for predictive accuracy and to detect
- Be able to interpret fitted models using both parameter estimates
- Be able to critique the literature to detect models that are likely
to be unreliable
Planning for Modeling
- Notation for Regression Models
- Interpreting Model Parameters
- Relaxing Linearity Assumption for Continuous Predictors
Simple Nonlinear Terms
- Splines for Estimating Shape of Regression Function and
Determining Predictor Transformations
- Cubic Spline Functions
- Restricted Cubic Splines
- Nonparametric regression
- Advantages of Splines over Other Methods
- Tests of Association
- Assessment of Model Fit
- Modeling and Testing Interactions
- Missing Data
Types of Missingness
- Understanding Patterns of Missing Values
- Problems with Simple Alternatives to Imputation
- Strategies for Developing Imputation Algorithms
- Single Conditional Mean Imputation
- Multiple Imputation
- S-Plus Software for Fitting Models and Adjusting Variances for
- Multivariable Modeling Strategy
Pre-Specification of Predictor Complexity
- Variable Selection
- Overfitting and Limits on Number of Predictors
- Data Reduction
- Resampling, Validating, Describing, and Simplifying the Model
- Model Validation
- Graphically Describing the Fitted Model
- Simplifying the Model by Approximating It
- S-Plus Design library
- Interactive Case Study: Binary Logistic Model for Survival of
- Nonparametric Regression
- Development of Logistic Model
- Multiple Imputation to Handle Missing Passenger Ages
- Interactive Case Study: Development of a Long-Term Survival Model for
Critically Ill Patients
- Case Study in Cox Regression
Choosing the Number of Parameters
- Checking Proportional Hazards
- Testing Interactions
- Describing Predictor Effects
- Validating the Model
- Presenting the Model
Dr. Harrell is Professor of Biostatistics and Statistics and Chief
of the Division of Biostatistics and Epidemiology, Department of
Health Evaluation Sciences, University of Virginia School of Medicine,
Charlottesville. He received his Ph.D. in biostatistics
from the University of North Carolina, Chapel Hill in 1979, where he
studied under P.K. Sen. Dr. Harrell has been involved in statistical
computing since 1969 and is the author of many S-Plus functions and SAS
procedures. Since 1973 he has been involved in medical applications
of statistics, especially in the area of survival analysis and
clinical prediction modeling. He is an editorial consultant for the
Journal of Clinical Epidemiology, is on the editorial board of
Statistics in Medicine, is co-managing editor of the new
journal Health Services and Outcomes Research Methodology, and is
a consultant to FDA.
- Handouts: Participants will receive copies of the 180 slides that
will be presented
and a copy of the 500-page book
manuscript on which the course is based, Regression
Modeling Strategies written by the instructor. See
http://hesweb1.med.virginia.edu/biostat/rms for information about
BackgroundRegression models are frequently used to develop diagnostic,
prognostic, and health resource utilization models in clinical, health
services, outcomes, pharmacoeconomic, and epidemiologic research, and
in a multitude of non-health-related areas. Regression models
are also used to adjust for patient heterogeneity in randomized
clinical trials, to obtain tests that are more powerful and valid than
unadjusted treatment comparisons.
Models must be flexible enough to fit nonlinear and non-additive
relationships, but unless the sample size is enormous, the approach to
modeling must avoid common problems with data mining or data dredging
that result in overfitting and a failure of the predictive model to
validate on new subjects.
All standard regression models have assumptions that must be verified
for the model to have power to test hypotheses and for it to be able
to predict accurately. Of the principal assumptions (linearity,
additivity, distributional), this short course will emphasize methods for
assessing and satisfying the first two. Practical but powerful tools
are presented for validating model assumptions and presenting model
results. This course provides methods for estimating the shape of the
relationship between predictors and response using the widely
applicable method of augmenting the design matrix using restricted
F. E. Harrell, K. L. Lee, and D. B. Mark.
Multivariable prognostic models: Issues in developing models,
evaluating assumptions and adequacy, and measuring and reducing errors.
Statistics in Medicine, 15:361--387, 1996.
F. E. Harrell, P. A. Margolis, S. Gove, K. E. Mason, E. K. Mulholland,
D. Lehmann, L. Muhe, S. Gatchalian, and H. F. Eichenwald.
Development of a clinical prediction model for an ordinal outcome:
The World Health Organization ARI Multicentre Study of clinical signs and
etiologic agents of pneumonia, sepsis, and meningitis in young infants.
Statistics in Medicine, 17:909--944, 1998.
A. Spanos, F. E. Harrell, and D. T. Durack.
Differential diagnosis of acute meningitis: An analysis of the
predictive value of initial observations.
Journal of the American Medical Association, 262:2700--2707,