MSCI Biostatistics II - February 2023

Key Persons Name Contact Zulip
Instructor: Frank E. Harrell, Jr. @Frank Harrell
Teaching Assistant: Heather Prigmore @Heather Prigmore

Important Items

Course Handouts

Course Format

The Biostatistics II course is designed for students to do concentrated, intensive study before each class so that class time can be devoted to clarification, reviewing key concepts, answering student questions, and especially to problem solving. This design allows students to do the vast majority of "homework" assignments during class.

Pre-class: Intensive study of statistical methods and ideas
  • Read assigned sections of books and/or course notes, listening to audio narrative and watching short movies demonstrating statistical methods that are linked from the notes
  • Read assigned supplemental articles
  • Review key elements of the assigned material
  • Ample time for students' questions about the material and the concepts
  • Interactive demonstrations of the methods using datasets from ABD
  • In-class assignments using Stata
  • Write interpretations of selected analyses done during class
  • Take self-quizzes to gauge understanding of key concepts


Class Announcements & Discussion Board

  • Class announcements and homework assignments will appear on the course Zulip stream. It is the way to keep in touch with the class and even more to ask and answer questions. We hope that all students will use it to:
    • ask or answer any question whatsoever related to group assignments
    • ask or answer any logistical or purely technical questions related to individual work assignments
    • ask or answer any questions about modeling or statistical computing concepts that are not directly related to a pending individual work assignment
    • Use the Zulip stream for statistical or study design questions related to what's in those notes
  • Please also take advantage of the general regression modeling strategies discussion board: stats.stackexchange
  • Use for questions and discussion about study design, measurement, clinical trials, epidemiology, machine learning, and medical applications of statistics

High-Level Overview

Multivariable regression models the fundamental tools used for prediction, effect estimation, and hypothesis testing. This course covers the most commonly used regression models plus general methods applicable to all regression models. There is an emphasis on aspects related to clinical and translational study design.


Accurate estimation of patient prognosis or of the probability of a disease or other outcomes is important for many reasons.
  1. Prognostic estimates can be used to inform the patient about likely outcomes of her disease.
  2. A physician can use estimates of diagnosis or prognosis as a guide for ordering additional tests and selecting appropriate therapies.
  3. Outcome assessments are useful in the evaluation of technologies; for example, diagnostic estimates derived both with and without using the results of a given test can be compared to measure the incremental diagnostic information provided by that test over what is provided by prior information.
  4. A researcher may want to estimate the effect of a single factor (e.g., treatment given) on outcomes in an observational study in which many uncontrolled confounding factors are also measured. Here the simultaneous effects of the uncontrolled variables must be controlled (held constant mathematically if using a regression model) so that the effect of the factor of interest can be more purely estimated. An analysis of how variables (especially continuous ones) affect the patient outcomes of interest is necessary to ascertain how to control their effects.
  5. Predictive modeling is useful in designing randomized clinical trials. Both the decision concerning which patients to randomize and the design of the randomization process (e.g., stratified randomization using prognostic factors) are aided by the availability of accurate prognostic estimates before randomization. It is also important to adjust for prognostic factors in randomized studies to achieve optimum power and precision. Lastly, accurate prognostic models can be used to test for differential therapeutic benefit or to estimate the clinical benefit for an individual patient in a clinical trial, taking into account the fact that low-risk patients must have less absolute benefit (e.g., lower change in survival probability). To accomplish these objectives, researchers must create multivariable models that accurately reflect the patterns existing in the underlying data and that are valid when applied to comparable data in other settings or institutions. Models may be inaccurate due to violation of assumptions, omission of important predictors, high frequency of missing data and/or improper imputation methods, and especially with small datasets, overfitting.


Many types of regression models are increasingly being used in developing clinical prediction models for diagnosis, prognosis, and other applications in epidemiology, health services research, health economics, clinical trials, business, finance, and prediction in general. Regression models are introduced, and first the basics of multivariable regression models are discussed, starting with the ordinary multiple linear regression model (ordinary least squares). Early topics include interpretation of regression coefficients, coding of categorical predictors, meaning of linearity assumptions, estimating the relationships between two variables nonparametrically, and coding and interpretation of interaction terms. Popular models include logistic models for binary and ordinal responses, survival models, ordinal regression, and models for longitudinal data analysis, many of which are covered in this course. All regression models have assumptions that must be verified for them to have power to test hypotheses and to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two as these methods apply to all regression models. To deal with the linearity assumption, this course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of piecewise polynomials. Emphasis will be given to interpreting fitted models using effect plots (e.g., continuous partial effect plots and odds ratio charts) and nomograms. Even when assumptions are satisfied, overfitting can ruin a model's predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be introduced, as will auxiliary topics such as modeling interaction surfaces, dealing with missing data, variable selection, collinearity, and shrinkage. All methods covered will apply to almost any regression model. The course will include detailed case studies in developing, validating, and interpreting clinical prediction and epidemiologic models.

Additional material for the curious student

  • Steyerberg EW. Clinical Prediction Models, 2nd ed. New York: Springer; 2019.
  • Cosma Shalizi's Undergraduate Advanced Data Analysis course
  • TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD):

Study Questions and Polls

Institutional Memory (Thoughts, Feedback, Ideas for future)

Topic revision: r96 - 26 Jan 2023, FrankHarrell

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback