STAT 731: Advanced Data Analysis (3 credits)
Frank E. Harrell, Jr. (fharrell@virginia.edu)
Professor of Biostatistics and Statistics
Division of Biostatistics and Epidemiology
Department of Health Evaluation Sciences
School of Medicine
Department of Statistics,
Graduate School of Arts and Sciences
924-8712
3 September - 3 December 2001
Monday 8:30-10:50a
(10 Sep, 15 and 22 Oct 9:35-11:00a)
Room 121, Halsey Hall
Open Office Hours: TBD


Class Web Page: http://hesweb1.med.virginia.edu/biostat/teaching/biostat.mod
Advanced Data Analysis is a required course for students in the Biostatistics track in the Statistics program. This course is also suitable for students in Health Evaluation Sciences who have completed HES 704 (Biostatistical Modeling). This course expands on topics covered in HES 704 and covers other topics such as penalized estimation, the cluster bootstrap and bootstrap bumping, semiparametric methods for handling clustered and serial data, simulation, multiple imputation, nonlinear principal components, robust, resistent, and rank-based regression, the smearing estimator, generalized additive models and other flexible regression methods such as projection pursuit regression, ordinal regression, recursive partitioning, cluster analysis, model validation, and some elements of exploratory data analysis.
Motivation
The art of data analysis is about choosing and using multiple tools. This course teaches how to use many tools of applied statistics and teaches strategies for solving real problems. Many of the methods presented in this course allow the statistician to analyze messy datasets and variables with strange distributions, and allow for flexible fitting of somewhat complex models without overfitting.
Prerequisites
HES 703 (STAT 301/501) and HES 704 or permission of the instructor.
Learning Objectives
To learn various methods useful for high-level analysis of the kinds of data often encountered in real world applications. Texts
Harrell FE. Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer, 2001.
Venables WN, Ripley BD: Modern Applied Statistics with S-Plus. New York: Springer-Verlag, 3rd Edition, 2000.
Recommended Supplemental Reading
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. New York: Springer, 2001.
Harrell FE, Alzola CF: An Introduction to S-Plus and to the Hmisc and Design Libraries. Available from the Biostatistics web page or from the UVa bookstore, 2001.
Hastie T, Tibshirani R: Generalized Additive Models. London: Chapman and Hall, 1990.
Collett, D: Modelling Binary Data. London: Chapman & Hall, 1994.
Efron B, Tibshirani, R: An Introduction to the Bootstrap. New York: Chapman and Hall, 1993.
Davison AC, Hinkley DV: Bootstrap Methods and Their Application. Cambridge University Press, 1997.
Datasets
  1. Datasets from Rosner's Fundamentals of Biostatistics, 4th edition.
  2. From web page
  3. Students are encouraged to find their own datasets for the final project
See Rosner and Datasets area under Teaching Materials on the web page. Several of the Rosner datasets have already been converted to S-Plus data frames. These are under dumpdata.sdd under the Rosner area.
Class E-mail Group
Students will automatically receive clarifications of assignments and notes, news about updates to the web page, class schedule changes, software bug corrections, and other information. To post questions to the instructor and to all class members, E-mail to HES731-1@toolkit.virginia.edu. You can also access this E-mail address through our web page. Class members who wish to post answers should send their response to the HES731-1 address so that all students can profit.
Software
S-Plus 2000 Professional Edition on Microsoft Windows 95/98/NT/2000 or Version 6 on UNIX or Linux, with add-on S-Plus libraries Hmisc and Design. To access Hmisc and Design on the campus-wide Unix system (e.g., by telneting to blue.unix.virginia.edu), define the following .First function:
.First <- function() {
  library(Hmisc, T, lib.loc = "/home/feh3k/slibrary")
  library(Design, T, lib.loc = "/home/feh3k/slibrary")
  invisible()
}
To use the Statistics Division Unix computers, change lib.loc above to point to '/home/library/Splus'. Note: S-Plus 3.4 was removed from all UNIX machines recently so UNIX availability is uncertain at the present time.
Class Format
Lecture, interactive computer demonstrations, and occasional computer labs
Labs
Labs available for doing homework:
Wilson 308 (Mon., Wed. noon-3p when class or lab not in session)
ACHS (Academic Computing Health Sciences) 8a-5p Mon-Fri (3rd floor Hospital West by DHES)
Health Sciences Library Learning Resources Center (1st floor of HSL) 7:30a-midnight Mon-Thurs, 7:30a-7p Fri, 9a-7p Sat, 9a-midnight Sun.
Schedule exceptions
Classes will start a little over an hour late three times. Some makeup classes may be scheduled.
Grading
Assignments are due by 5p on date listed. Projects must be done independently and must include the honor pledge. Small homeworks may be done in groups (these will be designated as group assignments), with all group participants signing a single copy verifying their participation. Projects count 2 × group homeworks. Final project counts 4 × group homeworks. One letter grade is deducted per day late unless prior arrangements are made with the instructor (these prior arrangements are allowed once or twice per semester).
Work must be as concise as clear communication will allow. The preferred method for producing reports is to print PDF files created by the Biostatistics LATEX server (at http://fharrell.biostat.virginia.edu/latex). Last day to turn in final project: 14 December.
Exams
Quizzes on random days, counting 1/2 of a group assignment. There will be no makeups for quizzes. Students will be able to choose one quiz to not count in their final grade.
Assistance
Questions for instructor outside class - E-mail or phone for appointment. First try the E-mail group, so that other students can see your question and the answers you get from others. Assistance will only be given on projects when the same assistance is available to all students (i.e., through the E-mail group or during class).
Anonymous Feedback to Instructor
See Web page.
Course Outline, Projects, and Approximate Schedule
Bold numbers to the right of topics indicate sequential lecture numbers.
Hn stands for Harrell Section n. VRn stands for Venables & Ripley Section n. HTFn stands for optional readings in Hastie, Tibshirani, and Friedman Section n.
  1. Introduction (1)
    1. Course overview and logistics
    2. Hypothesis testing vs. estimation vs. prediction (H1.1)
    3. Choice of model (H1.4)
  2. General methods for multivariable models (H2)
    1. Nonparametric smoothers (H2.4.6, VR9.1)
    2. Smoothing splines (VR9.1)
    3. Regression splines (H2.4.2-2.4.5) (2)
    4. Modeling interactions (H2.7.2) and tensor splines (HTF5.7)
    5. Recursive partitioning (H2.5, VR10)
  3. Multiple Imputation (H3) (3)
  4. Shrinkage (H4.4-4.5) (4)
  5. Nonlinear principal components and related methods (H4.7.2)
  6. Bootstrap for validating models (H5.2.5, VR 5.7, 6.6) (5)
  7. Case study in data reduction and missing value imputation (H8)
  8. Maximum Likelihood Estimation (H9) (6)
    1. Three test statistics (H6.3.3)
    2. Robust covariance matrix estimator (H9.5)
    3. Correcting variances for clustered or serial data using sandwich and bootstrap estimators (H9.5)
    4. Bootstrap simultaneous confidence regions using Tibshirani's bootstrap bumping (H9.7) (7)
    5. S-Plus bootcov and rm.boot functions
    6. Simulations to study coverage of simultaneous bootstrap confidence regions
    7. Further use of the log likelihood (H9.8) (8)
    8. Weighted MLE (H9.9)
    9. Penalized MLE (H9.10, HTF3.4.3)
    10. Effective d.f. (H9.10,HTF7.6)
    11. Tibshirani's lasso (HTF3.4.3,3.4.5,10.12.3)
  9. Ordinal Logistic Models (H13, 14) (9)
    1. Models
    2. Using ordinal models and the Cox model for robust rank-based analysis of continuous response data
    3. Special residual plots
    4. Special use of penalized MLE
    5. Case study
    Project: Develop and validate a proportional odds ordinal logistic model
  10. Projection-pursuit regression (VR9.2) and MARS (HTF9.4) (10)
  11. Transform-both-sides Nonparametric Additive Regression Models (H15, VR9.3) (11)
    1. Generalized additive models
    2. ACE
    3. AVAS
    4. S-Plus areg.boot function
    5. Smearing estimator (H15.4)
    Project: Develop and interpret a nonparametric additive model for a continuous response
  12. Other topics such as cluster analysis, correspondence analysis, unsupervised association rules (HTF14.2) Final Project