STAT 731: Advanced Data Analysis (3 credits)
Frank E. Harrell, Jr. (
fharrell@virginia.edu)
Professor of Biostatistics and Statistics
Division of Biostatistics and Epidemiology
Department of Health Evaluation Sciences
School of Medicine
Department of Statistics,
Graduate School of Arts and Sciences
924-8712
3 September - 3 December 2001
Monday 8:30-10:50a
(10 Sep, 15 and 22 Oct 9:35-11:00a)
Room 121, Halsey Hall
Open Office Hours: TBD
Class Web Page:
http://hesweb1.med.virginia.edu/biostat/teaching/biostat.mod
Advanced Data Analysis is a required course for students in the
Biostatistics track in the Statistics program. This course is also
suitable for students in Health Evaluation Sciences who have completed
HES 704 (Biostatistical Modeling). This course expands on topics
covered in HES 704 and covers other topics such as penalized
estimation, the cluster bootstrap and bootstrap bumping,
semiparametric methods for handling
clustered and serial data, simulation, multiple imputation, nonlinear
principal components, robust, resistent, and rank-based regression,
the smearing estimator, generalized additive models and other flexible
regression methods such as projection pursuit regression, ordinal
regression, recursive partitioning, cluster analysis, model
validation, and some elements of exploratory data analysis.
Motivation
The art of data analysis is about choosing and using multiple
tools. This course teaches how to use many tools of applied
statistics and teaches strategies for solving real problems.
Many of the methods presented in this course allow the statistician to
analyze messy datasets and variables with strange distributions, and
allow for flexible fitting of somewhat complex models without
overfitting.
Prerequisites
HES 703 (STAT 301/501) and HES 704 or permission of the instructor.
Learning Objectives
To learn various methods useful for high-level analysis of the kinds
of data often encountered in real world applications.
Texts
Harrell FE. Regression Modeling Strategies with Applications
to Linear Models, Logistic Regression, and Survival Analysis. New
York: Springer, 2001.
Venables WN, Ripley BD: Modern Applied Statistics with S-Plus.
New York: Springer-Verlag, 3rd Edition, 2000.
Recommended Supplemental Reading
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical
Learning. New York: Springer, 2001.
Harrell FE, Alzola CF: An Introduction to S-Plus and to the Hmisc and
Design Libraries. Available from the Biostatistics web page or
from the UVa bookstore, 2001.
Hastie T, Tibshirani R: Generalized Additive Models. London:
Chapman and Hall, 1990.
Collett, D: Modelling Binary Data. London: Chapman & Hall, 1994.
Efron B, Tibshirani, R: An Introduction to the Bootstrap. New
York: Chapman and Hall, 1993.
Davison AC, Hinkley DV: Bootstrap Methods and Their Application.
Cambridge University Press, 1997.
Datasets
-
Datasets from Rosner's Fundamentals of Biostatistics,
4th edition.
- From web page
- Students are encouraged to find their own datasets for the
final project
See Rosner and Datasets area under Teaching
Materials on the web page. Several of the Rosner datasets have
already been converted to S-Plus data frames. These are under
dumpdata.sdd under the Rosner area.
Class E-mail Group
Students will automatically receive clarifications of assignments and
notes, news about updates to the web page, class schedule changes,
software bug corrections, and other information. To post questions to
the instructor and to all class members, E-mail to
HES731-1@toolkit.virginia.edu. You can also access this
E-mail address through our web page. Class members who wish to post
answers should send their response to the HES731-1 address so
that all students can profit.
Software
S-Plus 2000 Professional Edition on Microsoft
Windows 95/98/NT/2000 or Version 6 on UNIX or Linux, with
add-on S-Plus libraries Hmisc and Design. To access
Hmisc and Design on the campus-wide Unix system (e.g., by
telneting to blue.unix.virginia.edu), define
the following .First function:
.First <- function() {
library(Hmisc, T, lib.loc = "/home/feh3k/slibrary")
library(Design, T, lib.loc = "/home/feh3k/slibrary")
invisible()
}
To use the Statistics Division Unix computers, change lib.loc
above to point to '/home/library/Splus'.
Note: S-Plus 3.4 was removed from all UNIX machines recently
so UNIX availability is uncertain at the present time.
Class Format
Lecture, interactive computer demonstrations,
and occasional computer labs
Labs
Labs available for doing homework:
Wilson 308 (Mon., Wed. noon-3p when class or lab not in session)
ACHS (Academic Computing Health Sciences) 8a-5p Mon-Fri (3rd
floor Hospital West by DHES)
Health Sciences Library Learning Resources Center (1st floor
of HSL) 7:30a-midnight Mon-Thurs, 7:30a-7p Fri, 9a-7p Sat,
9a-midnight Sun.
Schedule exceptions
Classes will start a little over an hour late three times.
Some makeup classes may be scheduled.
Grading
Assignments are due by 5p on date listed. Projects must be done
independently and must include the honor pledge. Small homeworks may
be done in groups (these will be designated as group assignments), with
all group participants signing a single copy
verifying their participation. Projects count 2 × group
homeworks. Final project counts 4 × group homeworks. One
letter grade is deducted per day late unless prior arrangements are
made with the instructor (these prior arrangements are allowed once or
twice per semester).
Work must be as concise as clear communication
will allow. The preferred method for producing reports is to print
PDF files created by the Biostatistics LATEX server (at
http://fharrell.biostat.virginia.edu/latex).
Last day to turn in final project: 14 December.
Exams
Quizzes on random days,
counting 1/2 of a group assignment. There will be no
makeups for quizzes. Students will be able to choose one quiz to not
count in their final grade.
Assistance
Questions for instructor outside class - E-mail
or phone for appointment. First try the E-mail
group, so that other students can see your question and the
answers you get from others. Assistance will only be given on
projects when the same assistance is available to all students
(i.e., through the E-mail group or during class).
Anonymous Feedback to Instructor
See Web page.
Course Outline, Projects, and Approximate
Schedule
Bold numbers to the right of topics indicate sequential lecture numbers.
Hn stands for Harrell Section n. VRn stands for Venables &
Ripley Section n. HTFn stands for optional readings in Hastie,
Tibshirani, and Friedman Section n.
-
Introduction (1)
-
Course overview and logistics
- Hypothesis testing vs. estimation vs. prediction (H1.1)
- Choice of model (H1.4)
- General methods for multivariable models (H2)
-
Nonparametric smoothers (H2.4.6, VR9.1)
- Smoothing splines (VR9.1)
- Regression splines (H2.4.2-2.4.5) (2)
- Modeling interactions (H2.7.2) and tensor splines (HTF5.7)
- Recursive partitioning (H2.5, VR10)
- Multiple Imputation (H3) (3)
- Shrinkage (H4.4-4.5) (4)
- Nonlinear principal components and related methods (H4.7.2)
- Bootstrap for validating models (H5.2.5, VR 5.7, 6.6) (5)
- Case study in data reduction and missing value imputation (H8)
- Maximum Likelihood Estimation (H9) (6)
-
Three test statistics (H6.3.3)
- Robust covariance matrix estimator (H9.5)
- Correcting variances for clustered or serial data using
sandwich and bootstrap estimators (H9.5)
- Bootstrap simultaneous confidence regions using Tibshirani's
bootstrap bumping (H9.7) (7)
- S-Plus bootcov and rm.boot functions
- Simulations to study coverage of simultaneous bootstrap
confidence regions
- Further use of the log likelihood (H9.8) (8)
- Weighted MLE (H9.9)
- Penalized MLE (H9.10, HTF3.4.3)
- Effective d.f. (H9.10,HTF7.6)
- Tibshirani's lasso (HTF3.4.3,3.4.5,10.12.3)
- Ordinal Logistic Models (H13, 14) (9)
-
Models
- Using ordinal models and the Cox model for robust rank-based
analysis of continuous response data
- Special residual plots
- Special use of penalized MLE
- Case study
Project: Develop and validate a proportional odds ordinal
logistic model
- Projection-pursuit regression (VR9.2) and MARS (HTF9.4) (10)
- Transform-both-sides Nonparametric Additive Regression Models
(H15, VR9.3) (11)
-
Generalized additive models
- ACE
- AVAS
- S-Plus areg.boot function
- Smearing estimator (H15.4)
Project: Develop and interpret a nonparametric additive model
for a continuous response
- Other topics such as cluster analysis, correspondence analysis,
unsupervised association rules (HTF14.2)
Final Project