CourseDSI5640 < Main < Vanderbilt Biostatistics Wiki

You are here: Vanderbilt Biostatistics Wiki>Main Web>CourseDSI5640 (16 Apr 2024, MattShotwell)Edit Attach

DSI 5640: Machine Learning I

Instructor

Matthew S. Shotwell, PhD
matt.shotwell@vumc.org
11118B: 2525 West End Avenue
615-875-3397
Github: biostatmatt

Teaching Assistants

aleksandra.cvetanovska@vanderbilt.edu; GitHub:Cvetanovska
zhaozhou.lyu@vanderbilt.edu; GitHub: Georgelv1021

Dates, Time, and Location

First meeting: Tue. Jan. 9, 2024; Last meeting: Thu. Apr. 18, 2024
Tuesday, Thursday 2:15PM-3:30PM
Location: Sony Building Chapel
Office hours: By appointment.
We will use the Graduate School Academic Calendar
We will not have classes the week of March 9-17, 2024, for spring break.

Textbook

The main book for this course is listed below, and free to download in PDF format at the book webpage: Hastie, Tibshirani, Friedman. (2009) The elements of statistical learning: data mining, inference and prediction. Springer, 2nd edition.. In the course outline and class schedule, the textbook is abbreviated "HTF", often followed by chapter or page references "Ch. X-Y" or "pp. X-Y", respectively.

Other Resources

Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python, by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili, and Dmytro Dzhulgakov, Feb 2022: Google Play
A more applied book (also free to download) with slides, R code, and video tutorials: http://www-bcf.usc.edu/~gareth/ISL/
The Matrix Cookbook (version 15 November 2012): MCB-20121115.pdf

Course Topics

Overview of Supervised Learning and Review of Linear Methods: HTF Ch. 2-4
Splines and Kernel Methods: HTF Ch. 5-6
Model Assessment, Selection, and Inference: HTF Ch. 7-8
Bagging and Boosting: HTF Ch. 8, 10, and 15
Neural Networks: HTF Ch. 11
Unsupervised Learning: HTF Ch. 14

Other information

Unless otherwise stated, assigned homework is due in one week. Late homework will be subject to a penalty of 20% for each day late.
Students are encouraged to work together on homework problems, but must turn in their own write-ups.
Class participation is encouraged.
Please bring a laptop to class, when classes are held in-person.

Grading

Homework: 40%
Midterm Exam: 30%
Final Exam: 30%

Letter Grade	Lowest Score
A+	96.5
A	93.5
A-	90.0
B+	86.5
B	83.5
B-	80.0
C	70.0
F	0.0

Schedule of Topics

Date	Reading (before class)	Homework	Topic/Content	Presentation
1/9/24	none	none	Syllabus, introduction	Intro.pdf
1/11/24	HTF Ch. 1 and Ch. 2.1, 2.2, and 2.3	Homework 1	Least-squares, nearest-neighbors	lecture-1.pdf mixture-data-lin-knn.R ESL.mixture.rda
1/16/24	HTF Ch. 2.4	none	Class cancelled due to poor weather conditions
1/18/24	HTF Ch. 2.4	none	Decision theory	lecture-2.pdf
1/23/24	none	none	Loss functions in practice	lecture-2a.pdf prostate-data-lin.R prostate.csv
1/25/24	HTF Ch. 2.7, 2.8, and 2.9	Homework 2	Structured regression	lecture-3.pdf ex-1.R ex-2.R ex-3.R
1/30/24	HTF Ch. 3.1, 3.2, 3.3, 3.4	none	Linear methods, subset selection, ridge, and lasso	lecture-4a.pdf linear-regression-examples.R lecture-5.pdf lasso-example.R
2/1/24	none	none	Linear methods, subset selection, ridge, and lasso (cont.)	lecture-5.pdf lasso-example.R Suggested supplemental reading: HTF Ch. 3.6, 3.7, 3.8, and 3.9. Suggested supplemental exercises: Ex. 3.12, 3.18
2/6/24	HTF Ch. 3.5 and 3.6	none	Linear methods: principal components regression	lecture-6.pdf pca-regression-example.R
2/8/24	HTF Ch. 4.1, 4.2, and 4.3	none	Linear methods: Linear discriminant analysis	lecture-8.pdf simple-LDA-3D.R
2/13/24	HTF Ch. 5.1 and 5.2	Homework 3	Basis expansions: piecewise polynomials & splines	lecture-11.pdf splines-example.R mixture-data-complete.R
2/15/24	HTF Ch. 6.1-6.5	none	Kernel methods	lecture-13.pdf mixture-data-knn-local-kde.R kernel-methods-examples-mcycle.R
2/20/24	HTF Ch. 7.1, 7.2, 7.3, 7.4	none	Model assessment: Cp, AIC, BIC	lecture-14.pdf effective-df-aic-bic-mcycle.R
2/22/24	HTF Ch. 7.10	none	Cross validation	lecture-15.pdf kNN-CV.R Income2.csv
2/27/24	HTF Ch. 9.2	Homework 4	Classification and Regression Trees	lecture-21.pdf mixture-data-rpart.R
2/29/24	HTF Ch. 8.7, 8.8, 8.9	none	Bagging	lecture-18.pdf mixture-data-rpart-bagging.R nonlinear-bagging.html
3/5/24	HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5	none	Introduction to Neural networks	lecture-31.pdf nnet.R
3/7/24	HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5	none	Introduction to Neural networks (cont.)	lecture-31.pdf nnet.R
3/19/24	HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5	none	Introduction to Neural networks (cont.)	lecture-31.pdf nnet.R
3/21/24	HTF Ch. 15.1, 15.2	none	Random Forest (distribute midterm)	lecture-25.pdf random-forest-example.R
3/26/24	HTF Ch. 10.1	none	Boosting and AdaBoost.M1	lecture-22.pdf boosting-trees.R
3/28/24	HTF Ch. 10.2-10.9	none	Boosting and AdaBoost.M1 (part 2)	lecture-23.pdf
4/2/24	HTF Ch. 10.10, 10.13	none	Boosting and AdaBoost.M1 (part 3)	lecture-24.pdf gradient-boosting-example.R
4/4/24	HTF Ch. 14.5	none	Principal curves and surfaces	lecture-28.pdf principal-curves.R
4/9/24	HTF 14.8	none	Multidimensional scaling	lecture-30.pdf MDS-examples.R
4/11/24	HTF 14.5.3	none	k-means, hierarchical, and spectral clustering	lecture-29.pdf spectral-clustering.R
4/16/24	none	none	Clustering with mixtures	lecture-32.pdf normal-mixture-examples.R

Homework/Laboratory (other than problems listed in HTF)

Homework assignments should be completed in a Github repository using R or Python (unless otherwise noted). Make sure to add the TA(s) as collaborators on your repo. Any reproducible format that renders natively in Github is acceptable. In Rmarkdown, using the 'github_document' or 'md_document' output type in the header will produce a markdown (.md) file that can be rendered within Github, e.g.

---
title: "Homework 1"
author: DS Student
date: January 15, 2020
output: github_document
---

Make sure to list the code, output, and any plots in your repo so that readers can understand everything you did.

Resubmissions are only allowed if the initial submission was made on time. A resubmission is due within one week of receiving feedback from TA and there is a maximum of 2 resubmissions for each assignment.

Homework 1

Using the RMarkdown/Jupyter notebook and Github mechanism, implement the following tasks:

Use the mixture-data-lin-knn.R file as the basis for this homework. You can copy and paste the R code, write code from scratch, or translate to Python using ChatGPT (or any LLM).
Re-write the functions fit_lc and predict_lc using lm (if using R) and the associated methods. If using Python, you may use any linear regression function you like, such as sklearn.linear_model from scikit-learn.
Make the linear classifier more flexible, by adding squared terms for x1 and x2 to the linear model
Describe in one or two sentences how this more flexible model affects the bias-variance tradeoff

Homework 2

Implement the following tasks by extending the example R script or translating to Python ( prostate-data-lin.R):

Write functions that implement the L1 loss and tilted absolute loss functions.
Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label (e.g., using the 'legend' function in R) the linear model predictors associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.
Write functions to fit and predict from a simple nonlinear model with three parameters defined by 'beta[1] + beta[2]*exp(-beta[3]*x)'. Hint: make copies of 'fit_lin' and 'predict_lin' and modify them to fit the nonlinear model. Use c(-1.0, 0.0, -0.3) as 'beta_init'.
Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label the nonlinear model predictions associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.

Homework 3

Implement the following tasks:

Use the prostate cancer data.
Treat lpsa as the outcome, and use all other variables in the data set as predictors.
With the training subset of the prostate data, train a least-squares regression model with all predictors.
Use the testing subset to compute the test error (average squared-error loss) using the fitted least-squares regression model.
Train a ridge regression model (see glmnet in R or sklearn.linear_model and its alpha argument in Python) and tune the value of lambda, i.e., for a sequence of lambda find the value of lambda that approximately minimizes the test error.
Create a figure that shows the training and test error associated with ridge regression as a function of lambda.
Create a path diagram of the ridge regression analysis, similar to HTF Figure 3.8

Homework 4

Goal: Understand the overall supervised learning process and test error.

Use the training and testing zip code data to develop a k-nearest neighbor (k-NN) model to classify zip code digit [0-9] based on a 16x16 scanned greyscale image of the digit: Info, Training, Testing

Use the Nadaraya-Watson method with the k-NN kernel function, or any other function that implements the k-NN method.
Using the zero-one loss function and 5-fold cross validation, estimate the average test error as a function of the tuning parameter 'k' - the number of nearest neighbors - ranging the value of k from 1 to 20.
Plot the estimated average test error as a function of 'k' with error bars representing its standard error.
Apply the one-standard error rule to select a final value for 'k'.
Fit a final k-NN model using the full training data set and using the selected value for 'k'.
Using the test data set, compute a confusion matrix and an estimate of conditional test error (using the zero-one loss).

Links

https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about

RStudio/Knitr

I	Attachment	Action	Size	Date	Who	Comment
pdf	A.I._Is_Learning_to_Read_Mammograms_NYTimes.pdf	manage	507 K	27 Jan 2020 - 14:59	NathanTJames
pdf	Artificial_Intelligence_Makes_Bad_Medicine_Even_Worse_WIRED.pdf	manage	58 K	27 Jan 2020 - 14:59	NathanTJames
rda	ESL.mixture.rda	manage	163 K	11 Jan 2024 - 12:17	MattShotwell
csv	Income2.csv	manage	1 K	16 Mar 2021 - 08:42	MattShotwell
pdf	Intro.pdf	manage	783 K	09 Jan 2024 - 12:28	MattShotwell
html	LA_Examples_DS_Bootcamp.html	manage	2 MB	05 Feb 2020 - 08:24	MattShotwell
R	MDS-examples.R	manage	1 K	15 Apr 2020 - 10:07	MattShotwell
pdf	ai_mammography.pdf	manage	997 K	29 Jan 2020 - 11:56	NathanTJames
R	boosting-trees.R	manage	3 K	27 Mar 2020 - 10:35	MattShotwell
R	effective-df-aic-bic-mcycle.R	manage	4 K	13 Mar 2020 - 08:11	MattShotwell
R	gradient-boosting-example.R	manage	8 K	29 Mar 2022 - 14:36	MattShotwell
R	hclust.R	manage	1 K	18 Apr 2023 - 14:22	MattShotwell
R	kNN-CV.R	manage	4 K	16 Mar 2021 - 08:42	MattShotwell
R	kernel-manipulate-example.R	manage	1 K	15 Jan 2020 - 10:18	MattShotwell
R	kernel-methods-examples-mcycle.R	manage	3 K	16 Feb 2020 - 21:22	MattShotwell
R	lasso-example.R	manage	5 K	23 Feb 2021 - 08:43	MattShotwell
pdf	lecture-1.pdf	manage	425 K	11 Jan 2024 - 12:29	MattShotwell
pdf	lecture-10.pdf	manage	170 K	10 Feb 2020 - 10:55	MattShotwell
pdf	lecture-11.pdf	manage	285 K	12 Feb 2020 - 08:43	MattShotwell
pdf	lecture-13.pdf	manage	363 K	16 Feb 2020 - 13:04	MattShotwell
pdf	lecture-14.pdf	manage	382 K	09 Mar 2020 - 10:23	MattShotwell
pdf	lecture-15.pdf	manage	354 K	16 Mar 2021 - 08:41	MattShotwell
pdf	lecture-18.pdf	manage	191 K	23 Mar 2020 - 12:44	MattShotwell
pdf	lecture-2.pdf	manage	243 K	18 Jan 2024 - 11:27	MattShotwell
pdf	lecture-21.pdf	manage	456 K	23 Mar 2020 - 12:44	MattShotwell
pdf	lecture-22.pdf	manage	499 K	27 Mar 2020 - 10:35	MattShotwell
pdf	lecture-23.pdf	manage	292 K	30 Mar 2020 - 09:52	MattShotwell
pdf	lecture-24.pdf	manage	494 K	01 Apr 2020 - 09:47	MattShotwell
pdf	lecture-25.pdf	manage	410 K	25 Mar 2020 - 12:55	MattShotwell
pdf	lecture-28.pdf	manage	326 K	10 Apr 2020 - 12:51	MattShotwell
pdf	lecture-29.pdf	manage	955 K	13 Apr 2020 - 09:36	MattShotwell
pdf	lecture-2a.pdf	manage	97 K	13 Jan 2020 - 12:07	MattShotwell
pdf	lecture-3.pdf	manage	569 K	15 Jan 2020 - 10:17	MattShotwell
pdf	lecture-30.pdf	manage	626 K	15 Apr 2020 - 10:06	MattShotwell
pdf	lecture-31.pdf	manage	3 MB	08 Apr 2020 - 10:18	MattShotwell
pdf	lecture-32.pdf	manage	165 K	17 Apr 2020 - 10:05	MattShotwell
pdf	lecture-4a.pdf	manage	152 K	17 Jan 2020 - 10:09	MattShotwell
pdf	lecture-5.pdf	manage	578 K	22 Jan 2020 - 10:16	MattShotwell
pdf	lecture-6.pdf	manage	97 K	27 Jan 2020 - 11:29	RyanJarrett
pdf	lecture-8.pdf	manage	596 K	31 Jan 2020 - 10:16	MattShotwell
pdf	lecture-9.pdf	manage	1 MB	05 Feb 2020 - 12:54	MattShotwell
R	linear-regression-examples.R	manage	5 K	15 Feb 2021 - 20:34	MattShotwell
R	linear-spline-manipulate-example.R	manage	1 K	15 Jan 2020 - 10:18	MattShotwell
Rmd	mLR-bootstrap.Rmd	manage	2 K	12 Feb 2020 - 09:40	MattShotwell
pdf	medExtractR_lecture.pdf	manage	5 MB	27 Feb 2020 - 14:24	HannahWeeks	medExtractR_lecture
pdf	midterm-review.pdf	manage	563 K	26 Feb 2020 - 10:33	MattShotwell
R	mixture-data-complete.R	manage	5 K	13 Feb 2024 - 13:58	MattShotwell
R	mixture-data-knn-local-kde.R	manage	8 K	09 Mar 2021 - 10:28	MattShotwell
R	mixture-data-knn-local.R	manage	6 K	16 Feb 2020 - 13:05	MattShotwell
R	mixture-data-lin-knn.R	manage	4 K	11 Jan 2024 - 12:19	MattShotwell
R	mixture-data-rpart-bagging.R	manage	3 K	23 Mar 2020 - 07:45	MattShotwell
R	mixture-data-rpart.R	manage	2 K	20 Mar 2020 - 09:11	MattShotwell
html	multivariate-KDE.html	manage	862 K	24 Feb 2020 - 10:07	MattShotwell
R	nnet.R	manage	2 K	05 Apr 2020 - 19:43	MattShotwell
html	nonlinear-bagging.html	manage	655 K	23 Mar 2020 - 10:46	MattShotwell
R	normal-mixture-examples.R	manage	1 K	17 Apr 2020 - 10:04	MattShotwell
R	pca-regression-example.R	manage	3 K	07 Feb 2023 - 14:24	MattShotwell
pdf	presentation.pdf	manage	170 K	10 Feb 2020 - 10:52	MattShotwell
R	principal-curves.R	manage	3 K	10 Apr 2020 - 10:11	MattShotwell
R	prostate-data-lin.R	manage	2 K	25 Jan 2024 - 21:13	MattShotwell
csv	prostate.csv	manage	6 K	23 Jan 2024 - 13:40	MattShotwell
R	random-forest-example.R	manage	1 K	30 Mar 2021 - 10:54	MattShotwell
R	simple-LDA-3D.R	manage	3 K	03 Feb 2020 - 09:01	MattShotwell
R	smooth-splines-manipulate-example.R	manage	1020 bytes	15 Jan 2020 - 10:17	MattShotwell
R	spectral-clustering.R	manage	3 K	13 Apr 2020 - 09:35	MattShotwell
R	sphered-and-canonical-inputs.R	manage	6 K	07 Feb 2020 - 09:39	MattShotwell
R	splines-example.R	manage	3 K	13 Feb 2024 - 13:59	MattShotwell
Rmd	vowel-data-LDA.Rmd	manage	4 K	07 Feb 2020 - 13:08	MattShotwell
Rmd	vowel-data-LR.Rmd	manage	3 K	10 Feb 2020 - 12:18	MattShotwell
txt	zip.info.txt	manage	1 K	27 Feb 2024 - 10:48	MattShotwell
gz	zip.test.gz	manage	428 K	27 Feb 2024 - 10:47	MattShotwell
gz	zip.train.gz	manage	1 MB	27 Feb 2024 - 10:47	MattShotwell

Topic revision: r196 - 16 Apr 2024, MattShotwell

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback