DSI 5640: Machine Learning I

Instructor

Teaching Assistants

Dates, Time, and Location
  • First meeting: Tue. Jan. 9, 2024; Last meeting: Thu. Apr. 18, 2024 (Earth Day)
  • Tuesday, Thursday 2:15PM-3:30PM
  • Location: Sony Building Chapel
  • Office hours: By appointment.
  • We will use the Graduate School Academic Calendar
  • We will not have classes the week of March 9-17, 2024, for spring break.

Textbook

The main book for this course is listed below, and free to download in PDF format at the book webpage: Hastie, Tibshirani, Friedman. (2009) The elements of statistical learning: data mining, inference and prediction. Springer, 2nd edition.. In the course outline and class schedule, the textbook is abbreviated "HTF", often followed by chapter or page references "Ch. X-Y" or "pp. X-Y", respectively.

Other Resources

  • Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python, by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili, and Dmytro Dzhulgakov, Feb 2022: Google Play
  • A more applied book (also free to download) with slides, R code, and video tutorials: http://www-bcf.usc.edu/~gareth/ISL/
  • The Matrix Cookbook (version 15 November 2012): MCB-20121115.pdf

Course Topics

  • Overview of Supervised Learning and Review of Linear Methods: HTF Ch. 2-4
  • Splines and Kernel Methods: HTF Ch. 5-6
  • Model Assessment, Selection, and Inference: HTF Ch. 7-8
  • Bagging and Boosting: HTF Ch. 8, 10, and 15
  • Neural Networks: HTF Ch. 11
  • Unsupervised Learning: HTF Ch. 14

Other information

  • Unless otherwise stated, assigned homework is due in one week. Late homework will be subject to a penalty of 20% for each day late.
  • Students are encouraged to work together on homework problems, but must turn in their own write-ups.
  • Class participation is encouraged.
  • Please bring a laptop to class, when classes are held in-person.

Grading

  • Homework: 40%
  • Midterm Exam: 30%
  • Final Exam: 30%

Letter Grade Lowest Score
A+ 96.5
A 93.5
A- 90.0
B+ 86.5
B 83.5
B- 80.0
C 70.0
F 0.0

Schedule of Topics

Date Reading (before class) Homework Topic/Content Presentation
1/9/24 none none Syllabus, introduction Intro.pdf
1/11/24 HTF Ch. 1 and Ch. 2.1, 2.2, and 2.3 Homework 1 Least-squares, nearest-neighbors lecture-1.pdf mixture-data-lin-knn.R ESL.mixture.rda
1/16/24 HTF Ch. 2.4 none Class cancelled due to poor weather conditions  
1/18/24 HTF Ch. 2.4 none Decision theory lecture-2.pdf
1/23/24 none none Loss functions in practice lecture-2a.pdf prostate-data-lin.R prostate.csv
1/25/24 HTF Ch. 2.7, 2.8, and 2.9 Homework 2 Structured regression lecture-3.pdf ex-1.R ex-2.R ex-3.R
1/30/24 HTF Ch. 3.1, 3.2, 3.3, 3.4 none Linear methods, subset selection, ridge, and lasso lecture-4a.pdf linear-regression-examples.R lecture-5.pdf lasso-example.R
2/1/24 none none Linear methods, subset selection, ridge, and lasso (cont.) lecture-5.pdf lasso-example.R Suggested supplemental reading: HTF Ch. 3.6, 3.7, 3.8, and 3.9. Suggested supplemental exercises: Ex. 3.12, 3.18
2/6/24 HTF Ch. 3.5 and 3.6 none Linear methods: principal components regression lecture-6.pdf pca-regression-example.R
2/8/24 HTF Ch. 4.1, 4.2, and 4.3 none Linear methods: Linear discriminant analysis lecture-8.pdf simple-LDA-3D.R
2/13/24 HTF Ch. 5.1 and 5.2 Homework 3 Basis expansions: piecewise polynomials & splines lecture-11.pdf splines-example.R mixture-data-complete.R
2/15/24 HTF Ch. 6.1-6.5 none Kernel methods lecture-13.pdf mixture-data-knn-local-kde.R kernel-methods-examples-mcycle.R
2/20/24 HTF Ch. 7.1, 7.2, 7.3, 7.4 none Model assessment: Cp, AIC, BIC lecture-14.pdf effective-df-aic-bic-mcycle.R
2/22/24 HTF Ch. 7.10 none Cross validation lecture-15.pdf kNN-CV.R Income2.csv
2/27/24 HTF Ch. 9.2 Homework 4 Classification and Regression Trees lecture-21.pdf mixture-data-rpart.R
2/29/24 HTF Ch. 8.7, 8.8, 8.9 none Bagging lecture-18.pdf mixture-data-rpart-bagging.R nonlinear-bagging.html
3/5/24 HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 none Introduction to Neural networks lecture-31.pdf nnet.R
3/7/24 HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 none Introduction to Neural networks (cont.) lecture-31.pdf nnet.R

Homework/Laboratory (other than problems listed in HTF)

Homework assignments should be completed in a Github repository using R or Python (unless otherwise noted). Make sure to add the TA(s) as collaborators on your repo. Any reproducible format that renders natively in Github is acceptable. In Rmarkdown, using the 'github_document' or 'md_document' output type in the header will produce a markdown (.md) file that can be rendered within Github, e.g.

---
title: "Homework 1"
author: DS Student
date: January 15, 2020
output: github_document
---

Make sure to list the code, output, and any plots in your repo so that readers can understand everything you did.

Resubmissions are only allowed if the initial submission was made on time. A resubmission is due within one week of receiving feedback from TA and there is a maximum of 2 resubmissions for each assignment.

Homework 1

Using the RMarkdown/Jupyter notebook and Github mechanism, implement the following tasks:

  • Use the mixture-data-lin-knn.R file as the basis for this homework. You can copy and paste the R code, write code from scratch, or translate to Python using ChatGPT (or any LLM).
  • Re-write the functions fit_lc and predict_lc using lm (if using R) and the associated methods. If using Python, you may use any linear regression function you like, such as sklearn.linear_model from scikit-learn.
  • Make the linear classifier more flexible, by adding squared terms for x1 and x2 to the linear model
  • Describe in one or two sentences how this more flexible model affects the bias-variance tradeoff

Homework 2

Implement the following tasks by extending the example R script or translating to Python ( prostate-data-lin.R):

  • Write functions that implement the L1 loss and tilted absolute loss functions.
  • Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label (e.g., using the 'legend' function in R) the linear model predictors associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.
  • Write functions to fit and predict from a simple nonlinear model with three parameters defined by 'beta[1] + beta[2]*exp(-beta[3]*x)'. Hint: make copies of 'fit_lin' and 'predict_lin' and modify them to fit the nonlinear model. Use c(-1.0, 0.0, -0.3) as 'beta_init'.
  • Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label the nonlinear model predictions associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.

Homework 3

Implement the following tasks:
  • Use the prostate cancer data.
  • Treat lpsa as the outcome, and use all other variables in the data set as predictors.
  • With the training subset of the prostate data, train a least-squares regression model with all predictors.
  • Use the testing subset to compute the test error (average squared-error loss) using the fitted least-squares regression model.
  • Train a ridge regression model (see glmnet in R or sklearn.linear_model and its alpha argument in Python) and tune the value of lambda, i.e., for a sequence of lambda find the value of lambda that approximately minimizes the test error.
  • Create a figure that shows the training and test error associated with ridge regression as a function of lambda.
  • Create a path diagram of the ridge regression analysis, similar to HTF Figure 3.8

Homework 4

Goal: Understand the overall supervised learning process and test error.

Use the training and testing zip code data to develop a k-nearest neighbor (k-NN) model to classify zip code digit [0-9] based on a 16x16 scanned greyscale image of the digit: Info, Training, Testing

  1. Use the Nadaraya-Watson method with the k-NN kernel function, or any other function that implements the k-NN method.
  2. Using the zero-one loss function and 5-fold cross validation, estimate the average test error as a function of the tuning parameter 'k' - the number of nearest neighbors - ranging the value of k from 1 to 20.
  3. Plot the estimated average test error as a function of 'k' with error bars representing its standard error.
  4. Apply the one-standard error rule to select a final value for 'k'.
  5. Fit a final k-NN model using the full training data set and using the selected value for 'k'.
  6. Using the test data set, compute a confusion matrix and an estimate of conditional test error (using the zero-one loss).

Links

RStudio/Knitr

Topic attachments
I Attachment Action Size Date Who Comment
A.I._Is_Learning_to_Read_Mammograms_NYTimes.pdfpdf A.I._Is_Learning_to_Read_Mammograms_NYTimes.pdf manage 507.4 K 27 Jan 2020 - 14:59 NathanTJames  
Artificial_Intelligence_Makes_Bad_Medicine_Even_Worse_WIRED.pdfpdf Artificial_Intelligence_Makes_Bad_Medicine_Even_Worse_WIRED.pdf manage 58.9 K 27 Jan 2020 - 14:59 NathanTJames  
ESL.mixture.rdarda ESL.mixture.rda manage 163.9 K 11 Jan 2024 - 12:17 MattShotwell  
Income2.csvcsv Income2.csv manage 1.6 K 16 Mar 2021 - 08:42 MattShotwell  
Intro.pdfpdf Intro.pdf manage 783.1 K 09 Jan 2024 - 12:28 MattShotwell  
LA_Examples_DS_Bootcamp.htmlhtml LA_Examples_DS_Bootcamp.html manage 2374.0 K 05 Feb 2020 - 08:24 MattShotwell  
MDS-examples.RR MDS-examples.R manage 2.0 K 15 Apr 2020 - 10:07 MattShotwell  
ai_mammography.pdfpdf ai_mammography.pdf manage 997.4 K 29 Jan 2020 - 11:56 NathanTJames  
boosting-trees.RR boosting-trees.R manage 3.3 K 27 Mar 2020 - 10:35 MattShotwell  
effective-df-aic-bic-mcycle.RR effective-df-aic-bic-mcycle.R manage 4.3 K 13 Mar 2020 - 08:11 MattShotwell  
gradient-boosting-example.RR gradient-boosting-example.R manage 8.8 K 29 Mar 2022 - 14:36 MattShotwell  
hclust.RR hclust.R manage 1.3 K 18 Apr 2023 - 14:22 MattShotwell  
kNN-CV.RR kNN-CV.R manage 4.0 K 16 Mar 2021 - 08:42 MattShotwell  
kernel-manipulate-example.RR kernel-manipulate-example.R manage 1.2 K 15 Jan 2020 - 10:18 MattShotwell  
kernel-methods-examples-mcycle.RR kernel-methods-examples-mcycle.R manage 3.6 K 16 Feb 2020 - 21:22 MattShotwell  
lasso-example.RR lasso-example.R manage 5.4 K 23 Feb 2021 - 08:43 MattShotwell  
lecture-1.pdfpdf lecture-1.pdf manage 425.7 K 11 Jan 2024 - 12:29 MattShotwell  
lecture-10.pdfpdf lecture-10.pdf manage 170.9 K 10 Feb 2020 - 10:55 MattShotwell  
lecture-11.pdfpdf lecture-11.pdf manage 285.1 K 12 Feb 2020 - 08:43 MattShotwell  
lecture-13.pdfpdf lecture-13.pdf manage 363.7 K 16 Feb 2020 - 13:04 MattShotwell  
lecture-14.pdfpdf lecture-14.pdf manage 382.7 K 09 Mar 2020 - 10:23 MattShotwell  
lecture-15.pdfpdf lecture-15.pdf manage 354.6 K 16 Mar 2021 - 08:41 MattShotwell  
lecture-18.pdfpdf lecture-18.pdf manage 191.9 K 23 Mar 2020 - 12:44 MattShotwell  
lecture-2.pdfpdf lecture-2.pdf manage 243.6 K 18 Jan 2024 - 11:27 MattShotwell  
lecture-21.pdfpdf lecture-21.pdf manage 456.8 K 23 Mar 2020 - 12:44 MattShotwell  
lecture-22.pdfpdf lecture-22.pdf manage 499.9 K 27 Mar 2020 - 10:35 MattShotwell  
lecture-23.pdfpdf lecture-23.pdf manage 292.3 K 30 Mar 2020 - 09:52 MattShotwell  
lecture-24.pdfpdf lecture-24.pdf manage 494.2 K 01 Apr 2020 - 09:47 MattShotwell  
lecture-25.pdfpdf lecture-25.pdf manage 410.4 K 25 Mar 2020 - 12:55 MattShotwell  
lecture-28.pdfpdf lecture-28.pdf manage 326.9 K 10 Apr 2020 - 12:51 MattShotwell  
lecture-29.pdfpdf lecture-29.pdf manage 955.6 K 13 Apr 2020 - 09:36 MattShotwell  
lecture-2a.pdfpdf lecture-2a.pdf manage 97.7 K 13 Jan 2020 - 12:07 MattShotwell  
lecture-3.pdfpdf lecture-3.pdf manage 569.0 K 15 Jan 2020 - 10:17 MattShotwell  
lecture-30.pdfpdf lecture-30.pdf manage 626.1 K 15 Apr 2020 - 10:06 MattShotwell  
lecture-31.pdfpdf lecture-31.pdf manage 4059.4 K 08 Apr 2020 - 10:18 MattShotwell  
lecture-32.pdfpdf lecture-32.pdf manage 165.5 K 17 Apr 2020 - 10:05 MattShotwell  
lecture-4a.pdfpdf lecture-4a.pdf manage 152.8 K 17 Jan 2020 - 10:09 MattShotwell  
lecture-5.pdfpdf lecture-5.pdf manage 578.2 K 22 Jan 2020 - 10:16 MattShotwell  
lecture-6.pdfpdf lecture-6.pdf manage 97.9 K 27 Jan 2020 - 11:29 RyanJarrett  
lecture-8.pdfpdf lecture-8.pdf manage 596.0 K 31 Jan 2020 - 10:16 MattShotwell  
lecture-9.pdfpdf lecture-9.pdf manage 1199.4 K 05 Feb 2020 - 12:54 MattShotwell  
linear-regression-examples.RR linear-regression-examples.R manage 5.2 K 15 Feb 2021 - 20:34 MattShotwell  
linear-spline-manipulate-example.RR linear-spline-manipulate-example.R manage 1.2 K 15 Jan 2020 - 10:18 MattShotwell  
mLR-bootstrap.RmdRmd mLR-bootstrap.Rmd manage 2.6 K 12 Feb 2020 - 09:40 MattShotwell  
medExtractR_lecture.pdfpdf medExtractR_lecture.pdf manage 5878.6 K 27 Feb 2020 - 14:24 HannahWeeks medExtractR_lecture
midterm-review.pdfpdf midterm-review.pdf manage 563.7 K 26 Feb 2020 - 10:33 MattShotwell  
mixture-data-complete.RR mixture-data-complete.R manage 5.9 K 13 Feb 2024 - 13:58 MattShotwell  
mixture-data-knn-local-kde.RR mixture-data-knn-local-kde.R manage 8.4 K 09 Mar 2021 - 10:28 MattShotwell  
mixture-data-knn-local.RR mixture-data-knn-local.R manage 6.2 K 16 Feb 2020 - 13:05 MattShotwell  
mixture-data-lin-knn.RR mixture-data-lin-knn.R manage 4.2 K 11 Jan 2024 - 12:19 MattShotwell  
mixture-data-rpart-bagging.RR mixture-data-rpart-bagging.R manage 3.7 K 23 Mar 2020 - 07:45 MattShotwell  
mixture-data-rpart.RR mixture-data-rpart.R manage 2.5 K 20 Mar 2020 - 09:11 MattShotwell  
multivariate-KDE.htmlhtml multivariate-KDE.html manage 862.9 K 24 Feb 2020 - 10:07 MattShotwell  
nnet.RR nnet.R manage 3.0 K 05 Apr 2020 - 19:43 MattShotwell  
nonlinear-bagging.htmlhtml nonlinear-bagging.html manage 656.0 K 23 Mar 2020 - 10:46 MattShotwell  
normal-mixture-examples.RR normal-mixture-examples.R manage 1.8 K 17 Apr 2020 - 10:04 MattShotwell  
pca-regression-example.RR pca-regression-example.R manage 3.7 K 07 Feb 2023 - 14:24 MattShotwell  
presentation.pdfpdf presentation.pdf manage 170.9 K 10 Feb 2020 - 10:52 MattShotwell  
principal-curves.RR principal-curves.R manage 3.7 K 10 Apr 2020 - 10:11 MattShotwell  
prostate-data-lin.RR prostate-data-lin.R manage 2.7 K 25 Jan 2024 - 21:13 MattShotwell  
prostate.csvcsv prostate.csv manage 6.7 K 23 Jan 2024 - 13:40 MattShotwell  
random-forest-example.RR random-forest-example.R manage 1.4 K 30 Mar 2021 - 10:54 MattShotwell  
simple-LDA-3D.RR simple-LDA-3D.R manage 3.1 K 03 Feb 2020 - 09:01 MattShotwell  
smooth-splines-manipulate-example.RR smooth-splines-manipulate-example.R manage 1.0 K 15 Jan 2020 - 10:17 MattShotwell  
spectral-clustering.RR spectral-clustering.R manage 3.1 K 13 Apr 2020 - 09:35 MattShotwell  
sphered-and-canonical-inputs.RR sphered-and-canonical-inputs.R manage 6.3 K 07 Feb 2020 - 09:39 MattShotwell  
splines-example.RR splines-example.R manage 3.9 K 13 Feb 2024 - 13:59 MattShotwell  
vowel-data-LDA.RmdRmd vowel-data-LDA.Rmd manage 4.7 K 07 Feb 2020 - 13:08 MattShotwell  
vowel-data-LR.RmdRmd vowel-data-LR.Rmd manage 3.2 K 10 Feb 2020 - 12:18 MattShotwell  
zip.info.txttxt zip.info.txt manage 1.2 K 27 Feb 2024 - 10:48 MattShotwell  
zip.test.gzgz zip.test.gz manage 428.9 K 27 Feb 2024 - 10:47 MattShotwell  
zip.train.gzgz zip.train.gz manage 1786.2 K 27 Feb 2024 - 10:47 MattShotwell  
Topic revision: r186 - 27 Feb 2024, MattShotwell
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback