You are here: Vanderbilt Biostatistics Wiki>Main Web>CourseDSI5640 (revision 103)EditAttach

DSI 5640: Modeling & Machine Learning I

Instructor

Teaching Assistants

  • Shiyao Li
  • Jiabing Ruan

Dates, Time, and Location

  • First meeting: Tue. Jan. 26, 2021; Last meeting: Thu. Apr. 29, 2021
  • Tuesday, Thursday 11:10AM-12:25PM
  • Sony Building 2001-A, and virtually using Zoom via Brightspace
  • Office hours: By appointment, initially. Will determine a regular schedule. * We will use the Graduate School Academic Calendar

Textbook

The main book for this course is listed below, and free to download in PDF format at the book webpage: Hastie, Tibshirani, Friedman. (2009) The elements of statistical learning: data mining, inference and prediction. Springer, 2nd edition.. In the course outline and class schedule, the textbook is abbreviated "HTF", often followed by chapter or page references "Ch. X-Y" or "pp. X-Y", respectively.

Other Resources

Course Topics

  • Overview of Supervised Learning and Review of Linear Methods: HTF Ch. 2-4
  • Splines and Kernel Methods: HTF Ch. 5-6
  • Model Assessment, Selection, and Inference: HTF Ch. 7-8
  • Neural Networks: HTF Ch. 11
  • Support Vector Machines: HTF Ch. 12
  • Unsupervised Learning: HTF Ch. 14

Other information

  • Unless otherwise stated, assigned homework is due in one week. Late homework will be subject to a penalty of 20% for each day late.
  • Students are encouraged to work together on homework problems, but must turn in their own write-ups.
  • Class participation is encouraged.
  • Please bring a laptop to class, when classes are held in-person.

Grading

  • Homework: 40%
  • Midterm Exam: 30%
  • Final Exam: 30%

Letter Grade Lowest Score
A+ 96.5
A 93.5
A- 90.0
B+ 86.5
B 83.5
B- 80.0
C 70.0
F 0.0

Schedule of Topics

Date Reading (before class) Homework Topic/Content Presentation
Tue. 1/26 none none Syllabus, introduction Intro.pdf
Thu. 1/28 HTF Ch. 1 and Ch. 2.1, 2.2, and 2.3 none Least-squares, nearest-neighbors lecture-1.pdf mixture-data-lin-knn.R
Tue. 2/2 none See below: Tue. 2/2 Least-squares, nearest-neighbors code mixture-data-lin-knn.R
Thu. 2/4 HTF Ch. 2.4 none Decision theory lecture-2.pdf
Tue. 2/9 none See below: Tue. 2/9 Loss functions in practice lecture-2a.pdf prostate-data-lin.R
Thu. 2/11 HTF Ch. 2.7, 2.8, and 2.9 none Structured regression lecture-3.pdf ex-1.R ex-2.R ex-3.R
Tue. 2/16 HTF Ch. 3.1, 3.2, 3.3, 3.4 none Linear methods, subset selection, ridge, and lasso lecture-4a.pdf linear-regression-examples.R lecture-5.pdf lasso-example.R
Thu. 2/18 none See below: Thu. 2/18 No Class Reading day focused on linear methods for regression. Suggested supplemental reading: Introduction to Statistical Learning Ch.3 and Laboratory (section 3.6).
Tue. 2/23 none none Linear methods, subset selection, ridge, and lasso (cont.) lecture-5.pdf lasso-example.R
Thu. 2/25 HTF Ch. 3.5 and 3.6 none Linear methods: principal components regression lecture-6.pdf pca-regression-example.R
Tue. 3/2 HTF Ch. 4.1, 4.2, and 4.3 See below: Tue. 3/2 Linear methods: Linear discriminant analysis lecture-8.pdf simple-LDA-3D.R
Thu. 3/4 HTF Ch. 5.1 and 5.2 none Basis expansions: piecewise polynomials & splines lecture-11.pdf splines-example.R mixture-data-complete.R
Tue. 3/9 HTF Ch. 6.1-6.5 none Kernel methods lecture-13.pdf mixture-data-knn-local-kde.R kernel-methods-examples-mcycle.R
Thu. 3/11 HTF Ch. 7.1, 7.2, 7.3, 7.4 See below: Thu. 3/11 Model assessment: Cp, AIC, BIC lecture-14.pdf effective-df-aic-bic-mcycle.R
Tue. 3/16 HTF Ch. 7.10 none Cross validation lecture-15.pdf kNN-CV.R Income2.csv
Thu. 3/18 none none Midterm Review none
Tue. 3/23 HTF Ch. 9.2 none Classification and Regression Trees lecture-21.pdf mixture-data-rpart.R
Thu. 3/25 HTF Ch. 8.7, 8.8, 8.9 none Bagging lecture-18.pdf mixture-data-rpart-bagging.R nonlinear-bagging.html
Tue. 3/30 HTF Ch. 15.1, 15.2 Tue. 3/30 (below) Random Forest lecture-25.pdf random-forest-example.R
Thu. 4/1 HTF Ch. 10.1 none Boosting and AdaBoost.M1 (part 1) lecture-22.pdf boosting-trees.R
Tue. 4/6 HTF Ch. 10.2-10.9 Work through this nice GBM tutorial Boosting and AdaBoost.M1 (part 2) lecture-23.pdf
Thu. 4/8 HTF Ch. 10.10, 10.13 none Boosting and AdaBoost.M1 (part 3) lecture-24.pdf gradient-boosting-example.R
Tue. 4/12 HTF Ch. 10.10, 10.13 none Boosting and AdaBoost.M1 (part 3; cont.) lecture-24.pdf gradient-boosting-example.R
Thu. 4/14 HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 none Introduction to Neural networks lecture-31.pdf nnet.R
Tue. 4/20 HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 Thu. 4/14 (below) Introduction to Neural networks (cont.) lecture-31.pdf nnet.R
Thu. 4/22 HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 none Introduction to Neural networks (cont.) lecture-31.pdf nnet.R

Homework/Laboratory (other than problems listed in HTF)

Homework assignments should be completed in a GitHub repository using the R language (unless otherwise noted). Make sure to add the instructor and TAs as collaborators on your repo. Any reproducible format that renders natively in Github is acceptable. In Rmarkdown, using the 'github_document' or 'md_document' output type in the header will produce a markdown (.md) file that can be rendered within Github, e.g.

---
title: "Homework 1"
author: DS Student
date: January 15, 2020
output: github_document
---

Make sure to include the raw code (.Rmd), the rendered file (.md), and any plots in your repo. Jupyter notebooks (.ipynb) using the R language are also ok.

Resubmissions are only allowed if the initial submission was made on time. A resubmission is due within one week of receiving feedback from TA and there is a maximum of 2 resubmissions for each assignment.

Tue. 2/2

Using the RMarkdown/knitr/github mechanism, implement the following tasks by extending the example R script mixture-data-lin-knn.R:

  • Paste the code from the mixture-data-lin-knn.R file into the homework template Knitr document.
  • Read the help file for R's built-in linear regression function lm
  • Re-write the functions fit_lc and predict_lc using lm, and the associated predict method for lm objects.
  • Consider making the linear classifier more flexible, by adding squared terms for x1 and x2 to the linear model
  • Describe how this more flexible model affects the bias-variance tradeoff

Tue. 2/9

Using the RMarkdown/knitr/github mechanism, implement the following tasks by extending the example R script ( prostate-data-lin.R):

  • Write functions that implement the L1 loss and tilted absolute loss functions.
  • Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label (using the 'legend' function) the linear model predictors associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.
  • Write functions to fit and predict from a simple nonlinear model with three parameters defined by 'beta[1] + beta[2]*exp(-beta[3]*x)'. Hint: make copies of 'fit_lin' and 'predict_lin' and modify them to fit the nonlinear model. Use c(-1.0, 0.0, -0.3) as 'beta_init'.
  • Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label (using the 'legend' function) the nonlinear model predictors associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.

Thu. 2/18

Using the RMarkdown/knitr/github mechanism, implement the following tasks:
  • Use the prostate cancer data.
  • Use the cor function to reproduce the correlations listed in HTF Table 3.1, page 50.
  • Treat lcavol as the outcome, and use all other variables in the data set as predictors.
  • With the training subset of the prostate data, train a least-squares regression model with all predictors using the lm function.
  • Use the testing subset to compute the test error (average squared-error loss) using the fitted least-squares regression model.
  • Train a ridge regression model using the glmnet function, and tune the value of lambda (i.e., use guess and check to find the value of lambda that approximately minimizes the test error).
  • Create a figure that shows the training and test error associated with ridge regression as a function of lambda
  • Create a path diagram of the ridge regression analysis, similar to HTF Figure 3.8

Tue. 3/2

Using the RMarkdown/knitr/github mechanism, complete the following exercises from chapter 4, section 4.7 (beginning pp 168) or the https://www.statlearning.com/:
  • Exercise 4: "When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations..." Please type your solutions within your R Markdown document. No R coding is required for this exercise.
  • Exercise 10: "This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar..." This exercise requires R coding.

Thu. 3/11

Goal: Understand and implement various ways to approximate test error.

In the ISLR book, read section 6.1.3 “Choosing the Optimal Model” and section 5.1 “Cross-Validation”. Extend and convert the attached effective-df-aic-bic-mcycle.R R script into an R markdown file that accomplishes the following tasks.

  1. Randomly split the mcycle data into training (75%) and validation (25%) subsets.
  2. Using the mcycle data, consider predicting the mean acceleration as a function of time. Use the Nadaraya-Watson method with the k-NN kernel function to create a series of prediction models by varying the tuning parameter over a sequence of values. (hint: the script already implements this)
  3. With the squared-error loss function, compute and plot the training error, AIC, BIC, and validation error (using the validation data) as functions of the tuning parameter.
  4. For each value of the tuning parameter, Perform 5-fold cross-validation using the combined training and validation data. This results in 5 estimates of test error per tuning parameter value.
  5. Plot the CV-estimated test error (average of the five estimates from each fold) as a function of the tuning parameter. Add vertical line segments to the figure (using the segments function in R) that represent one “standard error” of the CV-estimated test error (standard deviation of the five estimates from each fold).
  6. Interpret the resulting figures and select a suitable value for the tuning parameter.

Tue. 3/30

Goal: Understand and implement a random forest classifier.

Using the “vowel.train” data, and the “randomForest” function in the R package “randomForest”. Develop a random forest classifier for the vowel data by doing the following:

  1. Convert the response variable in the “vowel.train” data frame to a factor variable prior to training, so that “randomForest” does classification rather than regression.
  2. Review the documentation for the “randomForest” function.
  3. Fit the random forest model to the vowel data using all of the 11 features using the default values of the tuning parameters.
  4. Use 5-fold CV and tune the model by performing a grid search for the following tuning parameters: 1) the number of variables randomly sampled as candidates at each split; consider values 3, 4, and 5, and 2) the minimum size of terminal nodes; consider a sequence (1, 5, 10, 20, 40, and 80).
  5. With the tuned model, make predictions using the majority vote method, and compute the misclassification rate using the ‘vowel.test’ data.

Thu. 4/14

Goal: Get started using Keras to construct simple neural networks

Due: Tuesday, April 27.

  1. Work through the "Image Classification" tutorial on the RStudio Keras website.
  2. Use the Keras library to re-implement the simple neural network discussed during lecture for the mixture data (see nnet.R). Use a single 10-node hidden layer; fully connected.
  3. Create a figure to illustrate that the predictions are (or are not) similar using the 'nnet' function versus the Keras model.
  4. (optional extra credit) Convert the neural network described in the "Image Classification" tutorial to a network that is similar to one of the convolutional networks described during lecture on 4/15 (i.e., Net-3, Net-4, or Net-5) and also described in the ESL book section 11.7. See the !ConvNet tutorial on the RStudio Keras website.

Links

RStudio/Knitr

Topic attachments
I Attachment Action Size Date Who Comment
A.I._Is_Learning_to_Read_Mammograms_NYTimes.pdfpdf A.I._Is_Learning_to_Read_Mammograms_NYTimes.pdf manage 507.4 K 27 Jan 2020 - 14:59 NathanTJames  
Artificial_Intelligence_Makes_Bad_Medicine_Even_Worse_WIRED.pdfpdf Artificial_Intelligence_Makes_Bad_Medicine_Even_Worse_WIRED.pdf manage 58.9 K 27 Jan 2020 - 14:59 NathanTJames  
Income2.csvcsv Income2.csv manage 1.6 K 16 Mar 2021 - 08:42 MattShotwell  
Intro.pdfpdf Intro.pdf manage 781.9 K 06 Jan 2020 - 08:15 MattShotwell  
LA_Examples_DS_Bootcamp.htmlhtml LA_Examples_DS_Bootcamp.html manage 2374.0 K 05 Feb 2020 - 08:24 MattShotwell  
MDS-examples.RR MDS-examples.R manage 2.0 K 15 Apr 2020 - 10:07 MattShotwell  
ai_mammography.pdfpdf ai_mammography.pdf manage 997.4 K 29 Jan 2020 - 11:56 NathanTJames  
boosting-trees.RR boosting-trees.R manage 3.3 K 27 Mar 2020 - 10:35 MattShotwell  
effective-df-aic-bic-mcycle.RR effective-df-aic-bic-mcycle.R manage 4.3 K 13 Mar 2020 - 08:11 MattShotwell  
gradient-boosting-example.RR gradient-boosting-example.R manage 8.5 K 08 Apr 2021 - 11:02 MattShotwell  
kNN-CV.RR kNN-CV.R manage 4.0 K 16 Mar 2021 - 08:42 MattShotwell  
kernel-manipulate-example.RR kernel-manipulate-example.R manage 1.2 K 15 Jan 2020 - 10:18 MattShotwell  
kernel-methods-examples-mcycle.RR kernel-methods-examples-mcycle.R manage 3.6 K 16 Feb 2020 - 21:22 MattShotwell  
lasso-example.RR lasso-example.R manage 5.4 K 23 Feb 2021 - 08:43 MattShotwell  
lecture-1.pdfpdf lecture-1.pdf manage 414.8 K 08 Jan 2020 - 12:42 MattShotwell  
lecture-10.pdfpdf lecture-10.pdf manage 170.9 K 10 Feb 2020 - 10:55 MattShotwell  
lecture-11.pdfpdf lecture-11.pdf manage 285.1 K 12 Feb 2020 - 08:43 MattShotwell  
lecture-13.pdfpdf lecture-13.pdf manage 363.7 K 16 Feb 2020 - 13:04 MattShotwell  
lecture-14.pdfpdf lecture-14.pdf manage 382.7 K 09 Mar 2020 - 10:23 MattShotwell  
lecture-15.pdfpdf lecture-15.pdf manage 354.6 K 16 Mar 2021 - 08:41 MattShotwell  
lecture-18.pdfpdf lecture-18.pdf manage 191.9 K 23 Mar 2020 - 12:44 MattShotwell  
lecture-2.pdfpdf lecture-2.pdf manage 243.4 K 10 Jan 2020 - 09:36 MattShotwell  
lecture-21.pdfpdf lecture-21.pdf manage 456.8 K 23 Mar 2020 - 12:44 MattShotwell  
lecture-22.pdfpdf lecture-22.pdf manage 499.9 K 27 Mar 2020 - 10:35 MattShotwell  
lecture-23.pdfpdf lecture-23.pdf manage 292.3 K 30 Mar 2020 - 09:52 MattShotwell  
lecture-24.pdfpdf lecture-24.pdf manage 494.2 K 01 Apr 2020 - 09:47 MattShotwell  
lecture-25.pdfpdf lecture-25.pdf manage 410.4 K 25 Mar 2020 - 12:55 MattShotwell  
lecture-28.pdfpdf lecture-28.pdf manage 326.9 K 10 Apr 2020 - 12:51 MattShotwell  
lecture-29.pdfpdf lecture-29.pdf manage 955.6 K 13 Apr 2020 - 09:36 MattShotwell  
lecture-2a.pdfpdf lecture-2a.pdf manage 97.7 K 13 Jan 2020 - 12:07 MattShotwell  
lecture-3.pdfpdf lecture-3.pdf manage 569.0 K 15 Jan 2020 - 10:17 MattShotwell  
lecture-30.pdfpdf lecture-30.pdf manage 626.1 K 15 Apr 2020 - 10:06 MattShotwell  
lecture-31.pdfpdf lecture-31.pdf manage 4059.4 K 08 Apr 2020 - 10:18 MattShotwell  
lecture-32.pdfpdf lecture-32.pdf manage 165.5 K 17 Apr 2020 - 10:05 MattShotwell  
lecture-4a.pdfpdf lecture-4a.pdf manage 152.8 K 17 Jan 2020 - 10:09 MattShotwell  
lecture-5.pdfpdf lecture-5.pdf manage 578.2 K 22 Jan 2020 - 10:16 MattShotwell  
lecture-6.pdfpdf lecture-6.pdf manage 97.9 K 27 Jan 2020 - 11:29 RyanJarrett  
lecture-8.pdfpdf lecture-8.pdf manage 596.0 K 31 Jan 2020 - 10:16 MattShotwell  
lecture-9.pdfpdf lecture-9.pdf manage 1199.4 K 05 Feb 2020 - 12:54 MattShotwell  
linear-regression-examples.RR linear-regression-examples.R manage 5.2 K 15 Feb 2021 - 20:34 MattShotwell  
linear-spline-manipulate-example.RR linear-spline-manipulate-example.R manage 1.2 K 15 Jan 2020 - 10:18 MattShotwell  
mLR-bootstrap.RmdRmd mLR-bootstrap.Rmd manage 2.6 K 12 Feb 2020 - 09:40 MattShotwell  
medExtractR_lecture.pdfpdf medExtractR_lecture.pdf manage 5878.6 K 27 Feb 2020 - 14:24 HannahWeeks medExtractR_lecture
midterm-review.pdfpdf midterm-review.pdf manage 563.7 K 26 Feb 2020 - 10:33 MattShotwell  
mixture-data-complete.RR mixture-data-complete.R manage 5.8 K 14 Feb 2020 - 12:32 MattShotwell  
mixture-data-knn-local-kde.RR mixture-data-knn-local-kde.R manage 8.4 K 09 Mar 2021 - 10:28 MattShotwell  
mixture-data-knn-local.RR mixture-data-knn-local.R manage 6.2 K 16 Feb 2020 - 13:05 MattShotwell  
mixture-data-lin-knn.RR mixture-data-lin-knn.R manage 4.0 K 28 Jan 2021 - 10:31 MattShotwell  
mixture-data-rpart-bagging.RR mixture-data-rpart-bagging.R manage 3.7 K 23 Mar 2020 - 07:45 MattShotwell  
mixture-data-rpart.RR mixture-data-rpart.R manage 2.5 K 20 Mar 2020 - 09:11 MattShotwell  
multivariate-KDE.htmlhtml multivariate-KDE.html manage 862.9 K 24 Feb 2020 - 10:07 MattShotwell  
nnet.RR nnet.R manage 3.0 K 05 Apr 2020 - 19:43 MattShotwell  
nonlinear-bagging.htmlhtml nonlinear-bagging.html manage 656.0 K 23 Mar 2020 - 10:46 MattShotwell  
normal-mixture-examples.RR normal-mixture-examples.R manage 1.8 K 17 Apr 2020 - 10:04 MattShotwell  
pca-regression-example.RR pca-regression-example.R manage 3.4 K 25 Feb 2021 - 08:28 MattShotwell  
presentation.pdfpdf presentation.pdf manage 170.9 K 10 Feb 2020 - 10:52 MattShotwell  
principal-curves.RR principal-curves.R manage 3.7 K 10 Apr 2020 - 10:11 MattShotwell  
prostate-data-lin.RR prostate-data-lin.R manage 2.5 K 09 Feb 2021 - 08:46 MattShotwell  
random-forest-example.RR random-forest-example.R manage 1.4 K 30 Mar 2021 - 10:54 MattShotwell  
simple-LDA-3D.RR simple-LDA-3D.R manage 3.1 K 03 Feb 2020 - 09:01 MattShotwell  
smooth-splines-manipulate-example.RR smooth-splines-manipulate-example.R manage 1.0 K 15 Jan 2020 - 10:17 MattShotwell  
spectral-clustering.RR spectral-clustering.R manage 3.1 K 13 Apr 2020 - 09:35 MattShotwell  
sphered-and-canonical-inputs.RR sphered-and-canonical-inputs.R manage 6.3 K 07 Feb 2020 - 09:39 MattShotwell  
splines-example.RR splines-example.R manage 3.8 K 14 Feb 2020 - 11:03 MattShotwell  
vowel-data-LDA.RmdRmd vowel-data-LDA.Rmd manage 4.7 K 07 Feb 2020 - 13:08 MattShotwell  
vowel-data-LR.RmdRmd vowel-data-LR.Rmd manage 3.2 K 10 Feb 2020 - 12:18 MattShotwell  
Edit | Attach | Print version | History: r189 | r104 < r103 < r102 < r101 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r103 - 21 Apr 2021, MattShotwell
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback