You are here: Vanderbilt Biostatistics Wiki>Main Web>CourseDSI5640 (29 Apr 2021, MattShotwell)EditAttach

- Matthew S. Shotwell, PhD
- matt.shotwell@vanderbilt.edu
- 11118B: 2525 West End Avenue
- 615-875-3397
- Github: biostatmatt

- Shiyao Li
- Jiabing Ruan

- First meeting: Tue. Jan. 26, 2021; Last meeting: Thu. Apr. 29, 2021
- Tuesday, Thursday 11:10AM-12:25PM
- Sony Building 2001-A, and virtually using Zoom via Brightspace
- Office hours: By appointment, initially. Will determine a regular schedule. * We will use the Graduate School Academic Calendar

- A more applied book (also free to download) with slides, R code, and video tutorials: http://www-bcf.usc.edu/~gareth/ISL/
- The Matrix Cookbook (version 15 November 2012): MCB-20121115.pdf

- Overview of Supervised Learning and Review of Linear Methods: HTF Ch. 2-4
- Splines and Kernel Methods: HTF Ch. 5-6
- Model Assessment, Selection, and Inference: HTF Ch. 7-8
- Neural Networks: HTF Ch. 11
- Support Vector Machines: HTF Ch. 12
- Unsupervised Learning: HTF Ch. 14

- Unless otherwise stated, assigned homework is due in one week. Late homework will be subject to a penalty of 20% for each day late.
- Students are encouraged to work together on homework problems, but must turn in their own write-ups.
- Class participation is encouraged.
- Please bring a laptop to class, when classes are held in-person.

- Homework: 40%
- Midterm Exam: 30%
- Final Exam: 30%

Letter Grade | Lowest Score |

A+ | 96.5 |

A | 93.5 |

A- | 90.0 |

B+ | 86.5 |

B | 83.5 |

B- | 80.0 |

C | 70.0 |

F | 0.0 |

Date |
Reading (before class) |
Homework |
Topic/Content |
Presentation |

Tue. 1/26 | none | none | Syllabus, introduction | Intro.pdf |

Thu. 1/28 | HTF Ch. 1 and Ch. 2.1, 2.2, and 2.3 | none | Least-squares, nearest-neighbors | lecture-1.pdf mixture-data-lin-knn.R |

Tue. 2/2 | none | See below: Tue. 2/2 | Least-squares, nearest-neighbors code | mixture-data-lin-knn.R |

Thu. 2/4 | HTF Ch. 2.4 | none | Decision theory | lecture-2.pdf |

Tue. 2/9 | none | See below: Tue. 2/9 | Loss functions in practice | lecture-2a.pdf prostate-data-lin.R |

Thu. 2/11 | HTF Ch. 2.7, 2.8, and 2.9 | none | Structured regression | lecture-3.pdf ex-1.R ex-2.R ex-3.R |

Tue. 2/16 | HTF Ch. 3.1, 3.2, 3.3, 3.4 | none | Linear methods, subset selection, ridge, and lasso | lecture-4a.pdf linear-regression-examples.R lecture-5.pdf lasso-example.R |

Thu. 2/18 | none | See below: Thu. 2/18 | No Class Reading day focused on linear methods for regression. |
Suggested supplemental reading: Introduction to Statistical Learning Ch.3 and Laboratory (section 3.6). |

Tue. 2/23 | none | none | Linear methods, subset selection, ridge, and lasso (cont.) | lecture-5.pdf lasso-example.R |

Thu. 2/25 | HTF Ch. 3.5 and 3.6 | none | Linear methods: principal components regression | lecture-6.pdf pca-regression-example.R |

Tue. 3/2 | HTF Ch. 4.1, 4.2, and 4.3 | See below: Tue. 3/2 | Linear methods: Linear discriminant analysis | lecture-8.pdf simple-LDA-3D.R |

Thu. 3/4 | HTF Ch. 5.1 and 5.2 | none | Basis expansions: piecewise polynomials & splines | lecture-11.pdf splines-example.R mixture-data-complete.R |

Tue. 3/9 | HTF Ch. 6.1-6.5 | none | Kernel methods | lecture-13.pdf mixture-data-knn-local-kde.R kernel-methods-examples-mcycle.R |

Thu. 3/11 | HTF Ch. 7.1, 7.2, 7.3, 7.4 | See below: Thu. 3/11 | Model assessment: Cp, AIC, BIC | lecture-14.pdf effective-df-aic-bic-mcycle.R |

Tue. 3/16 | HTF Ch. 7.10 | none | Cross validation | lecture-15.pdf kNN-CV.R Income2.csv |

Thu. 3/18 | none | none | Midterm Review | none |

Tue. 3/23 | HTF Ch. 9.2 | none | Classification and Regression Trees | lecture-21.pdf mixture-data-rpart.R |

Thu. 3/25 | HTF Ch. 8.7, 8.8, 8.9 | none | Bagging | lecture-18.pdf mixture-data-rpart-bagging.R nonlinear-bagging.html |

Tue. 3/30 | HTF Ch. 15.1, 15.2 | Tue. 3/30 (below) | Random Forest | lecture-25.pdf random-forest-example.R |

Thu. 4/1 | HTF Ch. 10.1 | none | Boosting and AdaBoost.M1 (part 1) | lecture-22.pdf boosting-trees.R |

Tue. 4/6 | HTF Ch. 10.2-10.9 | Work through this nice GBM tutorial | Boosting and AdaBoost.M1 (part 2) | lecture-23.pdf |

Thu. 4/8 | HTF Ch. 10.10, 10.13 | none | Boosting and AdaBoost.M1 (part 3) | lecture-24.pdf gradient-boosting-example.R |

Tue. 4/12 | HTF Ch. 10.10, 10.13 | none | Boosting and AdaBoost.M1 (part 3; cont.) | lecture-24.pdf gradient-boosting-example.R |

Thu. 4/14 | HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 | none | Introduction to Neural networks | lecture-31.pdf nnet.R |

Tue. 4/20 | HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 | Thu. 4/14 (below) | Introduction to Neural networks (cont.) | lecture-31.pdf nnet.R |

Thu. 4/22 | HTF Ch. 11.1, 11.2, 11.3, 11.4, 11.5 | none | Introduction to Neural networks (cont.) | lecture-31.pdf nnet.R |

Tue. 4/27 | HTF 14.5.3 | none | k-means, hierarchical, and spectral clustering | lecture-29.pdf spectral-clustering.R |

Thu. 4/29 | none | none | Distribute final exam. Last day of class |

---

title: "Homework 1"

author: DS Student

date: January 15, 2020

output: github_document

--- Make sure to include the raw code (.Rmd), the rendered file (.md), and any plots in your repo. Jupyter notebooks (.ipynb) using the R language are also ok. Resubmissions are only allowed if the initial submission was made on time. A resubmission is due within one week of receiving feedback from TA and there is a maximum of 2 resubmissions for each assignment.

- Paste the code from the
*mixture-data-lin-knn.R*file into the homework template Knitr document. - Read the help file for R's built-in linear regression function lm
- Re-write the functions fit_lc and predict_lc using lm, and the associated predict method for lm objects.
- Consider making the linear classifier more flexible, by adding squared terms for x1 and x2 to the linear model
- Describe how this more flexible model affects the bias-variance tradeoff

- Write functions that implement the L1 loss and tilted absolute loss functions.
- Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label (using the 'legend' function) the linear model predictors associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.
- Write functions to fit and predict from a simple nonlinear model with three parameters defined by 'beta[1] + beta[2]*exp(-beta[3]*x)'. Hint: make copies of 'fit_lin' and 'predict_lin' and modify them to fit the nonlinear model. Use c(-1.0, 0.0, -0.3) as 'beta_init'.
- Create a figure that shows lpsa (x-axis) versus lcavol (y-axis). Add and label (using the 'legend' function) the nonlinear model predictors associated with L2 loss, L1 loss, and tilted absolute value loss for tau = 0.25 and 0.75.

- Use the prostate cancer data.
- Use the
`cor`

function to reproduce the correlations listed in HTF Table 3.1, page 50. - Treat
`lcavol`

as the outcome, and use all other variables in the data set as predictors. - With the training subset of the prostate data, train a least-squares regression model with all predictors using the
`lm`

function. - Use the testing subset to compute the test error (average squared-error loss) using the fitted least-squares regression model.
- Train a ridge regression model using the
`glmnet`

function, and tune the value of`lambda`

(i.e., use guess and check to find the value of`lambda`

that approximately minimizes the test error). - Create a figure that shows the training and test error associated with ridge regression as a function of
`lambda`

- Create a path diagram of the ridge regression analysis, similar to HTF Figure 3.8

- Exercise 4: "When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations..." Please type your solutions within your R Markdown document. No R coding is required for this exercise.
- Exercise 10: "This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar..." This exercise requires R coding.

- Randomly split the mcycle data into training (75%) and validation (25%) subsets.
- Using the mcycle data, consider predicting the mean acceleration as a function of time. Use the Nadaraya-Watson method with the k-NN kernel function to create a series of prediction models by varying the tuning parameter over a sequence of values. (hint: the script already implements this)
- With the squared-error loss function, compute and plot the training error, AIC, BIC, and validation error (using the validation data) as functions of the tuning parameter.
- For each value of the tuning parameter, Perform 5-fold cross-validation using the combined training and validation data. This results in 5 estimates of test error per tuning parameter value.
- Plot the CV-estimated test error (average of the five estimates from each fold) as a function of the tuning parameter. Add vertical line segments to the figure (using the
`segments`

function in R) that represent one “standard error” of the CV-estimated test error (standard deviation of the five estimates from each fold). - Interpret the resulting figures and select a suitable value for the tuning parameter.

- Convert the response variable in the “vowel.train” data frame to a factor variable prior to training, so that “randomForest” does classification rather than regression.
- Review the documentation for the “randomForest” function.
- Fit the random forest model to the vowel data using all of the 11 features using the default values of the tuning parameters.
- Use 5-fold CV and tune the model by performing a grid search for the following tuning parameters: 1) the number of variables randomly sampled as candidates at each split; consider values 3, 4, and 5, and 2) the minimum size of terminal nodes; consider a sequence (1, 5, 10, 20, 40, and 80).
- With the tuned model, make predictions using the majority vote method, and compute the misclassification rate using the ‘vowel.test’ data.

- Work through the "Image Classification" tutorial on the RStudio Keras website.
- Use the Keras library to re-implement the simple neural network discussed during lecture for the mixture data (see nnet.R). Use a single 10-node hidden layer; fully connected.
- Create a figure to illustrate that the predictions are (or are not) similar using the 'nnet' function versus the Keras model.
- (optional extra credit) Convert the neural network described in the "Image Classification" tutorial to a network that is similar to one of the convolutional networks described during lecture on 4/15 (i.e., Net-3, Net-4, or Net-5) and also described in the ESL book section 11.7. See the !ConvNet tutorial on the RStudio Keras website.

Edit | Attach | Print version | History: r105 < r104 < r103 < r102 | Backlinks | View wiki text | Edit wiki text | More topic actions

Topic revision: r105 - 29 Apr 2021, MattShotwell

Copyright © 2013-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback

Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback