Assignment Points
Final Project 20 pts
Paper Review 20 pts
Daily HW 4 pts x 15 = 60 pts
Total 100 pts

The instructors may provide opportunities to earn additional points. Your final grade will be calculated from the final score as follows:
90 - 100 A
80 -  89 B
70 -  79 C
60 -  69 D
00 -  59 F

# ASSIGNMENTS

Final Project
Overview:
You will plan, execute, and present a data analysis on a question of your own choosing. We encourage you to choose a question from your own research.
Due dates:
Analysis plan will be submitted as a HW - see schedule
Presentations will occur in class - see schedule
Presentation slides (if any) will be shared with the entire class (via slack)
Final reports due on the last day of class at 11:59PM
Name the report as follows: lastname-firstname-finalproject. For example, stewart-thomas-finalproject.pdf
Use a header so that each page prints your name in the top margin.
Instructions:
• Choose a research question so that development of a predictive model makes sense.
• Select a dataset that includes predictors of various types (both continuous and categorical).
• Incorporate methods discussed in the course. (For example, transformations, splines, missing data methods, bootstrap, etc.)
• Address model selection and fit.
• Use graphical displays in both the report and the presentation.
• Plan for a 15 minute presentation and 5 minutes of questions
• Prepare a statistical analysis plan for your final project. The following is a suggested outline for the analysis plan.
1. Introduction. This is a statistical analysis plan, so a (very) short introduction of the scientific / biological content is sufficient. Give the reader enough content to orient themselves.
2. Context. List earlier studies that form the foundation of your current research project. As a statistical analysis plan, focus on the statistical aspects of the earlier studies, i.e., study design, patient population, analysis method, type of hypothesis test. Indicate how the project adds to or replicates earlier work.
3. Research questions.
4. Analysis plan. For each research question, provide information on the following:
• Study design
• Study population, inclusion/exclusion criteria
• Variables
• Statistical methods
• Missing Data
• Sensitivity analyses
• Output. Provide mock-ups of tables and figures that summarize the results of the analyses. Mock-ups should be as complete as possible. Think how you might generate tables and figures that compare results from earlier research with the results from the present project.

Paper Review
Overview:
You will present to the class a critique of a published paper from a medical journal.
Due dates:
Discussion in class - see schedule

Presentation slides (if any) will be shared with the entire class (via slack)
Use a header so that each page prints your name in the top margin.
Instructions:
• Select a paper from the list of papers on slack (or add a paper of your choice)
• Review of the paper, focusing on the statistical analysis.
• Identify strengths and weaknesses.
• If there are issues with the analysis, explain what the issue is, how it may affect the analysis, and how you would analyze the data to avoid the issue.
• Be prepared to discuss your critique and lead a discussion in class.
• Read two other papers from the list of papers so that you may participate in the discussions during class

Daily Homework
Due:
The next class period after the question is assigned.
Include code if the problem requires STATA computation
Problems:
1. Read the description of the data in ABD chapter 17 problem 1. Questions here (link).
3. Do ABD chapter 17 problem 8
4. Do ABD chapter 17 problem 16
5. Do ABD chapter 17 problem 20. Data available in ABD Datasets.
6. Do ABD chapter 17 problem 21. Data available in ABD Datasets.
7. Do ABD chapter 17 problem 28. Data available in ABD Datasets.
8. Do ABD chapter 18 problem 10
9. Do ABD chapter 18 problem 11. Data available in ABD Datasets.
10. Do ABD chapter 18 problem 12 with the following modifications:
(a) Generate an ANOVA and partial effect table as described in the BBR notes.  The table will look like this:

| Effect         | DF | F | P |
|----------------|----|---|---|
| Species        | xx | x | x |
|   interactions | xx | x | x |
|----------------|----|---|---|
| Mass           | xx | x | x |
|   interactions | xx | x | x |
|----------------|----|---|---|
| Regression     | xx | x | x |
|----------------|----|---|---|
| Total          | xx |   |   |

(b) How does the model with SPECIES*MASS term differ from the model without the term?  (Answer in the context of the scientific question.  For example: In model with the interaction, the association of brain size and body mass ...; whereas, in the model without the interaction the association is ....)
(c) As in the book
(d) As in the book (Write the hypotheses in terms of betas, then write the hypotheses in terms of a comparison of two models.)
(e) As in the book
(f) Generate a scatter plot of the data.  Underlay the scatter plot with a plot of the model and confidence bands.

Data available in ABD Datasets: 18-q-12
11. Do ABD chapter 18 problem 15. Data available in ABD Datasets. Reproduce the figure in the question.
12. Do ABD chapter 18 problem 18. Data available in ABD Datasets.
13. Do RMS chapter 2 problem 1.
14. Do RMS chapter 2 problem 2.
15. Do RMS chapter 2 problem 3.
16. Do RMS chapter 2 problem 4.
17. Do RMS chapter 2 problem 5. Data available with command
getvdata sat

18. With the original data from ABD chapter 19 problem 11 (First 18 obs of 19-q-11):
1. Fit a simple linear model with spermstored as outcome and shellvolume as predictor.
E[spermstored] = b0 + b1 shellvolumne
2. Plot the data and the regression model.
3. Calculate the slope coefficient (b1).
4. Use bootstrapping to generate a confidence interval for the slope coefficient (b1).
5. Explain the advantages of the bootstrap confidence interval in this situation?
19. With the data from ABD chapter 19 problem 13 (19-q-13):
1. Provide a bootstrap estimate of the 95% confidence interval of the median.
2. Provide a bootstrap estimate of the 95% confidence interval of the 75th percentile.
20. Using the linear model in ABD chapter 17 problem 1,
1. Generate the following diagnostic plots:
1. Q-Q plot of the residuals
2. Residuals by predictor
2. What aspect of model fit does the Q-Q plot communicate?
3. What does the Q-Q plot look like if the model fits well?
4. What does the Q-Q plot look like if the model fits poorly?
5. What aspect of model fit does the residual by predictor plot communicate?
6. What does the residual by predictor plot look like if the model fits well?
7. What does the residual by predictor plot look like if the model fits poorly?
8. Generate the predicted outcome by predictor plot with confidence bands appropriate for forecasting the outcome for a single, previously unseen observation.
21. In section 10.10.3, there is one-way analysis of covariance of Rosner's lead data. Using the same data (getvdata lead):
1. Write the linear model relating maxfwt and group using reference group parameteriztion
2. Interpret the model parameters in the context of the scientific question of the analysis
3. Fit the model in STATA and report the parameter estimates
4. Write the linear model relating maxfwt and group using group means parameteriztion
5. Interpret the model parameters in the context of the scientific question of the analysis
6. Fit the model in STATA and report the parameter estimates
7. Write the linear model relating maxfwt with group and age using reference group parameteriztion
8. Interpret the model parameters in the context of the scientific question of the analysis
9. Fit the model in STATA and report the parameter estimates
10. Replicate the predicted outcome plot shown in section 10.10.3 of BBR
11. Replicate the ANOVA table with age and group in section 10.10.3 of BBR
12. Interpret the hypothesis test represented by "group" in the context of the scientific question
22. Using the 2.20.Framingham data, fit the following model
(a)  Model E[spb | sex bmi age scl] =
const + MALE
+ rcs(bmi,5) + rcs(bmi,5) × MALE
+ rcs(age,5) + rcs(age,5) × MALE
+ rcs(scl,5) + rcs(scl,5) × MALE

(b) Generate the following plots:
*   Plot E[spb|-] over bmi, for males and females
*   Plot E[spb|-] over age, for males and females

(c) Perform a chunk test to determine if there is a
common or sex specific "slope" of
bmi, age, and scl.

23. Using the nhgh data, do the following
(a) Create a variable cluster plot of wt ht bmi leg arml armc waist tri sub age sex re

(b) Generate the first principle component of the variables leg, arml, ht, report the internal coefficients of the principle component.

(c) Generate the plot of generalized Spearman's rho for wt ht bmi leg arml armc waist tri sub age sex as predictors of gh.  See http://hbiostat.org/doc/rms.pdf#page=349

24. Using the support data, do the following
(a) Create the potential predictive power plot and the variable cluster plot of the following variables:

sex num_co age meanbp hrt resp bili crea sod slos dzgroup hospdead when predicting ln(totcst)

(b) Based on the output in part (a), decide if you need to combine some variables and decide how many knots to allocate to continuous variables.  Decide on a model that includes all of the variables and include a meaningful interaction of a continuous and categorical variable.  (Hint: interact slos and dzgroup).  Write the model in the following way (replacing ?? with the desired number of knots).

E[ln totcst|-] = sex + num_co + rcs(age, ??)
+ rcs(meanbp, ??) + rcs(hrt, ??)
+ ...
+ ...
+ rcs(slos, ??) + rcs(slos, ??)*dzgroup

(c) Generate residual diagnostic plots (QQ, residual by y_hat)

(d) Generate the calibration plot (y by y_hat).  Add the line of identity or the scatter plot of y by y.

(e) Generate the regression anova table with tests of no associations, tests of interaction, tests of linearity

25. This question is a continuation of the previous question
(f) Generate the DFFITS influence plot

(g) Generate the leverage plot.

(h) Calculate the mean absolute prediction error of your model.

(i) Calculate a measure of rank discrimination.

(j) Calculate the R^2 optimism.

26. Using the 2.20.Framingham data, do the following
1. Calculate the effective sample size. Determine the available degrees of freedom under the 15:1 rule.
2. Perform a variable cluster analysis, and identify two variables to combine into a single principle component.
3. Fit a cox regression model in which the the continuous variables are included as restricted cubic splines with 3 knots. Allow the effect of age to differ between males and females.
4. Examine the proportional hazards assumption graphically by creating 6 groups of patients based on the estimated log hazards
5. Calculate and report a rank discrimination measure (a concordance probability)
6. Describe how you might examine the model for overly influential observations. (You do not have to report anything from this step, other than to describe what you might do.)
7. Generate the partial effect plot of age
8. Generate the ANOVA table of predictor effects.
getvdata 2.20.Framingham
drop month id

27. Using the titanic3 data, do the following
1. Calculate the effective sample size. Determine the available degrees of freedom under the 15:1 rule.
2. Fit the model reported on page 277 (section 12.3) of the RMS course notes.
3. Calculate one rank discrimination measure and one discrimination measure. (See output table reported in RMS notes.)
4. Generate the calibration plot.
5. Generate the ANOVA table of predictor effects.
6. Replicate the partial effect plots reported on page 279 of RMS course notes.
getvdata titanic3

28. Using the MoleRat data (18-e-4), run the STATA code below which fits a regression with caste-specific intercepts and caste-specific slopes. After running the regression,
1. Write the regression model.
Use this notation:
E[lnenergy|lnmass, worker] = ...
V[lnenergy|lnmass, worker] = ...

2. Generate a plot of the model with predicted lnenergy on the Y-axis and lnmass on the X-axis. Add a line and (population mean) confidence band for each caste.
3. Interpret the beta coefficients and the sigma parameter .
4. Generate the forecasted lnenergy and confidence interval for a lazy and worker mole rat that has the median lnmass.
getvdata 18-e-4
generate worker = 1*(caste == "worker")
regress lnenergy worker lnmass i.worker#c.lnmass

29. Replicate via MarkDoc the report posted here. To get you started, a partial .do file is here. Be sure to:
1. Substitute your name into the document
2. Submit the pdf via Slack in a private message to Thomas Stewart and Sam Nwosu
30. Use the nhgh data and do the following:
getvdata nhgh

a. Generate a summary table that reports the amount of missing data for each variable

b. Generate a summary table that reports the patterns of missing data.  What percentage of observations have all variable data.

c. Because we will try to understand the associations between bun, age, sex, treatment for diabetes, diagnosis of diabetes, waist, height; drop all other variables.  Because bun is the outcome, drop any observations missing bun.

d. Using predictive mean matching and chained equations, generate 10 multiply imputed datasets

e. Generate a histogram of observed waist values.  Add to the plot a reference line for the multiply imputed values for the first observation that is missing waist in the dataset.

f. Generate restricted cubic splines for waist and ht variables.

g. Fit the following model: bun i.sex age i.tx i.dx c.(rcs_ht*)##c.(rcs_waist*)

h. Test the interaction of ht and waist

Edit | Attach | Print version |  | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r46 - 13 Feb 2019, ThomasStewart

• Biostatistics Webs

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback