Due: The next class period after the question is assigned. Include code if the problem requires STATA computation Problems:  Read the description of the data in ABD chapter 17 problem 1. Questions here (link).
 Read ABD chapter 17 problem 1. Questions here (link).
 Do ABD chapter 17 problem 8
 Do ABD chapter 17 problem 16
 Do ABD chapter 17 problem 20. Data available in ABD Datasets.
 Do ABD chapter 17 problem 21. Data available in ABD Datasets.
 Do ABD chapter 17 problem 28. Data available in ABD Datasets.
 Do ABD chapter 18 problem 10
 Do ABD chapter 18 problem 11. Data available in ABD Datasets.
 Do ABD chapter 18 problem 12 with the following modifications:
(a) Generate an ANOVA and partial effect table as described in the BBR notes. The table will look like this:
 Effect  DF  F  P 

 Species  xx  x  x 
 interactions  xx  x  x 

 Mass  xx  x  x 
 interactions  xx  x  x 

 Regression  xx  x  x 

 Total  xx   
(b) How does the model with SPECIES*MASS term differ from the model without the term? (Answer in the context of the scientific question. For example: In model with the interaction, the association of brain size and body mass ...; whereas, in the model without the interaction the association is ....)
(c) As in the book
(d) As in the book (Write the hypotheses in terms of betas, then write the hypotheses in terms of a comparison of two models.)
(e) As in the book
(f) Generate a scatter plot of the data. Underlay the scatter plot with a plot of the model and confidence bands.
Data available in ABD Datasets: 18q12  Do ABD chapter 18 problem 15. Data available in ABD Datasets. Reproduce the figure in the question.
 Do ABD chapter 18 problem 18. Data available in ABD Datasets.
 Do RMS chapter 2 problem 1.
 Do RMS chapter 2 problem 2.
 Do RMS chapter 2 problem 3.
 Do RMS chapter 2 problem 4.
 Do RMS chapter 2 problem 5. Data available with command
getvdata sat
 With the original data from ABD chapter 19 problem 11 (First 18 obs of 19q11):
 Fit a simple linear model with spermstored as outcome and shellvolume as predictor.
E[spermstored] = b0 + b1 shellvolumne  Plot the data and the regression model.
 Calculate the slope coefficient (b1).
 Use bootstrapping to generate a confidence interval for the slope coefficient (b1).
 Explain the advantages of the bootstrap confidence interval in this situation?
 With the data from ABD chapter 19 problem 13 (19q13):
 Provide a bootstrap estimate of the 95% confidence interval of the median.
 Provide a bootstrap estimate of the 95% confidence interval of the 75th percentile.
 Using the linear model in ABD chapter 17 problem 1,
 Generate the following diagnostic plots:
 QQ plot of the residuals
 Residuals by predictor
 What aspect of model fit does the QQ plot communicate?
 What does the QQ plot look like if the model fits well?
 What does the QQ plot look like if the model fits poorly?
 What aspect of model fit does the residual by predictor plot communicate?
 What does the residual by predictor plot look like if the model fits well?
 What does the residual by predictor plot look like if the model fits poorly?
 Generate the predicted outcome by predictor plot with confidence bands appropriate for forecasting the outcome for a single, previously unseen observation.
 In section 10.10.3, there is oneway analysis of covariance of Rosner's lead data. Using the same data (getvdata lead):
 Write the linear model relating maxfwt and group using reference group parameteriztion
 Interpret the model parameters in the context of the scientific question of the analysis
 Fit the model in STATA and report the parameter estimates
 Write the linear model relating maxfwt and group using group means parameteriztion
 Interpret the model parameters in the context of the scientific question of the analysis
 Fit the model in STATA and report the parameter estimates
 Write the linear model relating maxfwt with group and age using reference group parameteriztion
 Interpret the model parameters in the context of the scientific question of the analysis
 Fit the model in STATA and report the parameter estimates
 Replicate the predicted outcome plot shown in section 10.10.3 of BBR
 Replicate the ANOVA table with age and group in section 10.10.3 of BBR
 Interpret the hypothesis test represented by "group" in the context of the scientific question
 Using the 2.20.Framingham data, fit the following model
(a) Model E[spb  sex bmi age scl] =
const + MALE
+ rcs(bmi,5) + rcs(bmi,5) × MALE
+ rcs(age,5) + rcs(age,5) × MALE
+ rcs(scl,5) + rcs(scl,5) × MALE
(b) Generate the following plots:
* Plot E[spb] over bmi, for males and females
* Plot E[spb] over age, for males and females
(c) Perform a chunk test to determine if there is a
common or sex specific "slope" of
bmi, age, and scl.
 Using the nhgh data, do the following
(a) Create a variable cluster plot of wt ht bmi leg arml armc waist tri sub age sex re
(b) Generate the first principle component of the variables leg, arml, ht, report the internal coefficients of the principle component.
(c) Generate the plot of generalized Spearman's rho for wt ht bmi leg arml armc waist tri sub age sex as predictors of gh. See http://hbiostat.org/doc/rms.pdf#page=349
 Using the support data, do the following
(a) Create the potential predictive power plot and the variable cluster plot of the following variables:
sex num_co age meanbp hrt resp bili crea sod slos dzgroup hospdead when predicting ln(totcst)
(b) Based on the output in part (a), decide if you need to combine some variables and decide how many knots to allocate to continuous variables. Decide on a model that includes all of the variables and include a meaningful interaction of a continuous and categorical variable. (Hint: interact slos and dzgroup). Write the model in the following way (replacing ?? with the desired number of knots).
E[ln totcst] = sex + num_co + rcs(age, ??)
+ rcs(meanbp, ??) + rcs(hrt, ??)
+ ...
+ ...
+ rcs(slos, ??) + rcs(slos, ??)*dzgroup
+ dzgroup + hospdead
(c) Generate residual diagnostic plots (QQ, residual by y_hat)
(d) Generate the calibration plot (y by y_hat). Add the line of identity or the scatter plot of y by y.
(e) Generate the regression anova table with tests of no associations, tests of interaction, tests of
(f) Generate the DFFITS influence plot
(g) Generate the leverage plot.
 This question is a continuation of the previous question
(h) Calculate the mean absolute prediction error of your model.
(i) Calculate a measure of rank discrimination.
(j) Calculate the R^2 optimism.
 Using the 2.20.Framingham data, do the following
 Calculate the effective sample size. Determine the available degrees of freedom under the 15:1 rule.
 Perform a variable cluster analysis, and identify two variables to combine into a single principle component.
 Fit a cox regression model in which the the continuous variables are included as restricted cubic splines with 3 knots. Allow the effect of age to differ between males and females.
 Examine the proportional hazards assumption graphically by creating 6 groups of patients based on the estimated log hazards
 Calculate and report a rank discrimination measure (a concordance probability)
 Describe how you might examine the model for overly influential observations. (You do not have to report anything from this step, other than to describe what you might do.)
 Generate the partial effect plot of age
 Generate the ANOVA table of predictor effects.
getvdata 2.20.Framingham
drop month id
 Using the titanic3 data, do the following
 Calculate the effective sample size. Determine the available degrees of freedom under the 15:1 rule.
 Fit the model reported on page 277 (section 12.3) of the RMS course notes.
 Calculate one rank discrimination measure and one discrimination measure. (See output table reported in RMS notes.)
 Generate the calibration plot.
 Generate the ANOVA table of predictor effects.
 Replicate the partial effect plots reported on page 279 of RMS course notes.
getvdata titanic3
 Using the MoleRat data (18e4), run the STATA code below which fits a regression with castespecific intercepts and castespecific slopes. After running the regression,
 Write the regression model.
Use this notation:
E[lnenergylnmass, worker] = ...
V[lnenergylnmass, worker] = ...
 Generate a plot of the model with predicted lnenergy on the Yaxis and lnmass on the Xaxis. Add a line and (population mean) confidence band for each caste.
 Interpret the beta coefficients and the sigma parameter .
 Generate the forecasted lnenergy and confidence interval for a lazy and worker mole rat that has the median lnmass.
getvdata 18e4
generate worker = 1*(caste == "worker")
regress lnenergy worker lnmass i.worker#c.lnmass
 Replicate via MarkDoc the report posted here. To get you started, a partial .do file is here. Be sure to:
 Substitute your name into the document
 Submit the pdf via Slack in a private message to Thomas Stewart and Sam Nwosu
 Use the nhgh data and do the following:
getvdata nhgh
a. Generate a summary table that reports the amount of missing data for each variable
b. Generate a summary table that reports the patterns of missing data. What percentage of observations have all variable data.
c. Because we will try to understand the associations between bun, age, sex, treatment for diabetes, diagnosis of diabetes, waist, height; drop all other variables. Because bun is the outcome, drop any observations missing bun.
d. Using predictive mean matching and chained equations, generate 10 multiply imputed datasets
e. Generate a histogram of observed waist values. Add to the plot a reference line for the multiply imputed values for the first observation that is missing waist in the dataset.
f. Generate restricted cubic splines for waist and ht variables.
g. Fit the following model: bun i.sex age i.tx i.dx c.(rcs_ht*)##c.(rcs_waist*)
h. Test the interaction of ht and waist
