Date | Room | Material to read/watch prior to class |
Session Topics | Assignments | Examples | |
---|---|---|---|---|---|---|
Monday | 1 | Zoom | Course Webpage Syllabus Key Concepts A17.1-17.7, A17.10-1 B10-B10.5.8 H2.4.7 |
Simple Linear Regression | Getting Started | |
2 | Zoom | A18, B10.6-B10.8 | Multiple Linear Regression | HW 1 | Lion Age from Nose | |
3 | Zoom | B 10.9-10.10, 9.4-9.9 | " | HW 2 | ||
4 | Zoom | B13-13.6.1 | Analysis of Covariance in Randomized Studies | HW 3 | Mole Rats | |
5 | Zoom | Hxiii, H1, H2-2.3.1, classification blog | Regression Modeling Strategies: Introduction | Chunk tests Cat vars in regression |
||
Monday | 8 | Zoom | H2.3.2-2.4.1-2.4.6 demos | Methods for Multivariable Models | HW 4 | |
9 | Zoom | H2.4.8-2.7.2 (omit 2.5.1) | " | HW 5 | Nonlinear relationships Plot RCS Missing data |
|
10 | Zoom | H3 (very brief), H4-4.1, 4.3-4.4, 4.6 (brief), 4.7-4.7.1 | Missing Data (brief), Multivariable Modeling Strategies | HW 6 Check-in |
||
11 | Zoom | H4.7.1-4.12.3 and this | " | HW 7 (optional) HW 8 |
Data reductions (principle components) Variable clustering |
|
12 | Zoom | H5-5.4, (omit 5.5), B10.11 | Model Interpretation and Validation | HW 9 | Outliers Internal Validation |
|
Monday | 15 | Zoom | H5.6, B6.8, 6.9, 6.10-6.10.2, A17.9 | Relative Effect Measures, Introduction to LRM | Prep for critique | Simple Logistic Logistic (part2) |
16 | Zoom | Student led discussion of papers | HW 10 | |||
17 | Zoom | A17.9, H10-10.5, 10.8-10.10, NNT | Binary Logistic Models, NNT | |||
18 | Zoom | B6.10.3, B13.6.2-13.7, blog, blog, H11 | " + Case Study | Analysis plan for final project | Bayes | |
19 | Zoom | H12, HTE, HTE, B19, More Analyses of Diagnostic Yield, Simple Decision Analysis |
Binary Logistic Model Case Study, Risk-based diagnostic research, ROC curves, HTE | HW 11 | Ordinal predictors Ordinal regression |
|
Monday | 22 | Zoom | B4.1.2, B5.12.4-.5, Power, B7.6-7.7, H13.1-13.3, H14.1 and Table 14.1 | Proportional Odds Ordinal Logistic Models | HW 12 (Optional) HW 13 |
|
23 | Zoom | H15-15.2, 15.5.2, Figure 15.15 17.1-17.3, 17.5.1, 18.1, 18.4-18.6 |
Ordinal Regression, Survival Analysis | Cox Model | ||
24 | Zoom | Survival Analysis | ||||
25 | Zoom | B15 | Analysis of Serial Measurements Student Presentations of Final Project |
Final Project | ||
26 | Zoom | Student Presentations of Final Project |
getvdata 17-e-1 //Scatter plot twoway (scatter ageinyears proportionblack), ytitle(Age (years)) xtitle(Proportion black) // Regression regress ageinyears proportionblack // Add plot data plotdata, at(proportionblack = .1(.1).9) // Generate yhat and confidence intervals regress_yhat_ci // Plot regression line and CI twoway (rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) /// (line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) /// (scatter ageinyears proportionblack), /// ytitle(Age (years)) /// xtitle(Proportion black) /// legend(off) // Generate yhat and forecast confidence intervals (use stub to identify new vars) regress_yhat_ci, forecast stub(f_) // Plot regression line twoway (rarea f_ageinyears_hat_lb f_ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(lime%70)) /// (rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) /// (line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) /// (scatter ageinyears proportionblack), /// ytitle(Age (years)) /// xtitle(Proportion black) /// legend(off) // Generate residuals generate ehat = ageinyears - ageinyears_hat // Studentized residuals predict student_ehat, rstudent // QQplot for residual diagnostic qnorm student_ehat // Constant variance diagnostic twoway (scatter student_ehat proportionblack)
// Read data getvdata 18-e-4 // Generate worker indicator variable encode caste, generate(lazy) gen worker = 1*(caste == "worker") // Construct regression model regress lnenergy lnmass i.lazy c.lnmass#i.lazy // Generate data for plot plotdata, at(lazy=1; lnmass = 4.2(.1)5.5) plotdata, at(lazy=2; lnmass = 3.5(.1)5) // Generate predicted values and CIs regress_yhat_ci // Generate plot of regression model twoway /// (rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if _plotindicator == 1, sort fcolor(red%30)) /// (line lnenergy_hat lnmass if _plotindicator==1, sort lcolor(red) lwidth(vthick)) /// (rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if _plotindicator == 2, sort fcolor(blue%30)) /// (line lnenergy_hat lnmass if _plotindicator==2, sort lcolor(blue) lwidth(vthick)) /// (scatter lnenergy lnmass if lazy==1, mcolor(red)) /// (scatter lnenergy lnmass if lazy==2, mcolor(blue)) /// , legend(order(5 "Lazy" 6 "Worker")) /// ytitle(Energy expenditure (log scale)) /// xtitle(Body mass (log scale))
getvdata 18-e-4 gen worker = 1*(caste == "worker") /* # Chunk test deliciousness In the following example, we are going to perform a series of chunk tests. We are going to demonstrate how to perform a chunk test with two different methods. The first method is a comparison of models. The second is a direct test of the betas. The two methods are equivalent and will give identical answers. Depending on your circumstances, one method may be easier to implement than another. ## METHOD 1: Comparison of models ### Interaction test FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker ### Total impact of lnmass FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass REDUCED: E[ lnenergy | - ] = b0 + b2 worker ### Total impact of caste FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass ### Overall regression FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass REDUCED: E[ lnenergy | - ] = b0 ### Steps to compare two models 1. Fit and store each model (use `est store` command) 2. Use `ftest` command ### Note: + The comparison of models approach is only valid if the regression models are fit using the same observations. This is often an issue if there is missing data as the default practice is to disregard observations that have missing covariates. The consequence is that the REDUCED model may use more observations than the FULL model. We will cover an example of how to compare models when there is missing data. */ regress lnenergy lnmass worker i.worker#c.lnmass est store full regress lnenergy lnmass worker est store no_int // for no interaction regress lnenergy worker est store no_mass // model with no lnmass regress lnenergy lnmass est store no_caste // model with no caste variable regress lnenergy est store no_vars // model with no predictors ftest full no_int // (also available from regress output) ftest full no_mass ftest full no_caste ftest full no_vars // (also available from regress output) /* ## METHOD 2: Betas ### Interaction test Ho: b3 = 0 Ha: b3 != 0 ### Total impact of lnmass Ho: b1 = 0 AND b3 = 0 Ha: b1 != 0 OR b3 != 0 ### Total impact of caste Ho: b1 = 0 AND b2 = 0 Ha: b1 != 0 OR b2 != 0 ### Overall regression Ho: b1 = 0 AND b2 = 0 AND b3 = 0 Ha: b1 != 0 OR b2 != 0 OR b3 != 0 ### Steps to test betas 1. Fit the full model 2. Use the `regress, coeflegend` command to identify labels for betas 3. Use `test` command ### Note: + This approach avoids issues with missing data. */ regress lnenergy lnmass worker i.worker#c.lnmass regress, coeflegend test (1.worker#c.lnmass) // Interaction (also available from regress output) test (lnmass 1.worker#c.lnmass) // Total impact of lnmass test (worker 1.worker#c.lnmass) // Total impact of caste test (lnmass worker 1.worker#c.lnmass) // Overall regression (also available from regress output) /* ## Table of usefull hypothesis tests You will want to organize the results of commonly used hypothesis tests in a table like the following: | Variable | df | Test stat | p-value | |---------------|----|-----------|---------| | lnmass | 2 | | | | interaction | 1 | | | | caste | 2 | | | | interaction | 1 | | | | Overall | 3 | | | */
/*** # Incorporating categorical variables into a regression model Example 1: Height as a function of race-ethnicity (re) + Option 1 - estimate the mean for each re E[ ht | re ] = a1 * Mexican American + a2 * Other Hispanic + a3 * Non-Hispanic White + a4 * Non-Hispanic Black + a5 * Other Race Including Multi-Racial + Option 2 - estimate the mean in a reference group (say Non-Hispanic Black), then estimate the difference in mean height between the reference group and each of the remaining re group E[ ht | re ] = b0 + b1 * Mexican American + b2 * Other Hispanic + b3 * Non-Hispanic White + b4 * Other Race Including Multi-Racial ***/ getvdata nhgh // Group means coding regress ht ibn.re, noconst /* Note: When you use the noconstant option, disregard any thing from the SS table or the overall chunk test */ // Reference group coding regress ht i.re /* Which is the reference group? How do you tell stata which group to use as reference? + help fvvarlist */ regress ht ib3.re // ib#. use # as base, #=value of variable regress ht ib(3).re // ib(##). use the #th ordered value as base (**) regress ht ib(first).re // ib(first). use smallest value as base (the default) regress ht ib(last).re // ib(last). use largest value as base regress ht ib(freq).re // ib(freq). use most frequent value as base /*** Example 2: Height as a function of race-ethnicity (re) and sex ***/ regress ht i.re##i.sex plotdata, at(sex = 1/2; re = 1/5) regress_yhat_ci decode(sex), generate(s_sex) decode(re), generate(s_re) gen label = s_re + " " + s_sex egen float row = rank(-10*re+sex) if _plotindicator == 1 point_ci ht_hat ht_hat_lb ht_hat_ub row if _plotindicator == 1 & sex == 1, /// labelvar(label) /// point_options(mcolor(magenta)) /// ci_options(lwidth(thick)) /// plot_options(title(Mean heights NHANES) xtitle(Height (cm)) ytitle(Race and Sex))
getvdata nhgh /* E[ ht | age, sex] = b0 + b1 sex + b2 age + b3 age' + b4 age'' + b5 age*sex + b6 age'*sex + b7 age''*sex */ * Plot data centile age, centile(2 98) plotdata, at(age=13/80; sex=1) plotdata, at(age=13/80; sex=2) * Generate restricted cubic spline of age with 3 knots mkspline_plotindicator age, nknots(4) * Fit the models for SCr and gh regress ht i.sex##c.(rcs*) * Generate the following diagnostic plots *** QQ plot of studentized residuals predict stud_residual, rstudent qnorm stud_residual *** residual-versus-fitted plot predict yhat, xb twoway (scatter stud_residual yhat) if _plotindicator == 0 *** residual-versus-predictor plot rvpplot rcs_age_1 rvpplot rcs_age_2 rvpplot rcs_age_3 *** LEAVE-ONE-OUT ***** DFBETAS dfbeta gen row = _n twoway (scatter _dfbeta_1 row) if _plotindicator == 0 twoway (scatter _dfbeta_2 row) if _plotindicator == 0 * ... ***** leverage predict leverage, leverage twoway (scatter leverage row) if _plotindicator == 0 ***** DFITS predict dfits, dfits twoway (scatter dfits row) if _plotindicator == 0 ***** Cooks D predict cooksd, cooksd twoway (scatter cooksd row) if _plotindicator == 0 list if cooksd > .005 & _plotindicator == 0 /* Group work E[ gh | age, sex, ...] = */
************************************ * THE SUPPORT STUDY * Dupont text, section 3.25 ************************************ clear getvdata support keep slos meanbp sex label variable slos "Length of Stay (days)" label variable meanbp "Mean Arterial Pressure (mm Hg)" plotdata, at(meanbp = 0, 25.71429, 51.42857, 77.14286, 102.8571, 128.5714, 154.2857, 181) plotdata, at(meanbp = 0, 25.71428, 51.42856, 77.14285, 102.8570, 128.5713, 154.2856, 180.9) scatter slos meanbp twoway (scatter slos meanbp), /// yscale(log) /// ylabel(, angle(horizontal)) egen meanbpcat = cut(meanbp), at(0 25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 181) icodes * STEP FUNCTION regress slos i.meanbpcat regress_yhat_ci, stub(step_) twoway (scatter slos meanbp) /// (line step_slos_hat meanbp if step_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), /// yscale(log) /// ylabel(, angle(horizontal)) /// xline(25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 ) summarize ibn.meanbpcat regress slos ibn.meanbpcat ibn.meanbpcat#c.meanbp, noconstant regress_yhat_ci, stub(pwl_) drop twoway (scatter slos meanbp) /// (line pwl_slos_hat meanbp if pwl_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), /// yscale(log) /// ylabel(, angle(horizontal)) /// xline(25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 ) * LINEAR SPLINE mkspline ls_bp_ 7 = meanbp, displayknots twoway (scatter ls_bp_1 meanbp), xline(25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 ) twoway (scatter ls_bp_2 meanbp), xline(25.71429 51.42857 77.14286 102.8571 128.5714 154.2857) twoway (scatter ls_bp_3 meanbp), xline(25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 ) twoway (scatter ls_bp_4 meanbp), xline(25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 ) list in 1/5 regress slos ls_bp_* regress_yhat_ci, stub(ls_) twoway (scatter slos meanbp) /// (line ls_slos_hat meanbp if ls_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), /// yscale(log) /// ylabel(, angle(horizontal)) /// xline(25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 ) * RESTRICTED SPLINE * mkspline rcs_bp_ = meanbp, nknots(7) cubic displayknots * 40 59.6829 69 78 101.8579 113 139 mkspline rcs_bp_ = meanbp, knots(40 59.6829 69 78 101.8579 113 139) cubic displayknots twoway (scatter rcs_bp_1 meanbp), xline(40 59.6829 69 78 101.8579 113 139 ) twoway (scatter rcs_bp_2 meanbp), xline(40 59.6829 69 78 101.8579 113 139 ) twoway (scatter rcs_bp_3 meanbp), xline(40 59.6829 69 78 101.8579 113 139 ) twoway (scatter rcs_bp_1 meanbp) /// (scatter rcs_bp_2 meanbp) /// (scatter rcs_bp_3 meanbp) /// (scatter rcs_bp_4 meanbp) /// (scatter rcs_bp_5 meanbp) /// (scatter rcs_bp_6 meanbp) /// , xline(40 59.6829 69 78 101.8579 113 139 ) regress slos rcs_bp_* regress_yhat_ci, stub(rcs_) drop twoway (scatter slos meanbp) /// (line rcs_slos_hat meanbp if rcs_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), /// yscale(log) /// ylabel(, angle(horizontal)) /// xline(40 59.6829 69 78 101.8579 113 139 ) * TEST OF LINEARITY test (rcs_bp_2 rcs_bp_3 rcs_bp_4 rcs_bp_5 rcs_bp_6) testparm rcs_bp_2-rcs_bp_6 * OR FULL vs REDUCED est store rcs // full regress slos meanbp regress_yhat_ci, stub(sl_) drop // simple linear est store simple_linear ftest simple_linear rcs * COMPARE CUBIC TO LINEAR label variable rcs_slos_hat "RCS prediction" label variable sl_slos_hat "Linear prediction" twoway (scatter slos meanbp) /// (line rcs_slos_hat meanbp, sort lcolor(orange%99) lwidth(vvthick)) /// (line sl_slos_hat meanbp, sort lcolor(green%99) lwidth(vvthick)), /// yscale(log range(3 250)) /// ylabel(2(2)10 10(10)50 50(50)250, angle(horizontal))
* 1. Specify the model *-------------------------------------------------- * E[ log scl | - ] = b0 + b1 * sex + * b2 * age + b3 * age' * b4 * bmi + b5 * bmi' * b6 * age * sex + b7 * age' * sex * b8 * bmi * sex + b9 * bmi' * sex * V[ log scl | - ] = sigma^2 * 2. Specify the partial effect plots of interest *-------------------------------------------------- * Plot 1: Y = E[ log scl | - ] * X = age * line for males and line for females * other covariates set at median * * Plot 2: Y = E[ log scl | - ] * X = bmi * line for males and line for females * other covariates set at median * Use centile to get range and median of variables getvdata 2.20.Framingham gen logscl = log(scl) label define map_sex 1 "Male" 2 "Female" label values sex map_sex centile age bmi, centile(5 50 95) * 3. Generate data for plotting * All of the predictors in the model need to be * specified. *-------------------------------------------------- plotdata, at(age=34/60; bmi=25; sex=1/2) plotdata, at(age=45; bmi=19/33; sex=1/2) * OR specify each line separately * This approach can simplify the `if` statements * when generating the partial effect plots * plotdata, at(age=34/60; bmi=25; sex=1) * plotdata, at(age=34/60; bmi=25; sex=2) * plotdata, at(age=45; bmi=19/33; sex=1) * plotdata, at(age=45; bmi=19/33; sex=2) * 4. Generate restricted cubic splines * * Why use `mkspline_plotindicator`? * * To correctly calculate the default knots for * the predictor variable after plotting data * has been appended to the dataset. * * If you want to use custom knots, or if no * plotting data have been appended to the data, * then `mkspline` will work just fine. *-------------------------------------------------- mkspline_plotindicator bmi, nknots(3) mkspline_plotindicator age, nknots(3) * 5. Fit the model *-------------------------------------------------- regress logscl rcs_age* rcs_bmi* i.sex i.sex#c.(rcs_bmi*) i.sex#c.(rcs_age*) * Shorter way: regress logscl i.sex##c.(rcs*) * 6. Model diagnostics (more to come) *-------------------------------------------------- regress_yhat_ci predict studres, rstudent qnorm studres * 7. Generate partial effect plots *-------------------------------------------------- twoway /// (rarea logscl_hat_lb logscl_hat_ub age if sex == 1, sort fcolor(cranberry%50)) /// (line logscl_hat age if sex == 1, sort lcolor(cranberry) lwidth(thick)) /// (rarea logscl_hat_lb logscl_hat_ub age if sex == 2, sort fcolor(dknavy%50)) /// (line logscl_hat age if sex == 2, sort lcolor(dknavy) lwidth(thick)) /// if _plotindicator == 1, /// legend(order(2 "Male" 4 "Female")) /// ytitle(Mean log scl) /// xtitle(Age) /// note(BMI = 25) twoway /// (rarea logscl_hat_lb logscl_hat_ub bmi if sex == 1, sort fcolor(cranberry%50)) /// (line logscl_hat bmi if sex == 1, sort lcolor(cranberry) lwidth(thick)) /// (rarea logscl_hat_lb logscl_hat_ub bmi if sex == 2, sort fcolor(dknavy%50)) /// (line logscl_hat bmi if sex == 2, sort lcolor(dknavy) lwidth(thick)) /// if _plotindicator == 2, /// legend(order(2 "Male" 4 "Female")) /// ytitle(Mean log scl) /// xtitle(BMI) /// note(age = 45)
************************************ * THE SUPPORT STUDY * Dupont text, section 3.25 ************************************ getvdata support keep age sex dzgroup num_co edu income scoma meanbp hrt resp temp totcst misstable summarize misstable pattern replace edu = . if edu == .z replace income = . if income == .z misstable summarize mi query mi set flong mi register imputed edu income totcst codebook num_co codebook scoma mi impute chained (pmm, knn(10)) totcst edu (ologit) income = age i.sex i.dzgroup num_co scoma meanbp hrt resp temp, add(10) augment rseed(23450) list income totcst edu _mi_id if _mi_m == 0 & _mi_id < 10 list income totcst edu _mi_id if _mi_m == 1 & _mi_id < 10 list income totcst edu _mi_id if _mi_m == 2 & _mi_id < 10 mkspline rcs_age_=age, cubic nknots(3) mkspline rcs_meanbp_=meanbp, cubic nknots(3) mkspline rcs_hrt_=hrt, cubic nknots(3) list rcs* if _mi_m == 1 & _mi_id < 10 list _mi_miss in 1/10 list _mi_miss _mi_m if _mi_m == 1 & _mi_id < 10 mi estimate : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income mi estimate, saving("miexample", replace) : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income help mi estimate mi estimate, coeflegend mi test (rcs_age_1 rcs_age_2) mi test (2.income 3.income 4.income) mi test (2.dzgroup 3.dzgroup 4.dzgroup 5.dzgroup 6.dzgroup 7.dzgroup 8.dzgroup)
getvdata support keep age sex dzclass race meanbp wblc hrt resp temp alb bili crea sod ph glucose bun urine adlsc * findit hcavar tabulate sex, generate(s) tabulate dzclass, generate(dz) hcavar s2 dz2 dz3 dz4 adlsc urine bun glucose ph sod crea bili alb temp resp hrt wblc meanbp age
getvdata nhgh * SUBSET AGE GE 21 and DX = 0 and TX = 0 keep if age >= 21 & dx == 0 & tx == 0 drop dx tx * GENERATE INDICATOR VARIABLES FOR RE quietly tabulate re, generate(re_) list re re_* * PREDICTIVE CAPABILITY OF CONTINUOUS VARIABLES * must have EGENMORE installed, use command "findit egenmore" for user defined function spearman2 gh age wt leg arml armc waist tri sub * VARIABLE CLUSTERING * findit hcavar hcavar wt ht bmi leg arml armc waist tri sub age sex re_* * PRINCIPLE COMPONENTS centile waist bmi, centile(5 95) twoway (scatter waist bmi), xline(20 40) yline(74 125) pca waist bmi, components(2) predict waist_bmi_pc*, score * INTERNAL COEFFICIENTS regress waist_bmi_pc1 waist bmi regress waist_bmi_pc2 waist bmi twoway (scatter waist_bmi_pc2 waist_bmi_pc1) centile waist_bmi_pc1 waist_bmi_pc2, centile(5 95) twoway (scatter waist_bmi_pc2 waist_bmi_pc1), xline(-1.95 2.49) yline(-.466 .473) * USE PC IN REGRESSION regress SCr waist_bmi_pc1 age leg armc
************************* * FRAMINGHAM HEART STUDY * Dupont, Section 3.11 ************************* getvdata 2.20.Framingham * For demonstration purposes: replace age = 103 if id == 2642 generate logsbp = log(sbp) label variable logsbp "Log Systolic Blood Pressure" generate logbmi = log(bmi) label variable logbmi "Log Body Mass Index" generate logscl = log(scl) label variable logscl "Log Serum Cholesterol" label define sexlabel 1 "Male" 2 "Female" label values sex sexlabel * MODEL 3.23 * E[log spb | ... ] = b0 + b1 logbmi + b2 age + b3 logscl + b4 female * + b5 female * logbmi + b6 female * age * + b7 female * logscl regress logsbp logbmi age logscl i.sex i.sex#c.(logbmi age logscl) * OUTLIERS dfbeta * FIT DIAGNOSTICS predict yhat, xb predict ehat, residuals predict standard_resid, rstandard predict student_resid, rstudent predict cooksd, cooksd predict leverage, leverage predict dfits, dfits * PLOTS twoway (scatter student_resid yhat) (lowess student_resid yhat, lwidth(vthick)), yline(2 -2) * COOKS D generate index = _n twoway (scatter cooksd index) * DFIT twoway (scatter dfits index) * DFBETAS graph matrix _dfbeta_*
************************* * FRAMINGHAM HEART STUDY * Dupont, Section 3.11 ************************* getvdata 2.20.Framingham generate logsbp = log(sbp) label variable logsbp "Log Systolic Blood Pressure" generate logbmi = log(bmi) label variable logbmi "Log Body Mass Index" generate logscl = log(scl) label variable logscl "Log Serum Cholesterol" label define sexlabel 1 "male" 2 "female" label values sex sexlabel quietly regress logsbp i.sex##c.(logbmi age logscl) * R2 Bootstrap VALIDATION validate_regress, reps(100) seed(3243) * In-sample calibration plot regress_yhat_ci twoway (scatter logsbp logsbp_hat) /// (line logsbp_hat logsbp_hat, sort) /// (lowess logsbp logsbp_hat) * In-sample Mean absolute prediction error gen abs_error = abs(logsbp - logsbp_hat) sum abs_error * In-sample Rank discrimination ** Spearman spearman logsbp logsbp_hat ** In-sample Kendall's tau ktau logsbp logsbp_hat
/* 0. Question 1. Translate question to model 2. Identify covariates for adjustment a. Missing data strategy 3. Determine parameter budget a. Effective sample size --> 15:1 rule --> budget b. Data reduction needed? c. Interactions? d. Non-linear terms e. Allocation of parameters according to clinical understanding 4. Fit the model 5. Model diagnostics a. Overly influential observations (dfbetas, dfits) b. Residual diagnostics (linear model usually) 6. Measures of model performance (Optimism corrected) a. Discrimination b. Calibration 7. Understanding the model (Partial effect plots) 8. Hypothesis tests a. Total predictor impact b. Linearity c. Interaction */ getvdata acath /* 0. What are the associations of age, cholesterol, and sex with significant coronary disease 1. logodds[ SCD | - ] = 2. Covariates for adjustment */ keep sex age choleste sigdz misstable pattern drop if choleste == .z /* For demonstration purposes. Imputation is a better option. */ /* 3. Paramter budget */ tab sigdz // display 768/15 // Data reduction probably not needed. /* 3c. Interactions? 3d. Non-linear terms 3e. Allocation of parameters according to clinical understanding | Full interaction | A * f(B) + B * f(A) Interaction | f(A*B) | |------------------------------------|---------------------------------|-----------------------------| | 1 | 1 | 1 | | + sex | + sex | + sex | | + rcs(age, 4 knots) | + rcs(age, 4) | + rcs(age, 4) | | + rcs(cholesterol, 4 knots) | + rcs(cholesterol, 4) | + rcs(cholesterol, 4) | | + rcs(age, 4)*rcs(cholesterol, 4) | + age * rcs(cholesterol, 4) | + rcs(age * cholesterol, 4) | | | + cholesterol * rcs(age, 4) | | | + rcs(age, 4)*sex | + rcs(age, 4) * sex | + rcs(age, 4)*sex | |------------------------------------|---------------------------------|-----------------------------| How many parameters do each of these models have? */ /* OUT OF ORDER Need to make plans for partial effect plots. Plan is to generate heatmap of Age | Z = predicted log odds | | | +-------------- cholesterol Holding sex = 1 */ centile age cholest, centile(0 3 50 97 100) plotdata, at(age = 34(2)70; choleste = 150(5)340; sex = 1) // Time permitting, will come back to this part // It isn't essential plotdata, at(age = 17(2)81; choleste = 30(5)575; sex = 1) sum age cholest if _plotindicator == 0 bnormpdf age cholest, dens(npdf) m1(52.2) m2(229.9) s1(9.9) s2(50.6) rho( 0.0139) centile npdf if _plotindicator == 0, centile(5) gen drop5 = 1*(npdf < .000016) if _plotindicator == 2 twoway (scatter age cholest if _plotindicator == 2 & drop5 == 0) /// (scatter age cholest if _plotindicator == 0) /* 4. Fit the model */ mkspline_plotindicator age, nknots(4) mkspline_plotindicator choleste, nknots(4) logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex /* 5. Diagnostics */ gen row = _n predict leverage, hat predict deltad, ddeviance twoway scatter leverage row if _plotindicator == 0 twoway scatter deltad row if _plotindicator == 0 // DFBETAS not part of STATA base. Can generate with ldfbeta package. // search ldfbeta //ldfbeta // Will be slow for larger datasets. //twoway scatter DFrcs_ag row //twoway scatter DFrcs_ch row //twoway scatter DF2 row /* 6. Apparent measures of performance */ logistic_phat_ci // to get phat brier sigdz sigdz_phat if _plotindicator == 0 lroc //ssc install somersd somersd sigdz sigdz_phat /* Apparent calibration with histogram rug plot */ twoway /// (lowess sigdz sigdz_phat, lcolor(blue) lwidth(thick)) /// (line sigdz sigdz, sort lcolor(red) lwidth(medium)) /// (histogram sigdz_phat, fraction bin(100) fcolor(black) lwidth(none)) /// , title(Apparent calibration) /// legend(order(1 "Apparent calibration" 2 "Ideal calibration")) /// xtitle(Predicted probability) /* Optimism corrected measures of performance */ logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex validate_logistic, reps(100) seed(023) /* Partial effect plots */ twoway /// (contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7)) twoway /// (contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7)) /// (scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%10)) twoway /// (contour sigdz_phat age choleste if _plotindicator == 2 & drop5 == 0, levels(100) format(%0.0g) heatmap zlabel(#7) interp(none)) /// (scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%8)) /* 8. Hypothesis Tests */ logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex est store full // Minus sex test (1.sex 1.sex#c.rcs_age_1 1.sex#c.rcs_age_2 1.sex#c.rcs_age_3) testparm i.sex i.sex#c.* // OR quietly logistic sigdz rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) est store minus_sex lrtest full minus_sex // Minus age logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex testparm rcs_age* c.rcs_age*#c.* c.rcs_age*#i.* // | | + Interaction of age with categorical vars // | + Interaction of age with continuous variables // + Age terms without interaction // OR quietly logistic sigdz i.sex rcs_chol* est store minus_age lrtest full minus_age // Minus chol logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex testparm rcs_chol* c.rcs_chol*#c.* c.rcs_chol*#i.* // OR quietly logistic sigdz i.sex rcs_age* c.(rcs_age*)#i.sex est store minus_chol lrtest full minus_chol // Minus chol nonlinear logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex // OR quietly logistic sigdz i.sex rcs_age* rcs_choleste_1 c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex est store minus_chol_nonlinear lrtest full minus_chol_nonlinear // Minus chol interaction quietly logistic sigdz i.sex rcs_age* rcs_chol* c.(rcs_age*)#i.sex est store minus_chol_interaction lrtest full minus_chol_interaction
getvdata sex.age.response display (5/invnormal(0.975))^2 display ((20 / invnormal(0.975)) / 10)^2 bayes, /// prior({response:_cons}, t(0,10,3)) /// prior({response:age}, normal(0,1)) /// prior({response:sex}, normal(0,6.5)) /// mcmcsize(40000) /// thinning(10) /// rseed(123) /// saving(stata-bayes-results.dta, replace) /// coef : logistic response age sex bayesgraph diagnostics _all use stata-bayes-results.dta, clear sunflower eq1_p1 eq1_p2 twoway (scatter eq1_p1 eq1_p2, mcolor(blue%2) msymbol(circle) mlwidth(none)) gen age_gt_zero = 1*(eq1_p1 > 0) sum age_gt_zero gen sex_gt_zero = 1*(eq1_p2 > 0) sum sex_gt_zero gen sex_age_gt_zero = 1*(eq1_p1 > 0 & eq1_p2 > 0) sum sex_age_gt_zero
getvdata support * Collapse levels 6 and 7 codebook num_co replace num_co = 6 if num_co == 7 * To compare models plotdata, at(num_co = 0/6) * Model of mean los where num_co is a nominal categorical variable regress slos i.num_co regress_yhat_ci, stub(i_) /* store results with i_ prefix */ twoway scatter i_slos_hat num_co if _plotindicator == 1 * Model of mean los where num_co association is linear regress slos c.num_co regress_yhat_ci, stub(c_) twoway scatter c_slos_hat num_co if _plotindicator == 1 * Model of mean los where num_co association is quadratic gen num_co_squared = num_co*num_co regress slos c.num_co c.num_co_squared regress_yhat_ci, stub(c2_) twoway scatter c2_slos_hat num_co if _plotindicator == 1 * Model of mean los where num_co association is quadratic with indicator variable for group 0 gen zero_co = 1*(num_co == 0) regress slos c.num_co c.num_co_squared zero_co regress_yhat_ci, stub(c2_gi_) twoway scatter c2_gi_slos_hat num_co if _plotindicator == 1 * Compare estimated means from all models list num_co *_slos_hat if _plotindicator == 1
use "https://biostatdata.app.vumc.org/tgs/simhiv.dta", clear // Simulated HIV data provided by Bryan Shepherd as part of the Liu, Shepherd, Li, and Harrell (2017) publication. encode site, generate(site_id) generate sex = real(male) encode init_class, generate(class) encode route, generate(iroute) encode preARTAIDS_clinical, generate(base_aids) drop site male init_class route preARTAIDS_clinical /* post_cd4 ~ site + male + rcs(age) + init_class + route + rcs(nadir_cd4) +rcs(init_year)+ rcs(pre_rna)+ preARTAIDS_clinical To create the partial-effect plot: What is the variable on the x-axis? What is the range of the x-axis variable? What value are the other predictors held-at? E[post_cd4|-] | **** | * | * | * |* | +-------------- nadir_cd4 */ plotdata, at( /// nadir_cd4 = 2(2)550; /// This is the x-axis variable init_year = 2007; /// All other covariates in the pre_rna = 35000; /// model are set to a specific age = 35; /// value. site_id = 1; /// sex = 1; /// class = 1; /// iroute = 1; /// base_aids = 1) mkspline_plotindicator age, nknots(5) mkspline_plotindicator nadir_cd4, nknots(5) mkspline_plotindicator init_year, nknots(5) mkspline_plotindicator pre_rna, nknots(4) set matsize 800 ologit post_cd4 i.site_id i.sex rcs* i.base_aids i.iroute i.class if _plotindicator == 0 ologit_yhat /* Hypothesis tests (can also perform with lrtest command) */ testparm rcs_age* // total impact of age testparm rcs_nadir* // total impact of nadir cd4 testparm i.site_id // total impact of site /* Partial effect plot */ twoway (line post_cd4_mean nadir_cd4 if _plotindicator == 1, sort)