# In-Class Examples Using Stata

### Tom Stewart, 2022

The lion's nose, ABD example 17.1, page 541
Dataset: 17-e-1
N observations: 32 male lions
Response: Age (years)
Predictor: Proportion of black pigmentation on nose

• Why would understanding the Age-Nose relationship be helpful?
• Construct a scatter plot of the response and explanatory variable.
• How would you describe the relationship in the plot?
• Write down the linear regression model of Age and Proportion of black pigmentation on nose.
• Interpret the constant coefficient.
• Interpret the slope coefficient.
• Interpret the variance coefficient.
• How would the model change if the predictor were on the percent scale?
• How would the model change if Age were on the decade scale?
• What scale is the variance coefficient?
• Fit a linear regression model
• What are the parameter estimates?
• What is the uncertainty of the model parameters?
• Plot the linear regression model
• Plot the linear regression model with CI bands.
• Assess the model fit
• Plot a scatter plot of the studentized residuals and the predictor variable.
• What aspect of model fit does this plot communicate?
• What would the plot look like if the model fits well?
• What would the plot look like if the model fits poorly?
• How might you use the model?
• What is the difference between "estimating the mean" and "prediction of age for a new lion"?
• Is there a difference between estimating a population parameter and predicting an individual outcome?
• Which of the two (population vs individual outcome) are we "more certain"?
• Plot the linear regression model with PI band

```getvdata 17-e-1

//Scatter plot
twoway (scatter ageinyears proportionblack), ytitle(Age (years)) xtitle(Proportion black)

// Regression
regress ageinyears proportionblack

plotdata, at(proportionblack = .1(.1).9)

// Generate yhat and confidence intervals
regress_yhat_ci

// Plot regression line and CI
twoway (rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) ///
(line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) ///
(scatter ageinyears proportionblack), ///
ytitle(Age (years)) ///
xtitle(Proportion black) ///
legend(off)

// Generate yhat and forecast confidence intervals (use stub to identify new vars)
regress_yhat_ci, forecast stub(f_)

// Plot regression line
twoway (rarea f_ageinyears_hat_lb f_ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(lime%70)) ///
(rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) ///
(line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) ///
(scatter ageinyears proportionblack), ///
ytitle(Age (years)) ///
xtitle(Proportion black) ///
legend(off)

// Generate residuals
generate ehat = ageinyears - ageinyears_hat

// Studentized residuals
predict student_ehat, rstudent

// QQplot for residual diagnostic
qnorm student_ehat

// Constant variance diagnostic
twoway (scatter  student_ehat proportionblack)
```

Mole rats , ABD example 18.4, page 620
Dataset: 18-e-4
N observations: 35
Response: lnenergy (log daily energy expenditure)
Predictors: lnmas (log body mass); caste (worker, lazy)

Questions to consider:
• What relationship are the researchers studying?
• Plot the data.
• Write down a linear model which will help researchers understand the relationship of interest.
• How many parameters are in the model?
• Write down the null and alternative hypotheses of interaction effect in terms of the model parameters.
• Write down the null and alternative hypotheses of interaction effect in terms of a comparison of two models.
• In terms of the research question, what does the presence an interaction effect mean?
• Fit the model.
• Plot the estimated relationship of the null model and the estimated relationship of the alternative model. See figure 18.4-1.
• Plot model diagnostics of both models (null and alternative).
• Test the hypothesis.

```// Read data
getvdata 18-e-4

// Generate worker indicator variable
encode caste, generate(lazy)
gen worker = 1*(caste == "worker")

// Construct regression model
regress lnenergy lnmass i.lazy c.lnmass#i.lazy

// Generate data for plot
plotdata, at(lazy=1; lnmass = 4.2(.1)5.5)
plotdata, at(lazy=2; lnmass = 3.5(.1)5)

// Generate predicted values and CIs
regress_yhat_ci

// Generate plot of regression model
twoway  ///
(rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if _plotindicator == 1, sort fcolor(red%30)) ///
(line lnenergy_hat lnmass if _plotindicator==1, sort lcolor(red) lwidth(vthick)) ///
(rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if _plotindicator == 2, sort fcolor(blue%30)) ///
(line lnenergy_hat lnmass if _plotindicator==2, sort lcolor(blue) lwidth(vthick)) ///
(scatter lnenergy lnmass if lazy==1, mcolor(red)) ///
(scatter lnenergy lnmass if lazy==2, mcolor(blue)) ///
, legend(order(5 "Lazy" 6 "Worker")) ///
ytitle(Energy expenditure (log scale)) ///
xtitle(Body mass (log scale))
```

Mole rats , ABD example 18.4, page 620
Dataset: 18-e-4
N observations: 35
Response: lnenergy (log daily energy expenditure)
Predictors: lnmas (log body mass); caste (worker, lazy)

Goal: Generate table of standard hypothesis tests.

```getvdata 18-e-4
gen worker = 1*(caste == "worker")

/*
# Chunk test deliciousness

In the following example, we are going to perform a series of chunk tests.  We
are going to demonstrate how to perform a chunk test with two different methods.
The first method is a comparison of models.  The second is a direct test of the
betas.  The two methods are equivalent and will give identical answers.

Depending on your circumstances, one method may be easier to implement than another.

## METHOD 1: Comparison of models

### Interaction test

FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker

### Total impact of lnmass

FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
REDUCED: E[ lnenergy | - ] = b0 +             b2 worker

### Total impact of caste

FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass

### Overall regression

FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
REDUCED: E[ lnenergy | - ] = b0

### Steps to compare two models

1. Fit and store each model (use `est store` command)
2. Use `ftest` command

### Note:

+ The comparison of models approach is only valid if the regression models are fit
using the same observations.  This is often an issue if there is missing data as
the default practice is to disregard observations that have missing covariates.
The consequence is that the REDUCED model may use more observations than the FULL
model.  We will cover an example of how to compare models when there is missing data.
*/

regress lnenergy lnmass worker i.worker#c.lnmass
est store full

regress lnenergy lnmass worker
est store no_int // for no interaction

regress lnenergy worker
est store no_mass // model with no lnmass

regress lnenergy lnmass
est store no_caste // model with no caste variable

regress lnenergy
est store no_vars // model with no predictors

ftest full no_int // (also available from regress output)
ftest full no_mass
ftest full no_caste
ftest full no_vars // (also available from regress output)

/*
## METHOD 2: Betas

### Interaction test

Ho: b3 = 0
Ha: b3 != 0

### Total impact of lnmass

Ho: b1 = 0 AND b3 = 0
Ha: b1 != 0 OR b3 != 0

### Total impact of caste

Ho: b1 = 0 AND b2 = 0
Ha: b1 != 0 OR b2 != 0

### Overall regression

Ho: b1 = 0 AND b2 = 0 AND b3 = 0
Ha: b1 != 0 OR b2 != 0 OR b3 != 0

### Steps to test betas

1. Fit the full model
2. Use the `regress, coeflegend` command to identify labels for betas
3. Use `test` command

### Note:

+ This approach avoids issues with missing data.
*/

regress lnenergy lnmass worker i.worker#c.lnmass
regress, coeflegend

test (1.worker#c.lnmass)              // Interaction (also available from regress output)
test (lnmass 1.worker#c.lnmass)        // Total impact of lnmass
test (worker 1.worker#c.lnmass)        // Total impact of caste
test (lnmass worker 1.worker#c.lnmass) // Overall regression (also available from regress output)

/*
## Table of usefull hypothesis tests

You will want to organize the results of commonly used hypothesis tests in a
table like the following:

| Variable      | df | Test stat | p-value |
|---------------|----|-----------|---------|
| lnmass        |  2 |           |         |
|   interaction |  1 |           |         |
| caste         |  2 |           |         |
|   interaction |  1 |           |         |
| Overall       |  3 |           |         |

*/
```

##### TASK: Categorical Variables in STATA
Dataset: nhgh
N observations: 6805
Response: Height
Predictors: Race and ethnicity, sex
```/***
# Incorporating categorical variables into a regression model

Example 1: Height as a function of race-ethnicity (re)

+ Option 1 - estimate the mean for each re

E[ ht | re ] =   a1 * Mexican American
+ a2 * Other Hispanic
+ a3 * Non-Hispanic White
+ a4 * Non-Hispanic Black
+ a5 * Other Race Including Multi-Racial

+ Option 2 - estimate the mean in a reference group (say Non-Hispanic Black),
then estimate the difference in mean height between the reference
group and each of the remaining re group

E[ ht | re ] =   b0
+ b1 * Mexican American
+ b2 * Other Hispanic
+ b3 * Non-Hispanic White
+ b4 * Other Race Including Multi-Racial
***/

getvdata nhgh

// Group means coding
regress ht ibn.re, noconst

/*
Note: When you use the noconstant option, disregard any thing from the SS table or
the overall chunk test
*/

// Reference group coding
regress ht i.re

/*
Which is the reference group?
How do you tell stata which group to use as reference?
+ help fvvarlist
*/

regress ht ib3.re //         ib#.           use # as base, #=value of variable
regress ht ib(3).re //       ib(##).        use the #th ordered value as base (**)
regress ht ib(first).re //   ib(first).     use smallest value as base (the default)
regress ht ib(last).re //    ib(last).      use largest value as base
regress ht ib(freq).re //    ib(freq).      use most frequent value as base

/***
Example 2: Height as a function of race-ethnicity (re) and sex
***/

regress ht i.re##i.sex
plotdata, at(sex = 1/2; re = 1/5)
regress_yhat_ci
decode(sex), generate(s_sex)
decode(re), generate(s_re)
gen label = s_re + " " + s_sex
egen float row = rank(-10*re+sex) if _plotindicator == 1
point_ci ht_hat ht_hat_lb ht_hat_ub row if _plotindicator == 1 & sex == 1,  ///
labelvar(label) ///
point_options(mcolor(magenta)) ///
ci_options(lwidth(thick)) ///
plot_options(title(Mean heights NHANES) xtitle(Height (cm)) ytitle(Race and Sex))
```

Dataset: nhgh
N observations: 6805
Response: Height
Predictors: Race and ethnicity, sex
```getvdata nhgh

/*
E[ ht | age, sex] = b0
+ b1 sex
+ b2 age + b3 age' + b4 age''
+ b5 age*sex + b6 age'*sex + b7 age''*sex
*/

* Plot data
centile age, centile(2 98)
plotdata, at(age=13/80; sex=1)
plotdata, at(age=13/80; sex=2)

* Generate restricted cubic spline of age with 3 knots
mkspline_plotindicator age, nknots(4)

* Fit the models for SCr and gh
regress ht i.sex##c.(rcs*)

* Generate the following diagnostic plots
*** QQ plot of studentized residuals
predict stud_residual, rstudent
qnorm stud_residual

*** residual-versus-fitted plot
predict yhat, xb
twoway (scatter stud_residual yhat) if _plotindicator == 0

*** residual-versus-predictor plot
rvpplot rcs_age_1
rvpplot rcs_age_2
rvpplot rcs_age_3

*** LEAVE-ONE-OUT
***** DFBETAS
dfbeta
gen row = _n
twoway (scatter _dfbeta_1 row) if _plotindicator == 0
twoway (scatter _dfbeta_2 row) if _plotindicator == 0
* ...

***** leverage
predict leverage, leverage
twoway (scatter leverage row) if _plotindicator == 0

***** DFITS
predict dfits, dfits
twoway (scatter dfits row) if _plotindicator == 0

***** Cooks D
predict cooksd, cooksd
twoway (scatter cooksd row) if _plotindicator == 0
list if cooksd > .005 & _plotindicator == 0

/*
Group work

E[ gh | age, sex, ...]  =

*/
```

The SUPPORT Study, Dupont text section 3.25, page 138
N observations: 1000
```************************************
* THE SUPPORT STUDY
* Dupont text, section 3.25
************************************
clear
getvdata support
keep slos meanbp sex
label variable slos "Length of Stay (days)"
label variable meanbp "Mean Arterial Pressure (mm Hg)"

plotdata, at(meanbp = 0, 25.71429, 51.42857, 77.14286, 102.8571, 128.5714, 154.2857, 181)
plotdata, at(meanbp = 0, 25.71428, 51.42856, 77.14285, 102.8570, 128.5713, 154.2856, 180.9)

scatter slos meanbp
twoway (scatter slos meanbp), ///
yscale(log) ///
ylabel(, angle(horizontal))

egen meanbpcat = cut(meanbp), at(0 25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 181) icodes

* STEP FUNCTION
regress slos i.meanbpcat
regress_yhat_ci, stub(step_)
twoway (scatter slos meanbp) ///
(line step_slos_hat meanbp if step_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), ///
yscale(log) ///
ylabel(, angle(horizontal)) ///
xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 )

summarize ibn.meanbpcat
regress slos ibn.meanbpcat ibn.meanbpcat#c.meanbp, noconstant
regress_yhat_ci, stub(pwl_) drop
twoway (scatter slos meanbp) ///
(line pwl_slos_hat meanbp if pwl_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), ///
yscale(log) ///
ylabel(, angle(horizontal)) ///
xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 )

* LINEAR SPLINE
mkspline ls_bp_ 7 = meanbp, displayknots
twoway (scatter ls_bp_1 meanbp), xline(25.71429   51.42857   77.14286   102.8571   128.5714   154.2857 )
twoway (scatter ls_bp_2 meanbp), xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857)
twoway (scatter ls_bp_3 meanbp), xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 )
twoway (scatter ls_bp_4 meanbp), xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 )
list in 1/5

regress slos ls_bp_*
regress_yhat_ci, stub(ls_)

twoway (scatter slos meanbp) ///
(line ls_slos_hat meanbp if ls_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), ///
yscale(log) ///
ylabel(, angle(horizontal)) ///
xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 )

* RESTRICTED SPLINE

* mkspline rcs_bp_ = meanbp, nknots(7) cubic displayknots
* 40 59.6829 69 78 101.8579 113 139
mkspline rcs_bp_ = meanbp, knots(40 59.6829 69 78 101.8579 113 139) cubic displayknots

twoway (scatter rcs_bp_1 meanbp), xline(40 59.6829 69 78 101.8579 113 139 )
twoway (scatter rcs_bp_2 meanbp), xline(40 59.6829 69 78 101.8579 113 139 )
twoway (scatter rcs_bp_3 meanbp), xline(40 59.6829 69 78 101.8579 113 139 )

twoway (scatter rcs_bp_1 meanbp) ///
(scatter rcs_bp_2 meanbp) ///
(scatter rcs_bp_3 meanbp) ///
(scatter rcs_bp_4 meanbp) ///
(scatter rcs_bp_5 meanbp) ///
(scatter rcs_bp_6 meanbp) ///
, xline(40 59.6829 69 78 101.8579 113 139 )

regress slos rcs_bp_*
regress_yhat_ci, stub(rcs_) drop

twoway (scatter slos meanbp) ///
(line rcs_slos_hat meanbp if rcs_slos_hat > 0, sort lcolor(orange%99) lwidth(vvthick)), ///
yscale(log) ///
ylabel(, angle(horizontal)) ///
xline(40 59.6829 69 78 101.8579 113 139 )

* TEST OF LINEARITY
test (rcs_bp_2 rcs_bp_3 rcs_bp_4 rcs_bp_5 rcs_bp_6)
testparm rcs_bp_2-rcs_bp_6

* OR FULL vs REDUCED
est store rcs // full
regress slos meanbp
regress_yhat_ci, stub(sl_) drop // simple linear
est store simple_linear
ftest simple_linear rcs

* COMPARE CUBIC TO LINEAR
label variable rcs_slos_hat "RCS prediction"
label variable sl_slos_hat "Linear prediction"

twoway  (scatter slos meanbp) ///
(line rcs_slos_hat meanbp, sort lcolor(orange%99) lwidth(vvthick)) ///
(line sl_slos_hat meanbp, sort lcolor(green%99) lwidth(vvthick)), ///
yscale(log range(3 250)) ///
ylabel(2(2)10 10(10)50 50(50)250, angle(horizontal))
```

##### TASK: Restricted Cubic Splines & Partial Effect Plots
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999)
N observations: 4699
Response: log serum cholesterol
Predictors: bmi, age, sex
```* 1. Specify the model
*--------------------------------------------------
* E[ log scl | - ] = b0 + b1 * sex +
*    b2 * age + b3 * age'
*    b4 * bmi + b5 * bmi'
*    b6 * age * sex + b7 * age' * sex
*    b8 * bmi * sex + b9 * bmi' * sex
* V[ log scl | - ] = sigma^2

* 2. Specify the partial effect plots of interest
*--------------------------------------------------
*    Plot 1: Y = E[ log scl | - ]
*            X = age
*            line for males and line for females
*            other covariates set at median
*
*    Plot 2: Y = E[ log scl | - ]
*            X = bmi
*            line for males and line for females
*            other covariates set at median

* Use centile to get range and median of variables

getvdata 2.20.Framingham
gen logscl = log(scl)
label define map_sex 1 "Male" 2 "Female"
label values sex map_sex
centile age bmi, centile(5 50 95)

* 3. Generate data for plotting
*    All of the predictors in the model need to be
*    specified.
*--------------------------------------------------
plotdata, at(age=34/60; bmi=25; sex=1/2)
plotdata, at(age=45; bmi=19/33; sex=1/2)

* OR specify each line separately
*    This approach can simplify the `if` statements
*    when generating the partial effect plots

* plotdata, at(age=34/60; bmi=25; sex=1)
* plotdata, at(age=34/60; bmi=25; sex=2)
* plotdata, at(age=45; bmi=19/33; sex=1)
* plotdata, at(age=45; bmi=19/33; sex=2)

* 4. Generate restricted cubic splines
*
*    Why use `mkspline_plotindicator`?
*
*    To correctly calculate the default knots for
*    the predictor variable after plotting data
*    has been appended to the dataset.
*
*    If you want to use custom knots, or if no
*    plotting data have been appended to the data,
*    then `mkspline` will work just fine.
*--------------------------------------------------
mkspline_plotindicator bmi, nknots(3)
mkspline_plotindicator age, nknots(3)

* 5. Fit the model
*--------------------------------------------------
regress logscl rcs_age* rcs_bmi* i.sex i.sex#c.(rcs_bmi*) i.sex#c.(rcs_age*)
* Shorter way:
regress logscl i.sex##c.(rcs*)

* 6. Model diagnostics (more to come)
*--------------------------------------------------
regress_yhat_ci
predict studres, rstudent
qnorm studres

* 7. Generate partial effect plots
*--------------------------------------------------
twoway ///
(rarea logscl_hat_lb logscl_hat_ub age if sex == 1, sort fcolor(cranberry%50)) ///
(line logscl_hat age if sex == 1, sort lcolor(cranberry) lwidth(thick)) ///
(rarea logscl_hat_lb logscl_hat_ub age if sex == 2, sort fcolor(dknavy%50)) ///
(line logscl_hat age if sex == 2, sort lcolor(dknavy) lwidth(thick)) ///
if _plotindicator == 1, ///
legend(order(2 "Male" 4 "Female")) ///
ytitle(Mean log scl) ///
xtitle(Age) ///
note(BMI = 25)

twoway ///
(rarea logscl_hat_lb logscl_hat_ub bmi if sex == 1, sort fcolor(cranberry%50)) ///
(line logscl_hat bmi if sex == 1, sort lcolor(cranberry) lwidth(thick)) ///
(rarea logscl_hat_lb logscl_hat_ub bmi if sex == 2, sort fcolor(dknavy%50)) ///
(line logscl_hat bmi if sex == 2, sort lcolor(dknavy) lwidth(thick)) ///
if _plotindicator == 2, ///
legend(order(2 "Male" 4 "Female")) ///
ytitle(Mean log scl) ///
xtitle(BMI) ///
note(age = 45)
```

The SUPPORT Study, Dupont text section 3.25, page 138
N observations: 1000
Response: Total RCC cost
```************************************
* THE SUPPORT STUDY
* Dupont text, section 3.25
************************************
getvdata support
keep age sex dzgroup num_co edu income scoma meanbp hrt resp temp totcst
misstable summarize
misstable pattern
replace edu = . if edu == .z
replace income = . if income == .z
misstable summarize
mi query
mi set flong
mi register imputed edu income totcst
codebook num_co
codebook scoma
mi impute chained (pmm, knn(10)) totcst edu (ologit) income = age i.sex i.dzgroup num_co  scoma meanbp hrt resp temp, add(10) augment rseed(23450)

list income totcst edu _mi_id if _mi_m == 0 & _mi_id  < 10
list income totcst edu _mi_id if _mi_m == 1 & _mi_id  < 10
list income totcst edu _mi_id if _mi_m == 2 & _mi_id  < 10

mkspline rcs_age_=age, cubic nknots(3)
mkspline rcs_meanbp_=meanbp, cubic nknots(3)
mkspline rcs_hrt_=hrt, cubic nknots(3)

list rcs* if _mi_m == 1 & _mi_id  < 10
list _mi_miss in 1/10
list _mi_miss _mi_m if _mi_m == 1 & _mi_id  < 10

mi estimate : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income
mi estimate, saving("miexample", replace) : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income
help mi estimate
mi estimate, coeflegend
mi test (rcs_age_1 rcs_age_2)
mi test (2.income 3.income 4.income)
mi test (2.dzgroup 3.dzgroup 4.dzgroup 5.dzgroup 6.dzgroup 7.dzgroup 8.dzgroup)
```

The SUPPORT Study, Dupont text section 3.25, page 138
N observations: 1000
```/*
WHY? If we hope to construct stable models, there is a limit to the complexity of the models we construct.  The type of outcome and size of dataset limit complexity.

APPROACH: Determine "effective sample size" and parameter budget.  Live within the parameter budget and allocate parameters to predictors based on research goals and the strength of association with outcome.

CHALLENGE: Lots of predictors with a limited parameter budget.

STRATEGY DEMONSTRATED:
+ Variable clustering
*/
getvdata support
keep age sex dzclass race meanbp wblc hrt resp temp alb bili crea sod ph glucose bun urine adlsc
* findit hcavar
tabulate sex, generate(s)
tabulate dzclass, generate(dz)
hcavar s2 dz2 dz3 dz4 adlsc urine bun glucose ph sod crea bili alb temp resp hrt wblc meanbp age
```

##### TASK: Data Reduction with Principle Components
Glycohemoglobin from NHANES 2009-2010, BBR 4-22, page 51
Dataset: nhgh (from Department collection of datasets)
N observations: 6795
```/*
WHY? If we hope to construct stable models, there is a limit to the complexity of the models we construct.  The type of outcome and size of dataset limit complexity.

APPROACH: Determine "effective sample size" and parameter budget.  Live within the parameter budget and allocate parameters to predictors based on research goals and the strength of association with outcome.

CHALLENGE: Lots of predictors with a limited parameter budget.

STRATEGIES DEMONSTRATED:
+ Generalized Spearman's (predictive potential)
+ Variable clustering
+ Principal components
*/

getvdata nhgh

* SUBSET AGE GE 21 and DX = 0 and TX = 0
keep if age >= 21 & dx == 0 & tx == 0
drop dx tx

* GENERATE INDICATOR VARIABLES FOR RE
quietly tabulate re, generate(re_)
list re re_*

* PREDICTIVE CAPABILITY OF CONTINUOUS VARIABLES
* must have EGENMORE installed, use command "findit egenmore" for user defined function
spearman2 gh age wt leg arml armc waist tri sub

* VARIABLE CLUSTERING
* findit hcavar
hcavar wt ht bmi leg arml armc waist tri sub age sex re_*

* PRINCIPAL COMPONENTS
centile waist bmi, centile(5 95)
twoway (scatter waist bmi), xline(20 40) yline(74 125)
pca waist bmi, components(2)
predict waist_bmi_pc*, score

* INTERNAL COEFFICIENTS
regress waist_bmi_pc1 waist bmi
regress waist_bmi_pc2 waist bmi

twoway (scatter waist_bmi_pc2 waist_bmi_pc1)
centile waist_bmi_pc1 waist_bmi_pc2, centile(5 95)
twoway (scatter waist_bmi_pc2 waist_bmi_pc1), xline(-1.95 2.49) yline(-.466 .473)

* USE PC IN REGRESSION
regress SCr waist_bmi_pc1 age leg armc```

Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999) (Available in department collection of datasets.)
N observations: 4699
Response: Log Systolic Blood Pressure
Predictors: log(bmi), age, sex, log(serum cholesterol)
```*************************
* FRAMINGHAM HEART STUDY
* Dupont, Section 3.11
*
* In this example, we are going to purposely include a mis-coded age so that we can see if/how the fit diagnostics identify the miscoded observation.
*************************

getvdata 2.20.Framingham

* For demonstration purposes:
replace age = 103 if id == 2642

generate logsbp = log(sbp)
label variable logsbp "Log Systolic Blood Pressure"
generate logbmi = log(bmi)
label variable logbmi "Log Body Mass Index"
generate logscl = log(scl)
label variable logscl "Log Serum Cholesterol"

label define sexlabel 1 "Male" 2 "Female"
label values sex sexlabel

* MODEL 3.23
* E[log spb | ... ] = b0 + b1 logbmi + b2 age + b3 logscl + b4 female
*                        + b5 female * logbmi + b6 female * age
*                        + b7 female * logscl
regress logsbp logbmi age logscl i.sex i.sex#c.(logbmi age logscl)

* OUTLIERS
dfbeta

* FIT DIAGNOSTICS
predict yhat, xb
predict ehat, residuals
predict standard_resid, rstandard
predict student_resid, rstudent
predict cooksd, cooksd
predict leverage, leverage
predict dfits, dfits

* PLOTS
twoway (scatter student_resid yhat) (lowess student_resid yhat, lwidth(vthick)), yline(2 -2)

* COOKS D
generate index = _n
twoway (scatter cooksd index)

* DFIT
twoway (scatter dfits index)

* DFBETAS
graph matrix _dfbeta_*
```

##### TASK: R2 Internal Validation in STATA
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999) (Available in department collection of datasets.)
N observations: 4699
Response: Log Systolic Blood Pressure
Predictors: log(bmi), age, sex, log(serum cholesterol)
```*************************
* FRAMINGHAM HEART STUDY
* Dupont, Section 3.11
*
* In this example, we will generate in-sample and optimism corrected measures of model performance.
*
*************************

getvdata 2.20.Framingham

generate logsbp = log(sbp)
label variable logsbp "Log Systolic Blood Pressure"
generate logbmi = log(bmi)
label variable logbmi "Log Body Mass Index"
generate logscl = log(scl)
label variable logscl "Log Serum Cholesterol"

label define sexlabel 1 "male" 2 "female"
label values sex sexlabel

quietly regress logsbp i.sex##c.(logbmi age logscl)

* R2 Bootstrap VALIDATION
validate_regress, reps(100) seed(3243)

// What is over-fitting and why is it a concern?
// What is optimism?

* In-sample calibration plot
regress_yhat_ci

// Does an in-sample calibration plot provide meaningful information?

twoway (scatter logsbp logsbp_hat) ///
(line logsbp_hat logsbp_hat, sort) ///
(lowess logsbp logsbp_hat)

* In-sample Mean absolute prediction error
gen abs_error = abs(logsbp - logsbp_hat)
sum abs_error

// Does the in-sample mean absolute prediction over or under estimate the out-of-sample mean absolute prediction error?

* In-sample Rank discrimination
** Spearman
spearman logsbp logsbp_hat

// What is the difference between discrimination and calibration?

** In-sample Kendall's tau
ktau logsbp logsbp_hat

// What is Kendall's tau?
```

##### TASK: Crash course in logistic regression
RMS notes 10.1.3, 10.5 letter W
Dataset: sex.age.response & acath
```#RMS 10.1.3

getvdata sex.age.response

by sex, sort : summarize age

plotdata, at(age=37/70; sex=1)
plotdata, at(age=34/61; sex=2)

logistic response i.sex age, coef

logistic_phat_ci

twoway ///
(rarea response_phat_lb response_phat_ub age if _plotindicator == 1, sort fcolor(lavender%25) lcolor(lavender)) ///
(line response_phat age if _plotindicator == 1, sort lcolor(lavender) lwidth(vthick)) ///
(rarea response_phat_lb response_phat_ub age if _plotindicator == 2, sort fcolor(orange%25) lcolor(orange)) ///
(line response_phat age if _plotindicator == 2, sort lcolor(orange) lwidth(vthick)) ///
, ytitle(P(response | sex, age)) ///
xtitle(Age (years)) ///
legend(order(2 "Female" 4 "Male"))
```

##### TASK: Logistic regression, RCS, interactions
RMS notes 10.5 letter A
Dataset: acath
Response: sigdz
Predictors: age, sex, cholesterol
```/*
0.  Question

1.  Translate question to model

a. Missing data strategy

3.  Determine parameter budget

a.  Effective sample size --> 15:1 rule --> budget
b.  Data reduction needed?
c.  Interactions?
d.  Non-linear terms
e.  Allocation of parameters according to clinical understanding

4.  Fit the model

5.  Model diagnostics

a.  Overly influential observations (dfbetas, dfits)
b.  Residual diagnostics (linear model usually)

6.  Measures of model performance (Optimism corrected)

a.  Discrimination
b.  Calibration

7.  Understanding the model (Partial effect plots)

8.  Hypothesis tests

a.  Total predictor impact
b.  Linearity
c.  Interaction

*/

getvdata acath

/*
0. What are the associations of age, cholesterol, and sex with significant coronary disease
1. logodds[ SCD | - ] =
*/

keep sex age choleste sigdz
misstable pattern
drop if choleste == .z /* For demonstration purposes.  Imputation is a better option. */

/*
3.  Paramter budget
*/
tab sigdz //
display 768/15 // Data reduction probably not needed.
/*
3c.  Interactions?
3d.  Non-linear terms
3e.  Allocation of parameters according to clinical understanding

| Full interaction                   | A * f(B) + B * f(A) Interaction | f(A*B)                      |
|------------------------------------|---------------------------------|-----------------------------|
|   1                                | 1                               | 1                           |
| + sex                              | + sex                           | + sex                       |
| + rcs(age, 4 knots)                | + rcs(age, 4)                   | + rcs(age, 4)               |
| + rcs(cholesterol, 4 knots)        | + rcs(cholesterol, 4)           | + rcs(cholesterol, 4)       |
| + rcs(age, 4)*rcs(cholesterol, 4)  | + age * rcs(cholesterol, 4)     | + rcs(age * cholesterol, 4) |
|                                    | + cholesterol * rcs(age, 4)     |                             |
| + rcs(age, 4)*sex                  | + rcs(age, 4) * sex             | + rcs(age, 4)*sex           |
|------------------------------------|---------------------------------|-----------------------------|

How many parameters do each of these models have?
*/

/* OUT OF ORDER
Need to make plans for partial effect plots.  Plan is to generate heatmap of

Age |                 Z = predicted log odds
|
|
|
+--------------
cholesterol

Holding sex = 1
*/
centile age cholest, centile(0 3 50 97 100)
plotdata, at(age = 34(2)70; choleste = 150(5)340; sex = 1)

// Time permitting, will come back to this part
// It isn't essential
plotdata, at(age = 17(2)81; choleste = 30(5)575; sex = 1)
sum age cholest if _plotindicator == 0
bnormpdf age cholest, dens(npdf) m1(52.2) m2(229.9) s1(9.9) s2(50.6) rho( 0.0139)
centile npdf if _plotindicator == 0, centile(5)
gen drop5 = 1*(npdf < .000016) if _plotindicator == 2
twoway (scatter age cholest if _plotindicator == 2 & drop5 == 0) ///
(scatter age cholest if _plotindicator == 0)

/*
4. Fit the model
*/
mkspline_plotindicator age, nknots(4)
mkspline_plotindicator choleste, nknots(4)
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex

/*
5. Diagnostics
*/
gen row = _n
predict leverage, hat
twoway scatter leverage row if _plotindicator == 0
twoway scatter deltad row if _plotindicator == 0

// DFBETAS not part of STATA base.  Can generate with ldfbeta package.
// search ldfbeta
//ldfbeta // Will be slow for larger datasets.
//twoway scatter DFrcs_ag row
//twoway scatter DFrcs_ch row
//twoway scatter DF2 row

/*
6. Apparent measures of performance
*/
logistic_phat_ci // to get phat
brier sigdz sigdz_phat if _plotindicator == 0
lroc
//ssc install somersd
somersd sigdz sigdz_phat

/*
Apparent calibration with histogram rug plot
*/

twoway ///
(lowess sigdz sigdz_phat, lcolor(blue) lwidth(thick)) ///
(line sigdz sigdz, sort lcolor(red) lwidth(medium)) ///
(histogram sigdz_phat, fraction bin(100) fcolor(black) lwidth(none)) ///
, title(Apparent calibration) ///
legend(order(1 "Apparent calibration" 2 "Ideal calibration")) ///
xtitle(Predicted probability)

/*
Optimism corrected measures of performance
*/
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
validate_logistic, reps(100) seed(023)

/*
Partial effect plots
*/
twoway ///
(contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7))

twoway ///
(contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7)) ///
(scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%10))

twoway ///
(contour sigdz_phat age choleste if _plotindicator == 2 & drop5 == 0, levels(100) format(%0.0g) heatmap zlabel(#7) interp(none)) ///
(scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%8))

/*
8.  Hypothesis Tests
*/
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
est store full

// Minus sex
test (1.sex 1.sex#c.rcs_age_1 1.sex#c.rcs_age_2 1.sex#c.rcs_age_3)
testparm i.sex i.sex#c.*
// OR
quietly logistic sigdz rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*)
est store minus_sex
lrtest full minus_sex

// Minus age
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
testparm rcs_age* c.rcs_age*#c.* c.rcs_age*#i.*
//       |        |              + Interaction of age with categorical vars
//       |        + Interaction of age with continuous variables
//       + Age terms without interaction
// OR
quietly logistic sigdz i.sex rcs_chol*
est store minus_age
lrtest full minus_age

// Minus chol
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
testparm rcs_chol* c.rcs_chol*#c.* c.rcs_chol*#i.*
// OR
quietly logistic sigdz i.sex rcs_age* c.(rcs_age*)#i.sex
est store minus_chol
lrtest full minus_chol

// Minus chol nonlinear
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
// OR
quietly logistic sigdz i.sex rcs_age* rcs_choleste_1 c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
est store minus_chol_nonlinear
lrtest full minus_chol_nonlinear

// Minus chol interaction
quietly logistic sigdz i.sex rcs_age* rcs_chol* c.(rcs_age*)#i.sex
est store minus_chol_interaction
lrtest full minus_chol_interaction
```

##### TASK: Bayes, a tantalizing morsel
RMS notes 10.11
Dataset: sex.age.response
```getvdata sex.age.response
display (5/invnormal(0.975))^2
display ((20 / invnormal(0.975)) / 10)^2
bayes, ///
prior({response:_cons}, t(0,10,3)) ///
prior({response:age}, normal(0,1)) ///
prior({response:sex}, normal(0,6.5)) ///
mcmcsize(40000) ///
thinning(10) ///
rseed(123) ///
saving(stata-bayes-results.dta, replace) ///
coef : logistic response age sex

bayesgraph diagnostics _all

use stata-bayes-results.dta, clear

sunflower eq1_p1 eq1_p2
twoway (scatter eq1_p1 eq1_p2, mcolor(blue%2) msymbol(circle) mlwidth(none))
gen age_gt_zero = 1*(eq1_p1 > 0)
sum age_gt_zero
gen sex_gt_zero = 1*(eq1_p2 > 0)
sum sex_gt_zero
gen sex_age_gt_zero = 1*(eq1_p1 > 0 & eq1_p2 > 0)
sum sex_age_gt_zero
```

##### TASK: Regression with ordinal predictors with a moderate (5ish - 20ish) number of levels
If an ordinal variable has few levels, we generally include it into a regression model as a nominal categorical variable. If an ordinal variable has many levels, we generally treat it as a continuous variable and include it into the regression as a restricted cubic spline. However, what are the options for an ordinal variable with a middle number of levels?
Dataset: support
```getvdata support

* Collapse levels 6 and 7
codebook num_co
replace num_co = 6 if num_co == 7

* To compare models
plotdata, at(num_co = 0/6)

* Model of mean los where num_co is a nominal categorical variable
regress slos i.num_co
regress_yhat_ci, stub(i_) /* store results with i_ prefix */
twoway scatter i_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is linear
regress slos c.num_co
regress_yhat_ci, stub(c_)
twoway scatter c_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is quadratic
gen num_co_squared = num_co*num_co
regress slos c.num_co c.num_co_squared
regress_yhat_ci, stub(c2_)
twoway scatter c2_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is quadratic with indicator variable for group 0
gen zero_co = 1*(num_co == 0)
regress slos c.num_co c.num_co_squared zero_co
regress_yhat_ci, stub(c2_gi_)
twoway scatter c2_gi_slos_hat num_co if _plotindicator == 1

* Compare estimated means from all models
list num_co *_slos_hat if _plotindicator == 1
```

##### TASK: Crash course in ordered logistic regression
Dataset: Simulated HIV data provided by Bryan Shepherd as part of the Liu, Shepherd, Li, Harrell (2017) publication.
```use "https://biostatdata.app.vumc.org/tgs/simhiv.dta", clear
// Simulated HIV data provided by Bryan Shepherd as part of the Liu, Shepherd, Li, and Harrell (2017) publication.

encode site, generate(site_id)
generate sex = real(male)
encode init_class, generate(class)
encode route, generate(iroute)
encode preARTAIDS_clinical, generate(base_aids)
drop site male init_class route preARTAIDS_clinical

/*
post_cd4 ~ site + male + rcs(age) + init_class + route +
preARTAIDS_clinical

To create the partial-effect plot:
What is the variable on the x-axis?
What is the range of the x-axis variable?
What value are the other predictors held-at?

E[post_cd4|-] |       ****
|     *
|    *
|  *
|*
|
+--------------
*/

plotdata, at(          ///
nadir_cd4 = 2(2)550; ///  This is the x-axis variable
init_year = 2007;    ///  All other covariates in the
pre_rna = 35000;     ///  model are set to a specific
age = 35;            ///  value.
site_id = 1;         ///
sex = 1;             ///
class = 1;           ///
iroute = 1;          ///
base_aids = 1)

mkspline_plotindicator age, nknots(5)
mkspline_plotindicator init_year, nknots(5)
mkspline_plotindicator pre_rna, nknots(4)

set matsize 800

ologit post_cd4 i.site_id i.sex rcs* i.base_aids i.iroute i.class if _plotindicator == 0

ologit_yhat

/*
Hypothesis tests (can also perform with lrtest command)
*/
testparm rcs_age*   // total impact of age
testparm i.site_id  // total impact of site

/*
Partial effect plot
*/
twoway (line post_cd4_mean nadir_cd4 if _plotindicator == 1, sort)
```

##### TASK: Crash course in Cox regression
RMS notes 18.1
Dataset: prostate

STATA script

Lesson Plans (restricted access)
Topic revision: r1 - 01 Feb 2023, FrankHarrell

• Biostatistics Webs

Copyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback