********************************** * Import data and only keep 1995 * ********************************** clear set memory 1g use "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios312/salary.dta" table year keep if year==95 ************************************************** * Convert string variables to numeric indicators * ************************************************** * Make a male indicator variable; Female is reference group gen male=. replace male=1 if sex=="M" replace male=0 if sex=="F" table sex male ***************************************** * Save residuals from unadjusted models * ***************************************** regress salary male predict resid1, resid regress yrdeg male predict resid2, resid ***************************************** * Plot the residuals; add a lowess line * ***************************************** scatter resid1 resid2 || lfit resid1 resid2 lowess resid1 resid2, bwidth(.2) addplot((lfit resid1 resid2)) ************************************************** * A simple linear regression using the residuals * ************************************************** regress resid1 resid2 *********************************** * Now fit the multivariable model * *********************************** regress salary male yrdeg * Note that the residuals model doesn't get the residual degrees of freedom exactly right * The residual d.f. is n-p, where p is the # of predictors in the model * The model using the residuals thinks there is only 1 predictor * Thus, the RMSE and standard error estimates for betas are a little off ************************************** * The extra problem: Adding rankfull * ******8******************************* gen rankfull = . replace rankfull = 1 if rank=="Full" replace rankfull = 0 if rank=="Assist" | rank=="Assoc" table rank rankfull regress salary male yrdeg predict resid3, resid regress rankfull male yrdeg predict resid4, resid lowess resid3 resid4, bwidth(.2) addplot((lfit resid3 resid4)) regress resid3 resid4 regress salary male yrdeg rankfull