********************************** * Import data and only keep 1995 * ********************************** clear set memory 1g use "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios312/salary.dta" table year keep if year==95 de ************************************************** * Convert stings variables to numeric indicators * ************************************************** * Make a male indicator variable; Female is reference group gen male=. replace male=1 if sex=="M" replace male=0 if sex=="F" table sex male * Make degree indicator variables; PhD is reference group gen degother=. replace degother=1 if deg=="Other" replace degother=0 if deg=="PhD" | deg=="Prof" table degother deg gen degprof=. replace degprof=1 if deg=="Prof" replace degprof=0 if deg=="PhD" | deg=="Other" table degprof deg * Make field indicator variables; Arts is reference group gen fieldother=. replace fieldother=1 if field=="Other" replace fieldother=0 if field=="Arts" | field=="Prof" table fieldother field gen fieldprof=. replace fieldprof=1 if field=="Prof" replace fieldprof=0 if field=="Arts" | field=="Other" table fieldprof field * Make rank indicator variables; Assistant is reference group gen rankassoc=. replace rankassoc=1 if rank=="Assoc" replace rankassoc=0 if rank=="Full" | rank=="Assist" table rankassoc rank gen rankfull=. replace rankfull=1 if rank=="Full" replace rankfull=0 if rank=="Assoc" | rank=="Assist" table rankfull rank ********************************************* * Step 1: Univariate descriptive statistics * * Look for outlying data points * * Get a feel for the data * ********************************************* summ salary male fieldother fieldprof rankassoc rankfull degother degprof yrdeg startyr, de tabulate sex, summarize(salary) * Get mean estimates by sex to get a feel of the data; ignore statistical significance regress salary male * Comments * Checked continuous variable for outliers: None noted * More males, assistant professors, "Other" field, PhD degree ********************************************************************** * Step 2: Specify a model for the primary analysis * * A priori, I will specify what I think is the "best" model * * Response: Salary (mean with robust standard errors) * * I actually think log(salary) would be a better fit, * * but the interpretation would be more difficult * POI: Gedner * * Confounders: Year degree (2 df), Rank, Field, Rank*Field, degree * * Precision: Admin responsibilities * * Irrelevant (after controlling for above): yrdeg, startyr * * No effect modification * ********************************************************************** * Create interaction variables for rank and field gen assocprof = rankassoc*fieldprof gen assocother = rankassoc*fieldother gen fullprof = rankfull*fieldprof gen fullother = rankfull*fieldother gen yrdeg2 = yrdeg*yrdeg * Two ways to run this regression regress salary male yrdeg yrdeg2 rankassoc rankfull fieldprof fieldother degprof degother admin assocprof assocother fullprof fullother, robust xi: regress salary male yrdeg yrdeg2 i.rank*i.field i.deg admin ************************************ * Step 3: Exploratory analysis * * (1) Uncontrolled confounding * * (2) Increased precision * ************************************ * Associations with outcome tabulate sex rank, summarize(salary) tabulate sex field, summarize(salary) tabulate sex deg, summarize(salary) tabulate sex yrdeg, summarize(salary) quietly: xi: regress salary i.sex*i.yrdeg predict yhat twoway (scatter yhat yrdeg if male==0, sort) (scatter yhat yrdeg if male==1, sort), legend(order(1 "Female" 2 "Male")) xi: regress salary i.sex*yrdeg predict yhat2 twoway (line yhat2 yrdeg if male==0, sort) (line yhat2 yrdeg if male==1, sort), legend(order(1 "Female" 2 "Male")) twoway (line yhat2 yrdeg if male==0, sort) (line yhat2 yrdeg if male==1, sort) (scatter yhat yrdeg if male==0, sort) (scatter yhat yrdeg if male==1, sort), legend(order(1 "Female" 2 "Male")) * Associations among predictors tabulate rank field, chi2 tabulate rank sex, chi2 tabulate rank deg, chi2 tabulate rank field, summarize(yrdeg) tabulate rank field, summarize(startyr) *************************************************** * Step 4: Consider alternate scientific questions * * Transformations of the outcome * * Effect modification? * * Should be related to the primary question * *************************************************** * Percent change in salary gen lnsalary = log(salary) regress lnsalary male yrdeg yrdeg2 rankassoc rankfull fieldprof fieldother degprof degother admin assocprof assocother fullprof fullother, robust lincom male, eform * Look at starting in the last 10 years regress salary male yrdeg yrdeg2 rankassoc rankfull fieldprof fieldother degprof degother admin assocprof assocother fullprof fullother if startyr>=85, robust * Look at just assistant professors in stratified analysis (should control for many other factors) regress salary male yrdeg fieldprof fieldother degprof degother admin if rank=="Assist", robust * Look at just assitant professors using interactions gen maleassoc = male*rankassoc gen malefull = male*rankfull xi: regress salary male maleassoc malefull yrdeg yrdeg2 rankassoc rankfull fieldprof fieldother degprof degother admin assocprof assocother fullprof fullother, robust