Purpose was for Frank to explain to us the data, its location, and his data manipulation to date to hand off the project since he's leaving
everything is on the drive "koyama on 10.103.16.98" (barocas on my machine) /PCOS
data originally in excel files. Frank saved to csvs in Work_Data dir.
There was data from surveys and survival data obtained from registries
In this study, "baseline" and "6 month" mean the same thing. 6 month refers to 6 month after dx
Frank has made two main R data sets which are produced by scripts. The sets are called Master Data.RData (contains survival data, contains comorbidity data from PCOS survey) and PCOS Study Cohort.RData (doesn't contain survival data, contains demographics including income and education). Both data sets summarize all the PCOS survey data.
The N for both data sets is 3718.
Frank's scripts use some of his own functions, which are saved in MF.RData and in a script called My Function.R.
Sampling weights
PCOS oversampled young patients and minority races. There are sampling weights in the data. In Frank's data, they are called "weight." The weights were determined using registry, race, and age.
Since the data were obtained using weighted sampling, we must use analysis methods that takes this into account, for example, weighted regression.
To merge any of the PCOS data, need to merge by both registry and pcosid, because not all the pcosid's are unique.
Fixed some dates that were in the future. They were five-year survey dates and diagnosis date. The problem was in my code, and it was caused by using a "y" instead of a "Y" in the date format. One was surveyDate.5y, which is from pcosnew$q1_date.