PCOS Prostate Cancer Outcomes Study

  • Met on July 8, 2013 with Frank Fan. Purpose was for Frank to explain to us the data, its location, and his data manipulation to date to hand off the project since he's leaving
  • everything is on the drive "koyama on 10.103.16.98" (barocas on my machine) /PCOS
  • data originally in excel files. Frank saved to csvs in Work_Data dir.
  • There was data from surveys and survival data obtained from registries
  • the pcos data includes medical chart abstraction (abstract.csv), cancer registry data (person/sample.csv), 5 sets of surveys, and vital status data from the sites. The "new" vital status data is from Oct. 2012.
  • Indicators that patients have particular data sources: case6, case12, abstract (all in sample.csv)
  • Patients had to have either a 6 or 12 month data to be included in PCOS.
  • In this study, "baseline" and "6 month" mean the same thing. 6 month refers to 6 month after dx
  • Frank has made two main R data sets which are produced by scripts. The sets are called Master Data.RData (contains survival data, contains comorbidity data from PCOS survey) and PCOS Study Cohort.RData (doesn't contain survival data, contains demographics including income and education). Both data sets summarize all the PCOS survey data.
  • The N for both data sets is 3718.
  • Frank's scripts use some of his own functions, which are saved in MF.RData and in a script called My Function.R.
  • Sampling weights
    • PCOS oversampled young patients and minority races. There are sampling weights in the data. In Frank's data, they are called "weight." The weights were determined using registry, race, and age.
    • Since the data were obtained using weighted sampling, we must use analysis methods that takes this into account, for example, weighted regression.
  • To merge any of the PCOS data, need to merge by both registry and pcosid, because not all the pcosid's are unique.

  • Fixed some dates that were in the future. They were five-year survey dates and diagnosis date. The problem was in my code, and it was caused by using a "y" instead of a "Y" in the date format. One was surveyDate.5y, which is from pcosnew$q1_date.

Data

  • the 6 month functional outcomes in PCOS are also in the six month survey (along with baseline).

Questions to ask Penson/Barocas about data

  • Should treatment variables with missing values be imputes as no? It looks like Frank's code did that for the secondary treatment variables in the pcosnew data only. Do you remember which treatment variables were supposed to use this imputation?_We will defer this question to secondary treatments. We won't need to impute the primary treatment._
  • For q1_y on the 12 month survey, is 99 a legitimate year or the missing code? I'm not sure from just looking at the data dictionary. There were 7 coded 99 and none coded 98.
  • There are two csv files for the long term follow up. One is called pcos.new, and the other is called pcos.new.survey. Do you know of any pre-processing that Frank did to create one from the other, or do you know where they came from? The codebook for the longterm follow up indicates that there are variables q1_y, q1_m, and q1_d, and no q1_date. However, the data I have in pcos.new.survey.csv does not have q1_m nor q1_d, and it does have q1_date. What are legitimate years for this study? Here are the frequencies:
pcosnew$Q1_Y                                                                                                                                                                                                
      n missing  unique    Mean                                                                                                                                                                             
   1016       5       6    2009                                                                                                                                                                             
                                                                                                                                                                                                            
          1940 1994 1995 2004 2008 2009                                                                                                                                                                     
Frequency    1    1    1    1    5 1007
%            0    0    0    0    0   99

Completed

  • For insurance, should we use the algorithm Frank used with a8a-a8g from abstract.csv, or should we use insure from the abstract? Here is how they correspond. In the table, the variable "insurance" on the side is the one created with Frank's algorithm, and the one going across the top is the one from abstract.csv.: First check if they said "no insurance." If they checked that, the value should be "no insurance." If not, look at the variable "insure."
  • Frank used the variable primtrt from abstract.csv, and he excluded those with blanks or "9-Unknown info" in his main data manipulation file. We have decided to use trtment from sample.csv instead. Should I similarly exclude those with no value coded for this variable? There are 1593 of them. yes, we can throw them out because they are people who weren't even part of pcos, dave thinks. _ _For now, I'm not going to exclude these people. I will exclude based on whether they have either 6 or 12 month data, and then see if they have missing treatment.
  • What should we use as the source of age? agedx, the age of diagnosis, from sample.csv?_yes_
  • For the primary treatment variable, we don't need to incorporate info from sources other than trtment, like "Treatments of survived patients.0614.csv," or "other Treatment Patients.csv" right? no!
  • For the purposes of Dan's AS R03, we have decided to group the primary treatment into the following categories: no treatment, active treatment, and hormone only. This is how I am grouping the codes: list("No treatment" = c(1), "Active treatment" = c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13), "Hormone only" = c(14)). Some of the codes 2-13 include "unable to sequence" as part, for example, 5 is "rad prost + XRT + unable to sequence"
1 = Watchful waiting
2 = Rad prost + XRT + hormone
3 = Rad prost + XRT
4 = XRT + prost
5 = Rad prost + XRT - unable to sequence
6 = Rad prost + horm
7 = Hormone + rad prost
8 = Rad prost + hormone - unable to sequence
9 = XRT + hormone
10 = hormone + XRT
11 = XRT + hormone - unable to sequence
12 = Rad prost only
13 = XRT only
14 = Hormone only
  • Comment: we also decided to impute any missing values of trtment with primtrts. However, there are no patients with missing trtment that have nonmissing values of prmtrts.
> with(trts, table(trtment, primtrts, exclude = NULL))
       primtrts
trtment    1    2    3    4    5    6    7    8    9   10   99 <NA>
   1     561    0    0    0    0    0    0    0    0    7  110    0
   2       1   14    2    2    0   12    0   13    0    0    0    0
   3       0    2   74    0    0    0    0   19    1    0    0    0
   6       0    0    0   21    0    0    0    5    0    0    0    0
   7       0    0    0    0    0  130    0   14    0    5    1    0
   8       2    0    0    0    0    1    0    4    0    0    0    0
   9       0    0    0    0   23    0    0    0    5    1    0    0
   10      3    0    0    0    1    0  112    0   10   15    4    0
   11      3    0    0    0    0    0    0    0    2    5    6    0
   12     33    0    0    0    0    0    0 1392    0    0    4    0
   13     35    0    0    0    2    0    0    0  607    1   23    0
   14     30    0    0    0    0    0    0    0    0  485   25    0
   <NA>    0    0    0    0    0    0    0    0    0    0    0 1593
   * Double check inclusion criteria "clinically localized." Frank used metastasis (E6) and clinstg from abstract: allg$Stage <- ifelse(allg$metastasis==1,"Metastatic",
              ifelse(allg$clinstg%in%CL , "Clinically Localized",
              ifelse(allg$clinstg%in%RA, "Regionally Advanced", "NA" ))) _Answer: high psa is generally greater than 20. First throw out people with psa over 50. Then use clinstg to only include 11-15. Can ignore e6, which is metastasis._
correct!

  • Penson says we need to check if the updated vital status data are really newer._Still haven't done this._
  • We noticed some discrepancy between versions of the data dictionaries. This may be the reason primtrt1 is in one version while primtrts is in another
  • There is another variable called f2 that clinstage was calculated from (with other variables).

-- JoAnnAlvarez - 02 Apr 2014
Topic revision: r3 - 19 Sep 2014, JoAnnAlvarez
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback