Procedures for Managing MOON Analysis Database

-- BillRenfrew - 04 Mar 2004

Sports Medicine Work Log (and what may be the beginnings of Standard Operating Procedures (SOPs) for Processing Sports Medicine Data)

LOG DATE: 03/03/04

Converting Access to SAS:

Each component of the ACL data resides in an Access database. To be able to run consistency checks and re-code values, the data is converted to SAS. Having tried DBMScopy and STAT TRANSFER in the past and finding errors with each, it was decided to convert the files using PROC IMPORT. The syntax is as follows:

PROC IMPORT OUT= WORK.TESTFORBILL DATATABLE= "moon_evaluation_entry" DBMS=ACCESS2000 REPLACE; DATABASE="H:\Projects\Sports Medicine\11-17-03.MDB" ; RUN;

Each component of the MDB files were converted in this manner. The names given the newly processed data were as follows:

Form3.sas7bdat Form5_1.sas7bdat Form5_2.sas7bdat Kler.sas7bdat Nqtp12.sas7bdat Nques_1.sas7bdat Nques_2.sas7bdat Nr474. sas7bdat

The data were stored in 3 master tables, one for the oldest data, one for the next oldest, and one for the newest. The files were delineated by having the letter ‘o’ (oldest data), ‘n’ (next) or nothing for the newest data. These datasets were, and should be, combined when using the first year data.

Data Combination:

To match all variables by patient id for each form, the SSN was substrung from the id variable, and consistency was found with the exception of 5 cases out of 596.

Priority of analyses:

An analysis priority was established and it was decided that the MOON (Multi-Center Orthopaedic Outcomes Network) Form 5: Operative Information Questionnaire would be first, but only up to OCD Lesion which may be dropped at a later date. The data components used would be the form5_1.sas7bdat files from each of the 3 components.

Consistency Checks:

The primary concern has been one of accuracy for each data element. To assess that, each variable was first formatted as follows:

proc format; value validx .='MISSING'
  1. ='N/A' .i='INVALID' other='IN RANGE' ;

Data Dictionary:

.i (SAS Special Missing) was used if a response was outside the range specified in the data dictionary (DD):

(H:\Projects\Sports_Medicine\data_dictionary\ FORM 5 Data dictionary_Revised by BR.sxc.) This document was revised by BR because there was an escape sequence or something that kept crashing my computer in the original version. Update changes made have included adding questionnaire variables in italics that were missing in the original DD spreadsheet. Surprisingly, up to now, only a small number of observations have shown invalid data.

Skip Patterns:

The next concern was one of determining if skip patterns were consistent. For example, if a surgery was performed and several questions were asked relating to that surgery, consistency checks were performed to make sure that those questions were in fact answered. Similarly, if the surgery question was negative, checks were performed to make sure the following items were coded as N/A. It was seen that there was variation in the coding scheme, sometimes being coded as 9 (N/A), 0, or missing. Any of these were combined into the ‘9’ category and formatted as N/A, but only in cases of variables nested within skip patterns. Additional formats were written for variables not readily subscribing to the INVALID format and are listed below. To know which variable goes with which format, use the first 6 letters of the variable followed by ‘x.’. The yesno format was used universally for the lead surgery variables.

SAS Formats:

value yesnox .="Missing"
  1. ,2="NO SURG"
  2. ,4="SURG" ; value yesno2x .="Missing"
  3. ="NO SURG"
  4. ="SURG" ; value doc_typx .="Missing"
  5. -3="Valid" other="Out of Range" ; value loosex .="Missing"
    1. -4="Valid" other="Out of Range" ; value synovitx .="Missing"
  6. -2,8="Valid" other="Out of Range" ; value syn_tx_x .="Missing"
  7. ,1,9="Valid" other="Out of Range" ; value acl_teax .="Missing"
  8. ,1,2="Valid" other="Out of Range" ; value ACL_PERx .="Missing"
  9. -99="Valid" other="Out of Range" ; value aclg_prx .="Missing"
  10. ,1,2="Valid" other="Out of Range" ; value aclg_pex .="Missing"
  11. -99="Valid" other="Out of Range" ; value pcl_teax .="Missing"
  12. ,1,2="Valid" other="Out of Range" ;
/********Missing a couple vars here********/ value pcl_perx .="Missing"
  1. -99="Valid" other="Out of Range" ; value MCLx .="Missing"
  2. ,1,2,3,4="Valid" other="Out of Range" ; value mcl_teax .="Missing"
  3. ,1,2,3,4,5="Valid"
  4. ="N/A" other="Out of Range" ; value lclx .="Missing"
  5. ,2,3,4="Valid" other="Out of Range" ; value LCL_teax .="Missing"
  6. ,1,2,3,4,5,6,7="Valid"
  7. ="N/A" other="Out of Range" ; value mtreatx .="Missing"
  8. ='NO TX FOR TEAR'
  9. ,2,3,4="VALID"
  10. ="N/A" other="OUT OF RANGE" ;

Treatment Variables:

The TREATMENT variables within Meniscus Tear Surgeries (Lateral 1 2 Medial 1 and 2) had little skip patterns of their own, and were handled separately from the general skip pattern since they may or may not have been filled out if a surgery took place. It was determined that none were filled out if there WASN’T a surgery though which was good news.

Articular Lesion Sections:

The Articular Lesion variables (MTP, LTP, PAT and TRO) were slightly more problematic. Even though the questionnaire states to fill out the following variables if a surgery is performed, there were cases where that was not to be the case, and N/A was the proper answer. The confusion lies in N/A being the standard re-code for secondary variables following skip patterns. Therefore, missings, 0’s and 9s were not all combined into the “N/A” category, and these secondary variables were not cross-tabbed with the lead surgery variable EXCEPT for checking to make sure if there was no surgery, that the following variables didn’t have IN RANGE or VALID responses. Once ruled out, the individual secondary variables were checked independently for validity, and not cross-tabbed with the lead surgery variable.

Plans for improvement:

Macro-ize the cleaning program. Make formats more consistent (INVALID vs. OUT OF RANGE confusing, etc.)

LOG DATE 03/04/04

New Derived variable created:

/*New Derived Variable to account for design flaw in LFC_NOFX and LFC_NOFXNUM*/ if lfc_fx in(.,0,9) then D_LFC_NOFXNUM = 99; else if lfc_fx = 1 then D_LFC_NOFXNUM=LFC_NOFXNUM; else D_LFC_NOFXNUM=.i;

I went back to the beginning to fix some things. A basic set of rules was established that can be discussed in meeting tomorrow:

1. If a variable is outside a skip pattern (independent) and the legitimate range of values includes 0 and 9 (N/A), no recoding will be done. Values out of range will be identified as such by the format statement. 2. If a variable is independent and the legit values include 9, but NOT 0, any 0s that occur are re-coded to 9s. Missings are left alone. 3. If a variable is independent and the legit values don’t include 9 OR 0, no re-codes are performed on values of 0 or 9 or missing . Values outside the range are flagged as INVALID by the format statement.

4. If a variable is inside a skip pattern (dependent) and the legit values include 0 and 9, missings will be re-coded to 9 (N/A). 5. If a variable is dependent and the legit values include 9 but not 0, then all missings and 0’s are recoded to 9s. 6. If a variable is dependent and the legit values don’t include 9 or 0, values of 0 and 9 are still coded as 9s. Other values outside the range are flagged as such by the format.

LOG DATE 03/05/04

Met with group and decided that above categories of treating missings, 0s and 9s (N/A) was sufficient. Will continue as planned.

BR

LOG DATE 03/18/04

Finished Form 5 (part 1 that is...part 2 may not be used). Looked for ID number and only found it on "Old" data (see above). List has been generated for Allison and I'm waiting on her response with a list of IDs and updated/corrected values. Meeting to take place Friday morning, 9:00 a.m. at the Sports Medicine suites.

Topics:

1. Time table on getting back changes/edits on cases that have been flagged for accuracy checks.

2. Idea of Chang Yu to look at combining co-variates into a single co-variate. Background information available soon.

3. Agreement on next level of priority for accuracy checks. I believe it is the Moon Interoperative Data. Time table to be discussed. Other issues discussion to follow.

4. One routine of Dr. Harrel's, called Describe, is in the R supplemental library FMISC. IT basically runs frequencies on each variable in the file- frequencies on categorical/limited numeric, quantiles on continuous. May be a good place to start to look at what we have and see where we are. Then perhaps to run some simple means, plots, etc. But 1st priority is the Moon Inoperative Data.

Till next time.

b

LOG DATE 05/06/04

Dear Sports fans,

All along there have been merge errors because of typos in the REGN field for the SSN. I finally dealt with these. In lieu of writing some complicated program to match close matches and string searches to fix and replace text in the SSNs, etc, etc, I did it the old fashioned way (only possible with a small data set like ours, obviously) and sorted it, printed it out and looked at it. :>). There were a lot more errors than I thought so it was a bit of a hassle, but they’ve all been addressed and we are a stone’s through away from having a file. Here are the details:

The solution was fairly basic. Basically, I looked to see which SSN was on more of the file components, and then changed the erroneous one(s) to that number. (If it was 111111111 on 1 out of 5 components and 121111111 on 4 out of 5 of the components, I changed 111111111 to 121111111 on the 1 so all 5 had 121111111. Of course there would HAVE to be a tie or it wouldn’t be our data, right? (It’s at the bottom).

The other funky SSNs that I came up with were as follows and I’m not sure how to deal with them. Most of them are simply missing SSNs from form 5 which means this individual has everything EXCEPT form 5 in the data.

270880200 missing from f4. 284906695 missing from f5 291886221 missing from f5.

(I think these next two may be related, but I can’t prove it. They’re suspiciously close together. ;>/ ) 295543808 only on F5 295649510 missing from f5

336501922 missing from q1 380748472 missing on f5

383112896 missing from q5 483961382 missing from f5

Then there’s this little ‘tie’ problem mentioned above. Anyone know what the actually ssn should be? Below are the SSNs shown from the different file components.

    k_ssn              q1_ssn           f3_ssn       f5_ssn               r4_ssn
        .           483961362                .            .            483961362
483961382                   .        483961382            .                    .

Finally, there are 3 records in there with missing SSNs on all 5 components. Not sure if there’s any data…I’ll have to check. I’m sure we’ll just delete these.

Everything else is CLEAN!!!!!!

Until next time.

****SEPTEMBER 1, 2004*** (this is being added much later than when it was run- this was actually run back in June)
/******************CUTS FOR FINAL DATASET- ONLY WANT UNILATERAL*******/

/*******************************from FORM 5***************************/
if acl_recon ne 1 then cut_flag=1;

/*******************************from kler*****************************/
if contralat in (3,4) then cut_flag=1;  

if (d_knee=" LEFT" and PRESUR_RT = 1) then do;
   if ACLREP_RT=1 or INTACL_RT=1 OR EXTACL_RT=1 THEN CUT_FLAG=1;
END;

if (d_knee="RIGHT" and presur_lt=2) then do;
   if ACLREP_LT=1 or INTACL_LT=1 OR EXTACL_LT=1 THEN CUT_FLAG=1;
END;
/****************************from Ph Exam (Form 3) *******************/
if eff_uni gt 1 or end_uni =0 then cut_flag=1;

IF CUT_FLAG=1 THEN DELETE;

Aug 26, 2004

Warren Dunns specs on sorting out people with missing primary/secondary:

First make sure they had an acl reconstruction by confirming that acl_recon=1 and acl_gtype ne to 9. Then check prev_surgrt and prev_surg_lt, if these are missing then it is likely a primary. If they are not check side with knee variable. Also, we can use previous graft harvest= acl_gpat, acl_gham, and acl_gquad. These are coded 1=right and 2=left. Using the knee variable coded the same way 1=right and 2=left we can determine revisions. If they had a previous harvest on the same side it is a revision. The problem here is that if the primary was an allograft we will miss it.

b
Topic revision: r13 - 27 Nov 2017, DalePlummer
This site is powered by FoswikiCopyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback