Procedures for Preparing Datasets for Analysis and for Transmission to the Department of Biostatistics

Preferred Method for Transmitting Datasets to the Department of Biostatistics

  • Data-Hippo a secure file transfer application for confidential information.

Preferred Dataset Formats

The first two formats are desirable because they facilitate transmission of variable labels (long variable names) and value labels (definitions for coded variables).
  • Stata .dta files
  • SPSS .sav files
  • Comma separated variables with field names in one row and optionally with long field labels (descriptions) in another row
  • Spreadsheets, formatted as described under comma separated variables
  • Many others are possible

Stata and SPSS files can be read into R using the functions stata.get and spss.get in the Hmisc package. These functions have many options. stata.get automatically senses date variables. With spss.get you can specify a list of variables to be converted to dates in R.

For Department members: Stat/Transfer runs under Windows or Linux and can convert any SAS binary format to Stata to import into R.

Example Spreadsheets | Dallal's Notes

  • Spreadsheet from Heaven: Excel spreadsheet to demonstrate the proper way of entering data for a clinical research project.

  • Spreadsheet from hell: Excel spreadsheet to demonstrate improper ways of entering data for a clinical research project.

How to enter research data in a computer spreadsheet for optimal statistical analysis

10 Data Entry Commandments

  1. Enter all or most of the data as numbers. Avoid entering letters, words, string variables (e.g.,NA, 22%, <3.6), or anything that resembles a cartoon curse word, @#&*%,. In Excel, all columns, with the exception of names and text comments, should be formatted as numbers or dates (not as general or text).
  2. Give each column a unique, simple, 1-word name, 8 characters or less with no spaces, beginning with a letter, and place this name in the first row.
  3. Put only one variable in a column. Do not combine variables in the same column.
  4. Enter each patient (or unit of analysis) on a separate line, beginning on the second line.
  5. Give each research participant or patient a unique case number (1,2,3, etc.)- in the first column. Delete patient name, SS#, MR#, and any identifying information before sending it to a statistician. Always, save the spreadsheet with a password.
  6. Enter cases and controls in the same spreadsheet. Use one variable to define the control group (TREATED 0=no, 1=yes or GROUP 1=Drug A, 2=Drug B).
  7. Quantify. Enter continuous measurements when possible.
  8. Create a simple guide (or key) using a word processor to explain variables abbreviations, value coding, and how missing values were entered. Be consistent.
  9. Think through the analysis before collecting any data.
  10. Have a biostatistician review the coding before data entry and again after the first 10 patients have been entered.

"Research demands involvement. It cannot be delegated very far."
Topic attachments
I Attachment Action Size Date Who Comment
Spreadsheetfromhell.xlsxls Spreadsheetfromhell.xls manage 17.5 K 17 May 2007 - 11:37 ColeBeck  
spreadsheetfromheaven.csvcsv spreadsheetfromheaven.csv manage 1.1 K 28 Jan 2008 - 22:20 FrankHarrell Spreadsheet from Heaven (csv version exported from spreadsheet)
spreadsheetfromheaven.xlsxls spreadsheetfromheaven.xls manage 16.5 K 17 May 2007 - 11:37 ColeBeck  
Edit | Attach | Print version | History: r16 | r12 < r11 < r10 < r9 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r11 - 14 Jan 2009, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback