Procedures for Preparing Datasets for Analysis and for Transmission to the Department of Biostatistics
Preferred Method for Transmitting Datasets to the Department of Biostatistics
- Data-Hippo a secure file transfer application for confidential information.
Preferred Dataset Formats
The first two formats are desirable because they facilitate transmission of variable labels (long variable names) and value labels (definitions for coded variables).
- Stata
.dta
files
- SPSS
.sav
files
- Comma separated variables with field names in one row and optionally with long field labels (descriptions) in another row
- Spreadsheets, formatted as described under comma separated variables
- Many others are possible
Stata and SPSS files can be read into R using the functions stata.get
and spss.get
in the Hmisc
package. These functions have many options.
stata.get
automatically senses date variables. With
spss.get
you can specify a list of variables to be converted to dates in R.
For Department members:
Stat/Transfer
runs under Windows or Linux and can convert any SAS binary format to Stata to import into R.
- Spreadsheet from Heaven: Excel spreadsheet to demonstrate the proper way of entering data for a clinical research project.
- Spreadsheet from hell: Excel spreadsheet to demonstrate improper ways of entering data for a clinical research project.
How to enter research data in a computer spreadsheet for optimal statistical analysis
10 Data Entry Commandments
- Enter all or most of the data as numbers. Avoid entering letters, words, string variables (e.g.,NA, 22%, <3.6), or anything that resembles a cartoon curse word, @#&*%,. In Excel, all columns, with the exception of names and text comments, should be formatted as numbers or dates (not as general or text).
- Give each column a unique, simple, 1-word name, 8 characters or less with no spaces, beginning with a letter, and place this name in the first row.
- Put only one variable in a column. Do not combine variables in the same column.
- Enter each patient (or unit of analysis) on a separate line, beginning on the second line.
- Give each research participant or patient a unique case number (1,2,3, etc.)- in the first column. Delete patient name, SS#, MR#, and any identifying information before sending it to a statistician. Always, save the spreadsheet with a password.
- Enter cases and controls in the same spreadsheet. Use one variable to define the control group (TREATED 0=no, 1=yes or GROUP 1=Drug A, 2=Drug B).
- Quantify. Enter continuous measurements when possible.
- Create a simple guide (or key) using a word processor to explain variables abbreviations, value coding, and how missing values were entered. Be consistent.
- Think through the analysis before collecting any data.
- Have a biostatistician review the coding before data entry and again after the first 10 patients have been entered.
"Research demands involvement. It cannot be delegated very far."