Data Cleaning and Analysis Tips

Statisticians are the last bastion against publishing research based on specious numbers. It's up to us to catch bad data when it's given to us and to make sure we don't create it ourselves when constructing analyzable datasets.
This page is being constructed to collect tips and anecdotes on data cleaning, data merging, and data manipulation. Please contribute to this page freely. Create new sections, e.g. Dan's Tips, or add to any of the existing sections, e.g. Miscellaneous Tips.

See Also / Return To:

Miscellaneous Tips

  1. Before doing anything else, resave the original file under a special name, e.g. filename_date_v00 or filename_date_original. Never touch this file again. I (Rob) once had a random key stroke change a systolic blood pressure of 152 to 112 early in the data manipulation. Without the original file, I never would have been able to recover the actual number.
  2. Track your Ns. When doing data manipulation and analyses, double check your sample sizes after every step. You'll be surprised how often this helps you catch errors.
  3. Create summary stats early and check them as you go for consistency. With regards to the changed systolic BP, I picked up on the error by seeing a difference in the BP mean in the final analysis vs the mean in my original summary stats. It was a small difference, but I knew they should have been exactly the same.
  4. Make histograms, boxplots, and scatterplots. When you see an unusual observation, check it out. Maybe it's legit, but maybe it's not. It can be helpful to talk with your collaborators about the unusual observations (though don't get talked into simply dropping all unusual observations).
  5. Look for logical inconsistencies. This takes creativity and effort, but is one of the most important things we do we checking data. Look for the people who filled their last prescription a week after their death. When dealing with multiple datasets on the same subjects, look for the patients who are classified as white women in dataset A and Hispanic men in set B.
  6. Ask what could be wrong with this dataset and devise a way check it. Like looking for inconsistencies, this takes creativity, but it is where we really earn our paychecks. Shirley and I (Rob) were recently given 72 files which were supposed to cover 5 years of data. We asked the question, "Did we get all the data?" We plotted data entries by date and discovered two big valleys. It turned out two 3 month time periods were incorrectly queried, and we were able to ask for and quickly get the missing data.
  7. Check out variables not directly related to the analysis. You'll sometimes discover relevant errors in your dataset when exploring seemingly irrelevant variables.

Tips for Spreadsheets; Excel, Open Office, etc.

  1. Make sure the dataset does not exceed the columns and rows limit. Excel will truncate the data to make it fit, so you may be throwing away data without realizing it.
  2. Remove any blank columns between the data columns. In at least excel, blank columns mess up sorting the data unless you are very careful to select all the columns first.
  3. Choose your missing character. In most cases, using a real character for missing, e.g. a period or NA, is better than leaving a blank. There are exceptions to this rule, but I (Rob) prefer a blank to be a red flag that something is off, e.g. a cell was accidentally deleted.
  4. Keep a log. If you can create an extra worksheet in the spreadsheet, create one and name it LOG. That way your log file goes wherever the data does.

Tips for Reproducible Research

  1. Code your cleaning. Especially when it comes to data manipulation and the construction of analyzable datasets, it is preferable to write code that does the job, as opposed to doing everything by hand. That way if it ever has to be redone, it can be done with a click of a button. Moreover, if something went wrong it can be discovered in the code. A cut and paste by hand leaves no record. When something has to be done by hand, keep a log. (Note don't force yourself to use code when you're still figuring out what you need to do. If you're experimenting to figure out what you need to do, that is often done better and faster by hand. Once you know what to do, you can go back and code it.)

Topic revision: r10 - 03 Jun 2009, RobertGreevy

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback