Reproducible Research Tutorial

Introduction

Reproducible research (RR) is the practice of conducting and presenting research in such a way that others, and yourself can later re-implement your research strategy without ambiguity. In the context of statistical collaboration, this means that you or someone else can easily reproduce all of your actions relating to data management and data analysis, and reach the same result. Since the statistical collaborators work is mostly done using computer tools, reproducible research means documenting all of the tools and procedures (applications, data storage formats, programs/scripts) that were used.

In surveying the members of the Department of Biostatistics at Vanderbilt University, we found that the most significant barrier to adopting RR practices was the additional (perceived) cost in time. While the initial cost of RR may be greater than some non-RR practices, there is a strong argument that RR actually saves time over the course of a research project, because it streamlines actions that are often repeated. For example, data handling steps may be repeated many times because the data are updated, e.g., because errors are found, or new data become available. By implementing a RR framework, the statistical collaborator avoids having to remember and manually repeat the data management and analysis steps every time the data are updated. As another example, peer-reviewed manuscripts often require revision, and may be greatly simplified when RR practices are used.

When working with data from collaborators, reproducibility should be considered as early as possible. Suppose that data are received from a collaborator in Microsoft Excel format, but that it's necessary to convert the data to CSV format. A copy of the original Excel database should be kept for posterity, and it should be documented (e.g. in a README file, or comment in an R script) that the original data were received in Excel format, but were converted to CSV format. How the conversion is accomplished (e.g. using the export feature of Microsoft Excel) should also be recorded.

There are many ways to implement, and many software tools to facilitate reproducible research. How RR is implemented is a matter of preference. But, note that some strategies may impose fewer barriers to RR (e.g. time, financial, interoperability) than others. Because of widespread adoption, ease of use, and zero financial cost, this tutorial will focus on RR tools related to the R-Project. The software company RStudio has developed a free graphical environment for R that has additional features to facilitate RR. The remainder of this tutorial will use R and R code to illustrate some reproducible research techniques. In order to follow along, it will be necessary to have access to R or, preferably, RStudio. There are many on-line documents and tutorials for R and RStudio beginners. I recommend the official introduction to R: An Introduction to R, and the on-line RStudio documentation.

R Scripting

R is a useful tool for RR, mostly because data manipulation and data analysis procedures can be scripted (i.e. recorded as a list of instructions in R code). R scripts are easily re-run, and therefore facilitate the task of reproducing your work. The following scenario exemplifies the use of R scripting in a statistical collaboration setting.

Dr. Bosenberg has a data set that lists the weight (kg) and Skin-to-Epidural Distance (abbreviated SED, mm) for a cohort of patients. Conveniently, these data are available in CSV format, and attached to this wiki page (Bosenberg1995.csv). Dr. Bosenberg requests a scatterplot and caption for these data. The steps to create a scatter plot in R are

  1. download and read the data
  2. create a scatterplot
  3. count the number of rows (for use in the caption)

The R code corresponding to these steps are:

# download and read SED data
SEData <- read.csv(paste("http://biostat.mc.vanderbilt.edu/",
                         "wiki/pub/Main/ReproducibleResearchTutorial/",
                         "Bosenberg1995.csv", sep=""))

# create a scatterplot
plot(SEData$WT, SEData$SED, xlab="Weight (kg)",
         ylab="Skin to Epidural Distance (mm)")

# count the number of rows
nrow(SEData)

# caption:
# "Scatterplot of weight versus skin-to-epidural distance (SED) for X subjects."
# where X is substituted with the result of nrow(SEData)

These commands may be copied-and-pasted into R or RStudio, or saved in an R script file (i.e., a plain text file with a .R extension) and evaluated (executed) as a script. From the R prompt, an R script can be evaluated using the source command. It's not necessary to download the CSV file beforehand. By using the http:// syntax, R knows to download the file before reading the CSV contents. In order to send the scatterplot and caption to Dr. Bosenberg, it would be necessary to save the scatterplot image to a file. The image file and caption could then be sent as part of an email.

Since all of the steps for the task of generating the scatterplot and caption are documented, the procedure is reproducible. Hence, if Dr. Bosenberg later updates the SED data file, a new scatterplot may be generated by re-running the script above. The main drawback to this strategy is that completing the caption requires extra work (manually substituting the value of nrow(SEData) for X). A technique called "report templating" (or "template reporting") was designed to remedy this.

Report Templating

In the context of statistical collaboration, report templating combines scripted data analysis with narrative reporting. Rather than writing a report after running a data analysis script, both can be done simultaneously, avoiding the costly and error-prone transcription of results. There are many implementations of th the report templating strategy. Again however, this tutorial will use an implementation based on R (or RStudio); the Sweave framework in particular.

Although Sweave has many features, it was originally designed to combine R code and its output with LaTeX markup. Hence, the final result is usually a typeset document containing narrative text combined with R output.

Below is the Sweave file corresponding to the R script above. Most of the file is LaTeX markup. The R code portions, called chunks are delimited by <<>> and @, or within \Sexpr{}. By using these delimiters, Sweave knows that the enclosed text is R code that should be evaluated. Because the code in this example generates a plot, the plot is automatically inserted into the document. Also notice that the value of R objects can be output within the body of a LaTeX paragraph using the \Sexpr syntax, as below within the caption. This way, the number of subjects can be output directly into the caption text, rather than having to insert it manually.

\documentclass{article}

% Use the geometry package to set the margins to 1in
\usepackage[margin=1.0in]{geometry}

% Set the project title and analyst's name
\newcommand{\project}{Bosenberg's SED Data}
\newcommand{\analyst}{M. S. Shotwell, PhD}

\begin{document}

% Sweave options
\SweaveOpts{concordance=TRUE, keep.source=TRUE}

% Tell Sweave to show images at 0.7 times the width of text
\setkeys{Gin}{width=0.7\textwidth}

% Create a title
\begin{center} {\Huge \project} \\\vspace{0.25in} {\large\analyst}\end{center}

\begin{figure}[h!]
\begin{center}
<<fig=TRUE>>=
# download and read SED data
SEData <- read.csv(paste("http://biostat.mc.vanderbilt.edu/",
                         "wiki/pub/Main/ReproducibleResearchTutorial/",
                         "Bosenberg1995.csv", sep=""))
# create a scatterplot
plot(SEData$WT, SEData$SED, xlab="Weight (kg)",
         ylab="Skin to Epidural Distance")
@
\caption{Scatterplot of weight versus skin-to-epidural distance (SED) for \Sexpr{nrow(SEData)} subjects.}
\end{center}
\end{figure}
\end{document}

An Sweave file usually has the extension .Rnw. From the R prompt, the Sweave command will read the .Rnw file and output a .tex file, which may then be compiled into a PostScript or PDF document. Alternatively, RStudio has a convenient feature in its editor where .Rnw files may be compiled directly to PDF with the click of a button. The RStudio method was used to compile the .Rnw file above to PDF: Bosenberg's Report

The report templating technique makes the results more reproducible, because it eliminates a source of human error and further automates the reporting task. Again, there are many others techniques for report templating, even within the R framework (e.g., using knitr, or Jeff Horner's brew package).

Assignment

Modify the Bosenberg Sweave file as follows, and regenerate the PDF report. Ensure that the document remains reproducible.
  1. Change the analyst's name to your name.
  2. Use linear regression ( i.e., the lm function) to regress SED onto Weight.
  3. Add the fitted regression line to the scatterplot.
  4. Add a description of the regression line to the figure caption, including the estimated intercept and slope.

Additional Resources

Topic attachments
I Attachment Action Size Date Who Comment
Bosenberg1995-report.pdfpdf Bosenberg1995-report.pdf manage 66.3 K 19 Aug 2013 - 08:56 MattShotwell Bosenberg's SED report
Bosenberg1995-scatterplot.pngpng Bosenberg1995-scatterplot.png manage 6.1 K 19 Aug 2013 - 08:57 MattShotwell Bosenberg's SED data scatterplot
Bosenberg1995.csvcsv Bosenberg1995.csv manage 2.8 K 26 Aug 2012 - 08:32 MattShotwell Bosenberg's SED data
HarrellScottTutorial-useR2012.pdfpdf HarrellScottTutorial-useR2012.pdf manage 2438.4 K 26 Aug 2012 - 08:04 MattShotwell Handouts for Frank Harrell and Terri Scott's useR! 2012 tutorial on Reproducible Research
Topic revision: r8 - 19 Jul 2017, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback