Department of Biostatistics Seminar/Workshop Series

Sweave for Reproducible Research and Beautiful Statistical Reports

Frank E. Harrell, Jr., PhD

Professor of Biostatistics, Department Chair, Department of Biostatistics, Vanderbilt University School of Medicine

Wednesday, May 12, 1:30-2:30pm, MRBIII Conference Room 1220

Much of research that uses data analysis is not easily reproducible. This can be for a variety of reasons related to tweaking of instrumentation, the use of poorly studied high-dimensional feature selection algorithms, programming errors, lack of adequate documentation of what was done, too much copy and paste of results into manuscripts, and the use of spreadsheets and other interactive data manipulation and analysis tools that do not provide a usable audit trail of how results were obtained. Even when a research journal allows the authors the "luxury" of having space to describe their methods, such text can never be specific enough for readers to exactly reproduce what was done. All too often, the authors themselves are not able to reproduce their own results. Being able to reproduce an entire report or manuscript by issuing a single operating system command when any element of the data change, the statistical computing system is updated, graphics engines are improved, or the approach to analysis is improved, is also a major time saver.

It has been said that the analysis code provides the ultimate documentation of the "what, when, and how" for data analyses. Eminent computer scientist Donald Knuth invented literate programming in 1984 to provide programmers with the ability to mix code with documentation in the same file, with ``pretty printing'' customized to each. Lamport's LaTeX, an offshoot of Knuth's TeX typesetting system, became a prime tool for printing beautiful program documentation and manuals. When Friedrich Leisch developed Sweave in 2002, Knuth's literate programming model exploded onto the statistical computing scene with a highly functional and easy to use coding standard using R and LaTeX and for which the Emacs text editor has special dual editing modes using ESS. This approach has now been extended to other computing systems and to word processors. Using R with LaTeX to construct reproducible statistical reports remains the most flexible approach and yields the most beautiful reports, while using only free software. One of the advantages of this platform is that there are many high-level R functions for producing LaTeX markup code directly, and the output of these functions are easily directly to the LaTeX output stream created by Sweave.

This tutorial covers the basics of Sweave and shows how to enhance the default output in various ways by using: latex methods for converting R objects to LaTeX markup, your own floating figure environments, the LaTeX listings package to pretty-print R code and its output.

These methods apply to everyday statistical reports and to the production of `live' journal articles and books.
Topic revision: r2 - 26 Apr 2013, JohnBock

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback