S for Statistical Data Analysis and Graphics
Frank E. Harrell, Jr. (f.harrell@vanderbilt.edu)
Professor of Biostatistics and Statistics
Chair, Department of Biostatistics
Vanderbilt University School of Medicine
Course Web Page: StatCompCourse

To be able to use the high-level object-oriented S language to
  1. recode, manipulate, and reshape data
  2. annotate & document data elements
  3. inspect and manipulate data
  4. compute and display summary statistics
  5. make somewhat complex tables
  6. use Trellis and other S graphics to display information in multiple variables or multiple summary statistics
  7. use the Hmisc library to extend S graphics by automatically labeling curves, drawing legends, drawing error bars and bands, and plotting aggregate data
  8. compute sample size and power
  9. be introduced to missing value imputation, simulation, and bootstrapping using S
  10. conduct reproducible analysis

Applied statisticians and data analysts who already know how to use S basics. The course will be of particular interest to:
  • applied statisticians who want to incorporate new statistical methodology and graphical methods into their everyday work
  • data analysts who want to increase productivity in analyzing data and generating statistical graphics
  • data analysts who want to learn new ways to prepare graphical and tabular reports
  • research statisticians who want to acquire new tools for conducting and presenting methodologic research
  • SAS users who want to learn S or to enhance their S skills
  • %LATEX% users who want to interface %LATEX% to S
  • statisticians responsible for producing data and safety monitoring and other clinical reports for pharmaceutical and medical device studies who seek alternatives to Microsoft Word

Statistical Software used during the Course
Operating System: Microsoft Windows or Linux
Add-on S library: Hmisc (built-in to S; may be obtained from Hmisc for use with R or S-Plus). For R, Hmisc may easily be downloaded and stalled from CRAN.

Course Overview
This short course begins with a quick summary of the S language and objects and a comparison of S and SAS. This is followed by methods for inspecting, recoding, reshaping, and manipulating data, some of which use extensions of S found in the freely available Hmisc package written by the instructor. Usage of S for more advanced statistical methods such as simulation and bootstrapping is surveyed. Then basic sample size and power calculations are discussed, followed by an example using the Hmisc library to simulate power for the Cox-logrank test for comparing event times in a clinical trial in which drop-in, drop-out, and delayed treatment response are present. Then major emphasis is given to table making and S and Hmisc graphical functions useful in everyday exploratory work and in presenting final summary statistics. An overview of principles of constructing good graphics will be given. The course will cover how data may be aggregated in a variety of ways including many suitable for use with Trellis graphics, and it will cover how Trellis may be extended to produce more flexible graphics that display confidence bands and error bars, for example, and automatically label curves and produce keys. The course concludes with discussion of some tools for making analyses reproducible, an important facit for regulatory and other scientific review of final results. Advantages of using S in conjunction with the %LATEX% document processing / typesetting system for constructing reproducible statistical reports and hyperlinked PDF documents are discussed.

The short course is based on the premises that (1) graphical presentation of information is fundamental to understanding data and presenting research findings and (2) modern computing tools are required for extracting information and drawing inferences from data and for constructing the best graphical displays. Such computing tools allow researchers to generate highly informative graphics in a flexible fashion with a minimum of programming. These tools also allow researchers having a grasp of statistical concepts to use modern computationally-intensive techniques to draw inferences without making as many strict assumptions (e.g., about data distributions) as in the past.

The course will utilize both lecture and interactive laboratory format. The former will be used to teach students about the S language and the add-on S Hmisc library written by the instructor. During interactive labs students will learn how to use S to manipulate, inspect, analyze, graph raw and summary data, and to report statistical summaries. The S/%LATEX% interface written by Richard Heiberger and Frank Harrell is briefly introduced.

At least several months experience using S, including good facility with the S command line, and experience in statistical data analysis.

  1. S Statistical Computing and Graphics Language (Chapter 1 of Alzola & Harrell)
    • Why S?
    • Starting S on Windows
    • Commands vs. GUIs
    • Basic commands
    • Entering and saving commands
    • Differences between S and SAS
    • A quick tour of useful system tools (see also EmacsLaTeXTools)
  2. Objects, Getting Help, Functions, Attributes, and Libraries (Chapter 2 of A&H)
    • Objects
    • Getting help
    • Functions
    • Vectors
    • Matrices, lists, data frames
    • Attributes
    • Add-on function library: Hmisc
  3. Data in S (Chapter 3)
    • Importing data
    • Adjustments to variables after input
    • Cleanup imported data (3.1)
    • Inspecting data (descriptive statistics, checking quality, 3.2.3-3.4)
  4. Operating in S (Chapter 4)
    • Reading and writing data frames and variables
    • Functions for manipulating and summarizing data (4.2.2-4.2.8, 11.4.3-11.3.3)
    • Missing value imputation (4.5)
    • Creating derived variables
  5. Review of Data Frame Creation, Annotation, and Analysis
    • Importing external data
    • Making global changes to a data frame
    • Changing variables within a data frame
    • Analyses of the entire data frame
    • Analysis of individual variables
  6. Brief Introduction to Missing Value Imputation, Simulation & Bootstrapping (4.5-4.6)
    • Simple imputation and bookkeeping of imputed values
    • Simulation example
    • Basic bootstrap example
  7. Probability and Statistical Functions (Chapter 5)
    • Statistical summaries
    • Probability distributions
    • Statistical tests
    • Basic power/sample size calculations; one example of using simulution to compute power in a complex situation
    • Computing confidence limits using commands & menus
  8. Making Tables (Chapter 6 of A&H)
  9. Some Principles of Graph Construction (A & H 10)
  10. Some Useful Traditional Graphics and the S-Plus GUI
    • Introduction to the GUI (see Pam Goodman handout at gui.graphics.tricks.pdf)
    • Histograms and density plots (hist.data.frame, histSpike)
    • Cumulative distribution plots (Hmisc ecdf function, 11.3)
    • Box plots
  11. Using S Functions for Graphing Data (Chapters 11-12 in A & H)
    • Basic plotting commands
    • Graphical snapshot of a dataset: Hmisc datadensity
    • Graphical summary of missing data patterns: Hmisc naclus
    • Graphical summary of interrelationships of variables: Hmisc varclus
    • Basic scatterplots and rug plots; scatterplot matrices
    • Nonparametric regression fits (trends without linearity assumptions) using Hmisc plsmo
    • Turning tables into plots (review of summary.formula)
    • Trellis graphics (11.4)
      • bwplot, densityplot, dotplot, histogram, splom, density, stripplot, xyplot functions
      • Hmisc panel.bpplot Trellis panel function for extended box plots
      • Hmisc xYplot and Dotplot functions for error bars and bands (11.4.1)
    • Plotting summary statistics using Hmisc summarize and xYplot and Dotplot functions and other Trellis graphics functions (11.4.3)
  12. Tools for Reproducible and Analyses and Reports (Chapter 13 in A&H)
    • Batch processing and managing analyses: The Hmisc do function (13.2)
    • Reproducible analysis using Makefiles (13.3)
    • Reproducible reports (see sintro.pdf)
      • A 15-minute course in %LATEX%
      • S/%LATEX% interface and summary.formula's latex function
      • Producing HTML and hyperlinked PDF documents

Alzola CF, Harrell FE: An Introduction to S and to the Hmisc and Design Libraries, 2002. sintro.pdf.
Harrell FE: Statistical Tables and Plots using S and %LATEX%, 1999. summary.pdf.

References for Statistical Graphics
Cleveland W: The Elements of Graphing Data, Summit NJ: Hobart Press, 1994.
Tufte ER: The Visual Display of Quantitative Information, Cheshire CT: Graphics Press, 1983.
Tufte ER: Envisioning Information, Cheshire CT: Graphics Press, 1990.
Tufte ER: Visual Explanations, Cheshire CT: Graphics Press, 1997.
Cleveland WS: Visualizing Data, Summit NJ: Hobart Press, 1993.
S Manuals. (Also see General Bibliography in Alzola & Harrell)
Wilkinson L: The Grammar of Graphics, New York: Springer, 1999.
Wallgren A, Wallgren B, Persson R, Jorner U, Haaland, J: Graphing Statistics & Data, Thousand Oaks: SAGE Publications, 1996.

References for %LATEX%
Lamport A: %LATEX%: A Document Preparation System, Reading MA: Addison-Wesley, 2nd edition, 1994.
Oetiker T, Partl H, Hyna I, Schlegl E: The Not So Short Introduction to %LATEX%2e, 2005.

References for S and Statistical Methods
Venables WN, Ripley BD: Modern Applied Statistics with S, New York: Springer-Verlag, 4th edition, 2002.


Hardcopy Handouts and Books Supplied to Participants
  1. sCompGraph.pdf
  2. sintro.pdf
  3. Printout of this web page
  4. Printout of web page DataSets
  5. gui.graphics.tricks.pdf
  6. ShortCourse.hw.pdf
  7. Copy of solutions to the above lab assignments
  8. Copy of Cleveland's Elements of Graphing Data

-- ColeBeck - 25 Aug 2005
Topic revision: r4 - 29 Aug 2005, FrankHarrell

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback