S-Plus for Statistical Data Analysis and Graphics
Frank E. Harrell, Jr. (f.harrell@vanderbilt.edu)
Professor of Biostatistics and Statistics
Chair, Department of Biostatistics
Vanderbilt University School of Medicine
Course Web Page: http://hesweb1.med.virginia.edu/biostat/teaching/statcomp

Objectives
To be able to use the high-level object-oriented S-Plus language to
  1. recode, manipulate, and reshape data
  2. annotate & document data elements
  3. inspect and manipulate data
  4. compute and display summary statistics
  5. make somewhat complex tables
  6. use Trellis and other S-Plus graphics to display information in multiple variables or multiple summary statistics
  7. use the Hmisc library to extend S-Plus graphics by automatically labeling curves, drawing legends, drawing error bars and bands, and plotting aggregate data
  8. compute sample size and power
  9. be introduced to missing value imputation, simulation, and bootstrapping using S-Plus
  10. conduct reproducible analysis

Audience
Applied statisticians and data analysts who already know how to use S-Plus basics. The course will be of particular interest to:
  • applied statisticians who want to incorporate new statistical methodology and graphical methods into their everyday work
  • data analysts who want to increase productivity in analyzing data and generating statistical graphics
  • data analysts who want to learn new ways to prepare graphical and tabular reports
  • research statisticians who want to acquire new tools for conducting and presenting methodologic research
  • SAS users who want to learn S-Plus or to enhance their S-Plus skills
  • LATEX users who want to interface LATEX to S-Plus
  • statisticians responsible for producing data and safety monitoring and other clinical reports for pharmaceutical and medical device studies who seek alternatives to Microsoft Word

Statistical Software used during the Course
S-Plus 6
Operating System: Microsoft Windows 95/98/NT/2000/XP
Add-on S-Plus library: Hmisc (built-in to S-Plus; may be obtained from Hmisc for use with S-Plus 3.4 on Unix, 2000 on Win/NT, and on S-Plus 6 under Linux & Unix).

Course Overview
This short course begins with a quick summary of the S-Plus language and objects and a comparison of S-Plus and SAS. This is followed by methods for inspecting, recoding, reshaping, and manipulating data, some of which use extensions of S-Plus found in the freely available Hmisc library written by the instructor. Usage of S-Plus for more advanced statistical methods such as simulation and bootstrapping is surveyed. Then basic sample size and power calculations are discussed, followed by an example using the Hmisc library to simulate power for the Cox-logrank test for comparing event times in a clinical trial in which drop-in, drop-out, and delayed treatment response are present. Then major emphasis is given to table making and S-Plus and Hmisc graphical functions useful in everyday exploratory work and in presenting final summary statistics. An overview of principles of constructing good graphics will be given. The course will cover how data may be aggregated in a variety of ways including many suitable for use with Trellis graphics, and it will cover how Trellis may be extended to produce more flexible graphics that display confidence bands and error bars, for example, and automatically label curves and produce keys. The course concludes with discussion of some tools for making analyses reproducible, an important facit for regulatory and other scientific review of final results. Advantages of using S-Plus in conjunction with the LATEX document processing / typesetting system for constructing reproducible statistical reports and hyperlinked PDF documents are discussed.

The short course is based on the premises that (1) graphical presentation of information is fundamental to understanding data and presenting research findings and (2) modern computing tools are required for extracting information and drawing inferences from data and for constructing the best graphical displays. Such computing tools allow researchers to generate highly informative graphics in a flexible fashion with a minimum of programming. These tools also allow researchers having a grasp of statistical concepts to use modern computationally-intensive techniques to draw inferences without making as many strict assumptions (e.g., about data distributions) as in the past.

The course will utilize both lecture and interactive laboratory format. The former will be used to teach students about the S-Plus language and the add-on S-Plus Hmisc library written by the instructor. During interactive labs students will learn how to use S-Plus to manipulate, inspect, analyze, graph raw and summary data, and to report statistical summaries. The S-Plus/LATEX interface written by Richard Heiberger and Frank Harrell is briefly introduced.

Prerequisites
At least several months experience using S-Plus, including good facility with the S-Plus command line, and experience in statistical data analysis.

*Outline:*
 

    • S-Plus Statistical Computing and Graphics Language (Chapter 1 of Alzola & Harrell)
      1. Why S-Plus?
      2. Starting S-Plus on Windows
      3. Commands vs. GUIs
      4. Brief introduction to S-Plus GUI
      5. Basic commands
      6. Entering and saving commands
      7. Differences between S-Plus and SAS
      8. A quick tour of useful system tools (see also

    =http://hesweb1.med.virginia.edu/biostat/s/EmacsTeX/=)
    • Objects, Getting Help, Functions, Attributes, and Libraries (Chapter 2 of A&H)
      1. Objects
      2. Getting help
      3. Functions
      4. Vectors
      5. Matrices, lists, data frames
      6. Attributes
      7. Add-on function library: Hmisc
      8. Data in S-Plus (Chapter 3)
        1. Importing data
        2. Adjustments to variables after input
        3. Cleanup imported data (3.1)
        4. Inspecting data (descriptive statistics, checking quality, 3.2.3-3.4)
        5. Operating in S-Plus (Chapter 4)
          1. Reading and writing data frames and variables
          2. Functions for manipulating and summarizing data (4.2.2-4.2.8, 11.4.3-11.3.3)
          3. Missing value imputation (4.5)
          4. Creating derived variables
          5. Review of Data Frame Creation, Annotation, and Analysis
            1. Importing external data
            2. Making global changes to a data frame
            3. Changing variables within a data frame
            4. Analyses of the entire data frame
            5. Analysis of individual variables
            6. Brief Introduction to Missing Value Imputation, Simulation & Bootstrapping (4.5-4.6)
              1. Simple imputation and bookkeeping of imputed values
              2. Simulation example
              3. Basic bootstrap example
              4. Probability and Statistical Functions (Chapter 5)
                1. Statistical summaries
                2. Probability distributions
                3. Statistical tests
                4. Basic power/sample size calculations; one example of using simulution to compute power in a complex situation
                5. Computing confidence limits using commands & menus
                6. Making Tables (Chapter 6 of A&H)
                7. Some Principles of Graph Construction (A & H 10)
                8. Some Useful Traditional Graphics and the S-Plus GUI
                  1. Introduction to the GUI (see Pam Goodman handout at

    hesweb1.med.virginia.edu/biostat/s/doc/gui.graphics.tricks.pdf)
    • Histograms and density plots (hist.data.frame, histSpike)
    • Cumulative distribution plots (Hmisc ecdf function, 11.3)
    • Box plots

    • Using S-Plus Functions for Graphing Data (Chapters 11-12 in A & H)
      1. Basic plotting commands
      2. Graphical snapshot of a dataset: Hmisc datadensity
      3. Graphical summary of missing data patterns: Hmisc naclus
      4. Graphical summary of interrelationships of variables: Hmisc varclus
      5. Basic scatterplots and rug plots; scatterplot matrices
      6. Nonparametric regression fits (trends without linearity assumptions) using Hmisc plsmo
      7. Turning tables into plots (review of summary.formula)
      8. Trellis graphics (11.4)
        1. bwplot, densityplot, dotplot, histogram, splom, density,

    stripplot, xyplot functions
    • Hmisc panel.bpplot Trellis panel function for extended box plots
    • Hmisc xYplot and Dotplot functions for error bars and bands (11.4.1)
    • Plotting summary statistics using Hmisc summarize and xYplot and Dotplot functions and other Trellis graphics functions (11.4.3)
    • Tools for Reproducible and Analyses and Reports (Chapter 13 in A&H)
      1. Batch processing and managing analyses: The Hmisc do function (13.2)
      2. Reproducible analysis using Makefiles (13.3)
      3. Reproducible reports (see =http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf=)
        1. A 15-minute course in LATEX
        2. S-Plus/LATEX interface and summary.formula='s =latex function
        3. Producing HTML and hyperlinked PDF documents

    Texts
    Alzola CF, Harrell FE: An Introduction to S-Plus and to the Hmisc and Design Libraries, 2002.
    http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf.
    Harrell FE: Statistical Tables and Plots using S-Plus and LATEX, 1999. http://hesweb1.med.virginia.edu/biostat/s/doc/summary.pdf.

    References for Statistical Graphics
    Cleveland W: The Elements of Graphing Data, Summit NJ: Hobart Press, 1994.
    Tufte ER: The Visual Display of Quantitative Information, Cheshire CT: Graphics Press, 1983.
    Tufte ER: Envisioning Information, Cheshire CT: Graphics Press, 1990.
    Tufte ER: Visual Explanations, Cheshire CT: Graphics Press, 1997.
    Cleveland WS: Visualizing Data, Summit NJ: Hobart Press, 1993.
    S-Plus Manuals. (Also see General Bibliography in Alzola & Harrell)
    Wilkinson L: The Grammar of Graphics, New York: Springer, 1999.
    Wallgren A, Wallgren B, Persson R, Jorner U, Haaland, J: Graphing Statistics & Data, Thousand Oaks: SAGE Publications, 1996.

    References for LATEX
    Lamport A: LATEX: A Document Preparation System, Reading MA: Addison-Wesley, 2nd edition, 1994.
    Oetiker T, Partl H, Hyna I, Schlegl E: The Not So Short Introduction to LATEX2e, 2000.
    http://ctan.tug.org/tex-archive/info/lshort/english/lshort.pdf

    References for S-Plus and Statistical Methods
    Venables WN, Ripley BD: Modern Applied Statistics with S-Plus, New York: Springer-Verlag, 3rd edition, 1999.

    Datasets
    http://hesweb1.med.virginia.edu/biostat/teaching/statcomp

    Hardcopy Handouts and Books Supplied to Participants:
    1. http://hesweb1.med.virginia.edu/biostat/teaching/statcomp/notes.pdf
    2. http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf
    3. Printout of this web page
    4. Printout of web page http://hesweb1.med.virginia.edu/biostat/s/data
    5. http://hesweb1.med.virginia.edu/biostat/s/doc/gui.graphics.tricks.pdf
    6. http://hesweb1.med.virginia.edu/biostat/teaching/statcomp/ShortCourse.hw.pdf
    7. Copy of solutions to the above lab assignments
    8. Copy of Cleveland's Elements of Graphing Data

    -- ColeBeck - 25 Aug 2005

This topic: Main > StatCompGraphShortCourse
Topic revision: revision 1
 
This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback