Statistical Graphics for Exploring Data, Presenting Information, and Understanding Statistical Models | Scripts

Graphical methods are being increasingly used for exploratory data analysis. Some of the many graphical tools that are useful in this setting are scatterplot matrices, nonparametric smoothers, and tree diagrams. Statistical graphics for presenting information have been used much longer, but most of the commonly used graphics used in papers, presentations, and the popular media, such as bar charts and pie charts, are either poor or misleading in communicating information to the reader. This short course begins with a series of graphical horror stories from the scientific and lay press. Then elements of graphical perception and good graph construction, many from the writings of Bill Cleveland, are covered. Practical suggestions for choosing the best chart or graph type, making good and clear graphics, and formatting are covered. Techniques for simultaneous presentation of multiple variables are described.

Complex outcome or risk adjustment models are not easily grasped by non-statisticians. Special graphics such as effect charts and nomograms can assist physicians and other consumers of statistical analysis in understanding statistical models and in using them for obtaining predictions for individual subjects. Examples of model presentation graphics will be given.

At the close of the short course some graphical marvels from the literature (especially from Edward Tufte and Howard Wainer) are presented.

S script developed in the S-Plus International User Conference in Philadelphia, 16Oct01

S-PLUS : Copyright (c) 1988, 2000 MathSoft, Inc.
S : Copyright Lucent Technologies, Inc.
Version 6.0 Release 1 for Linux 2.2.12 : 2000 
Working data will be in .Data 
Hmisc library by Frank E Harrell Jr, Tue Jul 17 19:34:31 EDT 2001 

This library is provided by FE Harrell <fharrell@virginia.edu>.
It is not supported by Insightful Corp.
Type library(Hmisc, help=T) to view the readme file for the Hmisc library.
Type help(library='Hmisc') to see detailed documentation.

Hmisc redefines [.factor to drop unused levels of factor variables
when subscripting. To prevent this behaviour, issue the command
options(drop.unused.levels=F). You must use
options(drop.unused.levels=F) for multicomp to calculate correctly.

Design library by Frank E Harrell Jr, Tue Jul 17 19:34:50 EDT 2001 

This library is provided by FE Harrell <fharrell@virginia.edu>.
It is not supported by Insightful Corp.
Type library(Design, help=T) to view the readme file for the Design library.
 Type help(library='Design') to see detailed documentation.
options(contrasts=c('contr.treatment','contr.poly')) set

> ls()
 [1] ".First"        ".Last.fixed"   ".Last.value"   ".Random.seed" 
 [5] ".orig.cal"     "ABM"           "FEV"           "birth.estriol"
 [9] "boston"        "cal"           "counties"      "dd"           
[13] "diabetes"      "f"             "g"             "hospital"     
[17] "last.dump"     "last.warning"  "lead"          "nac"          
[21] "olympics.1996" "pbc"           "r"             "s"            
[25] "support"       "titanic.trans" "titanic3"      "v"            
[29] "w"             "z"            
> datadensity(diabetes,group=diabetes$gender)
> datadensity(diabetes[diabetes$gender=='female',])
> describe(diabetes$gender)
diabetes$gender 
   n missing unique 
 403       0      2

male (169, 42%), female (234, 58%) 
> nac <- naclus(diabetes)
> plot(nac)
> w <- latex(describe(pbc))
> describe(pbc)
pbc 

 19  Variables      418  Observations
---------------------------------------------------------------------------
bili : Serum Bilirubin (mg/dl) 
   n missing unique  Mean   .05   .10   .25   .50   .75   .90   .95 
 418 0       98     3.221  0.50  0.60  0.80  1.40  3.40  8.03 14.00

lowest :  0.3  0.4  0.5  0.6  0.7, highest: 21.6 22.5 24.5 25.5 28.0 
---------------------------------------------------------------------------
....
  Conflicting definitions of "[.data.frame" on databases "Hmisc" and  "splus" 
   in: x[ - yvar]
> par(mfrow=c(1,1))
> plot(f);text(f)
> describe(titanic3)
titanic3 

 14  Variables      1309  Observations
---------------------------------------------------------------------------
pclass : Passenger Class 
    n missing unique 
 1309       0      3

1st (323, 25%), 2nd (277, 21%), 3rd (709, 54%) 
---------------------------------------------------------------------------
. . .
> datadensity(titanic3)
> f <- rpart(is.na(age) ~ pclass + sex + survived,data=titanic3)
> plot(f);text(f)
> v <- varclus(~., data=diabetes)
> plot(v)
> attach(diabetes)
> smean.cl.normal(glyhb)
    Mean    Lower    Upper 
 5.58977 5.366506 5.813035
> smean.cl.boot(glyhb)
     Mean    Lower    Upper 
 5.589769 5.356792 5.810075
> smean.cl.boot(glyhb)
     Mean    Lower    Upper 
 5.589769 5.373114 5.817752
> set.seed(17)
> smean.cl.boot(glyhb)
     Mean    Lower    Upper 
 5.589769 5.380151 5.812173
> set.seed(17)
> smean.cl.boot(glyhb)
     Mean    Lower    Upper 
 5.589769 5.380151 5.812173
> tapply(glyhb, gender, mean)
 male female 
   NA     NA
> tapply(glyhb, gender, mean, na.rmT)
Problem: Object "na.rmT" not found 
Use traceback() to see the call stack
> tapply(glyhb, gender, mean, na.rm=T)
     male   female 
 5.724074 5.494342
> tapply(glyhb, gender, quantile, na.rm=T)
$male:
   0%    25%  50%   75%  100% 
 2.68 4.3825  4.9 5.695 16.11

$female:
   0%    25%   50%    75%  100% 
 2.73 4.3775 4.785 5.5525 14.94

> tapply(glyhb, gender, smean.cl.boot)
$male:
     Mean   Lower   Upper 
 5.724074 5.37528 6.09749

$female:
     Mean    Lower    Upper 
 5.494342 5.250717 5.757164

> summarize(glyhb, gender, na.rm=T)
Problem in summarize(glyhb, gender, na.rm = T): Argument "" is missing, with no default 
Use traceback() to see the call stack
> summarize(glyhb, gender, mean, na.rm=T)
  gender    glyhb 
1 female 5.494342
2   male 5.724074
> summarize(glyhb, gender, smean.sd)
  gender    glyhb       SD 
1 female 5.494344 2.133675
2   male 5.724074 2.387779
> summarize(glyhb, gender, smean.cl.boot)
  gender    glyhb    Lower    Upper 
1 female 5.494342 5.221705 5.790896
2   male 5.724074 5.391357 6.103901
> summarize(glyhb, llist(gender,frame), smean.cl.boot)
  gender  frame    glyhb    Lower    Upper 
1 female     NA 5.004286 4.247143 6.169999
2 female  large 6.407250 5.734312 7.089687
3 female medium 5.390177 5.031194 5.801590
4 female  small 5.180882 4.735415 5.689444
5   male     NA 5.595000 4.455000 7.067500
6   male  large 5.901186 5.428293 6.471082
7   male medium 6.075077 5.368205 6.812477
8   male  small 4.760882 4.389346 5.217780
> summarize(glyhb, llist(gender,cut2(age,g=3)), smean.cl.boot)
  gender cut2(age, g = 3)    glyhb    Lower    Upper 
1 female          [19,39) 4.690118 4.411160 5.032921
2 female          [39,55) 5.389333 4.934533 5.839916
3 female          [55,92] 6.615441 6.055290 7.182089
4   male          [19,39) 4.859636 4.609996 5.160201
5   male          [39,55) 5.486383 4.884255 6.256220
6   male          [55,92] 6.702667 6.074722 7.425051
> summarize(glyhb, llist(Sex=gender,Age=cut2(age,g=2)), smean.cl.boot)
     Sex     Age    glyhb    Lower    Upper 
1 female [19,46) 4.797031 4.572688 5.040264
2 female [46,92] 6.386900 5.920603 6.875483
3   male [19,46) 4.812667 4.570495 5.102008
4   male [46,92] 6.509770 5.939091 7.161430
> s <- summary(survived ~ age + sex + pclass,data=titanic3)
> s
Survived    N=1309

---------------+-------------+----+---------+
               |             |N   |survived |
---------------+-------------+----+---------+
Age            |[ 0.167,22.0)| 290|0.4310345|
               |[22.000,28.5)| 246|0.3861789|
               |[28.500,40.0)| 265|0.4188679|
               |[40.000,80.0]| 245|0.3918367|
               |Missing      | 263|0.2775665|
---------------+-------------+----+---------+
Sex            |female       | 466|0.7274678|
               |male         | 843|0.1909846|
---------------+-------------+----+---------+
Passenger Class|1st          | 323|0.6191950|
               |2nd          | 277|0.4296029|
               |3rd          | 709|0.2552891|
---------------+-------------+----+---------+
Overall        |             |1309|0.3819710|
---------------+-------------+----+---------+
> plot(s)
> search()
 [1] ".Data"    "diabetes" "Design"   "Hmisc"    "splus"    "stat"    
 [7] "data"     "trellis"  "nlme3"    "rpart"    "main"    
> detach(2)
> attach(titanic3)
> plsmo(age,survived,group=pclass,datadensity=T)
> plsmo(age,survived,group=interaction(sex,pclass),col=1:6,datadensity=T)
> 
> detach(2)
> bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes)
> bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes,panel=panel.bpplot)
> args(xYplot)
function(formula, data = sys.parent(1), groups = NULL, prepanel = 
   prepanel.xYplot, panel = "panel.xYplot", scales = NULL, ..., xlab = 
   NULL, ylab = NULL, subset = TRUE, minor.ticks = NULL)
NULL
> search()
 [1] ".Data"   "Design"  "Hmisc"   "splus"   "stat"    "data"    "trellis"
 [8] "nlme3"   "rpart"   "main"   
> attach(diabetes)
> s <- summarize(glyhb, llist(frame,gender), smean.cl.boot)
> s
   frame gender    glyhb    Lower    Upper 
1     NA female 5.004286 4.248535 6.200000
2     NA   male 5.595000 4.455000 7.067500
3  large female 6.407250 5.684419 7.207156
4  large   male 5.901186 5.424217 6.485382
5 medium female 5.390177 5.025188 5.794829
6 medium   male 6.075077 5.419884 6.819735
7  small female 5.180882 4.746453 5.672276
8  small   male 4.760882 4.394985 5.251574
> names(s)
[1] "frame"  "gender" "glyhb"  "Lower"  "Upper" 
> Dotplot(frame ~ Cbind(glyhb,Lower,Upper)|gender, data=s)


> d <- expand.grid(x=seq(0,2*pi,length=150), p=1:3, shift=c(0,pi)) 
> xYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l') 

> > > dfr <- expand.grid(month=1:12, continent=c('Europe','USA'), 
                   sex=c('female','male'))
attach(dfr)
set.seed(13)
y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') + 
  runif(48,-.15,.15)
lower <- y - runif(48,.05,.15)
upper <- y + runif(48,.05,.15)

xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male' & continent=='USA') 
# add ,label.curves=F to suppress use of labcurve to label curves where
farthest apart
+ > > > + > > > > Warning messages:
  Condition has 7 elements: only the first used in: e1 || e2
> > Problem: Object "farthest apart" not found
> xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key() 
Warning messages:
1: Condition has 7 elements: only the first used in: e1 || e2
2: Condition has 7 elements: only the first used in: e1 || e2
> Key
function(x = NULL, y = NULL, lev = c("female", "male"), cex = c(1., 1.), col = 
   c(1, 1), font = c(1, 1), pch = c("o", "+"), ...)
{
   if(length(x)) {
      if(is.list(x)) {
         y <- x$y
         x <- x$x
      }
      key(x = x, y = y, text = list(lev, col = col), points = list(
         cex = cex, col = col, font = font, pch = pch), 
         transparent = TRUE, ...)
   }
   else key(text = list(lev, col = col), points = list(cex = cex, col = 
         col, font = font, pch = pch), transparent = TRUE, ...)
   invisible()
}
> xYplot(Cbind(y,lower,upper) ~
month,groups=sex,subset=continent=='Europe',keys='lines') 

> > xYplot(Cbind(y,lower,upper) ~
month,groups=sex,subset=continent=='Europe',method='bands') 

+ > > label(y) <- 'Quality of Life Score'   
# label is in Hmisc library = attr(y,'label') <- 'Quality...'; will be
# y-axis label 
# can also specify Cbind('Quality of Life Score'=y,lower,upper) 
xYplot(Cbind(y,lower,upper) ~ month, groups=sex,
       subset=continent=='Europe', method='alt bars',
        offset=.4)   
# 

> setTrellis
function(strip.blank = TRUE, lty.dot.line = 2, lwd.dot.line = 1)
{
   if(strip.blank)
      trellis.strip.blank()
   # in Hmisc Misc.s
   dot.line <- trellis.par.get("dot.line")
   dot.line$lwd <- lwd.dot.line
   dot.line$lty <- lty.dot.line
   trellis.par.set("dot.line", dot.line)
   invisible()
}
> setTrellis()
Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names") 
Use traceback() to see the call stack
> trellis.strip.blank()
Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names") 
Use traceback() to see the call stack
> ecdf(~ glyhb | frame*gender, data=diabetes)
> ecdf(~ glyhb | frame*gender, groups=cut2(age,g=2), data=diabetes)
> xyplot( y ~ time,  groups=subject.id, type='b') # or | subject.id
Problem in abs(x): needed atomic data, got an object of class "function" 

Script developed in the one-day version of the course given at the S-Plus Users Conference, Seattle, October 2000

#page(describe(pbc), multi=T)
datadensity(pbc)
nac <- naclus(pbc)
plot(nac)
naplot(nac)
plot(varclus(~.,data=diabetes))

x <- c(1,2,6,NA)
x <- impute(x)
x
is.imputed(x)
attributes(x)
describe(x)

x <- runif(100)
smean.cl.normal(x)
smean.cl.boot(x)
smean.sd(x)
smean.sdl(x)

binconf(6, 10, .01)

bpower(.1, .2, n=500)
bpower.sim(.1, .2, n=500)

spearman2(glyhb ~ age + gender + frame,
  data=diabetes, p=2)

s <- summary(glyhb ~ age + chol + gender +
  frame, data=diabetes)
options(digits=3)
s # same print(s)
plot(s)
s <- summary(glyhb ~ age + chol + gender +
  frame, data=diabetes,
  fun=smean.cl.boot)
s
plot(s, which=1:3, pch=c('.','[',']'))
show.pch()

s <- summary(drug ~ age + platelet + chol +
 spiders + edema,data=pbc)
options(width=100)
#page(s)
plot(s)

hist.data.frame(pbc)
#attach(pbc)
ecdf(albumin)
ecdf(albumin,q=c(.25,.5,.75))
ecdf(albumin,group=drug)
ecdf(diabetes,group=diabetes$gender,
 col=1:2, label.curve=F)
ecdf(albumin, datadensity='rug')
ecdf(albumin, datadensity='density')
scat1d(albumin)

attach(titanic3)
plsmo(age, survived, group=sex,
  col=1:2, datadensity=T)

ecdf(~age|pclass,groups=sex,q=.5)
bwplot(pclass~age|sex)
bwplot(sex~age|pclass)

bwplot(sex ~ age | pclass,
  panel=panel.bpplot)

set.seed(13)
x <- rnorm(1000)
g <- sample(1:6, 1000, replace=T)
x[g==1][1:20] <- rnorm(20)+3   # contaminate 20 x's for group 1

# default trellis box plot
bwplot(g ~ x)

# box-percentile plot with data density (rug plot)
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=T)
# add ,scat1d.opts=list(tfrac=1) to make all tick marks the same size
# when a group has > 125 observations

# small dot for means, show only .05,.125,.25,.375,.625,.75,.875,.95 quantiles
bwplot(g ~ x, panel=panel.bpplot, cex=.3)

# suppress means and reference lines for lower and upper quartiles
bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,.1,.25), means=F, qref=F)

# continuous plot up until quartiles ("Tootsie Roll plot")
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.25,by=.01))

# start at quartiles then make it continuous ("coffin plot")
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.25,.49,by=.01))

# same as previous but add a spike to give 0.95 interval
bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,seq(.25,.49,by=.01)))

# decile plot with reference lines at outer quintiles and median
bwplot(g ~ x, panel=panel.bpplot, probs=c(.1,.2,.3,.4), qref=c(.5,.2,.8))

# default plot with tick marks showing all observations outside the outer
# box (.05 and .95 quantiles), with very small ticks
bwplot(g ~ x, panel=panel.bpplot, nout=.05, scat1d.opts=list(frac=.01))

# show 5 smallest and 5 largest observations
bwplot(g ~ x, panel=panel.bpplot, nout=5)

# Use a scat1d option (preserve=T) to ensure that the right peak extends 
# to the same position as the extreme scat1d

bwplot(~x , panel=panel.bpplot, probs=seq(.00,.5,by=.001), 
       datadensity=T, scat1d.opt=list(preserve=T))

# attach(diabetes)
label(chol)
xYplot(chol ~ age, groups=gender, type='p')
Key()
xYplot(chol ~ age|location, groups=gender, type='p')
Key()
?xYplot

dfr <- expand.grid(month=1:12, continent=c('Europe','USA'), 
                   sex=c('female','male'))
attach(dfr)
set.seed(13)
y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') + 
  runif(48,-.15,.15)
lower <- y - runif(48,.05,.15)
upper <- y + runif(48,.05,.15)

xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male'&continent=='USA')
# add ,label.curves=F to suppress use of labcurve to label curves where farthest apart

xYplot(Cbind(y,lower,upper) ~ month|continent,subset=sex=='male')
xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key()
xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe')
xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',keys='lines')
# keys='lines' causes labcurve to draw a legend where the panel is most empty


s <- summarize(y, llist(month,year), smean.cl.boot)
s
xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)

xYplot(glyhb ~ chol|gender,
 method=smean.cl.normal)

Dotplot(month ~ Cbind(y, Lower, Upper) | year, data=s) 
# or Cbind(...), groups=year, data=s

# Display a 5-number (5-quantile) summary (2 intervals, dot=median)
# Note that summarize produces a matrix for y, and Cbind(y) trusts the
# first column to be the point estimate (here the median)
s <- summarize(y, llist(month,year), quantile,
               probs=c(.5,.05,.25,.75,.95), type='matrix')
Dotplot(month ~ Cbind(y) | year, data=s)
# Use factor(year) to make actual years appear in conditioning title strips
dim(s)
names(s)
dim(s$y)

-- FrankHarrell - 04 Feb 2004; modified 11 Jul 2004
Topic revision: r4 - 01 Sep 2008, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback