Statistical Graphics for Exploring Data, Presenting Information, and Understanding Statistical Models | Scripts
Graphical methods are being increasingly used for exploratory data
analysis. Some of the many graphical tools that are useful in this
setting are scatterplot matrices, nonparametric smoothers, and tree
diagrams. Statistical graphics for presenting information have been
used much longer, but most of the commonly used graphics used in
papers, presentations, and the popular media, such as bar charts and
pie charts, are either poor or misleading in communicating information
to the reader. This short course begins with a
series of graphical horror stories from the scientific and lay press.
Then elements of graphical perception and good graph construction,
many from the writings of Bill Cleveland, are covered. Practical
suggestions for choosing the best chart or graph type, making good and
clear graphics, and formatting are covered. Techniques for
simultaneous presentation of multiple variables are described.
Complex outcome or risk adjustment models are not easily grasped by
non-statisticians. Special graphics such as effect charts and
nomograms can assist physicians and other consumers of statistical
analysis in understanding statistical models and in using them for
obtaining predictions for individual subjects. Examples of model
presentation graphics will be given.
At the close of the short course some graphical marvels from
the literature (especially from Edward Tufte and Howard Wainer) are
presented.
S script developed in the S-Plus International User Conference in Philadelphia, 16Oct01
S-PLUS : Copyright (c) 1988, 2000 MathSoft, Inc.
S : Copyright Lucent Technologies, Inc.
Version 6.0 Release 1 for Linux 2.2.12 : 2000
Working data will be in .Data
Hmisc library by Frank E Harrell Jr, Tue Jul 17 19:34:31 EDT 2001
This library is provided by FE Harrell <fharrell@virginia.edu>.
It is not supported by Insightful Corp.
Type library(Hmisc, help=T) to view the readme file for the Hmisc library.
Type help(library='Hmisc') to see detailed documentation.
Hmisc redefines [.factor to drop unused levels of factor variables
when subscripting. To prevent this behaviour, issue the command
options(drop.unused.levels=F). You must use
options(drop.unused.levels=F) for multicomp to calculate correctly.
Design library by Frank E Harrell Jr, Tue Jul 17 19:34:50 EDT 2001
This library is provided by FE Harrell <fharrell@virginia.edu>.
It is not supported by Insightful Corp.
Type library(Design, help=T) to view the readme file for the Design library.
Type help(library='Design') to see detailed documentation.
options(contrasts=c('contr.treatment','contr.poly')) set
> ls()
[1] ".First" ".Last.fixed" ".Last.value" ".Random.seed"
[5] ".orig.cal" "ABM" "FEV" "birth.estriol"
[9] "boston" "cal" "counties" "dd"
[13] "diabetes" "f" "g" "hospital"
[17] "last.dump" "last.warning" "lead" "nac"
[21] "olympics.1996" "pbc" "r" "s"
[25] "support" "titanic.trans" "titanic3" "v"
[29] "w" "z"
> datadensity(diabetes,group=diabetes$gender)
> datadensity(diabetes[diabetes$gender=='female',])
> describe(diabetes$gender)
diabetes$gender
n missing unique
403 0 2
male (169, 42%), female (234, 58%)
> nac <- naclus(diabetes)
> plot(nac)
> w <- latex(describe(pbc))
> describe(pbc)
pbc
19 Variables 418 Observations
---------------------------------------------------------------------------
bili : Serum Bilirubin (mg/dl)
n missing unique Mean .05 .10 .25 .50 .75 .90 .95
418 0 98 3.221 0.50 0.60 0.80 1.40 3.40 8.03 14.00
lowest : 0.3 0.4 0.5 0.6 0.7, highest: 21.6 22.5 24.5 25.5 28.0
---------------------------------------------------------------------------
....
Conflicting definitions of "[.data.frame" on databases "Hmisc" and "splus"
in: x[ - yvar]
> par(mfrow=c(1,1))
> plot(f);text(f)
> describe(titanic3)
titanic3
14 Variables 1309 Observations
---------------------------------------------------------------------------
pclass : Passenger Class
n missing unique
1309 0 3
1st (323, 25%), 2nd (277, 21%), 3rd (709, 54%)
---------------------------------------------------------------------------
. . .
> datadensity(titanic3)
> f <- rpart(is.na(age) ~ pclass + sex + survived,data=titanic3)
> plot(f);text(f)
> v <- varclus(~., data=diabetes)
> plot(v)
> attach(diabetes)
> smean.cl.normal(glyhb)
Mean Lower Upper
5.58977 5.366506 5.813035
> smean.cl.boot(glyhb)
Mean Lower Upper
5.589769 5.356792 5.810075
> smean.cl.boot(glyhb)
Mean Lower Upper
5.589769 5.373114 5.817752
> set.seed(17)
> smean.cl.boot(glyhb)
Mean Lower Upper
5.589769 5.380151 5.812173
> set.seed(17)
> smean.cl.boot(glyhb)
Mean Lower Upper
5.589769 5.380151 5.812173
> tapply(glyhb, gender, mean)
male female
NA NA
> tapply(glyhb, gender, mean, na.rmT)
Problem: Object "na.rmT" not found
Use traceback() to see the call stack
> tapply(glyhb, gender, mean, na.rm=T)
male female
5.724074 5.494342
> tapply(glyhb, gender, quantile, na.rm=T)
$male:
0% 25% 50% 75% 100%
2.68 4.3825 4.9 5.695 16.11
$female:
0% 25% 50% 75% 100%
2.73 4.3775 4.785 5.5525 14.94
> tapply(glyhb, gender, smean.cl.boot)
$male:
Mean Lower Upper
5.724074 5.37528 6.09749
$female:
Mean Lower Upper
5.494342 5.250717 5.757164
> summarize(glyhb, gender, na.rm=T)
Problem in summarize(glyhb, gender, na.rm = T): Argument "" is missing, with no default
Use traceback() to see the call stack
> summarize(glyhb, gender, mean, na.rm=T)
gender glyhb
1 female 5.494342
2 male 5.724074
> summarize(glyhb, gender, smean.sd)
gender glyhb SD
1 female 5.494344 2.133675
2 male 5.724074 2.387779
> summarize(glyhb, gender, smean.cl.boot)
gender glyhb Lower Upper
1 female 5.494342 5.221705 5.790896
2 male 5.724074 5.391357 6.103901
> summarize(glyhb, llist(gender,frame), smean.cl.boot)
gender frame glyhb Lower Upper
1 female NA 5.004286 4.247143 6.169999
2 female large 6.407250 5.734312 7.089687
3 female medium 5.390177 5.031194 5.801590
4 female small 5.180882 4.735415 5.689444
5 male NA 5.595000 4.455000 7.067500
6 male large 5.901186 5.428293 6.471082
7 male medium 6.075077 5.368205 6.812477
8 male small 4.760882 4.389346 5.217780
> summarize(glyhb, llist(gender,cut2(age,g=3)), smean.cl.boot)
gender cut2(age, g = 3) glyhb Lower Upper
1 female [19,39) 4.690118 4.411160 5.032921
2 female [39,55) 5.389333 4.934533 5.839916
3 female [55,92] 6.615441 6.055290 7.182089
4 male [19,39) 4.859636 4.609996 5.160201
5 male [39,55) 5.486383 4.884255 6.256220
6 male [55,92] 6.702667 6.074722 7.425051
> summarize(glyhb, llist(Sex=gender,Age=cut2(age,g=2)), smean.cl.boot)
Sex Age glyhb Lower Upper
1 female [19,46) 4.797031 4.572688 5.040264
2 female [46,92] 6.386900 5.920603 6.875483
3 male [19,46) 4.812667 4.570495 5.102008
4 male [46,92] 6.509770 5.939091 7.161430
> s <- summary(survived ~ age + sex + pclass,data=titanic3)
> s
Survived N=1309
---------------+-------------+----+---------+
| |N |survived |
---------------+-------------+----+---------+
Age |[ 0.167,22.0)| 290|0.4310345|
|[22.000,28.5)| 246|0.3861789|
|[28.500,40.0)| 265|0.4188679|
|[40.000,80.0]| 245|0.3918367|
|Missing | 263|0.2775665|
---------------+-------------+----+---------+
Sex |female | 466|0.7274678|
|male | 843|0.1909846|
---------------+-------------+----+---------+
Passenger Class|1st | 323|0.6191950|
|2nd | 277|0.4296029|
|3rd | 709|0.2552891|
---------------+-------------+----+---------+
Overall | |1309|0.3819710|
---------------+-------------+----+---------+
> plot(s)
> search()
[1] ".Data" "diabetes" "Design" "Hmisc" "splus" "stat"
[7] "data" "trellis" "nlme3" "rpart" "main"
> detach(2)
> attach(titanic3)
> plsmo(age,survived,group=pclass,datadensity=T)
> plsmo(age,survived,group=interaction(sex,pclass),col=1:6,datadensity=T)
>
> detach(2)
> bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes)
> bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes,panel=panel.bpplot)
> args(xYplot)
function(formula, data = sys.parent(1), groups = NULL, prepanel =
prepanel.xYplot, panel = "panel.xYplot", scales = NULL, ..., xlab =
NULL, ylab = NULL, subset = TRUE, minor.ticks = NULL)
NULL
> search()
[1] ".Data" "Design" "Hmisc" "splus" "stat" "data" "trellis"
[8] "nlme3" "rpart" "main"
> attach(diabetes)
> s <- summarize(glyhb, llist(frame,gender), smean.cl.boot)
> s
frame gender glyhb Lower Upper
1 NA female 5.004286 4.248535 6.200000
2 NA male 5.595000 4.455000 7.067500
3 large female 6.407250 5.684419 7.207156
4 large male 5.901186 5.424217 6.485382
5 medium female 5.390177 5.025188 5.794829
6 medium male 6.075077 5.419884 6.819735
7 small female 5.180882 4.746453 5.672276
8 small male 4.760882 4.394985 5.251574
> names(s)
[1] "frame" "gender" "glyhb" "Lower" "Upper"
> Dotplot(frame ~ Cbind(glyhb,Lower,Upper)|gender, data=s)
> d <- expand.grid(x=seq(0,2*pi,length=150), p=1:3, shift=c(0,pi))
> xYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l')
> > > dfr <- expand.grid(month=1:12, continent=c('Europe','USA'),
sex=c('female','male'))
attach(dfr)
set.seed(13)
y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') +
runif(48,-.15,.15)
lower <- y - runif(48,.05,.15)
upper <- y + runif(48,.05,.15)
xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male' & continent=='USA')
# add ,label.curves=F to suppress use of labcurve to label curves where
farthest apart
+ > > > + > > > > Warning messages:
Condition has 7 elements: only the first used in: e1 || e2
> > Problem: Object "farthest apart" not found
> xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key()
Warning messages:
1: Condition has 7 elements: only the first used in: e1 || e2
2: Condition has 7 elements: only the first used in: e1 || e2
> Key
function(x = NULL, y = NULL, lev = c("female", "male"), cex = c(1., 1.), col =
c(1, 1), font = c(1, 1), pch = c("o", "+"), ...)
{
if(length(x)) {
if(is.list(x)) {
y <- x$y
x <- x$x
}
key(x = x, y = y, text = list(lev, col = col), points = list(
cex = cex, col = col, font = font, pch = pch),
transparent = TRUE, ...)
}
else key(text = list(lev, col = col), points = list(cex = cex, col =
col, font = font, pch = pch), transparent = TRUE, ...)
invisible()
}
> xYplot(Cbind(y,lower,upper) ~
month,groups=sex,subset=continent=='Europe',keys='lines')
> > xYplot(Cbind(y,lower,upper) ~
month,groups=sex,subset=continent=='Europe',method='bands')
+ > > label(y) <- 'Quality of Life Score'
# label is in Hmisc library = attr(y,'label') <- 'Quality...'; will be
# y-axis label
# can also specify Cbind('Quality of Life Score'=y,lower,upper)
xYplot(Cbind(y,lower,upper) ~ month, groups=sex,
subset=continent=='Europe', method='alt bars',
offset=.4)
#
> setTrellis
function(strip.blank = TRUE, lty.dot.line = 2, lwd.dot.line = 1)
{
if(strip.blank)
trellis.strip.blank()
# in Hmisc Misc.s
dot.line <- trellis.par.get("dot.line")
dot.line$lwd <- lwd.dot.line
dot.line$lty <- lty.dot.line
trellis.par.set("dot.line", dot.line)
invisible()
}
> setTrellis()
Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names")
Use traceback() to see the call stack
> trellis.strip.blank()
Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names")
Use traceback() to see the call stack
> ecdf(~ glyhb | frame*gender, data=diabetes)
> ecdf(~ glyhb | frame*gender, groups=cut2(age,g=2), data=diabetes)
> xyplot( y ~ time, groups=subject.id, type='b') # or | subject.id
Problem in abs(x): needed atomic data, got an object of class "function"
Script developed in the one-day version of the course given at the S-Plus Users Conference, Seattle, October 2000
#page(describe(pbc), multi=T)
datadensity(pbc)
nac <- naclus(pbc)
plot(nac)
naplot(nac)
plot(varclus(~.,data=diabetes))
x <- c(1,2,6,NA)
x <- impute(x)
x
is.imputed(x)
attributes(x)
describe(x)
x <- runif(100)
smean.cl.normal(x)
smean.cl.boot(x)
smean.sd(x)
smean.sdl(x)
binconf(6, 10, .01)
bpower(.1, .2, n=500)
bpower.sim(.1, .2, n=500)
spearman2(glyhb ~ age + gender + frame,
data=diabetes, p=2)
s <- summary(glyhb ~ age + chol + gender +
frame, data=diabetes)
options(digits=3)
s # same print(s)
plot(s)
s <- summary(glyhb ~ age + chol + gender +
frame, data=diabetes,
fun=smean.cl.boot)
s
plot(s, which=1:3, pch=c('.','[',']'))
show.pch()
s <- summary(drug ~ age + platelet + chol +
spiders + edema,data=pbc)
options(width=100)
#page(s)
plot(s)
hist.data.frame(pbc)
#attach(pbc)
ecdf(albumin)
ecdf(albumin,q=c(.25,.5,.75))
ecdf(albumin,group=drug)
ecdf(diabetes,group=diabetes$gender,
col=1:2, label.curve=F)
ecdf(albumin, datadensity='rug')
ecdf(albumin, datadensity='density')
scat1d(albumin)
attach(titanic3)
plsmo(age, survived, group=sex,
col=1:2, datadensity=T)
ecdf(~age|pclass,groups=sex,q=.5)
bwplot(pclass~age|sex)
bwplot(sex~age|pclass)
bwplot(sex ~ age | pclass,
panel=panel.bpplot)
set.seed(13)
x <- rnorm(1000)
g <- sample(1:6, 1000, replace=T)
x[g==1][1:20] <- rnorm(20)+3 # contaminate 20 x's for group 1
# default trellis box plot
bwplot(g ~ x)
# box-percentile plot with data density (rug plot)
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=T)
# add ,scat1d.opts=list(tfrac=1) to make all tick marks the same size
# when a group has > 125 observations
# small dot for means, show only .05,.125,.25,.375,.625,.75,.875,.95 quantiles
bwplot(g ~ x, panel=panel.bpplot, cex=.3)
# suppress means and reference lines for lower and upper quartiles
bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,.1,.25), means=F, qref=F)
# continuous plot up until quartiles ("Tootsie Roll plot")
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.25,by=.01))
# start at quartiles then make it continuous ("coffin plot")
bwplot(g ~ x, panel=panel.bpplot, probs=seq(.25,.49,by=.01))
# same as previous but add a spike to give 0.95 interval
bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,seq(.25,.49,by=.01)))
# decile plot with reference lines at outer quintiles and median
bwplot(g ~ x, panel=panel.bpplot, probs=c(.1,.2,.3,.4), qref=c(.5,.2,.8))
# default plot with tick marks showing all observations outside the outer
# box (.05 and .95 quantiles), with very small ticks
bwplot(g ~ x, panel=panel.bpplot, nout=.05, scat1d.opts=list(frac=.01))
# show 5 smallest and 5 largest observations
bwplot(g ~ x, panel=panel.bpplot, nout=5)
# Use a scat1d option (preserve=T) to ensure that the right peak extends
# to the same position as the extreme scat1d
bwplot(~x , panel=panel.bpplot, probs=seq(.00,.5,by=.001),
datadensity=T, scat1d.opt=list(preserve=T))
# attach(diabetes)
label(chol)
xYplot(chol ~ age, groups=gender, type='p')
Key()
xYplot(chol ~ age|location, groups=gender, type='p')
Key()
?xYplot
dfr <- expand.grid(month=1:12, continent=c('Europe','USA'),
sex=c('female','male'))
attach(dfr)
set.seed(13)
y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') +
runif(48,-.15,.15)
lower <- y - runif(48,.05,.15)
upper <- y + runif(48,.05,.15)
xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male'&continent=='USA')
# add ,label.curves=F to suppress use of labcurve to label curves where farthest apart
xYplot(Cbind(y,lower,upper) ~ month|continent,subset=sex=='male')
xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key()
xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe')
xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',keys='lines')
# keys='lines' causes labcurve to draw a legend where the panel is most empty
s <- summarize(y, llist(month,year), smean.cl.boot)
s
xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)
xYplot(glyhb ~ chol|gender,
method=smean.cl.normal)
Dotplot(month ~ Cbind(y, Lower, Upper) | year, data=s)
# or Cbind(...), groups=year, data=s
# Display a 5-number (5-quantile) summary (2 intervals, dot=median)
# Note that summarize produces a matrix for y, and Cbind(y) trusts the
# first column to be the point estimate (here the median)
s <- summarize(y, llist(month,year), quantile,
probs=c(.5,.05,.25,.75,.95), type='matrix')
Dotplot(month ~ Cbind(y) | year, data=s)
# Use factor(year) to make actual years appear in conditioning title strips
dim(s)
names(s)
dim(s$y)
--
FrankHarrell - 04 Feb 2004; modified 11 Jul 2004