Graphical methods are being increasingly used for exploratory data analysis. Some of the many graphical tools that are useful in this setting are scatterplot matrices, nonparametric smoothers, and tree diagrams. Statistical graphics for presenting information have been used much longer, but most of the commonly used graphics used in papers, presentations, and the popular media, such as bar charts and pie charts, are either poor or misleading in communicating information to the reader. This short course begins with a series of graphical horror stories from the scientific and lay press. Then elements of graphical perception and good graph construction, many from the writings of Bill Cleveland, are covered. Practical suggestions for choosing the best chart or graph type, making good and clear graphics, and formatting are covered. Techniques for simultaneous presentation of multiple variables are described.
Complex outcome or risk adjustment models are not easily grasped by non-statisticians. Special graphics such as effect charts and nomograms can assist physicians and other consumers of statistical analysis in understanding statistical models and in using them for obtaining predictions for individual subjects. Examples of model presentation graphics will be given.
At the close of the short course some graphical marvels from the literature (especially from Edward Tufte and Howard Wainer) are presented.S-PLUS : Copyright (c) 1988, 2000 MathSoft, Inc. S : Copyright Lucent Technologies, Inc. Version 6.0 Release 1 for Linux 2.2.12 : 2000 Working data will be in .Data Hmisc library by Frank E Harrell Jr, Tue Jul 17 19:34:31 EDT 2001 This library is provided by FE Harrell <fharrell@virginia.edu>. It is not supported by Insightful Corp. Type library(Hmisc, help=T) to view the readme file for the Hmisc library. Type help(library='Hmisc') to see detailed documentation. Hmisc redefines [.factor to drop unused levels of factor variables when subscripting. To prevent this behaviour, issue the command options(drop.unused.levels=F). You must use options(drop.unused.levels=F) for multicomp to calculate correctly. Design library by Frank E Harrell Jr, Tue Jul 17 19:34:50 EDT 2001 This library is provided by FE Harrell <fharrell@virginia.edu>. It is not supported by Insightful Corp. Type library(Design, help=T) to view the readme file for the Design library. Type help(library='Design') to see detailed documentation. options(contrasts=c('contr.treatment','contr.poly')) set > ls() [1] ".First" ".Last.fixed" ".Last.value" ".Random.seed" [5] ".orig.cal" "ABM" "FEV" "birth.estriol" [9] "boston" "cal" "counties" "dd" [13] "diabetes" "f" "g" "hospital" [17] "last.dump" "last.warning" "lead" "nac" [21] "olympics.1996" "pbc" "r" "s" [25] "support" "titanic.trans" "titanic3" "v" [29] "w" "z" > datadensity(diabetes,group=diabetes$gender) > datadensity(diabetes[diabetes$gender=='female',]) > describe(diabetes$gender) diabetes$gender n missing unique 403 0 2 male (169, 42%), female (234, 58%) > nac <- naclus(diabetes) > plot(nac) > w <- latex(describe(pbc)) > describe(pbc) pbc 19 Variables 418 Observations --------------------------------------------------------------------------- bili : Serum Bilirubin (mg/dl) n missing unique Mean .05 .10 .25 .50 .75 .90 .95 418 0 98 3.221 0.50 0.60 0.80 1.40 3.40 8.03 14.00 lowest : 0.3 0.4 0.5 0.6 0.7, highest: 21.6 22.5 24.5 25.5 28.0 --------------------------------------------------------------------------- .... Conflicting definitions of "[.data.frame" on databases "Hmisc" and "splus" in: x[ - yvar] > par(mfrow=c(1,1)) > plot(f);text(f) > describe(titanic3) titanic3 14 Variables 1309 Observations --------------------------------------------------------------------------- pclass : Passenger Class n missing unique 1309 0 3 1st (323, 25%), 2nd (277, 21%), 3rd (709, 54%) --------------------------------------------------------------------------- . . . > datadensity(titanic3) > f <- rpart(is.na(age) ~ pclass + sex + survived,data=titanic3) > plot(f);text(f) > v <- varclus(~., data=diabetes) > plot(v) > attach(diabetes) > smean.cl.normal(glyhb) Mean Lower Upper 5.58977 5.366506 5.813035 > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.356792 5.810075 > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.373114 5.817752 > set.seed(17) > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.380151 5.812173 > set.seed(17) > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.380151 5.812173 > tapply(glyhb, gender, mean) male female NA NA > tapply(glyhb, gender, mean, na.rmT) Problem: Object "na.rmT" not found Use traceback() to see the call stack > tapply(glyhb, gender, mean, na.rm=T) male female 5.724074 5.494342 > tapply(glyhb, gender, quantile, na.rm=T) $male: 0% 25% 50% 75% 100% 2.68 4.3825 4.9 5.695 16.11 $female: 0% 25% 50% 75% 100% 2.73 4.3775 4.785 5.5525 14.94 > tapply(glyhb, gender, smean.cl.boot) $male: Mean Lower Upper 5.724074 5.37528 6.09749 $female: Mean Lower Upper 5.494342 5.250717 5.757164 > summarize(glyhb, gender, na.rm=T) Problem in summarize(glyhb, gender, na.rm = T): Argument "" is missing, with no default Use traceback() to see the call stack > summarize(glyhb, gender, mean, na.rm=T) gender glyhb 1 female 5.494342 2 male 5.724074 > summarize(glyhb, gender, smean.sd) gender glyhb SD 1 female 5.494344 2.133675 2 male 5.724074 2.387779 > summarize(glyhb, gender, smean.cl.boot) gender glyhb Lower Upper 1 female 5.494342 5.221705 5.790896 2 male 5.724074 5.391357 6.103901 > summarize(glyhb, llist(gender,frame), smean.cl.boot) gender frame glyhb Lower Upper 1 female NA 5.004286 4.247143 6.169999 2 female large 6.407250 5.734312 7.089687 3 female medium 5.390177 5.031194 5.801590 4 female small 5.180882 4.735415 5.689444 5 male NA 5.595000 4.455000 7.067500 6 male large 5.901186 5.428293 6.471082 7 male medium 6.075077 5.368205 6.812477 8 male small 4.760882 4.389346 5.217780 > summarize(glyhb, llist(gender,cut2(age,g=3)), smean.cl.boot) gender cut2(age, g = 3) glyhb Lower Upper 1 female [19,39) 4.690118 4.411160 5.032921 2 female [39,55) 5.389333 4.934533 5.839916 3 female [55,92] 6.615441 6.055290 7.182089 4 male [19,39) 4.859636 4.609996 5.160201 5 male [39,55) 5.486383 4.884255 6.256220 6 male [55,92] 6.702667 6.074722 7.425051 > summarize(glyhb, llist(Sex=gender,Age=cut2(age,g=2)), smean.cl.boot) Sex Age glyhb Lower Upper 1 female [19,46) 4.797031 4.572688 5.040264 2 female [46,92] 6.386900 5.920603 6.875483 3 male [19,46) 4.812667 4.570495 5.102008 4 male [46,92] 6.509770 5.939091 7.161430 > s <- summary(survived ~ age + sex + pclass,data=titanic3) > s Survived N=1309 ---------------+-------------+----+---------+ | |N |survived | ---------------+-------------+----+---------+ Age |[ 0.167,22.0)| 290|0.4310345| |[22.000,28.5)| 246|0.3861789| |[28.500,40.0)| 265|0.4188679| |[40.000,80.0]| 245|0.3918367| |Missing | 263|0.2775665| ---------------+-------------+----+---------+ Sex |female | 466|0.7274678| |male | 843|0.1909846| ---------------+-------------+----+---------+ Passenger Class|1st | 323|0.6191950| |2nd | 277|0.4296029| |3rd | 709|0.2552891| ---------------+-------------+----+---------+ Overall | |1309|0.3819710| ---------------+-------------+----+---------+ > plot(s) > search() [1] ".Data" "diabetes" "Design" "Hmisc" "splus" "stat" [7] "data" "trellis" "nlme3" "rpart" "main" > detach(2) > attach(titanic3) > plsmo(age,survived,group=pclass,datadensity=T) > plsmo(age,survived,group=interaction(sex,pclass),col=1:6,datadensity=T) > > detach(2) > bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes) > bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes,panel=panel.bpplot) > args(xYplot) function(formula, data = sys.parent(1), groups = NULL, prepanel = prepanel.xYplot, panel = "panel.xYplot", scales = NULL, ..., xlab = NULL, ylab = NULL, subset = TRUE, minor.ticks = NULL) NULL > search() [1] ".Data" "Design" "Hmisc" "splus" "stat" "data" "trellis" [8] "nlme3" "rpart" "main" > attach(diabetes) > s <- summarize(glyhb, llist(frame,gender), smean.cl.boot) > s frame gender glyhb Lower Upper 1 NA female 5.004286 4.248535 6.200000 2 NA male 5.595000 4.455000 7.067500 3 large female 6.407250 5.684419 7.207156 4 large male 5.901186 5.424217 6.485382 5 medium female 5.390177 5.025188 5.794829 6 medium male 6.075077 5.419884 6.819735 7 small female 5.180882 4.746453 5.672276 8 small male 4.760882 4.394985 5.251574 > names(s) [1] "frame" "gender" "glyhb" "Lower" "Upper" > Dotplot(frame ~ Cbind(glyhb,Lower,Upper)|gender, data=s) > d <- expand.grid(x=seq(0,2*pi,length=150), p=1:3, shift=c(0,pi)) > xYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l') > > > dfr <- expand.grid(month=1:12, continent=c('Europe','USA'), sex=c('female','male')) attach(dfr) set.seed(13) y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') + runif(48,-.15,.15) lower <- y - runif(48,.05,.15) upper <- y + runif(48,.05,.15) xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male' & continent=='USA') # add ,label.curves=F to suppress use of labcurve to label curves where farthest apart + > > > + > > > > Warning messages: Condition has 7 elements: only the first used in: e1 || e2 > > Problem: Object "farthest apart" not found > xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key() Warning messages: 1: Condition has 7 elements: only the first used in: e1 || e2 2: Condition has 7 elements: only the first used in: e1 || e2 > Key function(x = NULL, y = NULL, lev = c("female", "male"), cex = c(1., 1.), col = c(1, 1), font = c(1, 1), pch = c("o", "+"), ...) { if(length(x)) { if(is.list(x)) { y <- x$y x <- x$x } key(x = x, y = y, text = list(lev, col = col), points = list( cex = cex, col = col, font = font, pch = pch), transparent = TRUE, ...) } else key(text = list(lev, col = col), points = list(cex = cex, col = col, font = font, pch = pch), transparent = TRUE, ...) invisible() } > xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',keys='lines') > > xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',method='bands') + > > label(y) <- 'Quality of Life Score' # label is in Hmisc library = attr(y,'label') <- 'Quality...'; will be # y-axis label # can also specify Cbind('Quality of Life Score'=y,lower,upper) xYplot(Cbind(y,lower,upper) ~ month, groups=sex, subset=continent=='Europe', method='alt bars', offset=.4) # > setTrellis function(strip.blank = TRUE, lty.dot.line = 2, lwd.dot.line = 1) { if(strip.blank) trellis.strip.blank() # in Hmisc Misc.s dot.line <- trellis.par.get("dot.line") dot.line$lwd <- lwd.dot.line dot.line$lty <- lty.dot.line trellis.par.set("dot.line", dot.line) invisible() } > setTrellis() Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names") Use traceback() to see the call stack > trellis.strip.blank() Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names") Use traceback() to see the call stack > ecdf(~ glyhb | frame*gender, data=diabetes) > ecdf(~ glyhb | frame*gender, groups=cut2(age,g=2), data=diabetes) > xyplot( y ~ time, groups=subject.id, type='b') # or | subject.id Problem in abs(x): needed atomic data, got an object of class "function"
#page(describe(pbc), multi=T) datadensity(pbc) nac <- naclus(pbc) plot(nac) naplot(nac) plot(varclus(~.,data=diabetes)) x <- c(1,2,6,NA) x <- impute(x) x is.imputed(x) attributes(x) describe(x) x <- runif(100) smean.cl.normal(x) smean.cl.boot(x) smean.sd(x) smean.sdl(x) binconf(6, 10, .01) bpower(.1, .2, n=500) bpower.sim(.1, .2, n=500) spearman2(glyhb ~ age + gender + frame, data=diabetes, p=2) s <- summary(glyhb ~ age + chol + gender + frame, data=diabetes) options(digits=3) s # same print(s) plot(s) s <- summary(glyhb ~ age + chol + gender + frame, data=diabetes, fun=smean.cl.boot) s plot(s, which=1:3, pch=c('.','[',']')) show.pch() s <- summary(drug ~ age + platelet + chol + spiders + edema,data=pbc) options(width=100) #page(s) plot(s) hist.data.frame(pbc) #attach(pbc) ecdf(albumin) ecdf(albumin,q=c(.25,.5,.75)) ecdf(albumin,group=drug) ecdf(diabetes,group=diabetes$gender, col=1:2, label.curve=F) ecdf(albumin, datadensity='rug') ecdf(albumin, datadensity='density') scat1d(albumin) attach(titanic3) plsmo(age, survived, group=sex, col=1:2, datadensity=T) ecdf(~age|pclass,groups=sex,q=.5) bwplot(pclass~age|sex) bwplot(sex~age|pclass) bwplot(sex ~ age | pclass, panel=panel.bpplot) set.seed(13) x <- rnorm(1000) g <- sample(1:6, 1000, replace=T) x[g==1][1:20] <- rnorm(20)+3 # contaminate 20 x's for group 1 # default trellis box plot bwplot(g ~ x) # box-percentile plot with data density (rug plot) bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=T) # add ,scat1d.opts=list(tfrac=1) to make all tick marks the same size # when a group has > 125 observations # small dot for means, show only .05,.125,.25,.375,.625,.75,.875,.95 quantiles bwplot(g ~ x, panel=panel.bpplot, cex=.3) # suppress means and reference lines for lower and upper quartiles bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,.1,.25), means=F, qref=F) # continuous plot up until quartiles ("Tootsie Roll plot") bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.25,by=.01)) # start at quartiles then make it continuous ("coffin plot") bwplot(g ~ x, panel=panel.bpplot, probs=seq(.25,.49,by=.01)) # same as previous but add a spike to give 0.95 interval bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,seq(.25,.49,by=.01))) # decile plot with reference lines at outer quintiles and median bwplot(g ~ x, panel=panel.bpplot, probs=c(.1,.2,.3,.4), qref=c(.5,.2,.8)) # default plot with tick marks showing all observations outside the outer # box (.05 and .95 quantiles), with very small ticks bwplot(g ~ x, panel=panel.bpplot, nout=.05, scat1d.opts=list(frac=.01)) # show 5 smallest and 5 largest observations bwplot(g ~ x, panel=panel.bpplot, nout=5) # Use a scat1d option (preserve=T) to ensure that the right peak extends # to the same position as the extreme scat1d bwplot(~x , panel=panel.bpplot, probs=seq(.00,.5,by=.001), datadensity=T, scat1d.opt=list(preserve=T)) # attach(diabetes) label(chol) xYplot(chol ~ age, groups=gender, type='p') Key() xYplot(chol ~ age|location, groups=gender, type='p') Key() ?xYplot dfr <- expand.grid(month=1:12, continent=c('Europe','USA'), sex=c('female','male')) attach(dfr) set.seed(13) y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') + runif(48,-.15,.15) lower <- y - runif(48,.05,.15) upper <- y + runif(48,.05,.15) xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male'&continent=='USA') # add ,label.curves=F to suppress use of labcurve to label curves where farthest apart xYplot(Cbind(y,lower,upper) ~ month|continent,subset=sex=='male') xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key() xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe') xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',keys='lines') # keys='lines' causes labcurve to draw a legend where the panel is most empty s <- summarize(y, llist(month,year), smean.cl.boot) s xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s) xYplot(glyhb ~ chol|gender, method=smean.cl.normal) Dotplot(month ~ Cbind(y, Lower, Upper) | year, data=s) # or Cbind(...), groups=year, data=s # Display a 5-number (5-quantile) summary (2 intervals, dot=median) # Note that summarize produces a matrix for y, and Cbind(y) trusts the # first column to be the point estimate (here the median) s <- summarize(y, llist(month,year), quantile, probs=c(.5,.05,.25,.75,.95), type='matrix') Dotplot(month ~ Cbind(y) | year, data=s) # Use factor(year) to make actual years appear in conditioning title strips dim(s) names(s) dim(s$y)-- FrankHarrell - 04 Feb 2004; modified 11 Jul 2004