You are here:
Vanderbilt Biostatistics Wiki
>
Main Web
>
Education
>
StatGraphCourse
>
GraphExplorePresent
(01 Sep 2008,
FrankHarrell
)
(raw view)
E
dit
A
ttach
---+ Statistical Graphics for Exploring Data, Presenting Information, and Understanding Statistical Models | [[#GraphScripts][Scripts]] Graphical methods are being increasingly used for exploratory data analysis. Some of the many graphical tools that are useful in this setting are scatterplot matrices, nonparametric smoothers, and tree diagrams. Statistical graphics for presenting information have been used much longer, but most of the commonly used graphics used in papers, presentations, and the popular media, such as bar charts and pie charts, are either poor or misleading in communicating information to the reader. This short course begins with a series of graphical horror stories from the scientific and lay press. Then elements of graphical perception and good graph construction, many from the writings of Bill Cleveland, are covered. Practical suggestions for choosing the best chart or graph type, making good and clear graphics, and formatting are covered. Techniques for simultaneous presentation of multiple variables are described. Complex outcome or risk adjustment models are not easily grasped by non-statisticians. Special graphics such as effect charts and nomograms can assist physicians and other consumers of statistical analysis in understanding statistical models and in using them for obtaining predictions for individual subjects. Examples of model presentation graphics will be given. At the close of the short course some graphical marvels from the literature (especially from Edward Tufte and Howard Wainer) are presented. ----- #GraphScripts ---++ S script developed in the S-Plus International User Conference in Philadelphia, 16Oct01 <verbatim> S-PLUS : Copyright (c) 1988, 2000 MathSoft, Inc. S : Copyright Lucent Technologies, Inc. Version 6.0 Release 1 for Linux 2.2.12 : 2000 Working data will be in .Data Hmisc library by Frank E Harrell Jr, Tue Jul 17 19:34:31 EDT 2001 This library is provided by FE Harrell <fharrell@virginia.edu>. It is not supported by Insightful Corp. Type library(Hmisc, help=T) to view the readme file for the Hmisc library. Type help(library='Hmisc') to see detailed documentation. Hmisc redefines [.factor to drop unused levels of factor variables when subscripting. To prevent this behaviour, issue the command options(drop.unused.levels=F). You must use options(drop.unused.levels=F) for multicomp to calculate correctly. Design library by Frank E Harrell Jr, Tue Jul 17 19:34:50 EDT 2001 This library is provided by FE Harrell <fharrell@virginia.edu>. It is not supported by Insightful Corp. Type library(Design, help=T) to view the readme file for the Design library. Type help(library='Design') to see detailed documentation. options(contrasts=c('contr.treatment','contr.poly')) set > ls() [1] ".First" ".Last.fixed" ".Last.value" ".Random.seed" [5] ".orig.cal" "ABM" "FEV" "birth.estriol" [9] "boston" "cal" "counties" "dd" [13] "diabetes" "f" "g" "hospital" [17] "last.dump" "last.warning" "lead" "nac" [21] "olympics.1996" "pbc" "r" "s" [25] "support" "titanic.trans" "titanic3" "v" [29] "w" "z" > datadensity(diabetes,group=diabetes$gender) > datadensity(diabetes[diabetes$gender=='female',]) > describe(diabetes$gender) diabetes$gender n missing unique 403 0 2 male (169, 42%), female (234, 58%) > nac <- naclus(diabetes) > plot(nac) > w <- latex(describe(pbc)) > describe(pbc) pbc 19 Variables 418 Observations --------------------------------------------------------------------------- bili : Serum Bilirubin (mg/dl) n missing unique Mean .05 .10 .25 .50 .75 .90 .95 418 0 98 3.221 0.50 0.60 0.80 1.40 3.40 8.03 14.00 lowest : 0.3 0.4 0.5 0.6 0.7, highest: 21.6 22.5 24.5 25.5 28.0 --------------------------------------------------------------------------- .... Conflicting definitions of "[.data.frame" on databases "Hmisc" and "splus" in: x[ - yvar] > par(mfrow=c(1,1)) > plot(f);text(f) > describe(titanic3) titanic3 14 Variables 1309 Observations --------------------------------------------------------------------------- pclass : Passenger Class n missing unique 1309 0 3 1st (323, 25%), 2nd (277, 21%), 3rd (709, 54%) --------------------------------------------------------------------------- . . . > datadensity(titanic3) > f <- rpart(is.na(age) ~ pclass + sex + survived,data=titanic3) > plot(f);text(f) > v <- varclus(~., data=diabetes) > plot(v) > attach(diabetes) > smean.cl.normal(glyhb) Mean Lower Upper 5.58977 5.366506 5.813035 > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.356792 5.810075 > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.373114 5.817752 > set.seed(17) > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.380151 5.812173 > set.seed(17) > smean.cl.boot(glyhb) Mean Lower Upper 5.589769 5.380151 5.812173 > tapply(glyhb, gender, mean) male female NA NA > tapply(glyhb, gender, mean, na.rmT) Problem: Object "na.rmT" not found Use traceback() to see the call stack > tapply(glyhb, gender, mean, na.rm=T) male female 5.724074 5.494342 > tapply(glyhb, gender, quantile, na.rm=T) $male: 0% 25% 50% 75% 100% 2.68 4.3825 4.9 5.695 16.11 $female: 0% 25% 50% 75% 100% 2.73 4.3775 4.785 5.5525 14.94 > tapply(glyhb, gender, smean.cl.boot) $male: Mean Lower Upper 5.724074 5.37528 6.09749 $female: Mean Lower Upper 5.494342 5.250717 5.757164 > summarize(glyhb, gender, na.rm=T) Problem in summarize(glyhb, gender, na.rm = T): Argument "" is missing, with no default Use traceback() to see the call stack > summarize(glyhb, gender, mean, na.rm=T) gender glyhb 1 female 5.494342 2 male 5.724074 > summarize(glyhb, gender, smean.sd) gender glyhb SD 1 female 5.494344 2.133675 2 male 5.724074 2.387779 > summarize(glyhb, gender, smean.cl.boot) gender glyhb Lower Upper 1 female 5.494342 5.221705 5.790896 2 male 5.724074 5.391357 6.103901 > summarize(glyhb, llist(gender,frame), smean.cl.boot) gender frame glyhb Lower Upper 1 female NA 5.004286 4.247143 6.169999 2 female large 6.407250 5.734312 7.089687 3 female medium 5.390177 5.031194 5.801590 4 female small 5.180882 4.735415 5.689444 5 male NA 5.595000 4.455000 7.067500 6 male large 5.901186 5.428293 6.471082 7 male medium 6.075077 5.368205 6.812477 8 male small 4.760882 4.389346 5.217780 > summarize(glyhb, llist(gender,cut2(age,g=3)), smean.cl.boot) gender cut2(age, g = 3) glyhb Lower Upper 1 female [19,39) 4.690118 4.411160 5.032921 2 female [39,55) 5.389333 4.934533 5.839916 3 female [55,92] 6.615441 6.055290 7.182089 4 male [19,39) 4.859636 4.609996 5.160201 5 male [39,55) 5.486383 4.884255 6.256220 6 male [55,92] 6.702667 6.074722 7.425051 > summarize(glyhb, llist(Sex=gender,Age=cut2(age,g=2)), smean.cl.boot) Sex Age glyhb Lower Upper 1 female [19,46) 4.797031 4.572688 5.040264 2 female [46,92] 6.386900 5.920603 6.875483 3 male [19,46) 4.812667 4.570495 5.102008 4 male [46,92] 6.509770 5.939091 7.161430 > s <- summary(survived ~ age + sex + pclass,data=titanic3) > s Survived N=1309 ---------------+-------------+----+---------+ | |N |survived | ---------------+-------------+----+---------+ Age |[ 0.167,22.0)| 290|0.4310345| |[22.000,28.5)| 246|0.3861789| |[28.500,40.0)| 265|0.4188679| |[40.000,80.0]| 245|0.3918367| |Missing | 263|0.2775665| ---------------+-------------+----+---------+ Sex |female | 466|0.7274678| |male | 843|0.1909846| ---------------+-------------+----+---------+ Passenger Class|1st | 323|0.6191950| |2nd | 277|0.4296029| |3rd | 709|0.2552891| ---------------+-------------+----+---------+ Overall | |1309|0.3819710| ---------------+-------------+----+---------+ > plot(s) > search() [1] ".Data" "diabetes" "Design" "Hmisc" "splus" "stat" [7] "data" "trellis" "nlme3" "rpart" "main" > detach(2) > attach(titanic3) > plsmo(age,survived,group=pclass,datadensity=T) > plsmo(age,survived,group=interaction(sex,pclass),col=1:6,datadensity=T) > > detach(2) > bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes) > bwplot(cut2(age,g=4)~glyhb|frame*gender,data=diabetes,panel=panel.bpplot) > args(xYplot) function(formula, data = sys.parent(1), groups = NULL, prepanel = prepanel.xYplot, panel = "panel.xYplot", scales = NULL, ..., xlab = NULL, ylab = NULL, subset = TRUE, minor.ticks = NULL) NULL > search() [1] ".Data" "Design" "Hmisc" "splus" "stat" "data" "trellis" [8] "nlme3" "rpart" "main" > attach(diabetes) > s <- summarize(glyhb, llist(frame,gender), smean.cl.boot) > s frame gender glyhb Lower Upper 1 NA female 5.004286 4.248535 6.200000 2 NA male 5.595000 4.455000 7.067500 3 large female 6.407250 5.684419 7.207156 4 large male 5.901186 5.424217 6.485382 5 medium female 5.390177 5.025188 5.794829 6 medium male 6.075077 5.419884 6.819735 7 small female 5.180882 4.746453 5.672276 8 small male 4.760882 4.394985 5.251574 > names(s) [1] "frame" "gender" "glyhb" "Lower" "Upper" > Dotplot(frame ~ Cbind(glyhb,Lower,Upper)|gender, data=s) > d <- expand.grid(x=seq(0,2*pi,length=150), p=1:3, shift=c(0,pi)) > xYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l') > > > dfr <- expand.grid(month=1:12, continent=c('Europe','USA'), sex=c('female','male')) attach(dfr) set.seed(13) y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') + runif(48,-.15,.15) lower <- y - runif(48,.05,.15) upper <- y + runif(48,.05,.15) xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male' & continent=='USA') # add ,label.curves=F to suppress use of labcurve to label curves where farthest apart + > > > + > > > > Warning messages: Condition has 7 elements: only the first used in: e1 || e2 > > Problem: Object "farthest apart" not found > xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key() Warning messages: 1: Condition has 7 elements: only the first used in: e1 || e2 2: Condition has 7 elements: only the first used in: e1 || e2 > Key function(x = NULL, y = NULL, lev = c("female", "male"), cex = c(1., 1.), col = c(1, 1), font = c(1, 1), pch = c("o", "+"), ...) { if(length(x)) { if(is.list(x)) { y <- x$y x <- x$x } key(x = x, y = y, text = list(lev, col = col), points = list( cex = cex, col = col, font = font, pch = pch), transparent = TRUE, ...) } else key(text = list(lev, col = col), points = list(cex = cex, col = col, font = font, pch = pch), transparent = TRUE, ...) invisible() } > xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',keys='lines') > > xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',method='bands') + > > label(y) <- 'Quality of Life Score' # label is in Hmisc library = attr(y,'label') <- 'Quality...'; will be # y-axis label # can also specify Cbind('Quality of Life Score'=y,lower,upper) xYplot(Cbind(y,lower,upper) ~ month, groups=sex, subset=continent=='Europe', method='alt bars', offset=.4) # > setTrellis function(strip.blank = TRUE, lty.dot.line = 2, lwd.dot.line = 1) { if(strip.blank) trellis.strip.blank() # in Hmisc Misc.s dot.line <- trellis.par.get("dot.line") dot.line$lwd <- lwd.dot.line dot.line$lty <- lty.dot.line trellis.par.set("dot.line", dot.line) invisible() } > setTrellis() Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names") Use traceback() to see the call stack > trellis.strip.blank() Problem in trellis.par.set: No data to interpret as logical value: if(p[i] != q[i]) stop("inconsistent component names") Use traceback() to see the call stack > ecdf(~ glyhb | frame*gender, data=diabetes) > ecdf(~ glyhb | frame*gender, groups=cut2(age,g=2), data=diabetes) > xyplot( y ~ time, groups=subject.id, type='b') # or | subject.id Problem in abs(x): needed atomic data, got an object of class "function" </verbatim> ---++ Script developed in the one-day version of the course given at the S-Plus Users Conference, Seattle, October 2000 <verbatim> #page(describe(pbc), multi=T) datadensity(pbc) nac <- naclus(pbc) plot(nac) naplot(nac) plot(varclus(~.,data=diabetes)) x <- c(1,2,6,NA) x <- impute(x) x is.imputed(x) attributes(x) describe(x) x <- runif(100) smean.cl.normal(x) smean.cl.boot(x) smean.sd(x) smean.sdl(x) binconf(6, 10, .01) bpower(.1, .2, n=500) bpower.sim(.1, .2, n=500) spearman2(glyhb ~ age + gender + frame, data=diabetes, p=2) s <- summary(glyhb ~ age + chol + gender + frame, data=diabetes) options(digits=3) s # same print(s) plot(s) s <- summary(glyhb ~ age + chol + gender + frame, data=diabetes, fun=smean.cl.boot) s plot(s, which=1:3, pch=c('.','[',']')) show.pch() s <- summary(drug ~ age + platelet + chol + spiders + edema,data=pbc) options(width=100) #page(s) plot(s) hist.data.frame(pbc) #attach(pbc) ecdf(albumin) ecdf(albumin,q=c(.25,.5,.75)) ecdf(albumin,group=drug) ecdf(diabetes,group=diabetes$gender, col=1:2, label.curve=F) ecdf(albumin, datadensity='rug') ecdf(albumin, datadensity='density') scat1d(albumin) attach(titanic3) plsmo(age, survived, group=sex, col=1:2, datadensity=T) ecdf(~age|pclass,groups=sex,q=.5) bwplot(pclass~age|sex) bwplot(sex~age|pclass) bwplot(sex ~ age | pclass, panel=panel.bpplot) set.seed(13) x <- rnorm(1000) g <- sample(1:6, 1000, replace=T) x[g==1][1:20] <- rnorm(20)+3 # contaminate 20 x's for group 1 # default trellis box plot bwplot(g ~ x) # box-percentile plot with data density (rug plot) bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=T) # add ,scat1d.opts=list(tfrac=1) to make all tick marks the same size # when a group has > 125 observations # small dot for means, show only .05,.125,.25,.375,.625,.75,.875,.95 quantiles bwplot(g ~ x, panel=panel.bpplot, cex=.3) # suppress means and reference lines for lower and upper quartiles bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,.1,.25), means=F, qref=F) # continuous plot up until quartiles ("Tootsie Roll plot") bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.25,by=.01)) # start at quartiles then make it continuous ("coffin plot") bwplot(g ~ x, panel=panel.bpplot, probs=seq(.25,.49,by=.01)) # same as previous but add a spike to give 0.95 interval bwplot(g ~ x, panel=panel.bpplot, probs=c(.025,seq(.25,.49,by=.01))) # decile plot with reference lines at outer quintiles and median bwplot(g ~ x, panel=panel.bpplot, probs=c(.1,.2,.3,.4), qref=c(.5,.2,.8)) # default plot with tick marks showing all observations outside the outer # box (.05 and .95 quantiles), with very small ticks bwplot(g ~ x, panel=panel.bpplot, nout=.05, scat1d.opts=list(frac=.01)) # show 5 smallest and 5 largest observations bwplot(g ~ x, panel=panel.bpplot, nout=5) # Use a scat1d option (preserve=T) to ensure that the right peak extends # to the same position as the extreme scat1d bwplot(~x , panel=panel.bpplot, probs=seq(.00,.5,by=.001), datadensity=T, scat1d.opt=list(preserve=T)) # attach(diabetes) label(chol) xYplot(chol ~ age, groups=gender, type='p') Key() xYplot(chol ~ age|location, groups=gender, type='p') Key() ?xYplot dfr <- expand.grid(month=1:12, continent=c('Europe','USA'), sex=c('female','male')) attach(dfr) set.seed(13) y <- month/10 + 1*(sex=='female') + 2*(continent=='Europe') + runif(48,-.15,.15) lower <- y - runif(48,.05,.15) upper <- y + runif(48,.05,.15) xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male'&continent=='USA') # add ,label.curves=F to suppress use of labcurve to label curves where farthest apart xYplot(Cbind(y,lower,upper) ~ month|continent,subset=sex=='male') xYplot(Cbind(y,lower,upper) ~ month|continent,groups=sex); Key() xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe') xYplot(Cbind(y,lower,upper) ~ month,groups=sex,subset=continent=='Europe',keys='lines') # keys='lines' causes labcurve to draw a legend where the panel is most empty s <- summarize(y, llist(month,year), smean.cl.boot) s xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s) xYplot(glyhb ~ chol|gender, method=smean.cl.normal) Dotplot(month ~ Cbind(y, Lower, Upper) | year, data=s) # or Cbind(...), groups=year, data=s # Display a 5-number (5-quantile) summary (2 intervals, dot=median) # Note that summarize produces a matrix for y, and Cbind(y) trusts the # first column to be the point estimate (here the median) s <- summarize(y, llist(month,year), quantile, probs=c(.5,.05,.25,.75,.95), type='matrix') Dotplot(month ~ Cbind(y) | year, data=s) # Use factor(year) to make actual years appear in conditioning title strips dim(s) names(s) dim(s$y) </verbatim> -- Main.FrankHarrell - 04 Feb 2004; modified 11 Jul 2004
E
dit
|
A
ttach
|
P
rint version
|
H
istory
: r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Edit
w
iki text
|
M
ore topic actions
Topic revision: r4 - 01 Sep 2008,
FrankHarrell
Main
Department Home Page
Biostatistics Graduate Program
Vanderbilt University Medical Center
Main Web
Main Web Home
Search
Recent Changes
Changes
Topic list
Biostatistics Webs
Archive
Main
Sandbox
System
Register
|
Log In
Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki?
Send feedback