Conversion of SAS Code to Compute Multiple Stratified Statistics on More than One Analysis Variable
Greg Adkison
gadkison@email.wcu.edu writes:
Say for example that I have a dataset named "dat1" that includes five
variables: wshed, site, species, bda, and sla. I can calculate with the
following SAS code the mean, CV, se, and number of observations of
"bda" and "sla" for each combination of "wshed," "species," and "site,"
restricting the species considered to only three of several species in
dat1 (b, c, and p). Moreover, I can output these calculations and
grouping variables to a dataset named "dat2" that will reside in RAM
and include the variables wshed, site, species, mBdA, msla, cBda,
sBdA, ssla, nBda, and nsla.
proc sort data=dat1;
by wshed site species;
proc means data=dat1 noprint mean cv stderr n;
by wshed site species;
where species in ('b', 'c', 'p');
var BdA sla;
output out=dat2
mean=mBdA msla
cv=cBdA csla
stderr=sBdA ssla
n=nBdA nsla;
The following handles any number of analysis variables, with automatic
naming of all statistics computed from them. Note: as of 15Oct06 you will have to get a new version of
summarize
to fix a bug with stat.name=NULL. To do so do the following. Note that
getLatestSource
will be in the next version of
Hmisc
.
source('http://biostat.mc.vanderbilt.edu/tmp/getLatestSource.s')
getLatestSource(c('summary.formula','mApply'))
# Generate some data
set.seed(1)
dat1 <- expand.grid(wshed=1:2, site=c('A','B'),
species=c('a','b','c','p'),
reps=1:10)
n <- nrow(dat1)
dat1 <- transform(dat1,
BdA = rnorm(n, 100, 20),
sla = c(rnorm(n-1, 200, 30), NA))
# Can use upData function in Hmisc in place of transform
# Summarization function, per stratum
g <- function(y) {
s <- apply(y, 2,
function(z) {
z <- z[!is.na(z)]
n <- length(z)
if(n==0) c(NA,NA,NA,0) else
if(n==1) c(z, NA,NA,1) else {
m <- mean(z)
s <- sd(z)
c(Mean=m, CV=s/m, SE=s/sqrt(n), N=n)
}
})
w <- as.vector(s)
names(w) <- as.vector(outer(rownames(s), colnames(s), paste, sep=''))
w
}
library(Hmisc)
dat2 <- with(dat1,
summarize(cbind(BdA, sla),
llist(wshed, site, species),
g,
subset=species %in% c('b','c','p'),
stat.name=NULL))
options(digits=3)
dat2 # is a data frame
wshed site species MeanBdA CVBdA SEBdA NBdA Meansla CVsla SEsla Nsla
1 1 A b 100.5 0.133 4.23 10 195 0.1813 11.20 10
2 1 A c 99.7 0.101 3.17 10 206 0.1024 6.68 10
3 1 A p 101.4 0.239 7.65 10 188 0.1580 9.39 10
4 1 B b 109.9 0.118 4.09 10 203 0.1433 9.21 10
5 1 B c 98.4 0.193 6.01 10 221 0.1250 8.72 10
6 1 B p 102.9 0.216 7.03 10 203 0.1446 9.29 10
7 2 A b 95.8 0.241 7.31 10 195 0.2011 12.40 10
8 2 A c 98.7 0.194 6.04 10 207 0.1274 8.33 10
9 2 A p 102.2 0.217 7.01 10 191 0.1709 10.31 10
10 2 B b 97.8 0.235 7.27 10 191 0.2079 12.58 10
11 2 B c 100.9 0.164 5.24 10 194 0.0987 6.07 10
12 2 B p 103.0 0.144 4.69 10 209 0.0769 5.35 9
# Another approach, but does casewise deletion of NAs and computes all
# possible marginal summaries
dat2 <- summary(cbind(BdA, sla) ~ wshed + site + species,
data=dat1, fun=g, method='cross')
print.data.frame(dat2) # after Sept 2004 just say dat2
# dat2 is similar to but not really a data frame, with a matrix of statistics S
# Remove method='cross' to get all one-way summaries
--
FrankHarrell - 17-18 Jul 2004