# Journal Club for Analysis of Complex Datasets

## Sebastiani et al, Nature Genetics 37:435;2005: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia.

### Pattern Recognition using Bayesian Network

• Similar to recursive partitioning
• Example simulations showing instability
• Simulated data from additive logistic model with sample size of 1490 patients having a target of 92 strokes
• 108 SNPs (binary, prevalence 0.5) having importance (weights) shown below
• 30 clincial variables (normal with mean 0 and standard deviation .5), weights uniform between -1 and 1
• 3 independent repetitions of recursive partitional
• Most important SNPs: s18 and s27; most important clinical variables: c13 and c8
• Multiply sample size by 10 and repeat

library(Hmisc); library(rpart); library(lattice)

strokes  <- 92
n        <- 92 + 1398
snips    <- 108
clinical <- 30

## Generate set of true regression coefficients
set.seed(1)
snipImportance <- c(runif(snips/3, -2.5, 2.5), runif(2*snips/3, -1, 1))
png('/tmp/snipImportance.png')
hist(snipImportance, nclass=20)
dev.off()

clinImportance <- runif(clinical, -1, 1)
which(snipImportance==max(snipImportance))  #  18
which(snipImportance==min(snipImportance))  #  27
which(clinImportance==max(clinImportance))  #  13
which(clinImportance==min(clinImportance))  #   8

## Generate three sets of data under the fixed model, for original
## sample size and n*10
set.seed(2)
for(m in c(1,10)) {
for(i in 1:3) {
s <- matrix(sample(0:1, m*n*snips, replace=TRUE), ncol=snips)
c <- matrix(.5*rnorm(m*n*clinical), ncol=clinical)

## Generate stroke (y=1) and non-stroke (0) under additive logistic model
L <- .5*matxv(s, snipImportance) +
.5*matxv(c, clinImportance) - 6
y <- rbinom(m*n, 1, plogis(L))
prn(c(mean(y), strokes/n))

f <- rpart(y ~ c + s, control=rpart.control(minsplit=m*40,cp=.001))
prn(f\$cptable)
options(digits=3)
png(paste('/tmp/seb',m,'-',i,'.png',sep=''), pointsize=8)
plot(f); text(f)
dev.off()
}
}

### Quantifying Predictive Accuracy

• Proportional classified correctly is an improper scoring rule (optimized by a miscalibrated model)
• Arbitrary and covers up "gray zone" (authors used cutoff of 0.5 risk)
• Loss of statistical power
• Authors did independent validation on 114 patients, got 98.2% correct (112/114)
• Authors gave no confidence limits on their validated proportion correct [0.938, 0.995] but will the classification rule transport?

binconf(112,114)
• Need a more continuous measure of discrimination ability that does not require arbitrary classification
• Area under ROC curve (C-index), Brier score

### Difficulty in Estimating Accuracy in Small Samples

library(Hmisc)
P <- .8   # true probability of a correct prediction
N <- round(10*(2^seq(0,13,by=.25)))
p <- low <- hi <- N
set.seed(3)
y <- rbinom(max(N), 1, P)
i <- 0
for(n in N) {
i <- i+1
s <- sum(y[1:n])
p[i] <- s/n
cat(n,'')
lim <- binconf(s, n, method='wilson')
low[i] <- lim[,'Lower']
hi[i] <- lim[,'Upper']
}
#pdf('/tmp/validation.pdf')
png('/tmp/propCorrect.png')
plot(log2(N), p, ylim=range(c(low,hi)), axes=FALSE, type='b',
xlab='Number of Patients in Validation Sample',
ylab='Estimated Accuracy of Diagnostic Patterns',
main='Estimated Accuracy and Its Margin of Error\nWhen True Classification Accuracy is 0.8')
axis(2)
w <- 10*(2^seq(0,13,by=1))
axis(1, log2(w), w)
lines(log2(N), low, col=gray(.7))
lines(log2(N), hi, col=gray(.7))
abline(h=P, lty=2, col=gray(.7))
dev.off()
Topic attachments
I Attachment Action Size Date Who Comment
png propCorrect.png manage 5.9 K 14 Apr 2006 - 16:20 FrankHarrell Accuracy of Classification Accuracy as a Function of Sample Size
png seb1-1.png manage 4.4 K 17 Apr 2006 - 17:11 FrankHarrell
png seb1-2.png manage 4.6 K 17 Apr 2006 - 17:11 FrankHarrell
png seb1-3.png manage 4.5 K 17 Apr 2006 - 17:12 FrankHarrell
png seb10-1.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell
png seb10-2.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell
png seb10-3.png manage 4.6 K 17 Apr 2006 - 17:13 FrankHarrell
png snipImportance.png manage 3.1 K 17 Apr 2006 - 17:11 FrankHarrell
Edit | Attach | Print version |  | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r4 - 17 Apr 2006, FrankHarrell

• Biostatistics Webs

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback