Journal Club for Analysis of Complex Datasets

Sebastiani et al, Nature Genetics 37:435;2005: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia.

Goals and Data

Pattern Recognition using Bayesian Network

  • Similar to recursive partitioning
  • Example simulations showing instability
  • Simulated data from additive logistic model with sample size of 1490 patients having a target of 92 strokes
  • 108 SNPs (binary, prevalence 0.5) having importance (weights) shown below snipImportance.png
  • 30 clincial variables (normal with mean 0 and standard deviation .5), weights uniform between -1 and 1
  • 3 independent repetitions of recursive partitional
  • Most important SNPs: s18 and s27; most important clinical variables: c13 and c8
seb1-1.png seb1-2.png seb1-3.png
  • Multiply sample size by 10 and repeat
seb10-1.png seb10-2.png seb10-3.png

library(Hmisc); library(rpart); library(lattice)

strokes  <- 92
n        <- 92 + 1398
snips    <- 108
clinical <- 30

## Generate set of true regression coefficients
set.seed(1)
snipImportance <- c(runif(snips/3, -2.5, 2.5), runif(2*snips/3, -1, 1))
png('/tmp/snipImportance.png')
hist(snipImportance, nclass=20)
dev.off()
                    
clinImportance <- runif(clinical, -1, 1)
which(snipImportance==max(snipImportance))  #  18
which(snipImportance==min(snipImportance))  #  27
which(clinImportance==max(clinImportance))  #  13
which(clinImportance==min(clinImportance))  #   8

## Generate three sets of data under the fixed model, for original
## sample size and n*10
set.seed(2)
for(m in c(1,10)) {
  for(i in 1:3) {
    s <- matrix(sample(0:1, m*n*snips, replace=TRUE), ncol=snips)
    c <- matrix(.5*rnorm(m*n*clinical), ncol=clinical)

    ## Generate stroke (y=1) and non-stroke (0) under additive logistic model
    L <- .5*matxv(s, snipImportance) +
      .5*matxv(c, clinImportance) - 6
    y <- rbinom(m*n, 1, plogis(L))
    prn(c(mean(y), strokes/n))
    
    f <- rpart(y ~ c + s, control=rpart.control(minsplit=m*40,cp=.001))
    prn(f$cptable)
    options(digits=3)
    png(paste('/tmp/seb',m,'-',i,'.png',sep=''), pointsize=8)
    plot(f); text(f)
    dev.off()
  }
}

Quantifying Predictive Accuracy

  • Proportional classified correctly is an improper scoring rule (optimized by a miscalibrated model)
  • Arbitrary and covers up "gray zone" (authors used cutoff of 0.5 risk)
  • Loss of statistical power
  • Authors did independent validation on 114 patients, got 98.2% correct (112/114)
  • Authors gave no confidence limits on their validated proportion correct [0.938, 0.995] but will the classification rule transport?

binconf(112,114)
  • Need a more continuous measure of discrimination ability that does not require arbitrary classification
    • Area under ROC curve (C-index), Brier score

Difficulty in Estimating Accuracy in Small Samples

propCorrect.png

library(Hmisc)
P <- .8   # true probability of a correct prediction
N <- round(10*(2^seq(0,13,by=.25)))
p <- low <- hi <- N
set.seed(3)
y <- rbinom(max(N), 1, P)
i <- 0
for(n in N) {
  i <- i+1
  s <- sum(y[1:n])
  p[i] <- s/n
  cat(n,'')
  lim <- binconf(s, n, method='wilson')
  low[i] <- lim[,'Lower']
  hi[i] <- lim[,'Upper']
}
#pdf('/tmp/validation.pdf')
png('/tmp/propCorrect.png')
plot(log2(N), p, ylim=range(c(low,hi)), axes=FALSE, type='b',
     xlab='Number of Patients in Validation Sample',
     ylab='Estimated Accuracy of Diagnostic Patterns',
     main='Estimated Accuracy and Its Margin of Error\nWhen True Classification Accuracy is 0.8')
axis(2)
w <- 10*(2^seq(0,13,by=1))
axis(1, log2(w), w)
lines(log2(N), low, col=gray(.7))
lines(log2(N), hi, col=gray(.7))
abline(h=P, lty=2, col=gray(.7))
dev.off()
Topic attachments
I Attachment Action Size Date Who Comment
propCorrect.pngpng propCorrect.png manage 5.9 K 14 Apr 2006 - 16:20 FrankHarrell Accuracy of Classification Accuracy as a Function of Sample Size
seb1-1.pngpng seb1-1.png manage 4.4 K 17 Apr 2006 - 17:11 FrankHarrell  
seb1-2.pngpng seb1-2.png manage 4.6 K 17 Apr 2006 - 17:11 FrankHarrell  
seb1-3.pngpng seb1-3.png manage 4.5 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-1.pngpng seb10-1.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-2.pngpng seb10-2.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-3.pngpng seb10-3.png manage 4.6 K 17 Apr 2006 - 17:13 FrankHarrell  
snipImportance.pngpng snipImportance.png manage 3.1 K 17 Apr 2006 - 17:11 FrankHarrell  
Edit | Attach | Print version | History: r36 | r5 < r4 < r3 < r2 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r4 - 17 Apr 2006, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback