Journal Club for Analysis of Complex Datasets

Sebastiani et al, Nature Genetics 37:435;2005: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia.

Goals and Data

Pattern Recognition using Bayesian Network

  • Similar to recursive partitioning
  • Example simulations showing instability
  • Simulated data from additive logistic model with sample size of 1490 patients having a target of 92 strokes
  • 108 SNPs (binary, prevalence 0.5) having importance (weights) shown below snipImportance.png
  • 30 clincial variables (normal with mean 0 and standard deviation .5), weights uniform between -1 and 1
  • 3 independent repetitions of recursive partitional
  • Most important SNPs: s18 and s27; most important clinical variables: c13 and c8
seb1-1.png seb1-2.png seb1-3.png
  • Multiply sample size by 10 and repeat
seb10-1.png seb10-2.png seb10-3.png library(Hmisc); library(rpart); library(lattice)

strokes <- 92 n <- 92 + 1398 snips <- 108 clinical <- 30

## Generate set of true regression coefficients set.seed(1) snipImportance <- c(runif(snips/3, -2.5, 2.5), runif(2*snips/3, -1, 1)) png('/tmp/snipImportance.png') hist(snipImportance, nclass=20) dev.off()

clinImportance <- runif(clinical, -1, 1) which(snipImportance==max(snipImportance)) # 18 which(snipImportance==min(snipImportance)) # 27 which(clinImportance==max(clinImportance)) # 13 which(clinImportance==min(clinImportance)) # 8

## Generate three sets of data under the fixed model, for original ## sample size and n*10 set.seed(2) for(m in c(1,10)) { for(i in 1:3) { s <- matrix(sample(0:1, m*n*snips, replace=TRUE), ncol=snips) c <- matrix(.5*rnorm(m*n*clinical), ncol=clinical)

## Generate stroke (y=1) and non-stroke (0) under additive logistic model L <- .5*matxv(s, snipImportance) + .5*matxv(c, clinImportance) - 6 y <- rbinom(m*n, 1, plogis(L)) prn(c(mean(y), strokes/n))

f <- rpart(y ~ c + s, control=rpart.control(minsplit=m*40,cp=.001)) prn(f$cptable) options(digits=3) png(paste('/tmp/seb',m,'-',i,'.png',sep=''), pointsize=8) plot(f); text(f) dev.off() } }

Quantifying Predictive Accuracy

  • Proportional classified correctly is an improper scoring rule (optimized by a miscalibrated model)
  • Arbitrary and covers up "gray zone" (authors used cutoff of 0.5 risk)
  • Loss of statistical power
  • Authors did independent validation on 114 patients, got 98.2% correct (112/114)
  • Authors gave no confidence limits on their validated proportion correct [0.938, 0.995] but will the classification rule transport?
binconf(112,114)
  • Need a more continuous measure of discrimination ability that does not require arbitrary classification
    • Area under ROC curve (C-index), Brier score

Difficulty in Estimating Accuracy in Small Samples

propCorrect.png library(Hmisc) P <- .8 # true probability of a correct prediction N <- round(10*(2^seq(0,13,by=.25))) p <- low <- hi <- N set.seed(3) y <- rbinom(max(N), 1, P) i <- 0 for(n in N) { i <- i+1 s <- sum(y[1:n]) p[i] <- s/n cat(n,'') lim <- binconf(s, n, method='wilson') low[i] <- lim[,'Lower'] hi[i] <- lim[,'Upper'] } #pdf('/tmp/validation.pdf') png('/tmp/propCorrect.png') plot(log2(N), p, ylim=range(c(low,hi)), axes=FALSE, type='b', xlab='Number of Patients in Validation Sample', ylab='Estimated Accuracy of Diagnostic Patterns', main='Estimated Accuracy and Its Margin of Error\nWhen True Classification Accuracy is 0.8') axis(2) w <- 10*(2^seq(0,13,by=1)) axis(1, log2(w), w) lines(log2(N), low, col=gray(.7)) lines(log2(N), hi, col=gray(.7)) abline(h=P, lty=2, col=gray(.7)) dev.off()
Topic attachments
I AttachmentSorted ascending Action Size Date Who Comment
propCorrect.pngpng propCorrect.png manage 5.9 K 14 Apr 2006 - 16:20 FrankHarrell Accuracy of Classification Accuracy as a Function of Sample Size
seb1-1.pngpng seb1-1.png manage 4.4 K 17 Apr 2006 - 17:11 FrankHarrell  
seb1-2.pngpng seb1-2.png manage 4.6 K 17 Apr 2006 - 17:11 FrankHarrell  
seb1-3.pngpng seb1-3.png manage 4.5 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-1.pngpng seb10-1.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-2.pngpng seb10-2.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-3.pngpng seb10-3.png manage 4.6 K 17 Apr 2006 - 17:13 FrankHarrell  
snipImportance.pngpng snipImportance.png manage 3.1 K 17 Apr 2006 - 17:11 FrankHarrell  
Edit | Attach | Print version | History: r36 | r5 < r4 < r3 < r2 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r4 - 17 Apr 2006, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback