Journal Club for Analysis of Complex Datasets

Sebastiani et al, Nature Genetics 37:435;2005: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia.

Goals and Data

  • Find persons with sickle cell anemia at risk of overt stroke (6-8% incidence)
  • Genetic dissection: disentangling interactions among genes, environment, phenotype
  • 108 SNPs in 80 (?39) candidate genes
  • Unspecified extensive clinical variables also used
  • 1398 African Americans with SCA, 92 strokes

Pattern Recognition using Bayesian Network | Aliferis Commentary

  • Focused on networks describing dependencies of genotypes on phenotype
  • Authors state that strategy is more diagnostic than prognostic
  • Did not take variable follow-up into account
  • Used Bayesware Discoverer which claims to do "automated discovery of Bayesian networks from data"
  • $5,500 for one user license; first 2 authors are owners; not open source
  • Found that 31 SNPs in 12 genes interact with fetal hemoglobin to predict stroke risk
  • Final network had 73 nodes (fig. 2, 69 SNPs, avg. 1.3 strokes per node)
  • "the model is able to describe the determinant effects of genetic variants on stroke"
  • Stability benefits from smart choices of candidate genes
  • Bayes factor of $10^{198}$ quantifying the relative likelihood that a SNP is associated with stroke vs. being independent may be problematic
  • BN is not dissimilar to tree modeling using recursive partitioning
  • Example simulations showing instability
  • Simulated data from additive logistic model with sample size of 1490 patients having a target of 92 strokes
  • 108 SNPs (binary, prevalence 0.5) having importance (weights) shown below snipImportance.png
  • 30 clincial variables (normal with mean 0 and standard deviation .5), weights uniform between -1 and 1
  • 3 independent repetitions of recursive partitional
  • Most important SNPs: s18 and s27; most important clinical variables: c13 and c8
seb1-1.png seb1-2.png seb1-3.png
  • Multiply sample size by 10 and repeat
seb10-1.png seb10-2.png seb10-3.png

library(Hmisc); library(rpart); library(lattice)

strokes  <- 92
n        <- 92 + 1398
snips    <- 108
clinical <- 30

## Generate set of true regression coefficients
snipImportance <- c(runif(snips/3, -2.5, 2.5), runif(2*snips/3, -1, 1))
hist(snipImportance, nclass=20)
clinImportance <- runif(clinical, -1, 1)
which(snipImportance==max(snipImportance))  #  18
which(snipImportance==min(snipImportance))  #  27
which(clinImportance==max(clinImportance))  #  13
which(clinImportance==min(clinImportance))  #   8

## Generate three sets of data under the fixed model, for original
## sample size and n*10
for(m in c(1,10)) {
  for(i in 1:3) {
    s <- matrix(sample(0:1, m*n*snips, replace=TRUE), ncol=snips)
    c <- matrix(.5*rnorm(m*n*clinical), ncol=clinical)

    ## Generate stroke (y=1) and non-stroke (0) under additive logistic model
    L <- .5*matxv(s, snipImportance) +
      .5*matxv(c, clinImportance) - 6
    y <- rbinom(m*n, 1, plogis(L))
    prn(c(mean(y), strokes/n))
    f <- rpart(y ~ c + s, control=rpart.control(minsplit=m*40,cp=.001))
    png(paste('/tmp/seb',m,'-',i,'.png',sep=''), pointsize=8)
    plot(f); text(f)

Quantifying Predictive Accuracy

  • Proportional classified correctly is an improper scoring rule (optimized by a miscalibrated model)
  • Arbitrary and covers up "gray zone" (authors used cutoff of 0.5 risk)
  • Loss of statistical power
  • Authors did independent validation on 114 patients (7 with stroke), got 98.2% correct (112/114, predicted all strokes correctly, predicted 2 strokes in patients without stroke)
  • Fig. 3 treats stroke as the predictor and the probability of stroke as the response variable
  • Authors gave no confidence limits on their validated proportion correct [0.938, 0.995] but will the classification rule transport?

  • Need a more continuous measure of discrimination ability that does not require arbitrary classification
    • Area under ROC curve (C-index), Brier score

Difficulty in Estimating Accuracy in Small Samples

The following plot shows the results of a simulation of an ever-increasing test sample used to validate the most simple (and inappropriate) accuracy measure, the proportion classified correctly, when the true proportion is 0.8. This is also the least stringent validation, because if the outcome incidence is 0.2, one would be 0.8 accurate just by predicting that no patient will suffer the event. Validations need to be 2-sided (e.g., false positive and false negative rates) or a better measure such as overall calibration accuracy should be used. Outer bands are 0.95 confidence intervals for the presumably unknown true probability of correct prediction.


P <- .8   # true probability of a correct prediction
N <- round(10*(2^seq(0,13,by=.25)))
p <- low <- hi <- N
y <- rbinom(max(N), 1, P)
i <- 0
for(n in N) {
  i <- i+1
  s <- sum(y[1:n])
  p[i] <- s/n
  lim <- binconf(s, n, method='wilson')
  low[i] <- lim[,'Lower']
  hi[i] <- lim[,'Upper']
plot(log2(N), p, ylim=range(c(low,hi)), axes=FALSE, type='b',
     xlab='Number of Patients in Validation Sample',
     ylab='Estimated Accuracy of Diagnostic Patterns',
     main='Estimated Accuracy and Its Margin of Error\nWhen True Classification Accuracy is 0.8')
w <- 10*(2^seq(0,13,by=1))
axis(1, log2(w), w)
lines(log2(N), low, col=gray(.7))
lines(log2(N), hi, col=gray(.7))
abline(h=P, lty=2, col=gray(.7))


  • Meschia and Pankratz (p. 340-1): "Despite their promise, it is important to temper enthusiasm for Bayesian networks and similar analytical techniques. Results from these analytical techniques are dependent on the structure of the model, and their statistical properties have not yet been fully studied. The complex data analaytical techniques that can be applied to large data sets make it possible to obtain persuasive results for any given data set. If nuances in sample set identification, data collection and application of the method are overlooked, then the results that are obtained may be difficult to replicate, or even misleading. ... we recommend that extensive detail be provided about every component of the study design and analysis, including a detailed description of the intermediate steps in model creation."

Lewin et al (2006), Biometrics 62(2):1-9: Bayesian modeling of differential gene expression; Gottardo et al (2006), Biometrics 62(2):10-18: Bayesian robust inference for differential gene expression in microarray with multiple samples.

Goals and Data - Lewin et al (2006)

  • Detecting differentially expressed genes
  • Simultaneously model expression-dependent array effect
  • Simple decision rule taking in account uncertainty in parameter estimates
Topic attachments
I Attachment Action Size Date Who Comment
SebastianiAliferis.pptppt SebastianiAliferis.ppt manage 140.0 K 19 Apr 2006 - 14:19 FrankHarrell Aliferis Commentary on Sebastiani et al
propCorrect.pngpng propCorrect.png manage 5.9 K 14 Apr 2006 - 16:20 FrankHarrell Accuracy of Classification Accuracy as a Function of Sample Size
seb1-1.pngpng seb1-1.png manage 4.4 K 17 Apr 2006 - 17:11 FrankHarrell  
seb1-2.pngpng seb1-2.png manage 4.6 K 17 Apr 2006 - 17:11 FrankHarrell  
seb1-3.pngpng seb1-3.png manage 4.5 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-1.pngpng seb10-1.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-2.pngpng seb10-2.png manage 4.3 K 17 Apr 2006 - 17:12 FrankHarrell  
seb10-3.pngpng seb10-3.png manage 4.6 K 17 Apr 2006 - 17:13 FrankHarrell  
snipImportance.pngpng snipImportance.png manage 3.1 K 17 Apr 2006 - 17:11 FrankHarrell  
Edit | Attach | Print version | History: r36 | r9 < r8 < r7 < r6 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r8 - 26 Apr 2006, ChuanZhou

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback