Validation of Discoveries and Validation of the Predictive Performance of Clinical Models

Note: On 1Nov04, Steve Goodman, statistical editor of the Annals of Internal Medicine, invited FrankHarrell to write an article on validation for that journal. This would be an excellent forum for us.

Development of a Manuscript with Possible Co-Authors Ewout Steyerberg, Karel Moons, Frank Harrell, Dean Billheimer, David Ransohoff

Goals of Paper

To explain in non-technical terms and to demonstrate with a simulation the following concepts:
  • internal, external, boostrapping, cross-validation, data-splitting
  • why data-splitting results in low-precision accuracy estimates and wastes training data
  • bias and optimism, overfitting
  • need for estimates of precision of accuracy, and why a statistical index of model performance computed in a validation is only an estimate
  • how to do validation incorrectly
    • contamination of test sample with training data
    • failure to freeze model, e.g. not repeating variable selection for each resample
    • inadequate test sample size (or for resampling methods, inadequate total sample size)
    • validating more than a few models or markers and choosing the one that validates the best
Target Journal: JAMA, Annals, NEJM


  1. Abstract
  2. Introduction
  3. What is Validation?
    1. Definition in terms of assessing the likely future performance of a prediction model or of a categorical discovery
      1. discrimination
        1. measures to validate
      2. calibration
      3. what is wrong with percent classified correctly
        1. insensitive
        2. arbitrary
        3. improper scoring rule
    2. Validated performance estimates are only estimates
  4. What happens when you don't validate a model or discovery?
    1. bias or over-optimism and overfitting
    2. public dissatisfaction with epidemiologic "discoveries" about putative risk factors, bad effect on health behaviors
  5. How to improperly validate performance of a model
  6. Correct methods of validating the accuracy of predictions


Background Papers to Review

  • Rich Simon/Lisa McShane JNCI 2003, where they describe those reasons and how the Netherlands BrCa prognosis group did it WRONG in Nature 2002 (see p16, 2nd column, top paragraph).
  • Keith Baggerly, Jeff Morris and Kevin Coombs paper in Bioinformatics 2004
  • Papers by Steyerberg and Moons
  • General paper on validation by Amy Justice
  • Editorial by Edison T Liu and Krishna R Karuturi Microarrays and Clinical Investigations. NEJM 350;1596 15Apr04
  • Editorial by David Ransohoff Bias as a threat to the validity of cancer molecular-marker research. Nature Reviews 5;142-149 Feb05
  • Michiels, Koscielny, Hill Prediction of cancer outcome with microarrys: a multiple random validation strategy. Lancet 365;488-492 Feb 05. Comment on p. 454.

Discussions Not Yet Incorporated into Manuscript

Baggerly, Morris and Coombs (BMC, Bioinformatics 2004)

BMC assess the data and analysis approach used by Petricoin et al.(2002). Petricoin and colleagues develop an ovarian cancer classification algorithm based on 100 SELDI serum spectra (training set, 50 cancers 50 normals). Their resulting algoithm correctly classified 50 of 50 cancer cases, and 47 of 50 normals in a test set. Further, the algorithm correctly identified 16 of 16 cases of benign disease as "other" than normal or cancer. BMC's subsequent evaluation of the Petricoin data (as well as two supporting data sets) indicates that the structural features found to distinguish cancer from normal appear to be attibutable to artifacts caused by the measurement technology, or differential sample handling/processing. They argue convincingly that the classification results are driven by systematic differences other than biology, and are not useful for cancer detection in a novel sample. BMC propose that better use of (statistical) experimental design, and better external validation would "help". (They don't say precisely how it would help.)

My interpretation of this is as follows. I agree with BMC that Petricoin et al. inadvertantly introduced systematic bias into their SELDI spectra. Further, this bias was introduced such that no form of internal validation would have detected it. Both the training and testing sets were contaminated with this bias. (Another way of saying this is that the bias factor is completely confounded with the cancer/normal classification.) It is possible that better experimental design (e.g., randomization - statistical design; protocol standardization - logical design ?) would have reduced the effect of confounding.

DeanBillheimer 01 Apr 04

Model Validation vs. Model Assessment

This is a picky, idiosyncracy of mine. I want to mention it, and if I'm voted down, I'll live with it. I find the term model validation to be seriously misleading. I think this is because I don't know what it means for a model to be valid. Non-technically, we tend to equate "valid" with "true", but there is no such thing as a "true model" (except as a thought experiment or simulation). Further, we know that no amount of observed data can confirm a model's truth ( = validity). Is Newton's model for gravity "valid"? At the level of predicting planetary motion, it seems so; at other levels (e.g. where quantum or relativistic effects interfere) it's not so good (I think?). When trying to predict a falling body's velocity profile, the gravity model is incomplete; it's not really valid or invalid. Instead of describing a model as valid or not, I think it would be better to describe the situations under which a model predicts well.

Operationally, I'm not sure what it means for a model to be valid. For example, what level of predictive performance is requred for validation? Is 90% correct classification good enough to claim a valid model? Clearly, if we change the population to which the modeling results are applied we can change the predictive performance. Also, in this setting we already have sensitivity, specificity, predictive value positive, and predictive value negative to describe different aspects of a model's predictive behavior. These seem to me to be reasonable measures assessing its performance. I don't understand how they relate to validity.

Close by, is predictive performance the only characteristic of a model relevant for validation? We (statisticians, at least) don't use the term "validity" when evaluating the effect of a regression variable.

Clearly, I'm a neophyte at model validation.

So why the rant? I am wary of anyone who claims that their model has been "validated", without providing specific performance details. More importantly, I think that non-technical (non-statistical?, non-critical?) consumers of models can be lulled into false security by using "validated" models (an aside: Why aren't they called "valid" models?). A claim of validity confuses the mathematical model with the phenomenon being described by mathematics. Finally, the term "validation" tends to subvert critical evaluation. To me, "assessment" implies critical evaluation.

As a working title, let me propose a start: The Assessment of Predictive Performance of Clinical Models.

DeanBillheimer 01 Apr 04
Dean, you make some great points. My personal definition of validation does not include saying the model is actually useful but rather means that the "validated" performance of the model in new, similarly collected, samples is on the average going to be as good as we estimate in the validation. Validation doesn't mean the model is right. For example, a calibration validation means that when you predict a 0.4 chance of disease on the average 0.4 of future subjects with those characteristics will have the disease. This might be a mixture of a group with 0.3 and a group with 0.5 disease probability due to an unknown binary predictor, but the 0.4 estimate that is not conditional on this unobserved variable is correct.

My suggestion for the title is to include discovery not just model validation.

FrankHarrell 2Apr04
Frank, Thank you for the explanation. The validation (verification) of model performance makes perfect sense. If my own experience is any guide, I think this point is not well understood by non-specialists. Given the intended audience for the paper, I think this is an important point to make: performance validation, not model validation.

revised working title: Discovery and Predictive Performance Validation of Clinical Models.

DeanBillheimer 2 Apr 04
I like it.

FrankHarrell 2 Apr 04
It would be interesting to know the "false validation rate", i.e., if you independently validated 100 validated findings how many would validate and to what extent is this a function of how hard the researchers had to search for discoveries.

FrankHarrell 22 Apr 04
One validation issue that I'm personally attached to is the issue of validation in populations that differ in the prevalence of disease. We showed in 1990 that using Bayes' theorem to adjust the probability of disease given a chest pain score for the overall prevalence of CAD in the population (comparing primary care populations (2) to 2 populations admitted for coronary arteriography.

Sox HC, Hickam DH, Marton KI, Skeff KS , Sox CH, Moses L, Neal A. Using the patient's history to estimate the probability of coronary artery disease: a comparison of referral and primary care practice. Am J Medicine. 1990;89:7-14 .

Hal Sox 24 May 05

Links to Other Sites

Ewout: The slides are nice. You may want to incorporate some of the points there into the outline at the top -FH

Review and cite the REMARK guidelines (JNCI 97:1180, 2005)
Topic attachments
ISorted ascending Attachment Action Size Date Who Comment
Validationofpredictivemodels.pdfpdf Validationofpredictivemodels.pdf manage 88.0 K 30 Mar 2004 - 05:44 EwoutSteyerberg ES: educational material on validation
Validationscheme.pptppt Validationscheme.ppt manage 26.0 K 01 Apr 2004 - 13:42 EwoutSteyerberg A scheme for internal and external validation/dity
Topic revision: r15 - 12 Feb 2007, FrankHarrell

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback