Statistical Thinking in Biomedical Research
Division of Biostatistics & Epidemiology
Department of Health Evaluation Sciences
University of Virginia School of Medicine
Rob Abbott, Viktor Bovbjerg, Mark Conaway, Frank Harrell
hesweb1.med.virginia.edu/biostat/
teaching/clinicians
Office of Continuing Medical Education
Fall 2002
Slide: 1
Section 1: Introduction to Statistical Concepts ¾
Frank Harrell
-
What do biostatistics & epidemiology have to offer to
biomedical research?
- Statistical inference
- Study design issues
- Descriptive statistics
- Measuring change
- Statistical resources at UVa
Slide: 2
What Do Biostatistics & Epidemiology Offer?
-
Help in developing concrete objectives and data acquisition
methods that meet the objectives, concrete descriptions
of primary and secondary endpoints
- Appropriate experimental and study design
-
Sources of bias
- Measurement issues
- Efficiency/power
- Maximizing use of a given number of animals
- Interpretability of findings
- Reproducibility of analyses
Choice of appropriate design depends crucially on the type of
experiment/disease and treatment being studied.
- likelihood that sample will yield estimates of adequate
precision to make experiments conclusive/affect medical practice
- More efficient use of data
- Formulate analysis plans without making inappropriate assumptions
- Estimate sample size (if fixed)
Slide: 3
Example
-
Objective: Does an intervention (could be a new drug,
patient education intervention, method of administration) improve
a response?
- What is meant by ``response'' and how will it be
measured?
- Is it based on symptoms, physiologic measurements, anatomical
measurements, or a combination?
- Does the ``response'' variable truly
measure the effect of treatment?
- When is the response measured ¾ 30 days after enrollment;
30 days after discharge from the hospital, anytime within a
5 year follow--up period, during the procedure, immediately
upon reperfusion after a coronary artery is unclamped,
after steady state, ...?
- If a ``time to'' endpoint, how much time is given to
enrolling patients and how much to follow--up?
- Where will patients come from and which group do they represent?
- How will the study results be used?
Slide: 4
Statistical Inference ¾ Examples
-
``How did my 5 patients do after I put them on an
ACE--inhibitor?'': Describe results.
- ``How do patients with condition x respond after being on
an ACE--inhibitor for 6 months?'': Infer ® Need to take a sample of
patients of interest to approximate what would be observed
had all such patients been treated that way.
- ``What is the in--hospital mortality after open heart surgery
at my hospital so far this year?'': Describe; whole population
captured.
- ``What is the in--hospital mortality after open heart surgery
likely to be this year, given results from last year?'': Infer ®
Estimate probability of death for patients like those seen in
1996.
- Inference = observations ® some general truth
- Answering research questions usually requires inferential
reasoning because you want to make a statement in general, not
just a statement about your specific study.
- Ability to do so depends on how observations collected as well
as their number
Slide: 5
Infinite Data Case
-
Suppose that one had an infinitely large amount of data of the
kind under consideration
- Inference not required
- Do need to determine if the infinite dataset would answer the
question of interest (QOI)
-
Subjects relevant?
- Measurements biased?
- Measurements relevant (e.g. measure cholesterol reduction
but not survival time)?
- Data collection process adequate?
- Patient--to--patient variability still too great for
conclusions to be applied to individuals?
Slide: 6
Finite Dataset
-
Compute an estimate of something, e.g. expected reduction in blood pressure
- Approximates what would have been observed if had ¥
data
- Can estimate likely |error| in this approximation
- Probabilistic thinking: likely absolute error is a function of:
-
Sample size
- Subject to subject variability
- Intra--subject variability if using multiple
observations/subject
- Systematic bias
- Some subjects not getting desired experimental condition
Slide: 7
Steps Involved in Statistical Inference
-
Statistical inference based on the fact that when laws of
probability govern data collection, can infer from sample to
infinite data results
- First insure that an infinite dataset would answer the QOI
- Assess results using a sample
- Compute likely closeness with which sample results
approximates infinite dataset results
- Internal validity (chance), external validity
(generalizability)
Slide: 8
Study Design Issues
-
Concretely define study objectives
- Design study so that if had ¥ data would answer QOI
- Conduct experiment making efficient use of resources,
minimize likely |error|
- Standardize measurement devices
- Quantify and minimize intra-- and inter--observer
variability of measurements
- Define terms: symptoms, signs, diagnoses, disease severity,
risk factors,
treatment or experimental conditions, control condition, events
- Use standardized assessment instruments when possible
- Definite animal/patient entry criteria
- Concomitant therapies/laboratory conditions
- Dosing of active control agents must be optimal
- Account for accommodation (tolerance) to drug effect
- In comparison studies, masked and random assignment of
experimental conditions
- Masked assessment of specimens/subject responses
- Masked specification of analysis[6]
- Masked reporting: write manuscript before data
analysis[7]
Slide: 9
Response Variables
-
Continuous measurements are best (e.g., mmHg, not
``hypertension'')
- Time lapse after experimental condition/how often to measure
- May have to wait until after an acute but temporary
derangement
- Time of assessment may need to correspond with phases of
disease development/trajectory of disease severity as well as
lifespan of the technology
- Binary response when time of event not important (e.g.,
procedural death) ¾ still need to justify duration of observation
- Time to event: all subjects without event need to be
followed a minimum duration to capture some of the
clinically relevant period.
Sufficiently large subset of
subjects should be followed until the end of the clinically
relevant period.
- Ordinal responses can be useful and have good statistical
power, e.g., no event within 30d, mild myocardial infarction,
moderate MI, severe MI, death within 30d. For diagnostic
studies may need to at least include a ``gray zone''.
- When have multiple responses they should be (1) prioritized,
or (2) combined into a summary scale. Need to ``go out on a limb''
and pre--specify which results will be emphasized when study
results are publicized.
- Example: may combine systolic and diastolic b.p. into
mean arterial b.p.
- Otherwise, have multiple comparison problems. To
preserve overall type I error (false positive rate), would need to
be more conservative ® ¯ power.
Slide: 10
Types of Studies/Believability of Results
-
Single--arm (pilot, Phase I--II)
- Comparisons with a reference standard
- Toxicity
- Pharmacokinetic
- Correlating two responses
- Dose--finding/dose titration
- Estimation of dose-- or time--response curves within subjects
- Beware of problems with noncomparative studies of therapies:
Treatment response = natural history + Hawthorne effect +
placebo effect + bias caused by investigator enthusiasm + real
treatment effect
- Comparative: ³ 2 arms
- Unacceptable Studies:
-
Observational studies where subjects were selected on
the basis of their outcomes (e.g., consecutive series of
100 open heart patients who lived)
- Comparison with historical controls unless time--trends
fully understood and excellent subject baseline
descriptors in both studies
- Randomized controlled trial (RCT) where physicians only
allowed patients to be randomized who were invincible
- RCT where entry criteria do not reflect patients seen in practice
- RCT of a procedure or therapy that is obsolete by the
time the results are disseminated (or mode of use is
obsolete)
- Any study where positive results were
derived only after torturing the data (multiple subgroups
or response variables examined)
- Experiment in which measurements have extremely large
variability across replications within the same animal,
or ones in which measurements were ``optimized'' by
non--replicable ``tweaking''
- An average ranking of quality for comparative
studies1:
-
double masked RCT with masked analysis &
manuscript writing
- double masked RCT
- single masked RCT
- unmasked RCT
- prospectively designed and conducted
cohort study
- prospective case--control study
- retrospective cohort study
- retrospective case--control study
- See Chalmers et al.[4] for a rating scale for
study quality. Also see[3].
Slide: 11
Randomized Experiments
-
Randomly allocate patients to treatments while
masked
- If sample size is at all reasonable, should balance all known and
unknown risk factors
- Even if there is an apparent imbalance in one factor, you'll
see imbalances in the other direction if you look at enough
other factors
- Best not to look at patient characteristics stratified by
treatment; report statistics for overall sample
- Randomization is best done using a computer program, with
treatment assignments revealed at the last moment
Slide: 12
External Validity of Study Findings
-
Knowledge of pathophysiology can allow extrapolation of
results to a group of subjects not represented in study
- Example: Reduction of probability of myocardial infarction by
aspirin in men ® reduction in women
- But what if aspirin GI bleeding in women more than
in men?
- Differences in dosing, side effects, compliance can cause
different results in another population
- Relative effects of treatments frequently carry over to
other types of subjects even though absolute effects do
not2
Slide: 13
Pitfalls in Analysis & Interpretation
-
Highlighting results found by data dredging; need to at
least document the context
- Deleting ``outliers'' based on observed response values
-
Unscientific, results non--replicable
- Instead use robust statistical methods
- Irreproducible analyses based on point--and--click software without
audit trail
- Concentrating exclusively on hypothesis testing. Null hypotheses
are generally boring and do not answer questions about clinical
significance. It's better to think in terms of being able to
estimate, with sufficient precision, the effects of interest.
- Using P--values to provide evidence supporting a hypothesis;
they can only be used to quantify evidence against a
hypothesis.
``Absence of evidence is not evidence of absence''
[1]
- P=0.4 ® insufficient sample size or no effect; don't know which
- Relying too much on standard deviations as descriptive statistics.
Standard deviations are not very meaningful if the distribution of
the data is non--Gaussian and especially if asymmetric.
- Using standard errors to describe anything other than the
precision of a summary estimate. Standard errors do not
describe variability across subjects. To describe precision,
it's better to use confidence limits on summary statistics.
Slide: 14
Descriptive Statistics
-
Number of non--missing measurements, central tendency, perhaps
inter--subject variability
- Mean and especially standard deviation may not be meaningful
unless data normally distributed
- Don't expect normality for biological variables
- Deciding on statistics to use on basis of test of normality
assumes such tests have power near 1.0
- For continuous variables, a good summary is obtained from
the 3 quartiles (25th, 50th, 75th percentiles,
50th = median)
- Describes central tendency, spread, symmetry
- For continuous variables for which totals may be relevant
(e.g., costs), supplement this with the sample mean
- Computing means on transformations (e.g., geometric mean) and
then back--transforming is problematic
- Standard errors are not descriptive statistics
- For discrete numeric variables representing counts or
interval scale values, where the number of possible
categories <10, use the mean and outer
quartiles or mean and selected proportions.
Median will not be sensitive and is erratic
because of heavy ties in data.
- Nominal (polytomous) variables ® proportions in k-1 of the
k categories
- Binary variables ® mean (proportion of ``positives'')
Slide: 15
Analysis of Paired Observations
-
Frequently one makes multiple observations on same
experimental unit
- Can't analyze as if independent
- When two observations made on each unit (e.g., pre--post),
it is common to summarize each pair using a measure of effect
® analyze effects as if (unpaired) raw data
- Most common: simple difference, ratio, percent change
- Can't take effect measure for granted
- Subjects having large initial values may have largest
differences
- Subjects having very small initial values may have
largest post/pre ratios
Slide: 16
What's Wrong with Percent Change?
-
Depends on point of reference ¾ which term is used in the
denominator?
- Example:
Treatment A: 0.05 proportion having stroke
Treatment B: 0.09 proportion having stroke
Treatment A reduced proportion of stroke by 44%
Treatment B increased proportion by 80%
- Two increases of 50% result in a total increase of 125%, not
100%
- Percent change (or ratio) not a symmetric measure
- Simple difference or log ratio are symmetric
Slide: 17
Objective Method for Choosing Effect Measure
-
Goal: Measure of effect should be as independent of
baseline value as possible3
- Plot difference in pre and post values vs. the average of the
pre and post values. If this shows no trend, the simple
differences are adequate summaries of the effects, i.e., they
are independent of initial measurements.
- If a systematic pattern is observed, consider repeating the
previous step after taking logs of both the pre and post
values. If this removes any systematic relationship between
the average and the difference in logs, summarize the data
using logs, i.e., take the effect measure as the log ratio.
- Other transformations may also need to be examined
Slide: 18
Biostat / Epi Resources at UVa
-
We profit from strong ties with Statistics in the College
of Arts & Sciences
- Who to contact: Faculty & Executive Secretary of the Division of
Biostatistics & Epidemiology, Dept. of HES, Box 800717,
-712
-
Frank E Harrell Jr, Professor and Chief
fharrell@virginia.edu, 4-8712
General, cardiovascular, nephrology, critical care medicine, health
services research and outcomes evaluation, pharmaceutical research
- Robert D Abbott, Professor
abbott@virginia.edu, 4-1687
GCRC, cardiovascular epidemiology, dementia studies, Medicine
- Mark R Conaway,
Professor
mconaway@virginia.edu, 4-8510
General, clinical trial design, longitudinal data, oncology
- Gina R Petroni, Associate Research Professor
gpetroni@virginia.edu, 4-8363
Cancer Center
- Viktor E Bovbjerg,
Assistant Professor
bovbjerg@virginia.edu, 4-8430
Epidemiology, chronic disease studies, especially
cardiovascular disease and diabetes
- Jae K Lee, Assistant Professor
jklee@virginia.edu, 4-8712
Statistical genetics, genomics,
bioinformatics
- Jim Patrie MS, Biostatistician III
jpatrie@virginia.edu, 4-8576
GCRC, general, experimental design
- Eric Bissonette MS, Biostatistician III
ebisson@virginia.edu, 4-8586
Cancer Center, interventional radiology, general
- Xin Wang MS, Biostatistician III
xwang@virginia.edu, 4-8519
Health services research and outcomes evaluation,
cardiology, general
- Mark Smolkin MS, Biostatistician II
msmolkin@virginia.edu, 4-8712
Cancer Center
- Mir Siadaty MD, MS, Biostatistician II
msiadaty@virginia.edu, 4-8712
Bioinformatics, general
- Brenda Lee
Executive Secretary
-712
Inquiries, appointments
- General E--mail address: biostat@virginia.edu
- Department of Statistics Consulting Lab
- Biostatistics Consulting Service (hourly charge for data
analysis) - see hesweb1.med.virginia.edu/
biostat/services.htm
- Free (but currently limited because almost all our time is now
funded by successful grant applications): developing grant
proposals, study planning meetings
-
For Divisions not providing general ongoing support of a
portion of an M.S. biostatistician and a faculty biostatistician
or epidemiologist, assistance is on a first-come first-serve basis
with a minimum of 120 days advance notice before the grant
application deadline
- Clinical Trials Office (Lori Elder)
Slide: 19
How to Collaborate with Statisticians & Epidemiologists?
-
Willingness to explain the
details of the study. An
appropriate choice of the outcome, design, sample size,
data collection, etc. requires some knowledge of
the area being studied. Explaining these things can
not only provide a more efficient way of doing the
study, but can sometimes help to clarify issues that
may have been taken for granted.
Slide: 20
Collaboration Issues
-
Collaboration should begin early
- Too often stat. will uncover a fatal flaw in data
collection too late, e.g., recording measurements
as ranges rather than raw data
- Early understanding on authorship; depends on whether e.g.
statistician serves as a ``number cruncher'' vs. as part of
the investigation or manuscript writing or she
develops/assimilates new methods for the purpose
of the project
- Best way to fund epi/biostat involvement is through %FTE
of grant support or general ongoing funding from the
divisional level
- Long--term goal of Division of Biostatistics &
Epidemiology is to have stat. on staff who have
long--term collegial relationships with biomedical
researchers and with a good understanding of specific
subject matter areas
Slide: 21
Education Opportunities at UVa
-
One-semester clinical trials methodology course
taught once/year by Lori Elder and Gina Petroni
- Short courses: Statistical Thinking in Biomedical
Research offered twice yearly through CME
- M.S. in Health Evaluation Sciences: Clinical
Investigation, Health Services Research and Outcomes
Evaluation, and Epidemiology tracks
- Comprehensive Introduction to Clinical Investigation,
intensive 4-week course for housestaff and other physicians
(month of July, beginning 2001)
- Health Evaluation Sciences Research Conference: Wednesdays
4:0-:00p, 3rd floor Hospital West, Room 3182
- First Wednesday of the month:
Clinical Investigation Seminar sponsored jointly by the Center for
the Study of Complementary and Alternative medicine and DHES
-
Forum for research methodology, clinical trials design
- Ideal for presenting work/grant proposals in progress, for
critique by methodologists
- Contact fharrell@virginia.edu if you want to be put on
the e-mailing list
- Short courses in clinical investigation and biomedical ethics
References
- [1]
-
D. G. Altman and J. M. Bland.
Absence of evidence is not evidence of absence.
British Medical Journal, 311:485, 1995.
- [2]
-
J. C. Bailar III and F. Mosteller.
Medical Uses of Statistics.
NEJM Books, Boston, second edition, 1995.
- [3]
-
C. Begg, M. Cho, S. Eastwook, R. Horton, D. Moher, I. Olkin, and et al.
Improving the quality of reporting of randomized controlled trials.
The Consort statement.
Journal of the American Medical Association, 276:63-39,
1996.
- [4]
-
T. C. Chalmers, H. Smith, B. Blackburn, B. Silverman, B. Schroeder, D. Reitman,
and A. Ambroz.
A method for assessing the quality of a randomized control trial.
Controlled Clinical Trials, 2:3-9, 1981.
- [5]
-
T. J. Cole.
Sympercents: symmetric percentage differences on the 100 loge
scale simplify the presentation of log transformed data.
Statistics in Medicine, 19:310-125, 2000.
- [6]
-
CPMP Working Party.
Biostatistical methodology in clinical trials in applications for
marketing authorizations for medicinal products.
Statistics in Medicine, 14:165-682, 1995.
- [7]
-
P. C. Gøtzsche.
Blinding during data analysis and writing of manuscripts.
Controlled Clinical Trials, 17:28-93, 1996.
- [8]
-
L. Kaiser.
Adjusting for baseline: Change or percentage change?
Statistics in Medicine, 8:118-190, 1989.
- [9]
-
R. A. Kronmal.
Spurious correlation and the fallacy of the ratio standard revisited.
Journal of the Royal Statistical Society A, 156:37-92, 1993.
- [10]
-
T. A. Lang and M. Secic.
How to Report Statistics in Medicine: Annotated Guidelines for
Authors, Editors, and Reviewers.
American College of Physicians, Philadelphia, 1997.
- [11]
-
J. S. Maritz.
Models and the use of signed rank tests.
Statistics in Medicine, 4:14-53, 1985.
- [12]
-
L. Törnqvist, P. Vartia, and Y. O. Vartia.
How should relative changes be measured?
American Statistician, 39:4-6, 1985.
- 1
- RCTs include crossover studies, which can be of
excellent quality when there are no carryover effects or when
carryover effects are understood well enough to be ``subtracted out''.
- 2
- Because absolute risks of events vary with
disease severity, dictating that risk differences must vary.
- 3
- Because of regression to
the mean, it may be impossible to make the measure of change
truly independent of the initial value. A high initial value
may be that way because of measurement error. The high value will
cause the change to be less than it would have been had the
initial value been measured without error. Plotting
differences against averages rather than against initial
values will help reduce the effect of regression to the mean.