Applied Nonparametric Bootstrap with Hierarchical and Correlated Data

Introduction

Hierarchy. Correlated data may often be represented using a nested hierarchy. Consider a clinical trial where repeat measurements are collected on each patient in a cohort selected uniformly at random from one of several hospitals, which are selected uniformly at random from a network of hospitals. Repeat measurements are nested within patients, and patients within hospitals. A portion of the hierarchy may look like this:

Bootstrap. The goal is to approximate a data generating mechanism (the data distribution function) using an empirical estimate. In designing a bootstrap technique, the strategy is to approximate the data generating mechanism as well as possible. In the absence of hierarchical structure, sampling from the empirical distribution (simple bootstrap) may be sufficient. When there is hierarchical structure, the simple bootstrap may misrepresent variability in the sampling distribution of statistics.

Correlation. In order to preserve this nested correlation structure in resamples, we may mimic the data generating mechanism, e.g. by resampling in a nested fashion. To illustrate, first resample among hospitals, then among patients within hospitals, and finally among measurements within patients. At each level, we can sample with replacement, or without replacement. Sampling with replacement at each level is called multi-stage bootstrap (the reverse multi-stage bootstrap starts at the bottom of the hierarchy instead of the top). It doesn't make much sense to sample without replacement at all levels (why?). There are arguments that favor sampling with replacement at the top level, and without replacement at lower levels. This is called the cluster bootstrap.

Simplifications. Sometimes one or more levels of a nested hierarchy may be ignored, with regard to resampling, when the corresponding factors are thought not to induce correlation among observations. For example, the patient factor is generally thought to induce correlation among repeated measurements. The hospital factor is less direct. We might expect correlation among measurements on patients sampled from the same hospital, but not for patients drawn from different hospitals. There might be an exception if two or more hospitals were located (nested) within a small geographical region, that is, if there were yet another level of hierarchy.

Implementation

The R code below implements a multi-stage bootstrap in a recursive manner. In the context of this example, the dat argument is a data frame having a factor field for each level of hierarchy (i.e., hospital, patient, measurement), and a numeric field that contains the measured values. The cluster argument should be a character vector that identifies the hierarchy in order from top to bottom (i.e., c('hospital','patient','measurement')). The replace argument is a logical vector that indicates whether sampling should be with or without replacement at the corresponding level of hierarchy (e.g., c(TRUE,FALSE,FALSE) for the cluster bootstrap).

resample <- function(dat, cluster, replace) {
  
  # exit early for trivial data
  if(nrow(dat) == 1 || all(replace==FALSE))
      return(dat)
  
  # sample the clustering factor
  cls <- sample(unique(dat[[cluster[1]]]), replace=replace[1])
  
  # subset on the sampled clustering factors
  sub <- lapply(cls, function(b) subset(dat, dat[[cluster[1]]]==b))
  
  # sample lower levels of hierarchy (if any)
  if(length(cluster) > 1)
    sub <- lapply(sub, resample, cluster=cluster[-1], replace=replace[-1])
  
  # join and return samples
  do.call(rbind, sub)
  
}

Simulation

The following R code simulates a dataset with 5 correlated (rho=0.4) repeat measurements on each of 10 patients from each of 5 hospitals; 250 measurements total, 50 patients total. In these data, patients are totally independent. The functions covimage and datimage generate a levelplot representation of the covariance matrix and data matrix, respectively, for the simulated data.

# simulate correlated data
rho <- 0.4
dat <- expand.grid(
  measurement=factor(1:5),
  patient=factor(1:10),
  hospital=factor(1:5))
sig <- rho * tcrossprod(model.matrix(~ 0 + patient:hospital, dat))
diag(sig) <- 1
dat$value <- chol(sig) %*% rnorm(250, 0, 1)

library("lattice")

covimage <- function(x)
   levelplot(as.matrix(x), aspect="fill", scales=list(draw=FALSE),
      xlab="", ylab="", colorkey=FALSE, col.regions=rev(gray.colors(100, end=1.0)),
      par.settings=list(axis.line=list(col=NA,lty=1,lwd=1)))
  
datimage <- function(x) {
   mat <- as.data.frame(lapply(x, as.numeric))
   levelplot(t(as.matrix(mat)), aspect="fill", scales=list(cex=1.2, y=list(draw=FALSE)),
      ylab="", xlab="", colorkey=FALSE, col.regions=gray.colors(100),
      par.settings=list(axis.line=list(col=NA,lty=1,lwd=1)))
}

datimage(dat)
covimage(sig)

Bootstrap

The following R code generates various boostrap distributions for the sample mean, and approximates the 'true' sampling distribution by Monte Carlo. The final boxplot illustrates that the bootstrap strategy has a significant impact on the bootstrap distribution of sample statistics. Hence, it's important to think carefully about the strategy, and ensure that it most closely reflects the 'true' data generating mechanism.
# bootstrap ignoring hospital and patient levels
cluster <- c("measurement")
system.time(mF <- replicate(200, mean(resample(dat, cluster, c(F))$val)))
system.time(mT <- replicate(200, mean(resample(dat, cluster, c(T))$val)))
#boxplot(list("F" = mF, "T" = mT))

# bootstrap ignoring hospital level
cluster <- c("patient","measurement")
system.time(mFF <- replicate(200, mean(resample(dat, cluster, c(F,F))$val)))
system.time(mTF <- replicate(200, mean(resample(dat, cluster, c(T,F))$val)))
system.time(mTT <- replicate(200, mean(resample(dat, cluster, c(T,T))$val)))
#boxplot(list("FF" = mFF, "TF" = mTF, "TT" = mTT))

# bootstrap accounting for full hierarchy
cluster <- c("hospital","patient","measurement")
system.time(mFFF <- replicate(200, mean(resample(dat, cluster, c(F,F,F))$val)))
system.time(mTFF <- replicate(200, mean(resample(dat, cluster, c(T,F,F))$val)))
system.time(mTTF <- replicate(200, mean(resample(dat, cluster, c(T,T,F))$val)))
system.time(mTTT <- replicate(200, mean(resample(dat, cluster, c(T,T,T))$val)))
#boxplot(list("FFF" = mFFF, "TFF" = mTFF, "TTF" = mTTF, "TTT" = mTTT))

# Monte Carlo for the true sampling distribution
system.time(mMC <- replicate(200, mean(chol(sig) %*% rnorm(250, 0, 1))))
#boxplot(list("MC" = mMC))

boxplot(list("MC" = mMC,
             "F" = mF, "T" = mT,
             "FF" = mFF, "TF" = mTF, "TT" = mTT,
             "FFF" = mFFF, "TFF" = mTFF, "TTF" = mTTF, "TTT" = mTTT))
Topic revision: r35 - 23 Jul 2015, WikiGuest
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback