UVA Biostatistics Discussion Board

Restricted cubic splines for longitudinal data

UVA Biostatistics Discussion Board: Regression Modeling Strategies: Restricted cubic splines for longitudinal data

By Scot McNary on Wednesday, October 16, 2002 - 05:16 pm:

Hi,

I've been using the Design package ported to R last spring to make restricted cubic splines for time data (rcs) and using these to model behavior over time in some data I have.

The data are gathered weekly, and there are 4 weeks per treatment phase. I like applying the idea of the rcs since it shows a brief, transitory effect I believe exists immediately following change in treatment phase. For theoretical reasons I believe this effect is too fleeting to be represented by a coarser model, e.g., mean differences between treatment phases. So I'd like to use rcs to explain my results to a larger audience.

However, I've been wondering if there are problems with the standard errors for the rcs variables for longitudinal data due to autocorrelation among the observations. Specifically, positive autocorrelation biases the SEs for time lower (more likely to be significanct) and negative correlation biases the SEs higher (less likely to be significant). Are there these kinds of concerns with rcs SEs?

A second point: I've heard mention of some people using B-splines and C-splines for longitudinal data. Is there a source that compares or recommends rcs vs. B-C splines for longitudinal, or more generally, clustered data?

Any thoughts and/or references would be appreciated.

Thanks,

Scot

By Frank E Harrell Jr (Feh3k) on Thursday, October 17, 2002 - 02:55 pm:

Scot - These are excellent questions. You are using the "working independence model" which in many cases provides estimates of regression coefficients that are fairly efficient when compared to methods (such as mixed models) that explicitly take into account the correlation structure when fitting the model. You can do a full after-the-fit intra-cluster correlation correction by running the fit object throught bootcov (which I slightly prefer) or robcov (cluster sandwich covariance estimator).

Regarding the other spline bases, in my experience the fits are similar; I use natural splines because of their simple algebraic representation. I'm sure there are some papers comparing various bases but I don't have my finger on them. One book that does touch on this is the Therneau & Grambsch book "Modeling Survival Data".

By Scot McNary on Monday, October 21, 2002 - 02:51 pm:

Dr. Harrell,

Thanks for your note and spline reference. This discussion board is a generous offer of a valuable resource. I'll try to resist the temptation to flood it with questions! I have one followup question, though.

I took a look at bootcov and where it's used discussed in your text and found the citation for the Feng, McLerran & Grizzle (1996) paper on clustered data with gaussian error.

My particular situation is addressed at the end of FM&G's simulation reports. I have few clusters (10), and unbalanced observations (8-54) in each cluster. According to FM&G, they recommend the bootstrap method with few clusters (K<20) unless the data are seriously unbalanced. For that situation, they recommend a method called the "four stage method".

Given the ease of implementation of the bootstrap (via bootcov) and the fact that I can remove one case from my analysis that will improve the balance of the data, it seems like the bootstrap is the best way to go.

Here is my followup question: when implementing the bootstrap, FM&G state bootstrap sample is drawn from the K clusters with size K. I interpret this to mean that if there are 10 clusters, the sample size for each of the B=500 bootstrap samples is 10. Is this how the bootstrapping is implemented with the clustering option in bootcov?

Thanks,
Scot

By Frank E Harrell Jr (Feh3k) on Monday, October 21, 2002 - 08:56 pm:

I have not studied the four stage method.

In cluster bootstrapping you select clusters with replacement, then take all of the observations within a cluster if that cluster is chosen. Think of clusters as the primary data objects and doing ordinary bootstrapping on the clusters, then letting elements within clusters tag along.

By Frank E Harrell Jr (Feh3k) on Wednesday, April 23, 2003 - 06:20 am:

A new function in Design called glsD makes the Pinheiro and Bates nlme package's generalized least squares function gls easy to use in Design. For example, using anova after glsD gives you automatic tests of linearity in time, time by covariate interactions, linearity of covariates, etc. glsD also does the cluster bootstrap if desired. Simulations seem to show that the bootstrap is not needed if the clusters are multivariately normally distributed because the standard errors from MLE or REML work quite well. These simulations were done for the balanced case. With heavy imbalances, the parametric standard errors may perform better than the cluster bootstrap.

By Scot McNary on Thursday, June 19, 2003 - 01:36 pm:

Dr. Harrell,

I just saw your message from April about the glsD function. I've updated my Design source to 1.1-6 and have been experimenting with it. I'm in between help pages though. The gls help page got me started for the syntax of the glsD function. But I have a couple of questions beyond the gls help page. First, what's a good resource that describes how to call the cluster bootstrap in glsD--is it the same as in bootcov? A peek at the source shows a B = parameter, but I can't quite make out what else is needed.

Second, is there some specification of a time variable in glsD that calculates the tests of linearity of time you mention? I understand anova will return those tests, but doesn't glsD need to know which of my variables are covariates and which is the time variable? Is it in the "correlation = corSymm(form = ~ "time" | "subject")" statement? If so, this only seems to take integer values for "time".

Speaking of bootcov, the group specification was a nice addition. Is there some background reading you could point to on what different kinds of situations this might be used? I know that in my situation it was useful because I had indicators for each individual in my model (few individuals, many repeated measures) used as nuisance variables and when running the bootstrap, not all individuals would be pulled into the samples so the parameter estimates for these nuisance variables couldn't be estimated and bootcov couldn't produce bootstrap estimates.

Specifying "group = id" made sure that the number of individuals with observations was the same in each bootstrap, which matched the number of individuals in the original sample. This allowed the id indicators parameter estimates to be calculated. Very helpful. Just for general curiosity purposes, can you give other examples where this might be used?

Thanks a lot,

Scot McNary

By Frank E Harrell Jr (Feh3k) on Sunday, June 22, 2003 - 07:06 am:

Unlike with other fitting functions in Design, B= causes bootstrapping to be done in the glsD. If you specify B, the other function such as plot, anova, summary will use the bootstrap covariance matrix (which is based on the cluster bootstrap). You don't use bootcov with glsD.

Time is just like any other covariate except that you specify time in the correlation argument. But I have been unable to get corrSymm to work with glsD.

group in bootcov is intended for the k-sample problem to get the same number of observations for each treatment group. It is not for subject clusters (the cluster argument handles that). The cluster bootstrap cannot select every subject and still be correct. The cluster bootstrap works for generalized least squares but not for models in which the subject ID is part of the "mean" portion of the model.