# General Modeling Issues Discussion Board

### What to Do with a Dataset that is Too Small?

John Fieberg (

John.Fieberg@dnr.state.mn.us) asked the following on 2 Mar 04:

Do you have any suggestions for approaching problems where data are extremely limited [e.g., a logistic regression setting with min(0's, 1's) ~ 20]. I encounter these situations frequently in my work as a wildlife/fisheries statistician (logistics and funding constraints often result in small datasets). Obviously, we are limited in what we can learn by such studies - but at the same time, we want to do the best we can.

Based on the guidelines in your book (p. 61), in these situations I have typically tried to work w/ project investigators to limit the number of candidate predictors considered to 1 or 2 most promising variables - assuming linearity and additivity holds. I have started to consider using the bootstrap validation/calibration approach outlined in your book, but wondered if you thought this was worth attempting with such small sample sizes? I would expect the model not to converge for some bootstrap samples and also estimates of optimism, etc. to be very imprecise.

**Reply**: There are two hopes for dealing with very small datasets: data reduction and penalization (shrinkage). To obtain a reliable analysis the effective number of parameters estimated against the outcome must be in line with the effective sample size. Data reduction such as doing variable clustering followed by scoring groups of related variables into a single scale is a somewhat simple and very interpretable method. Another approach is to fit an overly complex model and to unbiased estimate its likely future performance (and hope for the best but expect the worst). You raise a good point about the bootstrap and other model validation methods perhaps being unreliable themselves when the sample size is small. The problem with non-convergence in some of the resamples can usually be dealt with by ignoring those samples.

--

FrankHarrell - 02 Mar 2004