# A plot that illustrates the effect of categorizing continuous covariates in a logistic regression model

The idea behind this problem is the fact that fitting a categorized continuous variable in a logistic regression model can produce very misleading estimates when compared to fitting the original continuous variable. To illustrate this, a plot was generated that plotted the effects of a specific continuous variable from two separate logistic regresison models: (1) fitting the categorized version of the continuous variable in the model; and (2) keeping the continuous variable, continuous.

Let's first read in the data set (`data.csv`). The data set is a simulated data set that was designed for our first Puzzler --- the data was simulated such that there is a nonlinear interaction between sex and age. library(Design)

Now let's define the logistic regression model in which the continuous variables are kept continuous (and are fitted with restricted cubic splines). ff <- lrm(dz ~ sex*rcs(age, 4) + rcs(sysbp, 4), data=x) anova(ff) # Total D.F. = 10 # sex*age interaction is significant (p < 0.05)

Now let's created categorized versions of our continuous variables in order to fit the second model. x <- upData(x, agecut = cut(age, breaks=quantile(subset(age, dz==0), probs=c(0, 0.25, 0.50, 0.75, 1), na.rm=TRUE)), sysbpcut = cut(sysbp, breaks=quantile(subset(sysbp, dz==0), probs=c(0, 0.25, 0.50, 0.75, 1), na.rm=TRUE)) )

fcut <- lrm(dz ~ sex*agecut + sysbpcut, data=x) anova(fcut) # TOTAL D.F. = 10 # sex*age interaction no longer significant (p > 0.05)

Just to see the results of each model, let's plot the predicted effect of age across sex in each model. dd <- datadist(x) options (datadist = 'dd') par(mfrow=c(1,2)) plot(ff, age=NA, sex=NA, conf.int=FALSE) plot(fcut, age=NA, sex=NA, conf.int=FALSE)

We can use the `Hmisc` package's `print.plot.Design()` function to return what was plotted: print.plot.Design(plot(ff, age=NA, sex=NA, conf.int=FALSE)) # Adjusted to medians: sysbp = 120 print.plot.Design(plot(fcut, age=NA, sex=NA, conf.int=FALSE)) # Adjusted to most frequent level: sysbpcut=(117, 128] It is important to make sure that both model are adjusted to the same things. For example, if `fcut` wasn't adjusted to the quantile that spanned 120, we would use: par(mfrow=c(1,2)) plot(ff, age=NA, sex=NA, conf.int=FALSE) plot(fcut, age=NA, sex=NA, conf.int=FALSE, sysbpcut="(117,128]")

Now let's put the two plots together. plff <- plot(ff, age=NA, sex=NA, conf.int=FALSE) plfcut <- plot(fcut, age=NA, sex=NA, conf.int=FALSE)

par (mfrow =c(1,1), mar=c(5,5,4,2)+0.1, lwd=2) # First plot the spline regression results: age across sex plot(ff, age=NA, sex=NA, conf.int=FALSE, # ylim = range(c(range(plff\$x.xbetaLogOdds), # range(plfcut\$x.xbetaLogOdds))), ylim = c(-2, 2), label.curves=FALSE, lty=1, col=c("gray", "black")) # Now plot the categorical regression results: age across sex Fstep <- stepfun(x=with(x, quantile(subset(age, dz==0), prob=c(0.25, 0.50, 0.75), na.rm=TRUE)), y = subset(plfcut\$x.xbeta, sex=="female")\$"log odds") par(new=TRUE, col="gray", lty=2) lines(Fstep, xlim=c(0, 150)) Mstep <- stepfun(x=with(x, quantile(subset(age, dz==0), prob=c(0.25, 0.50, 0.75), na.rm=TRUE)), y = subset(plfcut\$x.xbeta, sex=="male")\$"log odds") par(new=TRUE, col="black", lty=2) lines(Mstep, xlim=c(0, 150)) par(new=TRUE, col="black", lty=1, lwd=1) legend(x=40, y=2, # y=max(c(range(plff\$x.xbetaLogOdds), # range(plfcut\$x.xbetaLogOdds))), xjust=0, yjust=1, legend = c("Female - Spline Regression", "Male - Spline Regression", "Female - Categorical Regression", "Male - Categorical Regression"), lty=rep(1:2, each=2), col = rep(c("gray", "black"), 2), lwd=2)
Topic revision: r2 - 15 Nov 2006, TheresaScott

• Biostatistics Webs

Copyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback