Ordinal Logistic Regression

UVA Biostatistics Discussion Board: Regression Modeling Strategies: Ordinal Logistic Regression
By Frank E Harrell Jr (Feh3k) on Friday, November 08, 2002 - 08:36 am:

This topic is for discussions related to modeling ordinal response data.

By Frank E Harrell Jr (Feh3k) on Friday, November 08, 2002 - 09:08 am:

Discrimination and Calibration of Proportional Odds and Continuation Ratio Models


Javier Seoane posted the following questions.

I am having some problems trying to implement an ordinal regression with the functions in your Design library for S-plus 2000 (the readme.txt file of the library is dated 05/november/99). I have been reading your book "Regression Modeling Strategies" (2001, first printing, please allow me to call it RMS hereafter) but cannot get rid of some doubts, so I dare ask you some questions about PO and CR models. I realize that this message is very long and contains many questions, but after reading RMS thoroughly and struggling with S-plus (whose intricacies I am learning by my own) I seem incapable to move forward. I will really appreciate your advise and will welcome any comments you could kindly give me.

Let me first present you briefly my study.
My aim is to estimate the severity of recent fires in mediterranean forests with satellite imagery to help in restoration. Severe fires (those that stand long and reach extreme temperatures) burn most of the trees, leaving them leafless and branchless and producing a great amount of ash and charcoal. On the contrary, slight fires (those that pass quickly and do not reach very high temperatures) burn only a fraction of trees, producing less ash and charcoal. To restore burnt landscapes efficiently we need to know the degree of damage that suffered each area, because different measures will be implemented in each case. We designed a field survey in which we recorded the severity of recent fires in 6 categories, ranging roughly from "no leaves damaged" (a very slight fire) to "no leaves nor branches kept alive" (a definitely severe fire). However, field surveys are expensive and time-consuming, so a beneficial alternative method would be to relate our field categories with the information provided by satellite imagery (cheaper for covering large areas). Sensor carried on board of satellites are able to detect charcoal and ashes. Each of the several sensors of the satellites render a value for each unit of area (typically 30x30 meters) that can be considered continuous (they are originally on a 0-256 scale).

Then the question is, can I predict with the sensors the severity of fire as measured in the field?

The response variable is "severidad" (an ordered factor with 6 levels), and the predictive variables are 6 particular sensors (named TM1, TM2, TM3, TM4, TM5 and TM7, each one of them detecting a different part of the electromagnetic spectrum). I followed chapters13 and 14 of your book and assessed first the ordinality assumption for each sensor inspecting graphs with plot.xmean.ordinaly. PO and, specially, CR assumptions appeared to be satisfied. Smoothed partial residuals (with resid(..., pl=T...)) showed reasonably linear relationships. I used the following code:


> version
S-PLUS 2000 Professional Edition Release 2 for Microsoft Windows : 1999
attach(FUEGOS2)
> levels(severidad)
[1] "1" "3" "4" "5" "7" # note that I dropped level 2 (incoherent with the rest) and 6 (unreliably measured)
plot.mean.ordinaly(severidad ~ TM1 + TM2 + TM3 + TM4 + TM5 + TM7, cr=T)
f1 <- lrm(severidad ~ TM1 + TM2 + TM3 + TM4 + TM5 + TM7, x=T, y=T)
resid(f1, 'partial', pl=T)


1. Let's take a look to this first PO model (whose output I do not seem to understand):


> f1
Logistic Regression Model
lrm(formula = severidad ~ TM1 + TM2 + TM3 + TM4 + TM5 + TM7, x = T, y = T)

Frequencies of Responses
1 3 4 5 7
2386 3951 32766 10420 116

Obs Max Deriv Model L.R. d.f. P C Dxy Gamma Tau-a R2 Brier
49639 9e-011 1369.73 6 0 0.592 0.184 0.187 0.094 0.032 0.045

Coef S.E. Wald Z P
y>=3 2.113 0.05538 38.17 0
y>=4 1.035 0.05312 19.48 0
y>=5 -2.274 0.05414 -42.01 0
y>=7 -7.038 0.10728 -65.61 0
TM1 -5.517 0.97349 -5.67 0
TM2 11.896 2.12039 5.61 0
TM3 -6.798 0.89274 -7.61 0
TM4 7.769 0.48203 16.12 0
TM5 -20.160 0.92295 -21.84 0
TM7 23.497 0.76829 30.58 0


1.1. What does the C and Dxy above refer to?. I understand the C is the area under a curve of a ROC plot (the measure of discrimination I am interested in). I am used to calculate C for dicotomous outcomes, but I have here 6 possible outcomes. Does this C=0.59 mean that model f1 will rank correctly a group of 5 observations (one belonging to level 1, one to 3, and so on) in 6 out of 10 trials? (I am trying an analogy with the definition of C, for example in RMS p.335).

1.2. I think I have to worry about calibration. In RMS (p.357) you assess calibration of a model by computing "... predicted probabilities of Y=2 for all infants from the model and compared these with predictions from a customized binary logistic model derived to predict Pr(Y=2). The mean absolute difference... is only 0.02, but the 0.90 quantile of that difference is 0.059." Your response variable had there 3 levels (named 0, 1 and 2). Similarly, I will need to predict with model f1 the probability of severidad=7 and compare with a logistic model in which the "0's" will be severity levels 1,3,4,5 and the "1's" the severity level 7. Is this correct?
The code I tried is:


quantile(abs(predict(f1, type="fitted.ind")[,"severidad=7"] -
predict(lrm(FUEGOS2$severidad3 ~ TM1 + TM2 + TM3 + TM4 + TM5 + TM7, x=T, y=T), type="fitted")), 0.90) = 0.002273096 (the mean is 0.001)


where severidad3 is FUEGOS2$severidad3 <- as.factor(ifelse(FUEGOS2$severidad!=7, "0", "1"))

I think this piece of code may be right but the results are surprisingly good (at least for this level of fire severity).

1.3. How can I assess the prediction success for each particular level of fire severity? (e.g., how many observations measured in the field as level-7 severity are correctly predicted as such by the model?).
My clumsy approach is:


> somers2(predict(f1, type = "fitted.ind")[, "severidad=7"], as.numeric(FUEGOS2$severidad3) - 1)
C Dxy n Missing
0.7287668 0.4575336 27988 0


This is a bit clumsy because I need first to create severidad3 as a factor (1,0) and then to convert it to numeric (1,0) for function somers2 to work. But at this point I am more worried about the soundness of the approach than about its clumsiness. Can you comment on the approach?

1.4. And, finally, what are the coefficients on the table above?. I expect to find 5 intercepts, one for each level of the response (according to the model stated in RMS p.333), and a vector of 6 similar slopes for the predictors. Is Coef(y>=3) =2.113 the intercept for Pr[Y>=3|X]? If so, what is the intercept for Pr[Y>=1|X], is it 0?. Are the TM coefficients the slopes for the predictors? (if so, I am afraid there is not a common vector of regression coefficients in this case). At the end I will want to have an equation to calculate Pr[Y=1|X], Pr[Y=3|X], and so on. Am I correct if I try to do so by: Pr[Y=3|X] = Pr[Y>=3|X] - Pr[Y>=4|X] - Pr[Y>=5|X] - Pr[Y>=7|X]? (although I could use predict(f1, type="fitted.ind"), I would like to be able to write the equations).

2. But a more appropriate model for my study would probably be the CR, am I right? (again, I have a bunch of questions):
This is what I try:


u <- cr.setup(severidad)
attach(FUEGOS2[u$subs,])
y <- u$y
cohort <- u$cohort
full.2 <- lrm(y ~ cohort*(TM1+TM2+TM3+TM4+TM5+TM7), x=T, y=T)
anova(full.2)

> anova(full.2)
Wald Statistics Response: y

Factor Chi-Square d.f. P
cohort (Factor+Higher Order Factors) 23331.38 21 <.0001
All Interactions 1229.81 18 <.0001
TM1 (Factor+Higher Order Factors) 40.86 4 <.0001
All Interactions 12.46 3 0.006
TM2 (Factor+Higher Order Factors) 64.40 4 <.0001
All Interactions 34.86 3 <.0001
TM3 (Factor+Higher Order Factors) 125.39 4 <.0001
All Interactions 62.80 3 <.0001
TM4 (Factor+Higher Order Factors) 356.48 4 <.0001
All Interactions 248.81 3 <.0001
TM5 (Factor+Higher Order Factors) 320.52 4 <.0001
All Interactions 81.87 3 <.0001
TM7 (Factor+Higher Order Factors) 710.90 4 <.0001
All Interactions 40.48 3 <.0001
cohort * TM1 (Factor+Higher Order Factors) 12.46 3 0.006
cohort * TM2 (Factor+Higher Order Factors) 34.86 3 <.0001
cohort * TM3 (Factor+Higher Order Factors) 62.80 3 <.0001
cohort * TM4 (Factor+Higher Order Factors) 248.81 3 <.0001
cohort * TM5 (Factor+Higher Order Factors) 81.87 3 <.0001
cohort * TM7 (Factor+Higher Order Factors) 40.48 3 <.0001
TOTAL INTERACTION 1229.81 18 <.0001
TOTAL 23446.83 27 <.0001


2.1. There is a heavy evidence against the constant slopes assumption (Chi-square=1230 with 18 d.f. and P<0.0001), and the values from sensors (factors TM1 to TM7) seem to be significantly associated with the response, is that interpretation right? But then, why the approximative model tried with fastbw(full.2, aics=1e10) render no significant factors? Maybe because the factors are only significant due to their interaction with cohort?


> fastbw(full.2, aics = 10000000000)

Deleted Chi-Sq d.f. P Residual d.f. P AIC
TM4 1.48 1 0.2243 1.48 1 0.2243 -0.52
TM1 7.48 1 0.0062 8.96 2 0.0113 4.96
cohort * TM1 35.09 3 0.0000 44.05 5 0.0000 34.05
cohort * TM7 56.65 3 0.0000 100.70 8 0.0000 84.70
cohort * TM3 51.78 3 0.0000 152.48 11 0.0000 130.48
TM2 52.93 1 0.0000 205.41 12 0.0000 181.41
TM3 21.45 1 0.0000 226.86 13 0.0000 200.86
cohort * TM2 140.59 3 0.0000 367.45 16 0.0000 335.45
TM5 327.54 1 0.0000 694.99 17 0.0000 660.99
cohort * TM4 743.38 3 0.0000 1438.37 20 0.0000 1398.37
cohort * TM5 254.82 3 0.0000 1693.19 23 0.0000 1647.19
TM7 609.51 1 0.0000 2302.70 24 0.0000 2254.70
cohort 21144.13 3 0.0000 23446.83 27 0.0000 23392.83

Approximate Estimates after Deleting Factors

Coef S.E. Wald Z P
Intercept -0.5642 0.01039 -54.32 0

Factors in Final Model

None


2.2. I interpret the above table as TM4 being the first factor to be deleted and TM7 the last. Moreover, there seem to be a relatively important effect of cohort (but what does it mean?, is it that the change between levels of severidad is not constant?).

2.3.a. I do not fully understand the table with the coefficients for this model:


> full.2

Logistic Regression Model

lrm(formula = y ~ cohort * (TM1 + TM2 + TM3 + TM4 + TM5 + TM7), x = T, y = T)


Frequencies of Responses
0 1
52579 27930

Obs Max Deriv Model L.R. d.f. P C Dxy Gamma Tau-a R2 Brier
80509 7e-008 43354.01 27 0 0.89 0.78 0.782 0.354 0.574 0.115

Coef S.E. Wald Z P
Intercept -1.4605 0.1308 -11.16 0.0000
cohort=severidad>=3 0.6042 0.1667 3.62 0.0003
cohort=severidad>=4 3.4623 0.1592 21.74 0.0000
cohort=severidad>=5 7.5171 0.8867 8.48 0.0000
TM1 5.4853 2.2841 2.40 0.0163
TM2 -37.0759 5.0946 -7.28 0.0000
TM3 19.8835 1.9553 10.17 0.0000
TM4 -1.1661 0.9596 -1.22 0.2243
TM5 20.0133 1.9893 10.06 0.0000
TM7 -30.6859 1.6974 -18.08 0.0000
cohort=severidad>=3 * TM1 5.3304 2.9562 1.80 0.0714
cohort=severidad>=4 * TM1 -3.4826 2.8239 -1.23 0.2175
cohort=severidad>=5 * TM1 4.2484 15.1476 0.28 0.7791
cohort=severidad>=3 * TM2 24.8788 6.5049 3.82 0.0001
cohort=severidad>=4 * TM2 35.0899 6.2576 5.61 0.0000
cohort=severidad>=5 * TM2 85.1271 33.9979 2.50 0.0123
cohort=severidad>=3 * TM3 -19.7647 2.5794 -7.66 0.0000
cohort=severidad>=4 * TM3 -13.0976 2.4704 -5.30 0.0000
cohort=severidad>=5 * TM3 -41.1734 16.0831 -2.56 0.0105
cohort=severidad>=3 * TM4 3.2222 1.2567 2.56 0.0103
cohort=severidad>=4 * TM4 -13.7864 1.2517 -11.01 0.0000
cohort=severidad>=5 * TM4 20.3277 12.2499 1.66 0.0970
cohort=severidad>=3 * TM5 -15.5265 2.6500 -5.86 0.0000
cohort=severidad>=4 * TM5 3.6152 2.5768 1.40 0.1606
cohort=severidad>=5 * TM5 -43.7141 11.2278 -3.89 0.0001
cohort=severidad>=3 * TM7 13.9246 2.2167 6.28 0.0000
cohort=severidad>=4 * TM7 9.8559 2.1648 4.55 0.0000
cohort=severidad>=5 * TM7 6.1849 10.5633 0.59 0.5582


2.3.b. What does the frequency of responses (1 and 0) refer to?. Is the coefficient for the Intercept to be interpreted as the coefficient for cohort=severidad>=1, and are the rest of the coefficients to be interpreted as increments from this (RMS p.339)?. Again I am interested in give a final equation for (unconditional) probabilities for each level. Following RMS (p.338):
Pr(Y>1|X) = Pr(Y>1|X,Y³1) * Pr(Y³1|X) = (1- [logit with t1)*(1- [logit with t0)}.
I guess that in my study t0 is -1.46 and t1 is (-1.46 + 0.60), is that correct?
But this formulation ignores the interaction cohort*severidad. What would be the formulation with the interaction incorporated?
I am thinking in something similar to:
Pr(Y>1|X) = (1 - [t1 + X*g + X*interaction coefficient for level 1])*(1 - [t0 + X*g + X*interaction coefficient for level 0])

2.4. Wald tests for much of the terms in the formula (the call to full.2) give significant results, what resembles the table from anova(full.2) but contrasts with the results from fastbw(full.2). My aim is to provide a simple formula to predict fire severity as would be measured in the field, so I am concerned with getting rid of some irrelevant TM factors (most of them are correlated). Following one of the more important messages in RMS I should have selected beforehand the TM factors to include in the model, and I will do (this is only a first training example). But let's say I select four TM factors beforehand and want to try an approximate model, should I guide myself with the results from fastbw (as in RMS 366) or could I rely on the results from anova?

2.5. To assess the predictive ability of the model I use (following RMS p.362):

somers2(predict(full.2)[cohort=="all"], y[cohort=="all"])
somers2(predict(full.2)[cohort=="severidad>=3"], y[cohort=="severidad>=3"])
> somers2(predict(full.2)[cohort == "all"], y[cohort == "all"])
C Dxy n Missing
0.6934589 0.3869177 27988 0
> somers2(predict(full.2)[cohort == "severidad>=3"], y[cohort == "severidad>=3"])
C Dxy n Missing
0.6479558 0.2959117 25602 0


That means a C=0.69 for severidad>=1 and C=0.65 for severidad>=3. I would estimate an overall C by the mean for the k-1 levels of the response (or is it better the C=0.78 given in 2.3. by a call to full.2?). To have an estimate for severidad=1 I would follow an approach similar to the one in part 1.3. of this message (but I have not been able to figure out the code yet).

2.6. Finally, to validate the final model (before simplifications) I will use:
validate(final.model, B=..., cluster=u$subs, subset=cohort=='all') as in RMS p.369
The estimates of, say, Dxy given by the above formulation refer to Pr(Y=0). I guess that I should write subset=cohort>=3, subset=cohort>=5, and so on to have estimates of discrimination ability for the rest of the levels. But what if I want an overall estimate of success for the model?, would the mean of the estimates for the particular levels be sensible?.


Javier Seoane
Department of Applied Biology
Estacin Biolgica de Doana, CSIC
Avda. Mar Luisa s/n
41013, Sevilla
SPAIN

By Frank E Harrell Jr (Feh3k) on Friday, November 15, 2002 - 08:35 pm:

Here are some answers to many of your questions.

1.1.
C here is the proportion of concordant pairs of (linear predictor, Y), i.e., the fraction of pairs differing on Y for which the linear predictor was larger when Y was larger. A large C index is more difficult to obtain the more categories there are. Still, the predictive discrimination of your model is modest. Nonlinear terms or interactions may help.


1.2.
Yes although by this operation you assume that the binary model fits the data, i.e., that everything is linear and additive. This is mainly checking the PO assumption.

I think your piece of code may be right but the results are surprisingly good (at least for this level of fire severity).

That all looks good. For making the binary variable you can just to severidad3 <- 1*(severidad==7) (severidad=='7') if severidad is a factor).

1.3.
In terms of 'clumsiness' you can just do somers2(...,FUEGOS2$severidad=='7'). This approach is not looking at "correctly predicted" but at predictive discrimination. C of 0.73 is generally thought to not quite be useful for individual prediction. But another way to answer that is to look at the frequency distribution of Y when the predicted values are in a small interval, to see how much residual variation is there. This is a direct way to judge the utility of the predictions.

1.4.
You are right about the intercepts, except that since Y=1 is the lowest level it has no intercept (Prob[Y>=1] = 1.0). You can only compute Prob[Y=1] which you get by taking 1 - Prob[Y>=3].

2.
CR in its plain form can fit worse than PO but it can also fit better (as in my case study). It's worth is how easy it is to extend.

2.1.
Yes, your interpretation is right.

Use aics=1e10 just to get fastbw to remove ALL variables, for use in model approximation. And note that fastbw does not currently impose the right hierachical structure on the predictors, i.e., it can delete main effects before interactions. So I would not use it for your purpose.

2.2.

2.3.a
That is a complex model. You can to state it in terms of Prob[Y=j | Y>=j] and apply the right cohort terms to get that. It's too late in the evening for that!

2.3.b.
The frequency of 1s and 0s is for all cohorts combined so is not very meaningful due to the dummy Ys that are constructed to make the CR method work with regular MLE.

Your interpretation of the Intercept is pretty much correct. By the way, if you are going to allow all slopes to be unequal (which your sample size is large enough to allow) you may get an easier to interpret model using polytomous logistic regression (see the multinom function in the nnet library).

I didn't take time to check all your other equations but be sure not to multiply anything on the logit (linear predictor) scales.

2.4.
As stated above, ignore the fastbw output. Deleting "unimportant" terms can lead to problems but if you want to proceed, do a manual backward stepdown based on anova( ) looking at the pooled interaction+main effect tests first.

2.5.
It's hard to get an overall C when in effect you have multiple models due to the unequal slopes. You may need to stop with C indexes for individual binary events.

2.6.
No, I don't think the mean is very sensible. At present there is no way to get overall statistics like you want for the extended CR model.

By Frank E Harrell Jr (Feh3k) on Saturday, December 14, 2002 - 07:30 am:

Variable Selection and Functional Form Specification


Hongjie Wang e-mailed the following questions to which I have interspersed some responses (in blue).

I am trying to find some reference and routines to perform variable selection and transformation for building ordinal logit model. I start with 500 some variables (in a database marketing application) with outcome variable y=0,1,2

Here is what I am going to do. For each variable x, assess the proportional odds assumption. Let F_j denotes P(y<=j), j=0,1,2. The model is then log(F_j/(1-F_j))=alpha_j+beta X. Thus, logit(F_j)-logit(F_i) is independent of X. Harrell suggested lots of graphics methods in his book to perform the assessment. But my question is, if a particular variable x does not seem to meet this assumption, should I automatically drop this variable? In my case, also did chisq between pair of variables. I noticed lots of xs where chisq is quite big, but the PO assumption is not met. Is there any transformation can be applied to such variables?

A few observations:


For the other variables that met PO and after some reduction, say factor analysis, PCA or variable cluster, I come up with 50 some variables. I would what will be a practical way to detect the appropriate functional form between x and y. It seems to me that the model is very restrictive to the extend that almost nonlinear relationship is not going to be statistically significant.

I would do data reduction first, before looking at the PO assumption. Nonlinear principal components (e.g., the transcan function in Hmisc) may help.

Since beta (the coefficient of x) is independent of category of y, if there is a negative relationship between logit(F1) and x and a positive one between logit(F2) and x, overall, beta could be close to 0 and not significant, although x is indeed very different from groups formed by categories of y.

If I detect a x^2 relationship between logit(F1) and x, I have to make sure to have same relationship between Logit(F2) and x to apply the transformation. I realize that such restriction can be relaxed if I use multinomial model, but I was afraid of losing the order information which is important for the application.

This is where, as in my case study, I would consider allowing unequal slopes and coefficients x^2 across categories, but differentially penalize the two sets of interaction terms. In some situations nonlinear effects may be more similar across categories than linear effects. But this is only easy to do for the continuation ratio model.

Can someone give me some insights? For example, I have an income variable that looks like

y=0 y=1 y=2
mean income 200 100 200


A simple t test shows the mean difference is big enough. However, to fit an ordinal logit model, this variable is deemed insignificant. But even from descriptive stand point, we can say that if your income is low, you tend to in group y=1. Is there anyway such variable can be used after appropriate transformation?

Transformations of X don't generally fix non-PO. If you use the extended continuation ratio model you can allow for interactions between selected variables and categories of Y.

Ordinal logit model is gaining lots of popularity in the database marketing area, specifically in building predictive models in situations where a campaign was sent out and then 5% of the customers brought over $1000 goods, another 10% bought from (0, 1000) and then majority of them did not buy. A OLS model does not work so well, especially when I also need probability of buying. One of the unfortunate uniqueness in database marketing is that the data are often containing lots of noises along with missing values. That makes variable cluster (which one or or the other relies upon covariance matrix) unreliable. Often time, the number of variables are up to 1000 with many redundancy.

For those kind of data, economists frequently use a Heckman two-stage model in which one uses a binary logistic model to model the probability of any purchase, and an OLS model for predicting the amount of purchase given a purchase was made. Multiplying these two models results in the expected purchase amount, and the two models can have different structures.

I do have two comments on your reply. The CR approach, as I understand it, assumes the categories represent some sort of a progression and it is more appropriate that a subject went thought them in sequence. This might be appropriate in medical study, but not necessarily appropriate in marketing where the decision of response is based on the situational variables, the utility and the marketing mix at that time. In this case, is there a way I can somehow still use variables that violates PO assumption?

The assumptions of the CR model can be thought of in the way you stated. However even when the process you are modeling does not work like that, the CR model may empirically provide a good fit.

In your book, the variable clustering is done on sets of binary variables. I do var clustering quite a lot, but mostly in interval variables only. Since I use the distance matrix which is essentially a function of correlation matrix, the regular P- correlation might not be valid for binary items. (I do almost everything in SAS, but find your book incredibly helpful).

I routinely do variable clustering on mixtures of binary (and polytomous) variables and continuous variables. The Spearman correlation coefficient is not idea for binary variables but I think it works well enough.

By Javier Seoane on Monday, March 24, 2003 - 04:27 pm:

Dear all,

Two simple questions that, nevertheless, are getting me stuck.

After building a proportional odds logistic model to bird species abundances (a 6-level ordered factor) in several plots, I want to assign predicted abundance in each plot.

The following:
predict(lrm.model, type="fitted")

get the predicted probabilities for each interval between levels; the same as:

1 / (1+exp(boundary intercept 1/2 +slope-1*predictor-1+slope-2*predictor-2...
+slope-p*predictor-p)
1 / (1+exp(boundary intercept 2/3 +slope-1*predictor-1+slope-2*predictor-2...
+slope-p*predictor-p)
...
for the k intervals between the j levels (k=5 and j=6 in my case) and the p predictors

1. But, what are the values of:
predict(lrm.model, type="fitted.ind")?

Are they really the probability for each level?

2. And, what is the suggested way to select a single prediction for each case (that it, to predict a single category)?. The first thing to come to mind is to select the category with the higher predicted probability:


> predict(model.lrm, type = "fitted.ind")
ERIRUBc=1 ERIRUBc=2 ERIRUBc=3 ERIRUBc=4 ERIRUBc=5 ERIRUBc=6
1 0.2424018 0.1773136 0.07416807 0.13623209 0.17147442 0.19841002
2 0.2469122 0.1787591 0.07431371 0.13580089 0.16965735 0.19455672
3 0.2469122 0.1787591 0.07431371 0.13580089 0.16965735 0.19455672
...
37 0.5685846 0.1801159 0.05207815 0.07448775 0.06804894 0.05668466
...

what would led to predict level 1 for cases 1,2,3,...,37 and, actually, for my whole set of observations.

A second possibility (with type="fitted") could be to look for a 0.5 threshold and thus, select level 5 for cases 1 to 3, and level 1 for case 37:

> predict(model.lrm, type = "fitted")
y>=2 y>=3 y>=4 y>=5 y>=6
1 0.7575982 0.5802846 0.5061165 0.36988445 0.19841002
2 0.7530878 0.5743287 0.5000150 0.36421407 0.19455672
3 0.7530878 0.5743287 0.5000150 0.36421407 0.19455672
...
37 0.4314154 0.2512995 0.1992213 0.12473360 0.05668466


I would appreciate any suggestion,


> version
S-PLUS 2000 Professional Edition Release 2 for Microsoft Windows : 1999

By Frank E Harrell Jr (Feh3k) on Monday, March 24, 2003 - 10:27 pm:

Good questions. Yes, type='fitted.ind' gives the individual estimated probabilities of class membership. Choosing the most likely category is one good choice (apply() and order() will help). If the category levels are valid numeric values that are meaningful, you can specify type='mean' to get a predicted mean score. You could also use a median category but with the discreteness of the categorical response this may not work well.

I haven't checked this but code for getting the modal category might look like this:

(t(apply(-predict(f,type='fitted.ind')[,,drop=FALSE],1,order)))[,1]

By Javier Seoane on Thursday, March 27, 2003 - 02:04 pm:

Thank you for your help, Dr. Harrell,

your code worked nicely; after some proofs I get a little simplification subscripting directly the result of apply (no need to transpose and then to subscript it):

apply(-predict(erirub.lrm,type='fitted.ind')[,,drop=FALSE],1,order)[1,]

Where erirub.lrm is the model. I do not understand the "[,,drop=FALSE]", but I guess it would avoid problems with unused levels (if I had any). Dropping "[,,drop=FALSE]" o changing to "[,,drop=TRUE]" does not change the results.

What worries me, though, are the results from:
predict(erirub.lrm, type="mean")

which are generally very similar, for example:
> cbind(
round(predict(erirub.lrm, type = "mean"), 0),
apply( - predict(erirub.lrm, type = "fitted.ind")[, , drop = FALSE], 1, order)[1, ])[1:10, ]

[,1] [,2]
1 1 1
2 3 2
3 5 6
4 3 2
5 3 2
6 5 6
7 5 6
8 5 6
9 4 5
10 3 2

My category levels are meaningful values (counts categorized as 1 if count=0, 2= counts higher than zero to quantil 0.2, 3=counts to quantil 0.4, and so on). Knowing how "predict" calculate the mean score would help me to decide between these two ways to assign class membership (I thought the mean score for case 1 would be a weighted mean but I do not get it)

Could you give me a clue as to how the mean score is calculated?

By Frank E Harrell Jr (Feh3k) on Thursday, March 27, 2003 - 08:21 pm:

Javier, the calculation uses the definition of the mean for a discrete distribution, namely the sum over categories of the category code times the probability of being in that category. Your response variable is not a count because it uses quantiles of the original counts, resulting in among other things unequal width intervals. You will have to decide if the integer codes for the intervals are meaningful enough to be used in computing the mean. You might consider using more categories and/or assigning mean values within categories as the category levels (see the levels.mean parameter for the cut2 function). Then there's the decision about mean vs. mode.