Regression Modeling Strategies Assignments Discussion Board

Assignment 1

What should we include in our .pdf?

Do you want us to show our code? Should we include raw table output from regressions, anova, etc.?
Reply : Yes include all code in the `.pdf`

Question # 5: any response before a)?

We're not sure how to interpret the first paragraph; are you looking for anything in addition to our responses to parts a), b), and c)?

Question # 6: "no cubing or squaring"?

What exactly do you mean "without cubing or squaring any terms"? The answer that we are trying to get to in the middle of page 36 has cubed terms in it.
Reply : What I meant was without expanding terms in the original cubic formulation, i.e., if it was squared then don't do (a-b)^2 = a^2 + b^2 - 2ab

Question # 7: "general principles"

We're not sure what this means; would you give us some examples of what you would consider general principles?

Linear combinations of things still doesn't clarify things (at least for me). I feel it still requires some algebra. Any other hints?
Reply : This can be reasoned in words without explicitly using algebra. You might think about what happens if a piece of a linear combination is nonlinear.

Question # 8a: second derivatives

The directions specify that the mixed second derivative should agree at the rectangle boundaries-- what about the second derivatives with respect to X1 and X2?
Reply : Concentrate on derivatives with respect to a single variable

I have the same question. I tried to do part (b) without restricting the second derivatives with respect to X1 or X2 individually to be continuous (just as the question states), but ended up with a lot of coefficients to estimate and difficulty to further simplify the format of f(). It is a good experience about "curse of dimensionality", even though we only increase the dimension from 1 to 2! Besides, I am thinking in one dimension case, we require both f'() and f"() to be continuous. I wonder whether it might be reasonable to restrict the second derivatives with respect to X1 and X2 individually to be continuous. We do need clarification about it.
Reply: For this question I would like all the students to provide their insights, with everyone helping everyone else. Anyone is welcome to reformulate the question to make the problem cleaner, including using a quadratic response surface instead of cubic.

Assignment 3

Question 6: role of totcst

In the earlier problems, we thought we were treating totcst as a predictor. In Question 6, is it the outcome? We thought it wasn't a good idea to impute Y, but if we understand the question correctly, we are supposed to keep the imputed values for Y here--- do we have that right?
Reply : `totcst` is the outcome. And this is a special imputation problem because we have a powerful surrogate imputer in total charges.

Assignment 5

Question 1-7: Outcome

Horrible question of the day: is the outcome of interest in the pbc data serum bilirubin?
Another student's thoughts: I thought it was a survival study, and that the outcome was `fu.days` (with `status` as the event indicator). I used `bili` as one of the predictors...
Reply : The last statement is correct.

Question 2: Replacing NAs

I am trying to replace the NAs with the imputed values for this question. I am using impute(pbc.sub.trans,chol,data=m) and getting errors. If I use impute(pbc.sub.trans, data = m) to try and impute all variables I don't get any errors but the NAs have not replaced. Any ideas on what I am doing wrong?

Question 1: Sample sd

Do we know anything about the sample s.d. in question 1? I'm wondering if we should be reading anything into the choice of sigma=5 in the log likelihoods.
Reply : You only know what you are given. Use heuristics and not math.
Hmm. So far my heuristic has been "look at a bunch of plots of likelihood functions." Does this seem like a reasonable approach?
Reply : No it's much quicker than that
Another student's thoughts: I also thought it was helpful to see a few plots to get a gestalt feeling of the problem (or at least provide evidence that my heuristic argument was correct). In case others may need a little help getting started with this, I used the following code:

logL.normal <- function(n=100,mu=10,sig=5,true.mu=10,true.sig=5) {
Y <- rnorm(n=n, mean=true.mu, sd=true.sig)
sum(dnorm(Y, mean = mu, sd = sig, log = TRUE))
}

replicate(n=10000,logL.normal())
Reply : Drawing pictures never hurts. This problem is simpler than people are making it out to be however.

Question 4: saturated model

My understanding of a saturated model is a model that has one parameter per data point--- I think this is the kind of model we used in calculating the deviance in Ben's class--- is that what other people remember? Is the phrase being used differently here? If so, what does it mean? and if not, how many data points are there?
Reply : I should use better terminology. By saturated I meant to say that in the class of subject-matter-plausible models that force smooth (but not linear) effects, the chosen model will be almost guaranteed to fit. I could be more specific than that to consider (1) the universe of smooth non-additive models or (2) the universe of smooth additive models.

Does this mean that we can either use a additive / non-additive saturated model? If not, would you prefer that we stick with a non-additive model? By non-additive saturated model, do you mean a model with all the main effects, and their interactions?
Reply : For this problem there are only two variables and what I meant to imply by a saturated model was one that included smooth non-additive effects.
still not sure The way I'm looking at my model, there are four main variables: age, age^2, and two dummies for treatment group. Then I put in an intercept and interactions, so I wound up with nine terms in my model. But is this not what you had in mind?
Reply : That may give away too much.

Question 5: overall setup

I'm not sure I understand what we should do in question 5. I think the first step is to generate data, where p= E[Y], logit(p) = XB, and X is a 300x6 matrix, with the first column all ones and the other 5 columns any continuous values that let us wind up with Y such that mean(Y) is between 0.15 and 0.3. Do I have the right idea here? Once we have our simulated data, are we then doing something similar to the situation in Fig 10.6 of the revised version of the textbook, fitting a too-complicated (but in this case, still linear) model?
Another student's thoughts: I have to confess that the setup isn't totally clear to me, but this is my approach: 1) generate binary Y such that the mean(Y) is between 0.15 and 0.30, 2) generate 5 random continuous variables (N(0,1)?) and 3) follow example on pg 178 but analyze the data using a logistic model where all terms are linear. I am not sold on my results, so take this with a grain of salt. As an FYI, it may be helpful to check out page 31 of the text for additional parts of this problem.
still wondering: So you didn't use the X's (the 5 continuous RV's) to generate Y? Also, do you mean the revised or original text? Thank you!
Reply : The model would not be very meaningful if you didn't use at least some of the Xs to simulate Y. The art of simulating Y is coming up with a model for the Xs including the true population beta coefficients and any nonlinearities and interactions (this problem is easier on those two counts as the model is linear and additive). Depending on how you program it the initial column of 1s may be already handled automatically. Figure 10.6 is relevant but you now have 5 Xs and a logistic model instead of OLS [the penalization syntax is identical for the two models though]. If you look ahead to Problem 6 you'll see a very simple simulation of binary Y. I'll be watching for follow-up questions if this isn't clear.
still not clear to me I'm afraid I don't understand where the art comes in--- what are we trying to do? Is the model supposed to match the way we generated the data, or is the goal here to fit a model that is unnecessarily complicated to see what happens?
Reply : Maybe it's more trial and error than art, but the goal is to generate strengths of covariate effects that make for interesting relationships between X and Y. You wouldn't want to use coefficients that are so large that probabilities of Y=1 are between 0 and .03 and between .97 and 1. And be sure to distinguish between the population model used to simulate the data and the model you fit in absence of knowledge about how the data were actually generated in the population.
a wee bit confused From Problem 6, it looks like you chose betas to equal 1 and 0.5 (logit <- 1 + x1/2). Is where the "art" comes to play? Can this be modified with a few additional x's and then we need to pick betas such that mean(y) is between 0.15-0.30?
Reply : Yes that is the setup given to you for problem 6. For problem 5 you should choose a different set of coefficients without having most of them zero, then just look at a histogram of population probabilities for reasonableness. I often simulate a large sample size for this type of pre-simulation look.

Assignment 7

Overall setup

When you say "Test for the association between disease group and total hospital cost, " do mean that you want us to do this without adjusting for any of the other variables?
Another student's thoughts: I simply ran with a model only adjusting for disease group (i.e., total cost ~ disease group) since part (a) asks us to perform a simple rank based test and the remaining questions build off that crude comparison.

Question 4: efficiency of the analysis

I'm sorry if I have just missed this point in the lectures or the reading... In other courses we have looked at the efficiency of various estimators under different scenarios. When we do that, we compare the variances of the estimators (efficiency of a vs b = var(b)/var(a) ). So if we were simply testing whether a particular parameter is equal to zero, and we were using a Wald test, I think the "efficiency of the analysis" would be the same thing as the efficiency of that particular parameter estimate. But what exactly does "efficiency of the analysis" mean when a) we are testing more than one parameter at a time and b) we are using a LR test, not a Wald test?
Another student's thoughts: I wasn't sure about this one, but I interpreted efficiency as the magnitude of the model based LR with larger values being indicative of higher efficiency. I did just eat 12 cookies, so take this with a grain of salt .... or sugar.

Question 5: Which test?

In the questions preceding this one, we performed several tests, using different groupings of totcst. When you ask about which test? Are you asking about the tests (e.g. Kruskal-Wallis, PO generalization of Wilcoxon-M-W Kruskal-Wallis) or the models (e.g. using 2, 3, 6, 20 groups of totcst)? Somebody please fafanua (clarify). Thank you.
Another student's thoughts: My interpretation was that we could choose from any of those.

Assignment 8

Question 2: "remaining steps"

When you say "treat this predictor as nominal ... in remaining steps," do you mean the remaining steps of Question 2, or in the rest of question 2 and in all remaining problems?
Reply: Treat it as nominal throughout, i.e., make the model saturated with respect to that predictor main effect.
Topic revision: r44 - 17 Mar 2014, BenBulen

• Biostatistics Webs

Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback