Notes from Tim Hesterberg, Insightful Corporation , on why Y should be used when doing multiple imputation (July, 1998). Tim included several simulated examples using S-Plus. ------------------------------------------------------------------------ When doing imputations, you should not use either E(X2|X1) or the conditional distribution of X2 | X1, but rather the conditional distribution of X2 | X1 and Y. To see this, assume the multivariate Gaussian model is correct, that Y and X2 are highly correlated (say R=.99), and that X1 is independent of both X2 and Y.  I'll look at all three cases. Case 1:  Imputing with E(X2|X1) This corresponds to using mean imputation for X2.  Having done that, we can basically ignore X1, because it is independent of the other variables and it's regression coefficient will be near zero.  The scatterplot of Y vs X2 will show two distinct groups: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) the imputed points fall on a vertical line, with X2 set at its observed mean. If the values of X2 are Missing Completely at Random, then the regression coefficients (slope and intercept) are unbiased, but the standard error's are completely incorrect. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ runif(40) < .2 ] _ NA X2i _ X2 X2i[ which.na(X2) ] _ mean(X2, na.rm=T) plot(X2i, Y) abline( lm(Y~X2i, na.action=na.omit)) But in the more common case that X2 is Missing at Random (i.e. whether X2 is missing is independent of the value of Y | X2), it may be that say large value of X2 are more likely to be missing. The regression slope is still unbiased, but the intercept is biased. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 X2i[ which.na(X2) ] _ mean(X2, na.rm=T) plot(X2i, Y) abline( lm(Y~X2i, na.action=na.omit)) My intuition is that in higher dimensional problems, or where X1 is not independent of everything else, that all regression coefficients will be biased. Case 2:  Imputing with conditional distn of X2 | X1 This corresponds to using the unconditional distribution of X2. Again we ignore X1. The scatterplot of Y vs X2 will show two distinct groups: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) the imputed points form an uncorrelated cloud. All regression coefficients are biased. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ rnorm(length(w)) plot(X2i, Y) abline( lm(Y~X2i, na.action=na.omit)) Case 3:  Imputing with conditional distn of X2 | X1,Y Again we ignore X1. The scatterplot of Y vs X2 shows only one group: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) so do the imputed points. X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ .99*Y[ w ] + sqrt(1-.99^2)*rnorm(length(w)) plot(X2i, Y) points(X2i[w], Y[w], col=2) abline( lm(Y~X2i, na.action=na.omit)) Caveat: In practice you don't know the conditional distribution, so the multiple imputations also incorporate uncertainty in the model parameters. Now I'll return to the "ambiguity" mentioned above. Werner's statement mentions "imputation of conditional expectations", but doesn't indicate whether he means conditional on Y and other X's, or just on other X's.  The former would correspond to case 4. Case 4:  Imputing with conditional mean of X2 | X1,Y Again we ignore X1. The scatterplot of Y vs X2 shows two groups: (1) the completely observed points fall on a diagonal line plus a bit of noise, (2) the imputed points fall exactly on the line X2 _ rnorm(40) Y _ .99*X2 + sqrt(1-.99^2)*rnorm(40) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ .99*Y[ w ] plot(X2i, Y) points(X2i[w], Y[w], col=2) abline( lm(Y~X2i, na.action=na.omit)) Note that the line for the (2) points is X2 = E[X2 | Y], not Y = E[Y | X2].  This would be more noticeable with lower correlation: X2 _ rnorm(200) Y _ .7*X2 + sqrt(1-.7^2)*rnorm(200) X2[ X2 > .5 ] _ NA X2i _ X2 w _ which.na(X2) X2i[ w ] _ .7*Y[ w ] plot(X2i, Y) points(X2i[w], Y[w], col=2) abline( lm(Y~X2i, na.action=na.omit), col=3) abline( lm(Y~X2, na.action=na.omit)) Using this method gives biased coefficients.