Plotting Model Results

Problem:

Often times, you would like to generate graphics based on a model you fit in R. These plots may include a number of diagnostic plots, or just plotting the fitted line with prediction and confidence bands.

"Solution":

NOTE: I use the term "Solution" loosely; in R, there is never just one solution. This is just the solution I have found.

NOTE: The code distinguished with a ">" is what you would type at the R command line. The R output, if any, and any of my comments have been commented out using a "#".

This solution is a nearly exact excerpt from specific sections in the following book:
  • Introductory Statistics with R by Peter Dalgaard (Springer, 2002).

Specifically, I have excerpted paragraphs and have used and modified the R code from Chapter 5: Regression and Correlation Section 1, 2, and 3 and Chapter 10: Linear Models Section 8.

All of the data sets used in Peter Dalgaard's book can be found in the ISwR add-on package, which is available for download/installation on the CRAN website (http://cran.r-project.org/).

For this solution, we will be using the thuesen built-in data set from the ISwR add-on package. According to the ISwR description, the thuesen data set "contains ventricular shortening velocity and blood glucose for type 1 diabetic patients." thuesen contains two columns:
  • blood.glucose - fasting blood glucose (mmol/l) (numeric)
  • short.velocity - mean circumferential shortening velocity (%/s) (numeric)

In case some of you have problems installing the ISwR add-on package, I decided to generate the thuesen data set "from scratch" and proceed from there.

> thuesen<-data.frame(
   blood.glucose=c(15.3, 10.8, 8.1, 19.5, 7.2, 5.3, 9.3,
      11.1, 7.5, 12.2, 6.7, 5.2, 19.0, 15.1, 6.7, 8.6, 4.2, 10.3,
      12.5, 16.1, 13.3, 4.9, 8.8, 9.5),
   short.velocity=c(1.76, 1.34, 1.27, 1.47, 1.27, 1.49, 1.31, 1.09,
      1.18, 1.22, 1.25, 1.19, 1.95, 1.28, 1.52, NA, 1.12, 1.37, 1.19,
      1.05, 1.32, 1.03, 1.12, 1.70))
> thuesen
#    blood.glucose short.velocity
# 1           15.3           1.76
# 2           10.8           1.34
# 3            8.1           1.27
# 4           19.5           1.47
# 5            7.2           1.27
# 6            5.3           1.49
# 7            9.3           1.31
# 8           11.1           1.09
# 9            7.5           1.18
# 10          12.2           1.22
# 11           6.7           1.25
# 12           5.2           1.19
# 13          19.0           1.95
# 14          15.1           1.28
# 15           6.7           1.52
# 16           8.6             NA
# 17           4.2           1.12
# 18          10.3           1.37
# 19          12.5           1.19
# 20          16.1           1.05
# 21          13.3           1.32
# 22           4.9           1.03
# 23           8.8           1.12
# 24           9.5           1.70

For those of you who are able to properly install the ISwR package, you can load the library and read in the thuesen data set as follows:
library(ISwR)
data(thuesen)

This will then add the thuesen data set as an object to your current workspace:
ls()
# [1] "last.warning" "thuesen"

Either way, you will end up with a data set called thuesen.

Let's fit a simple linear regression model using the thuesen data. For convenience, let's assign the value(s) returned by the lm function under the name lm.velo:
lm.velo<-lm(short.velocity ~ blood.glucose, data=thuesen)
lm.velo
# 
# Call:
# lm(formula = short.velocity ~ blood.glucose, data = thuesen)
#
# Coefficients:
#   (Intercept)  blood.glucose
#       1.09781        0.02196

As Peter Daalgard states on page 97: "The result of lm is a model object. This is a distinctive concept of the S language (of which R is a dialect). Where other statistical systems focus on generating printed output that can be controlled by setting options, you get instead the result of a model fit encapsulated in an object from which the desired quantities can be obtained using extractor functions. An lm object does in fact contain much more information than you see when it is printed."

A basic extractor function is the summary function:
> summary(lm.velo)
#  
# Call:
# lm(formula = short.velocity ~ blood.glucose, data = thuesen)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max
# -0.40141 -0.14760 -0.02202  0.03001  0.43490
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)
# (Intercept)    1.09781    0.11748   9.345 6.26e-09 ***
# blood.glucose  0.02196    0.01045   2.101   0.0479 *
# ---
# Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
# 
# Residual standard error: 0.2167 on 21 degrees of freedom
# Multiple R-Squared: 0.1737,     Adjusted R-squared: 0.1343
# F-statistic: 4.414 on 1 and 21 DF,  p-value: 0.0479

Let's first plot just the points and the fitted line of the model fit :
plot(thuesen$blood.glucose, thuesen$short.velocity,
   xlab="Blood Glucose", ylab="Short Velocity",
   main="Plot of Thuesen Data")
abline(lm.velo)

Now let's consider residuals and fitted values. Two further extractor functions are fitted and resid.

As Peter Daalgard states on page 100 - 101:
  • "The fitted function returns fitted values - the y-values that you would expect for the given x-values according to the best fitting straight line", which was found by lm.
  • "The residuals shown by the resid function is the difference between this and the observed *short.velocity*".
> fitted(lm.velo)
#        1        2        3        4        5        6        7        8
# 1.433841 1.335010 1.275711 1.526084 1.255945 1.214216 1.302066 1.341599
#        9       10       11       12       13       14       15       17
# 1.262534 1.365758 1.244964 1.212020 1.515103 1.429449 1.244964 1.190057
#       18       19       20       21       22       23       24
# 1.324029 1.372346 1.451411 1.389916 1.205431 1.291085 1.306459
> resid(lm.velo)
#            1            2            3            4            5            6
#  0.326158532  0.004989882 -0.005711308 -0.056084062  0.014054962  0.275783754
#            7            8            9           10           11           12
#  0.007933665 -0.251598875 -0.082533795 -0.145757649  0.005036223 -0.022019994
#           13           14           15           17           18           19
#  0.434897199 -0.149448964  0.275036223 -0.070057471  0.045971143 -0.182346406
#           20           21           22           23           24
# -0.401411486 -0.069916424 -0.175431237 -0.171085074  0.393541161

Notice, observation 16 has a missing short.velocity, so no fitted value or residual is calculated. For ease of plotting, we will want to have calculated fitted values and residuals for all observations. A easy way to handle missing values is by using the na.exclude method for missing value (i.e. NA) handling. We do this using the options function:
> options(na.action="na.exclude")

This overrides the default option of na.action = "na.omit". na.omit removes any incomplete cases from the data set before calculating the fitted values and residuals. On the other hand, na.exclude inset NA=s for cases omitted by =na.omit in order to pad the residuals and predictions to the correct length.

In the future, you can determine the default of a particular option by using the getOption function in a similar fashion to this:
# To see default na.action option:
> getOption("na.action")
# [1] "na.omit"

Peter Dalgaard discuss more ways to handle missing values on page 101.

We now have to refit lm.velo after changing the na.action option, and if we look at the fitted values, we see that observation 16 has a missing fitted value (NA):
> lm.velo<-lm(short.velocity ~ blood.glucose, data=thuesen)
> fitted(lm.velo)
#        1        2        3        4        5        6        7        8
# 1.433841 1.335010 1.275711 1.526084 1.255945 1.214216 1.302066 1.341599
#        9       10       11       12       13       14       15       16
# 1.262534 1.365758 1.244964 1.212020 1.515103 1.429449 1.244964       NA
#       17       18       19       20       21       22       23       24
# 1.190057 1.324029 1.372346 1.451411 1.389916 1.205431 1.291085 1.306459

Now let's create a plot where residuals are displayed by connecting observations to corresponding points on the fitted line :
> plot(thuesen$blood.glucose, thuesen$short.velocity,
   xlab="Blood Glucose", ylab="Short Velocity",
   main="Plot of Thuesen Data")
> abline(lm.velo)
> segments(x0=thuesen$blood.glucose, 
   y0=fitted(lm.velo),
   x1=thuesen$blood.glucose, 
   y1=thuesen$short.velocity)

We can also generate a simple plot of residuals versus fitted values :
> plot(fitted(lm.velo), resid(lm.velo),
   xlab="Fitted Values from lm.velo",
   ylab="Residuals from lm.velo",
   main="Residuals versus Fitted values from lm.velo")

We can also get an indication of whether the residuals might have come from a normal distribution by checking for a straight line on a %BLUR% Q-Q plot :
> qqnorm(resid(lm.velo))
> qqline(resid(lm.velo))

Illustration and explanation of further regression diagnostics is given in section 10.8. A basic set of four regression diagnostic plots is available via the plot method for the lm objects:
  1. residual versus fitted values
  2. Q-Q normal distribution plot of standardized residuals
  3. plot of the square root of the absolute value of the standardized residuals
  4. plot of "Cook's distance", which is a measure of the influence of each observation on the regression coefficients
> par(mfro=c(2,2))
> plot(lm.velo)
# To return to the default of one graph:
> par(mfrow=c(1,1))

The last thing I am going to demonstrate how to plot the fitted line of your model including the prediction and confidence bands . There is a complete explanation of the procress to produce this plot in section 5.3.

As Peter Dalgaard explains on page 106: "First we create a new data frame in which the blood.glucose variable contains the values at which we want predictions to be made":
> pred.frame<-data.frame(blood.glucose=4:20)

Next we generate two vectors which "contain the result of predict for the new data in =pred.frame=", specifically (1) the prediction limits, and (2) the confidence limits:
> pp<-predict(lm.velo, int="p", newdata=pred.frame)
> pc<-predict(lm.velo, int="c", newdata=pred.frame)

"Now for the plotting: first we create a standard scatterplot, except we ensure that it has enough room for the prediction limits", by incorporating the ylim argument.
> plot(thuesen$blood.glucose, thuesen$short.velocity,
   ylim=range(thuesen$short.velocity, pp, na.rm=T),
   xlab="Blood glucose", ylab="Short Velocity",
   main="Plot with Confidence and Prediction Bands")

"Finally the curves are added, using as c-values the blood.glucose used for the prediction, and setting the line types and colours to more sensible values."
 matlines(pred.frame$blood.glucose, pc, lty=c(1,2,2), col="black")
 matlines(pred.frame$blood.glucose, pp, lty=c(1,3,3), col="black")

We now probably want to set the na.action option back to "na.omit" instead of ="na.exclude"=":
options(na.action="na.omit")

Acknowledgements:

I would like to thank Richard Urbano for posing this problem.
Edit | Attach | Print version | History: r5 | r4 < r3 < r2 < r1 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r2 - 08 Nov 2006, TheresaScott
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback