R Code and Examples from Class
Introduction
R is an powerful statistics package that is also capable of creating publication-quality graphics. The experienced user of R will use the command line to conduct analyses and create plots. However, there exists an add-on package to R, called R-commander, which provides a menu-based interface to assist new users of R. R-commander automatically translates menu choices and mouse clicks into the underlying R commands, which can be saved, modified, and re-run by the user. The following notes contain the R commands, which can be run as is or, in most cases, can be replicated using menu choices with R-commander.
Notes on Installing R and R-commander
Download and install the executable form of R (a .exe file on Windows) for your particular operating system. Links to the appropriate files are provided in the syllabus. After starting R, install the Rcmdr package by either (1) navigating the menu system to install the Rcmr package, or (2) entering
install.package("Rcmdr") at the R command line (the '>' prompt)
After successfully installing Rcmdr, you should be able to load the Rcmdr library. Again, either (1) navigate the menu system, this time loading the Rcmdr package, or (2) enter
library(Rcmdr) at the R command line (the '>' prompt)
Each time you start R, you will need to load the Rcmdr library to start R-commander. The first time you load the Rcmdr library, you will be asked to install some additional packages. Install the packages from CRAN.
We have learned that installing R commander on a Mac requires a few more steps. See
this link for more information.
Importing data, defining factors, simple box-plot (01/15/08)
- Import heaven dataset from here.
Using the 'import data' menus in R-commander may be easier.
- Show the data (check for import errors)
- Box plot of weight, Consider changing the y-axis label
- Change sex to a factor or categorical variable (by default R thinks it continuous variable). Can also be done with the R-commander menus by following Data... Manage variables in active dataset... Convert numeric variables to factors
- Using the newly created factor variable for sex, make a box plot for males and females
heaven <-
read.csv("http://biostat.mc.vanderbilt.edu/twiki/pub/Main/DataTransmissionProcedures/spreadsheetfromheaven.csv")
showData(heaven, placement='-20+200', font=getRcmdr('logFont'), maxwidth=80, maxheight=30)
boxplot(heaven$WT, ylab="WT")
heaven$SEX <- factor(heaven$SEX, labels=c('Male','Female'))
boxplot(WT~SEX, ylab="WT", xlab="SEX", data=heaven)
Calculate new variables, Numerical Summaries, Histogram, Dot Plot/Strip plot (01/17/08)
(1) Start with the 'heaven' dataset imported in the 01/15/08 notes
(2)-(4) Define variables as factors
(5) Means, median, and quartiles
(6) - (9) Frequency table for Race variable
(10) Create a new variable from weight and height, BMI
(11) Histogram of HCT (Hematocrit)
(12) (Ugly) Dot plot of HCT by treatment group (no jittering)
(13) Load the lattice package. If it had not been installed, navigate the menus to install the lattice packages or enter the command
install.packages("lattice") at the R command prompt
(14) Nice dot plot of HCT using the graphics in the lattice package Points are jittered to display overlapping values
heaven <-
read.csv("http://biostat.mc.vanderbilt.edu/twiki/pub/Main/DataTransmissionProcedures/spreadsheetfromheaven.csv")
heaven$GROUP <- factor(heaven$GROUP, labels=c('Treatment','Control'))
heaven$SEX <- factor(heaven$SEX, labels=c('Male','Female'))
heaven$RACE <- factor(heaven$RACE, labels=c('White','Black','Asian','Native American','Other'))
summary(heaven)
.Table <- table(heaven$RACE)
.Table # counts for RACE
100*.Table/sum(.Table) # percentages for RACE
remove(.Table)
heaven$BMI <- with(heaven, 703 * WT / HT^2)
Hist(heaven$HCT, scale="frequency", breaks="Sturges", col="darkgray")
library(lattice)
stripchart(heaven$HCT~heaven$GROUP, vertical=TRUE)
stripplot(heaven$HCT ~ heaven$GROUP, ylab="Hematocrit", main="Example of stripplot from the lattice library", jitter=T)
NOTE: The above commands are not run using the
Rcmdr
menus but are for pasting into the
Rcmdr Script Window
.
stripchart
and
stripplot
are functions in the R
lattice
package which is loaded by the command
library(lattice)
or by loading that package from the
Package
menu. If you get an error message when running the command
library(lattice)
then you need to use the
Package
menu to install
lattice
, or use the command
install.packages('lattice')
to fetch and install it.
One sample and two sample t-tests (01/29/08)
(1) - (3) Load the sleep dataset, get a brief description, and show the contents. The sleep dataset is included with R as a built-in example dataset, so to load it use the menus "Data... Data in packages... Read dataset" From there, either enter sleep as the dataset name or you can find it in the datasets packages using the mouse
(4) Box plot of sleep data
(5) Two-sample t-test, assuming equal variances, alpha = .05, two-sided
(6) -(9) Reorganize the sleep dataset into "stacked" form and give the columns appropriate labels. These commands cannot be replicated in R commander
(10) Paired t-test on extra hours of sleep, alpha = .05, two sided. If doing this test in R-commander, you will need to make "sleep.flat" the active dataset before you are able to run the test.
data(sleep)
help("sleep")
showData(sleep, placement='-20+200', font=getRcmdr('logFont'), maxwidth=80, maxheight=30)
boxplot(extra~group, ylab="extra", xlab="group", data=sleep)
t.test(extra~group, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=sleep)
drug1 <- subset(sleep, group==1)
drug2 <- subset(sleep, group==2)
sleep.flat <- cbind(drug1, drug2)
names(sleep.flat) <- c("drug1","grp1","drug2","grp2")
t.test(sleep.flat$drug1, sleep.flat$drug2, alternative='two.sided', conf.level=.95, paired=TRUE)
R Code for Demonstrations using Simulation
Standard deviations, sampling distribution of the mean, and standard errors
# Show that as n gets large, the sample mean and standard deviation approach population values
# Use a population that has a mean of 70 and standard deviation of 15
n <- 10
x <- rnorm(n, 70, 15)
mean(x); sd(x)
n <- 100
x <- rnorm(n, 70, 15)
mean(x); sd(x)
n <- 100000
x <- rnorm(n, 70, 15)
mean(x); sd(x)
# Draw a sample of size 20 from a population normal distribution with with mean 70 and SD 15
set.seed(1) # so everyone will get the same result
x <- rnorm(20, 70, 15)
# Compute the mean, variance, and SD of the sample
mean(x); var(x); sd(x)
# Draw 1000 samples from the same population and save their means
means <- numeric(1000)
for(i in 1:1000) means[i] <- mean(rnorm(20, 70, 15))
# Show histogram of raw data from one sample along with histogram of means
par(mfrow=c(1,2))
hist(x, xlim=c(30,110))
hist(means, xlim=c(30,110), nclass=30)
# Compute variance and SD of means
var(means); sd(means)
# Compare this variance to the variance estimate from the first sample, divided by 20
var(x)/20
# Compare this SD to the standard error from the first sample
sd(x)/sqrt(20)
# Show the sampling distributions of the sample mean and the sample variance for two different sample size
# Draw samples of size 10 and 40 from a N ~ (70, 15^2) population
means <- numeric(5000)
vars <- numeric(5000)
means2 <- numeric(5000)
vars2 <- numeric(5000)
for(i in 1:5000) {
means[i] <- mean(rnorm(10, 70, 15))
vars[i] <- var(rnorm(10,70,15))
means2[i] <- mean(rnorm(40, 70, 15))
vars2[i] <- var(rnorm(40,70,15))
}
15^2
par(mfrow=c(2,2))
hist(means,xlim=c(50,90),main="Sampling distribution of sample mean \n n = 10", nclass=30)
hist(vars,xlim=c(0,800), main="Sampling distribution of sample variance \n n = 10", nclass=30)
hist(means2, xlim=c(50,90), main="Sampling distribution of sample mean \n n = 40", nclass=30)
hist(vars2, xlim=c(0,800), main="Sampling distribution of sample variance \n n = 40", nclass=30)
Regression to the mean
# Generate some random Normal data where subjects are selected for treatment based on having high pre (x) values
set.seed(10)
data <- data.frame(x=rep(NA,40),x.post=rep(NA,40),diff=rep(NA,40),grp=rep(NA,40))
data$x <- rnorm(40, 120,10)
data$x.post <- rnorm(40, 120,10)
data$diff <- data$x.post - data$x
data$grp <- (data$x>=median(data$x)) + 1
plot(data$x)
plot(data$x, col=data$grp)
abline(v=median(data$x))
plot(data$x.post, col=data$grp)
# Plot showing induced relationship between group and the difference
library(lattice)
stripplot(grp ~ diff, data=data)
Histograms and box plots for symmetric and skewed distributions
# Create a function for displaying a histogram in the top panel and a box plot in the bottom panel
boxhist <-function(x, lims) {
X11()
par(mfrow=c(2,1))
hist(x, xlim=lims, main="Histogram")
boxplot(x, horizontal=TRUE, ylim=lims)
}
set.seed(1)
x <- rnorm(1000)
boxhist(x, c(-4,4))
y <- 1:100
boxhist(y, c(1,100))
a <- rep(NA, 1000)
a[1:500] <- rnorm(500, 4, 1)
a[501:1000] <- -a[1:500]
boxhist(a, c(-8,8))
z <- rchisq(1000, df=3)
boxhist(z, c(0,15))
boxhist(15-z, c(0,18))
The impact of adding an extreme observation on the 2-sample t-test and Wilcoxon rank sum test
This code uses the
dot plot function to create a fancier dot plot of the data than stripchart()
# Search for a dataset where, for samples x and y,
# (1) The mean(x) < mean(y)
# (2) A two-sample t-test (two-sided) is significant
# (3) Adding a new observation to y that is larger than any current
# observation in y makes the t-test not significant
thesum <- 0
while(thesum =3) {
x <- rnorm(5, 5, 1)
y <- rnorm(4, 6, 1)
y2 <- c(y, max(y) + runif(1,0,3))
thesum <- (t.test(x,y2, var.equal=T)$p.value > .05) + (t.test(x,y, var.equal=T)$p.value < .05) + (mean(x) < mean(y))
}
# One such dataset
x <- c(4.98, 4.96, 5.22, 4.47, 4.8)
y <- c(5.2, 5.77, 5.37, 5.11)
y2 <- c(5.2, 5.77, 5.37, 5.11, 8.51)
data1 <- c(x,y)
group1 <- factor(c(rep(0,5), rep(1,4)), levels=0:1)
data2 <- c(x,y2)
group2 <- factor(c(rep(0,5), rep(1,5)), levels=0:1)
cbind(data1,group1)
cbind(data2,group2)
# A plot of the data
library(lattice)
stripplot(data2~group2, jitter=T, ylab="Dependant Variable", xlab="Group")
# A fancier plot of the data
source("~/rlib/dotplot.R")
dotplot(data2~group2, show.n=TRUE, median.line=TRUE, ylab="Dependant Variable", xlab="Group")
points(2, max(y2), col=2,pch=20)
legend("topleft",inset=.05, c("Original dataset","Added data point"), pch=20, col=c(1,2))
text(1,7,"p = ???", cex=1.5)
# T-test gives unanticipated answer-- a larger p-value for dataset 2
t.test(x,y,var.equal=TRUE,paired=FALSE)
t.test(data1~group1, var.equal=TRUE, paired=FALSE)
t.test(x,y2,var.equal=TRUE,paired=FALSE)
t.test(data2~group2, var.equal=TRUE, paired=FALSE)
#Non-parametric test is more predictable-- the p-value decreases for dataset2
wilcox.test(x, y)
wilcox.test(data1~group1)
wilcox.test(x, y2)
wilcox.test(data2~group2)
Analysis of Serial Data Using Summary Measures (17Apr08)
d <- read.csv('MuticData.csv') # file is attached to this web page
with(d, table(Mouse,Week))
with(d, table(Mouse,Treatment))
library(lattice)
xyplot(T ~ Week | Treatment, groups=Mouse, data=d, type='b')
xyplot(T/H ~ Week | Treatment, groups=Mouse, data=d, type='b')
slope.int <- function(z) coef(lsfit(z[,1], z[,2]))
library(Hmisc)
s <- with(d, summarize(cbind(Week,T/H), llist(Treatment,Mouse), slope.int,
stat.name=c('intercept','slope')))
s
## Mean of each fitted line is line evaluated at 2 weeks. Add this to data.
## Since we are assume a straight line for each mouse, the estimated mean
## will be the observed response at 2 weeks if there was a measurement at
## 2 weeks but not at three weeks. Mean = AUC/(3-1)
s$mean <- with(s, intercept + 2*slope)
s
## Plot means
stripplot(Treatment ~ mean, data=s)
## Do nonparametric ANOVA on the mean response
kruskal.test(mean ~ Treatment, data=s)