R Code and Examples from Class

Introduction

R is an powerful statistics package that is also capable of creating publication-quality graphics. The experienced user of R will use the command line to conduct analyses and create plots. However, there exists an add-on package to R, called R-commander, which provides a menu-based interface to assist new users of R. R-commander automatically translates menu choices and mouse clicks into the underlying R commands, which can be saved, modified, and re-run by the user. The following notes contain the R commands, which can be run as is or, in most cases, can be replicated using menu choices with R-commander.

Notes on Installing R and R-commander

Download and install the executable form of R (a .exe file on Windows) for your particular operating system. Links to the appropriate files are provided in the syllabus. After starting R, install the Rcmdr package by either (1) navigating the menu system to install the Rcmr package, or (2) entering install.package("Rcmdr") at the R command line (the '>' prompt)

After successfully installing Rcmdr, you should be able to load the Rcmdr library. Again, either (1) navigate the menu system, this time loading the Rcmdr package, or (2) enter library(Rcmdr) at the R command line (the '>' prompt)

Each time you start R, you will need to load the Rcmdr library to start R-commander. The first time you load the Rcmdr library, you will be asked to install some additional packages. Install the packages from CRAN.

We have learned that installing R commander on a Mac requires a few more steps. See this link for more information.

Importing data, defining factors, simple box-plot (01/15/08)

  1. Import heaven dataset from here.
Using the 'import data' menus in R-commander may be easier.
  1. Show the data (check for import errors)
  2. Box plot of weight, Consider changing the y-axis label
  3. Change sex to a factor or categorical variable (by default R thinks it continuous variable). Can also be done with the R-commander menus by following Data... Manage variables in active dataset... Convert numeric variables to factors
  4. Using the newly created factor variable for sex, make a box plot for males and females
heaven <- read.csv("http://biostat.mc.vanderbilt.edu/twiki/pub/Main/DataTransmissionProcedures/spreadsheetfromheaven.csv") showData(heaven, placement='-20+200', font=getRcmdr('logFont'), maxwidth=80, maxheight=30) boxplot(heaven$WT, ylab="WT") heaven$SEX <- factor(heaven$SEX, labels=c('Male','Female')) boxplot(WT~SEX, ylab="WT", xlab="SEX", data=heaven)

Calculate new variables, Numerical Summaries, Histogram, Dot Plot/Strip plot (01/17/08)

(1) Start with the 'heaven' dataset imported in the 01/15/08 notes

(2)-(4) Define variables as factors

(5) Means, median, and quartiles

(6) - (9) Frequency table for Race variable

(10) Create a new variable from weight and height, BMI

(11) Histogram of HCT (Hematocrit)

(12) (Ugly) Dot plot of HCT by treatment group (no jittering)

(13) Load the lattice package. If it had not been installed, navigate the menus to install the lattice packages or enter the command install.packages("lattice") at the R command prompt

(14) Nice dot plot of HCT using the graphics in the lattice package Points are jittered to display overlapping values heaven <- read.csv("http://biostat.mc.vanderbilt.edu/twiki/pub/Main/DataTransmissionProcedures/spreadsheetfromheaven.csv") heaven$GROUP <- factor(heaven$GROUP, labels=c('Treatment','Control')) heaven$SEX <- factor(heaven$SEX, labels=c('Male','Female')) heaven$RACE <- factor(heaven$RACE, labels=c('White','Black','Asian','Native American','Other')) summary(heaven) .Table <- table(heaven$RACE) .Table # counts for RACE 100*.Table/sum(.Table) # percentages for RACE remove(.Table) heaven$BMI <- with(heaven, 703 * WT / HT^2) Hist(heaven$HCT, scale="frequency", breaks="Sturges", col="darkgray") library(lattice) stripchart(heaven$HCT~heaven$GROUP, vertical=TRUE) stripplot(heaven$HCT ~ heaven$GROUP, ylab="Hematocrit", main="Example of stripplot from the lattice library", jitter=T)

NOTE: The above commands are not run using the Rcmdr menus but are for pasting into the Rcmdr Script Window. stripchart and stripplot are functions in the R lattice package which is loaded by the command library(lattice) or by loading that package from the Package menu. If you get an error message when running the command library(lattice) then you need to use the Package menu to install lattice, or use the command install.packages('lattice') to fetch and install it.

One sample and two sample t-tests (01/29/08)

(1) - (3) Load the sleep dataset, get a brief description, and show the contents. The sleep dataset is included with R as a built-in example dataset, so to load it use the menus "Data... Data in packages... Read dataset" From there, either enter sleep as the dataset name or you can find it in the datasets packages using the mouse

(4) Box plot of sleep data

(5) Two-sample t-test, assuming equal variances, alpha = .05, two-sided

(6) -(9) Reorganize the sleep dataset into "stacked" form and give the columns appropriate labels. These commands cannot be replicated in R commander

(10) Paired t-test on extra hours of sleep, alpha = .05, two sided. If doing this test in R-commander, you will need to make "sleep.flat" the active dataset before you are able to run the test.

data(sleep) help("sleep") showData(sleep, placement='-20+200', font=getRcmdr('logFont'), maxwidth=80, maxheight=30) boxplot(extra~group, ylab="extra", xlab="group", data=sleep) t.test(extra~group, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=sleep) drug1 <- subset(sleep, group==1) drug2 <- subset(sleep, group==2) sleep.flat <- cbind(drug1, drug2) names(sleep.flat) <- c("drug1","grp1","drug2","grp2") t.test(sleep.flat$drug1, sleep.flat$drug2, alternative='two.sided', conf.level=.95, paired=TRUE)

R Code for Demonstrations using Simulation

Standard deviations, sampling distribution of the mean, and standard errors

# Show that as n gets large, the sample mean and standard deviation approach population values # Use a population that has a mean of 70 and standard deviation of 15 n <- 10 x <- rnorm(n, 70, 15) mean(x); sd(x) n <- 100 x <- rnorm(n, 70, 15) mean(x); sd(x) n <- 100000 x <- rnorm(n, 70, 15) mean(x); sd(x)

# Draw a sample of size 20 from a population normal distribution with with mean 70 and SD 15 set.seed(1) # so everyone will get the same result x <- rnorm(20, 70, 15) # Compute the mean, variance, and SD of the sample mean(x); var(x); sd(x) # Draw 1000 samples from the same population and save their means means <- numeric(1000) for(i in 1:1000) means[i] <- mean(rnorm(20, 70, 15)) # Show histogram of raw data from one sample along with histogram of means par(mfrow=c(1,2)) hist(x, xlim=c(30,110)) hist(means, xlim=c(30,110), nclass=30) # Compute variance and SD of means var(means); sd(means) # Compare this variance to the variance estimate from the first sample, divided by 20 var(x)/20 # Compare this SD to the standard error from the first sample sd(x)/sqrt(20)

# Show the sampling distributions of the sample mean and the sample variance for two different sample size # Draw samples of size 10 and 40 from a N ~ (70, 15^2) population

means <- numeric(5000) vars <- numeric(5000) means2 <- numeric(5000) vars2 <- numeric(5000) for(i in 1:5000) { means[i] <- mean(rnorm(10, 70, 15)) vars[i] <- var(rnorm(10,70,15)) means2[i] <- mean(rnorm(40, 70, 15)) vars2[i] <- var(rnorm(40,70,15)) }

15^2

par(mfrow=c(2,2)) hist(means,xlim=c(50,90),main="Sampling distribution of sample mean \n n = 10", nclass=30) hist(vars,xlim=c(0,800), main="Sampling distribution of sample variance \n n = 10", nclass=30) hist(means2, xlim=c(50,90), main="Sampling distribution of sample mean \n n = 40", nclass=30) hist(vars2, xlim=c(0,800), main="Sampling distribution of sample variance \n n = 40", nclass=30)

Regression to the mean

# Generate some random Normal data where subjects are selected for treatment based on having high pre (x) values set.seed(10) data <- data.frame(x=rep(NA,40),x.post=rep(NA,40),diff=rep(NA,40),grp=rep(NA,40)) data$x <- rnorm(40, 120,10) data$x.post <- rnorm(40, 120,10) data$diff <- data$x.post - data$x data$grp <- (data$x>=median(data$x)) + 1

plot(data$x) plot(data$x, col=data$grp) abline(v=median(data$x))

plot(data$x.post, col=data$grp)

# Plot showing induced relationship between group and the difference library(lattice) stripplot(grp ~ diff, data=data)

Histograms and box plots for symmetric and skewed distributions

# Create a function for displaying a histogram in the top panel and a box plot in the bottom panel boxhist <-function(x, lims) { X11() par(mfrow=c(2,1)) hist(x, xlim=lims, main="Histogram") boxplot(x, horizontal=TRUE, ylim=lims) }

set.seed(1) x <- rnorm(1000) boxhist(x, c(-4,4))

y <- 1:100 boxhist(y, c(1,100))

a <- rep(NA, 1000) a[1:500] <- rnorm(500, 4, 1) a[501:1000] <- -a[1:500] boxhist(a, c(-8,8))

z <- rchisq(1000, df=3) boxhist(z, c(0,15))

boxhist(15-z, c(0,18))

The impact of adding an extreme observation on the 2-sample t-test and Wilcoxon rank sum test

This code uses the dot plot function to create a fancier dot plot of the data than stripchart()

# Search for a dataset where, for samples x and y, # (1) The mean(x) < mean(y) # (2) A two-sample t-test (two-sided) is significant # (3) Adding a new observation to y that is larger than any current # observation in y makes the t-test not significant

thesum <- 0

while(thesum =3) {

x <- rnorm(5, 5, 1) y <- rnorm(4, 6, 1)

y2 <- c(y, max(y) + runif(1,0,3))

thesum <- (t.test(x,y2, var.equal=T)$p.value > .05) + (t.test(x,y, var.equal=T)$p.value < .05) + (mean(x) < mean(y))

}

# One such dataset x <- c(4.98, 4.96, 5.22, 4.47, 4.8) y <- c(5.2, 5.77, 5.37, 5.11) y2 <- c(5.2, 5.77, 5.37, 5.11, 8.51)

data1 <- c(x,y) group1 <- factor(c(rep(0,5), rep(1,4)), levels=0:1)

data2 <- c(x,y2) group2 <- factor(c(rep(0,5), rep(1,5)), levels=0:1)

cbind(data1,group1) cbind(data2,group2)

# A plot of the data library(lattice) stripplot(data2~group2, jitter=T, ylab="Dependant Variable", xlab="Group")

# A fancier plot of the data source("~/rlib/dotplot.R") dotplot(data2~group2, show.n=TRUE, median.line=TRUE, ylab="Dependant Variable", xlab="Group") points(2, max(y2), col=2,pch=20) legend("topleft",inset=.05, c("Original dataset","Added data point"), pch=20, col=c(1,2)) text(1,7,"p = ???", cex=1.5)

# T-test gives unanticipated answer-- a larger p-value for dataset 2 t.test(x,y,var.equal=TRUE,paired=FALSE) t.test(data1~group1, var.equal=TRUE, paired=FALSE)

t.test(x,y2,var.equal=TRUE,paired=FALSE) t.test(data2~group2, var.equal=TRUE, paired=FALSE)

#Non-parametric test is more predictable-- the p-value decreases for dataset2 wilcox.test(x, y) wilcox.test(data1~group1)

wilcox.test(x, y2) wilcox.test(data2~group2)

Analysis of Serial Data Using Summary Measures (17Apr08)

d <- read.csv('MuticData.csv') # file is attached to this web page with(d, table(Mouse,Week)) with(d, table(Mouse,Treatment)) library(lattice) xyplot(T ~ Week | Treatment, groups=Mouse, data=d, type='b') xyplot(T/H ~ Week | Treatment, groups=Mouse, data=d, type='b')

slope.int <- function(z) coef(lsfit(z[,1], z[,2]))

library(Hmisc) s <- with(d, summarize(cbind(Week,T/H), llist(Treatment,Mouse), slope.int, stat.name=c('intercept','slope'))) s

## Mean of each fitted line is line evaluated at 2 weeks. Add this to data. ## Since we are assume a straight line for each mouse, the estimated mean ## will be the observed response at 2 weeks if there was a measurement at ## 2 weeks but not at three weeks. Mean = AUC/(3-1) s$mean <- with(s, intercept + 2*slope) s

## Plot means stripplot(Treatment ~ mean, data=s)

## Do nonparametric ANOVA on the mean response kruskal.test(mean ~ Treatment, data=s)
Topic attachments
I Attachment Action Size Date Who Comment
MuticData.csvcsv MuticData.csv manage 4.7 K 17 Apr 2008 - 14:18 FrankHarrell Serial example data from Nathan Mutic
Topic revision: r16 - 02 Feb 2009, ChrisSlaughter
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback