--- title: "R Exercise Solutions" author: "Cole Beck" output: html_document: number_sections: no pdf_document: number_sections: no --- ```{r setup,echo=FALSE} require(Hmisc) knitrSet(lang='markdown') ``` # Manipulating Vectors 1. Modify the following character vector to keep only street names, then sort and remove duplicates. ```{r} x <- c("120 Main St", "231 Walnut Grove", "374 Central Pk", "402 Providence Ln", "555 Central Pk") ``` ```{r} sort(unique(sub("^[0-9 ]+", "", x))) sort(unique(sub(" ", "", gsub("[0-9]", "", x)))) xx <- strsplit(x, " ") res <- character(length(xx)) for(i in seq_along(xx)) { res[i] <- paste(xx[[i]][-1], collapse=' ') } sort(unique(res)) ``` 2. How could you sum all of the numbers between 1 and 1,000 that are evenly divisible by 3 or 5? What about numbers between 1 and 100,000 divisible by 4, 7, or 13? ```{r} sum(unique(c(seq(3, 1000, by=3), seq(5, 1000, by=5)))) sum(unique(c(seq(4, 100000, by=4), seq(7, 100000, by=7), seq(13, 100000, by=13)))) x <- seq(1000) sum(x[x %% 3 == 0 | x %% 5 == 0]) x <- seq(100000) sum(x[x %% 4 == 0 | x %% 7 == 0 | x %% 13 == 0]) ``` # Writing Functions Celsius to Fahrenheit: $f(x) = (x*9/5) + 32$ Celsius to Kelvin: $f(x) = x + 273.15$ 1. Write a temperature conversion function. It should take a vector of temperatures, the `from` type, and the `to` type. ```{r} temp <- function(x, from='C', to='F') { if(from == 'F') { x <- (x - 32)*5/9 } else if(from == 'K') { x <- x - 273.15 } if(to == 'F') { x <- x*9/5 + 32 } else if(to == 'K') { x <- x + 273.15 } x } ``` ```{r} # test temp function with this data set.seed(20) x <- round(rnorm(30, 10, 10)) xf <- temp(x, from='C', to='F') xk <- temp(x, from='C', to='K') all.equal(temp(xf, from='F', 'K'), xk) ``` # Manipulating Data Frames 1. Read in the CSV file ```{r} dat <- read.csv("https://github.com/fonnesbeck/Bios6301/raw/master/datasets/haart.csv", stringsAsFactors = FALSE) ``` 2. Describe the data set ```{r} describe(dat) ``` 3. Create a categorical variable `gender`, using `male` ```{r} dat[,'gender'] <- factor(ifelse(dat[,'male'] == 0, 'Female', 'Male')) dat[,'gender'] <- factor(dat[,'male'], labels=c('Female','Male')) ``` 4. Convert `init.date` and `last.visit` into Date variables ```{r} dat[,'init.date'] <- as.Date(dat[,'init.date'], format='%m/%d/%y') dat[,'last.visit'] <- as.Date(dat[,'last.visit'], format='%m/%d/%y') ``` 5. Create the column `daysbetween` by calculating the number of days between visits ```{r} dat[,'daysbetween'] <- dat[,'last.visit'] - dat[,'init.date'] dat[,'daysbetween'] <- difftime(dat[,'last.visit'], dat[,'init.date'], units='days') ``` 6. Subset the data where `age` is greater than 40 and `death` is zero. Only keep the following columns: gender, age, cd4baseline, weight, daysbetween ```{r} dat_subset <- subset(dat, age > 40 & death == 0, c(gender, age, cd4baseline, weight, daysbetween)) dat <- dat[dat[,'age'] > 40 & dat[,'death'] == 0, c('gender','age','cd4baseline','weight','daysbetween')] ``` 7. Reorder the data by `age` ```{r} dat <- dat[order(dat[,'age']),] ``` # Models ```{r} gender <- c('M','M','F','M','F','F','M','F','M') age <- c(34, 64, 38, 63, 40, 73, 27, 51, 47) smoker <- c('no','yes','no','no','yes','no','no','no','yes') exercise <- factor(c('moderate','frequent','some','some','moderate','none', 'none','moderate','moderate'), levels=c('none','some','moderate','frequent'), ordered=TRUE ) los <- c(4,8,1,10,6,3,9,4,8) x <- data.frame(gender, age, smoker, exercise, los) ``` 1. Create a linear model using `x`, estimating the association between `los` and all remaining variables ```{r} lm(los ~ gender + age + smoker + exercise, dat=x) lm(los ~ ., dat=x) ``` 2. Create a new model, this time predicting `los` by `gender`; show the model summary ```{r} mod <- lm(los ~ gender, dat=x) summary(mod) ``` 3. What is the estimate for the intercept? What is the estimate for gender? ```{r} coef(mod)[1] coef(mod)[2] ``` 4. Re-calculate the standard errors, by taking the square root of the diagonal of the variance-covariance matrix of the summary of the linear model ```{r} sqrt(diag(vcov(summary(mod)))) ``` 5. Predict `los` with the following new data set ```{r} newdat <- data.frame(gender=c('F','M','F')) predict(mod, newdat) ``` 6. Sum the square of the residuals of the model. Compare this to passing the model to the `deviance` function. ```{r} sum(residuals(mod)^2) deviance(mod) ``` 7. Create a subset of `x` by taking all records where `gender` is 'M' and assigning it to the variable `men`. Do the same for the variable `women`. ```{r} men <- x[x[,'gender'] == 'M',] women <- x[x[,'gender'] == 'F',] ``` 8. Call the `t.test` function, where the first argument is los for women and the second argument is los for men. Add the argument var.equal and set it to TRUE. Does this match the p-value computed in the model summary? ```{r} t.test(women$los, men$los, var.equal=TRUE) ``` # Generating Plots Given the `vlbw` data set, use `ggplot2` and `qplot` and create several plots. ```{r} require(ggplot2) getHdata(vlbw) ``` 1. Scatterplot of `gest` VS `bwt` ```{r} qplot(gest, bwt, data=vlbw) ``` 2. Scatterplot `gest` VS `bwt`, add color and shape using variable `sex` ```{r} qplot(gest, bwt, data=vlbw, color=sex, shape=sex) ``` 3. Boxplot of `btw` by `sex` ```{r} qplot(sex, bwt, data=vlbw, geom='boxplot') ``` 4. Scatterplot of `gest` VS `bwt`, facet by `race` ```{r} qplot(gest, bwt, data=vlbw, facets=race~.) ``` 5. Scatterplot of `gest` VS `bwt`, add regression line ```{r} qplot(gest, bwt, data=vlbw) + geom_smooth(method="lm") ```