---
title: "R Exercise Solutions"
author: "Cole Beck"
output:
  html_document:
    number_sections: no
  pdf_document:
    number_sections: no
---

```{r setup,echo=FALSE}
require(Hmisc)
knitrSet(lang='markdown')
```

# Manipulating Vectors

1. Modify the following character vector to keep only street names, then sort and remove duplicates.

```{r}
x <- c("120 Main St", "231 Walnut Grove", "374 Central Pk",
       "402 Providence Ln", "555 Central Pk")
```

```{r}
sort(unique(sub("^[0-9 ]+", "", x)))
sort(unique(sub(" ", "", gsub("[0-9]", "", x))))
xx <- strsplit(x, " ")
res <- character(length(xx))
for(i in seq_along(xx)) {
  res[i] <- paste(xx[[i]][-1], collapse=' ')
}
sort(unique(res))
```

2. How could you sum all of the numbers between 1 and 1,000 that are evenly divisible by 3 or 5?  What about numbers between 1 and 100,000 divisible by 4, 7, or 13?

```{r}
sum(unique(c(seq(3, 1000, by=3), seq(5, 1000, by=5))))
sum(unique(c(seq(4, 100000, by=4), seq(7, 100000, by=7), seq(13, 100000, by=13))))

x <- seq(1000)
sum(x[x %% 3 == 0 | x %% 5 == 0])
x <- seq(100000)
sum(x[x %% 4 == 0 | x %% 7 == 0 | x %% 13 == 0])
```

# Writing Functions

Celsius to Fahrenheit: $f(x) = (x*9/5) + 32$

Celsius to Kelvin: $f(x) = x + 273.15$

1. Write a temperature conversion function.  It should take a vector of temperatures, the `from` type, and the `to` type.

```{r}
temp <- function(x, from='C', to='F') {
  if(from == 'F') {
    x <- (x - 32)*5/9
  } else if(from == 'K') {
    x <- x - 273.15
  }
  if(to == 'F') {
    x <- x*9/5 + 32
  } else if(to == 'K') {
    x <- x + 273.15
  }
  x
}
```

```{r}
# test temp function with this data
set.seed(20)
x <- round(rnorm(30, 10, 10))
xf <- temp(x, from='C', to='F')
xk <- temp(x, from='C', to='K')
all.equal(temp(xf, from='F', 'K'), xk)
```

# Manipulating Data Frames

1. Read in the CSV file

```{r}
dat <- read.csv("https://github.com/fonnesbeck/Bios6301/raw/master/datasets/haart.csv", stringsAsFactors = FALSE)
```

2. Describe the data set

```{r}
describe(dat)
```

3. Create a categorical variable `gender`, using `male`

```{r}
dat[,'gender'] <- factor(ifelse(dat[,'male'] == 0, 'Female', 'Male'))
dat[,'gender'] <- factor(dat[,'male'], labels=c('Female','Male'))
```

4. Convert `init.date` and `last.visit` into Date variables

```{r}
dat[,'init.date'] <- as.Date(dat[,'init.date'], format='%m/%d/%y')
dat[,'last.visit'] <- as.Date(dat[,'last.visit'], format='%m/%d/%y')
```

5. Create the column `daysbetween` by calculating the number of days between visits

```{r}
dat[,'daysbetween'] <- dat[,'last.visit'] - dat[,'init.date']
dat[,'daysbetween'] <- difftime(dat[,'last.visit'], dat[,'init.date'], units='days')
```

6. Subset the data where `age` is greater than 40 and `death` is zero.  Only keep the following columns: gender, age, cd4baseline, weight, daysbetween

```{r}
dat_subset <- subset(dat, age > 40 & death == 0, c(gender, age, cd4baseline, weight, daysbetween))
dat <- dat[dat[,'age'] > 40 & dat[,'death'] == 0, c('gender','age','cd4baseline','weight','daysbetween')]
```

7. Reorder the data by `age`

```{r}
dat <- dat[order(dat[,'age']),]
```

# Models

```{r}
gender <- c('M','M','F','M','F','F','M','F','M')
age <- c(34, 64, 38, 63, 40, 73, 27, 51, 47)
smoker <- c('no','yes','no','no','yes','no','no','no','yes')
exercise <- factor(c('moderate','frequent','some','some','moderate','none',
                     'none','moderate','moderate'),
                    levels=c('none','some','moderate','frequent'), ordered=TRUE
)
los <- c(4,8,1,10,6,3,9,4,8)
x <- data.frame(gender, age, smoker, exercise, los)
```

1. Create a linear model using `x`, estimating the association between `los` and all remaining variables

```{r}
lm(los ~ gender + age + smoker + exercise, dat=x)
lm(los ~ ., dat=x)
```

2. Create a new model, this time predicting `los` by `gender`; show the model summary

```{r}
mod <- lm(los ~ gender, dat=x)
summary(mod)
```

3. What is the estimate for the intercept?  What is the estimate for gender?

```{r}
coef(mod)[1]
coef(mod)[2]
```

4. Re-calculate the standard errors, by taking the square root of the diagonal of the variance-covariance matrix of the summary of the linear model

```{r}
sqrt(diag(vcov(summary(mod))))
```

5. Predict `los` with the following new data set

```{r}
newdat <- data.frame(gender=c('F','M','F'))
predict(mod, newdat)
```

6. Sum the square of the residuals of the model.  Compare this to passing the model to the `deviance` function.

```{r}
sum(residuals(mod)^2)
deviance(mod)
```

7. Create a subset of `x` by taking all records where `gender` is 'M' and assigning it to the variable `men`. Do the same for the variable `women`.

```{r}
men <- x[x[,'gender'] == 'M',]
women <- x[x[,'gender'] == 'F',]
```

8. Call the `t.test` function, where the first argument is los for women and the second argument is los for men.  Add the argument var.equal and set it to TRUE.  Does this match the p-value computed in the model summary?

```{r}
t.test(women$los, men$los, var.equal=TRUE)
```

# Generating Plots

Given the `vlbw` data set, use `ggplot2` and `qplot` and create several plots.

```{r}
require(ggplot2)
getHdata(vlbw)
```

1. Scatterplot of `gest` VS `bwt`

```{r}
qplot(gest, bwt, data=vlbw)
```

2. Scatterplot `gest` VS `bwt`, add color and shape using variable `sex`

```{r}
qplot(gest, bwt, data=vlbw, color=sex, shape=sex)
```

3. Boxplot of `btw` by `sex`

```{r}
qplot(sex, bwt, data=vlbw, geom='boxplot')
```

4. Scatterplot of `gest` VS `bwt`, facet by `race`

```{r}
qplot(gest, bwt, data=vlbw, facets=race~.)
```

5. Scatterplot of `gest` VS `bwt`, add regression line

```{r}
qplot(gest, bwt, data=vlbw) + geom_smooth(method="lm")
```