The syllabus will be updated weekly.
Thursday, August 21
Learning Objectives (LOs)
- Get a sense of the class; determine if it is a good fit for you. See Bios 311 Class Details and Bios 311 Syllabus 2013 for more information.
- Learn some of your classmates names. Get to know them a little.
- Learn all of the pig rolls, terminology, and scoring values. See roll values.
Homework Assignments (HW)
- Read Rosner Ch 1 and 2. By "read", skim to remind yourself of the parts you already know (should be most of it) and read to learn the parts you don't know. I'll lecture very briefly on these chapters Tuesday.
- Borrow a pig to take home for the weekend.
- Roll your pig 100 times and collect the outcomes, i.e. the number of times it landed on each of the six possibilities.
- dot side up
- dot side down
- razorback
- trotter
- snouter
- leaning jowler
- On Tuesday, you'll enter your data into a spreadsheet and return your pig.
- Install R and RStudio on your laptop.
- On Tuesday, bring your laptops and books for class and lab. You'll be working problems from the text and a group quiz I'll hand out.
Tuesday, August 26
Learning Objectives (LOs)
- Sampling Distributions!!!
- Exploring R.
- Chapter 2 concepts, especially summarizing and describing data/distributions.
Class Outline/Summary
Class Pig Data
Introduction to sampling distributions:
The most important idea to understanding the "why" of statistical methods in the traditional paradigm, aka frequentist paradigm.
Introduction to sampling and functions in R
pigSides <- c( 'dot', 'nodot', 'razorback', 'trotter', 'snouter', 'leaningjowler' )
pigSides
sample( pigSides, replace = F )
sample( pigSides, replace = T )
sample( pigSides, 20, replace = T )
ClassRandomizer <- function(){
# function that splits the class into random groups
# this function has no inputs, e.g. you can't specify the number of groups
class <- c(
'Alex',
'Alice C.',
'Alice T.',
'Andrew',
'Christopher',
'Derek',
'Jea Young',
'Jie',
'Jonathan',
'Lauren',
'Linda',
'Ryan',
'Sam',
'Svetlana',
'Travis',
'Ying'
)
# randomly shuffle the class
classSample <- sample( class, replace=F )
# print out the groups
# do I need to add Meredith?
# printing would be prettier with commas between names
cat( c( "\nGroup A:", classSample[1:4]), "\n\n" )
cat( c( "Group B:", classSample[5:8]), "\n\n" )
cat( c( "Group C:", classSample[9:12]), "\n\n" )
cat( c( "Group D:", classSample[13:16]), "\n\n" )
}
Experiment Simulator Given The TRUTH
- Let R = one roll of a pig
- R = 1 if roll is a razorback
- R = 0 if roll is anything other than a razorback
- Let X = sum of the results for 100 rolls
- X = R1 + R2 + R3 + ... + R100
- Let theta = the true probability of rolling a razorback
- Let theta.hat = your estimate for the probability of rolling a razorback
What is the distribution of theta.hat for a given theta? Can you run a bunch of experiments quickly?
SamplingDistribution <- function( nExperiments = 10^4, nPerExperiment = 100, theta = 0.40 ){
# simulate a bunch of experiments of size nPerExperiment and where theta is known (a bunch = nExperiments)
# think about how you'd write this function using just the sample function
# let this next step just be magic for now, but in short, R has already has a function to do exactly what we want
rbinom( nExperiments, nPerExperiment, theta )
}
dist1 = SamplingDistribution ()
print( summary(dist1) )
Quiz 01
Due via email by the end of the day, aim for the end of class or lab, to keep from overdoing it. Here is how to read in the data.
data <- read.csv( "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/Bios311Syllabus2014/20140826-ClassPigData.csv" )
Q01) Using many of the summary statistics and plots from Rosner Ch 2, summarize the class' sampling distribution for theta.hat = the estimated probability of rolling a razorback from an experiment where N = 100 rolls.
Q02) Using a few useful summary statistics and plots of your choice, describe the five sampling distributions for theta.hat assuming five different values for theta. Include a broad range of values for theta and justify your choices. You may want to play with a lot more than five values
Q03) Pick one of your group's experimental results for theta.hat from the class' data. Based on what you've learned about sampling distributions from Q02, come up with a way of expressing your uncertainty about the true theta = probability of rolling a razorback. Justify your suggestion. I'm not looking for a certain answer here. I'm looking for deep thought about what sampling distributions reveal when you only get to see one experiment with an unknown theta actually performed.
Thursday, August 28
Learning Objectives (LOs)
- Get a vision of where we are going -- we want to carefully describe the uncertainty of our estimates.
- Introduction to probability
- S = Sample space, A = event, Pr( A ) or simply P( A ) = probability.
- Venn Diagrams. Unions and intersections.
- Complements of sets and the Null set.
- Mutually exclusive events. P(A or B) = P(A U B) = P(A) + P(B).
- The addition law of probability. P(A or B) = P(A U B) = P(A) + P(B) - P(AB).
- Independence. Multiplication law of probability. P(AB) = P(A)*P(B). What this looks like in a Venn diagram.
- Conditional probability. P(AB)/P(B) = the conditional probability of A given B. What this looks like in a Venn diagram.
- Probability tree diagrams. Not in the book, but can help thinking through some problems of modest scale.
Class Outline/Summary
Fun aside: Donating.vs.Death-Graph from
from Vox.
NIH funding.
interesting blog.
Quiz discussion Describing uncertainty and reaching an impasse. Why we need to add some basic probability to our tool kit.
Supplemental Lecture notes 010ProbabilityLecture.pdf This is optional reading, which you may find helpful.
Homework
- Read Rosner Ch 3. Reading guidance:
- 3.1-3.6 Intro, Definitions, Notation, Multiplication law, Addition law, Conditional probability - these are key ideas we need to move our discussion forward.
- 3.7-10 Bayes' rule and screening tests, Bayesian inference, ROC curves, Prevalence and incidence - we don't need these ideas for the immediate discussion, but the ideas are very important (and tend to show up on Biostat comps)
- 3.11 Summary - usually a better intro than the intro.
- Work the problems from the beginning through the Genetics groups of problems 3.30 etc. The Genetics and Mental Health group of problems are especially nice for practicing the basic probability skills.
- Start Reading Rosner Ch 4. Reading guidance:
- 4.1-4.3 and 4.8 Intro, RV's, PMF (PDF for discrete distributions), Binomial Distribution - this is the other set of key ideas we need to move our discussion forward.
Tuesday, September 02
Learning Objectives
- Binomial distribution
- Confidence intervals for proportions
- Operating characteristics of confidence intervals (performance metrics)
Class Outline/Summary
Quiz 02
Due via email by the end of the day, aim for the end of class or lab, to keep from overdoing it.
Randomly assigned groups:
- Group A: Lauren Alex Travis Sam
- Group B: Jonathan Jea Young Svetlana Alice T.
- Group C: Christopher Jie Alice C. Derek
- Group D: Linda Ying Ryan Andrew
Q01) Using the rbinom() and hist() commands, graphically display and
approximate pdf for X ~ Binom(100, 0.25). Comment on why this figure is an approximation and what influences the accuracy of the approximation.
Q02) Using the dbinom() and plot() commands, graphically display and
exact pdf for X ~ Binom(100, 0.25).
Q03) Suppose you had rolled 22 razorbacks out of 100 rolls. Find a lower and upper bound (LB, UB) for the true probability of rolling a razorback using the following criteria.
- argmin_UB[ P(X <= 22 | theta = UB) <= 0.025 ]
- argmax_LB[ P(X >= 22 | theta = LB) <= 0.025 ]
- Note the pbinom() function and some trial and error may come in handy here.
Q04) Check your answer for Q03 using the binconf() command in the Hmisc package for R.
Q05) Let X ~ Binom(100, 0.25). Let C = 1 if the exact 95% CI (akin to Q03 and Q04) contains the true value of theta, 0.25. Let C = 0 if it does not. Calculate E[ C ], i.e. calculate the true coverage probability for the 95% exact CI given X is Binom(100, 0.25).
Q06) Let theta vary. Plot E[ C ] for a bunch of thetas ranging from 0 to 1.
Thursday, September 04
Learning Objectives
- Confidence intervals for proportions
- Operating characteristics of confidence intervals (performance metrics)
Class Outline/Summary
- Quiz 02 feedback.
- Quiz 02 follow-up exercise.
- Compare the coverage of the exact 95% CI with the Bootstrap Credible Interval (see below) over a range of thetas, i.e. compare the "correct" answer to Q6 to the "incorrect" answer.
- Discuss how the answers compare. Is one method clearly preferable? Which would you choose and why?
- Look at another performance metric, CI widths. Discuss how the answers compare. Is one method clearly preferable? Which would you choose and why?
- Time permitting, try a few different N's, say 40, 500, and 3000. How does this impact your decision of which method to choose?
A common, incorrect, but well reasoned, solution for Q3-Q6
A common solution goes like this. Note, we will call this the Bootstrap Credible Interval.
- We rolled 22 razorbacks out of 100 rolls.
- Suppose the true theta was 0.22. What is a reasonable range for the number of razorbacks we could roll if theta=0.22? Specifically, solve for the 2.5th and 97.5th percentiles of Binom(100, 22). The command qbinom() may be helpful here. This yields a set of bounds on the scale of X, the number of razorbacks rolled.
- Convert that range of razorback rolls into an upper and lower bound for theta, i.e. divide by 100.
It turns out the logic of this approach is not far off from the logic of what we'll learn as the asymptotic normal confidence interval.
The logic for what's called the Exact confidence interval is vice versa of that approach. It goes like this:
- We rolled 22 razorbacks out of 100 rolls.
- Sequentially suppose a whole bunch of thetas going from 0 to 1. Which of those thetas are plausible given that we rolled 22 razorbacks?
- Eliminate any thetas where the probability of rolling a 22 or anything even more extreme for that theta is < 0.025.
- One way of expressing that is the two equations given in the quiz. Another is equation 6.20 in Rosner.
My Quiz 02 Solutions
<b>*Q01)*</b> Using the rbinom() and hist() commands, graphically display and approximate pdf for X ~ Binom(100, 0.25). Comment on why this figure is an approximation and what influences the accuracy of the approximation.
```{r}
hist( rbinom(10^5,100,0.25) )
```
<b>*Q02)*</b> Using the dbinom() and plot() commands, graphically display and exact pdf for X ~ Binom(100, 0.25).
```{r}
x <- 0:100
plot( x, dbinom(x,100,0.25), type='s' )
```
<b>*Q03)*</b> Suppose you had rolled 22 razorbacks out of 100 rolls. Find a lower and upper bound (LB, UB) for the true probability of rolling a razorback using the following criteria.
argmin_UB[ P(X <= 22 | theta = UB) <= 0.025 ]
argmax_LB[ P(X >= 22 | theta = LB) <= 0.025 ]
Note the pbinom() function and some trial and error may come in handy here.
```{r}
<b>*Q01)*</b> Using the rbinom() and hist() commands, graphically display and approximate pdf for X ~ Binom(100, 0.25). Comment on why this figure is an approximation and what influences the accuracy of the approximation.
```{r}
hist( rbinom(10^5,100,0.25) )
```
<b>*Q02)*</b> Using the dbinom() and plot() commands, graphically display and exact pdf for X ~ Binom(100, 0.25).
```{r}
x <- 0:100
plot( x, dbinom(x,100,0.25), type='s' )
```
<b>*Q03)*</b> Suppose you had rolled 22 razorbacks out of 100 rolls. Find a lower and upper bound (LB, UB) for the true probability of rolling a razorback using the following criteria.
argmin_UB[ P(X <= 22 | theta = UB) <= 0.025 ]
argmax_LB[ P(X >= 22 | theta = LB) <= 0.025 ]
Note the pbinom() function and some trial and error may come in handy here.
```{r}
# solve for lower bound
# tricky part: you need make sure 22 is included in your sum of the upper tail, i.e. sum from 0 to 21 and subtract that prob from 1
# First, get a broad sense of where the threshold is.
1-pbinom(22-1,100,0.14,lower.tail=T)
1-pbinom(22-1,100,0.15,lower.tail=T)
# Then, iteratively zoom in ...
1-pbinom(22-1,100,0.140,lower.tail=T)
1-pbinom(22-1,100,0.145,lower.tail=T)
# until you reach your desired level of precision.
1-pbinom(22-1,100,0.1433,lower.tail=T)
1-pbinom(22-1,100,0.1434,lower.tail=T)
# Likewise, solve for upper bound.
pbinom(22,100,0.3140,lower.tail=T)
pbinom(22,100,0.3139,lower.tail=T)
```
Answer: (0.143, 0.314)
<b>*Q04)*</b> Check your answer for Q03 using the binconf() command in the Hmisc package for R.
```{r}
library(Hmisc)
binconf(x=22,n=100,method="exact")
```
Answer: (0.1433, 0.3139)
<b>*Q05)*</b> Let X ~ Binom(100, 0.25). Let C = 1 if the exact 95% CI (akin to Q03 and Q04) contains the true value of theta, 0.25. Let C = 0 if it doesn't. Calculate E[ C ], i.e. calculate the true coverage probability for the 95% exact CI given X is Binom(100, 0.25).
```{r}
# Create a vector of the sample space for X,
# the number of razorbacks rolled.
x <- 0:100
# Calculate P(x) for each x
Px <- dbinom(x,100,0.25)
# Data check: make sure Px has the correct mode
# Prevent scientific notation when displaying numbers
# and look at Px rounded to four decimals
options( scipen= 100 )
round(Px,4)
# Create vectors for the 95% exact CI LB and UB
# Note, I couldn't get binconf to handle the vector,
# so I ran this through a for loop.
LB = rep(NA,101)
for(i in 0:100){
LB[i+1] <- binconf(x=i,n=100,method="exact")[2]
}
UB = rep(NA,101)
for(i in 0:100){
UB[i+1] <- binconf(x=i,n=100,method="exact")[3]
}
# Data check - look at the result
cbind(LB,UB)
round( cbind(x,LB,UB), 4 )
# Calculate C with a logical expression
C = 0.25)
# Data check - look at the result
C
round( cbind(x,LB,UB,C), 4 )
round( cbind(x,LB,UB,C,Px), 4 )
# Calculate E[C]
sum( C*Px )
# This is a very precise calculation; however,
# reporting to four decimals is plenty.
round( sum( C*Px ), 4 )
```
So for a Binom(100, 0.25), the 95% Exact Confidence Interval actually has 96.25% coverage.
<b>*Q06)*</b> Let theta vary. Plot E[ C ] for a bunch of thetas ranging from 0 to 1.
```{r}
# First, let's turn Q05 into a function.
CoverageCalc <- function(theta){
# Function to calculate the true coverage of
# the exact 95% CI for a Binom(100,theta).
# Note, R functions return the last item calculated,
# unless specified otherwise with return().
x <- 0:100
Px <- dbinom(x,100,theta)
LB = rep(NA,101)
for(i in 0:100){
LB[i+1] <- binconf(x=i,n=100,method="exact")[2]
}
UB = rep(NA,101)
for(i in 0:100){
UB[i+1] <- binconf(x=i,n=100,method="exact")[3]
}
C = theta)
sum( C*Px )
}
# Data check - test the function at 0.25
CoverageCalc (0.25)
# Second, calculate the coverage for
# a bunch of thetas.
N.thetas <- 100
theta <- seq(from=0,to=1,by=1/N.thetas)
theta
Coverage <- rep( NA, length(theta) )
for(i in 1:length(theta)){
Coverage[i] <- CoverageCalc (theta[i])
}
plot( theta, Coverage,
xlim = c(0,1),
ylim = c(0.9,1),
type = 'l',
lwd = 2
)
lines( c(0,1),c(0.95,0.95) )
```
<b>*Extension:*</b> Now compare this to the bootstrap credible interval.
```{r}
# Notice the CIs are fixed for any given X
# so I will run this outside the function to save CPU
x <- 0:100
LB <- qbinom(0.025,100,x/100)/100
UB <- qbinom(0.975,100,x/100)/100
CoverageCalcBoot <- function(theta){
# depends on x, LB, and UB existing
Px <- dbinom(x,100,theta)
C = theta)
sum( C*Px )
}
# Data check - test the function at 0.25
CoverageCalcBoot (0.25)
# Second, calculate the coverage for
# a bunch of thetas.
N.thetas <- 100
theta <- seq(from=0,to=1,by=1/N.thetas)
theta
CoverageBoot <- rep( NA, length(theta) )
for(i in 1:length(theta)){
CoverageBoot [i] <- CoverageCalcBoot (theta[i])
}
plot( theta, CoverageBoot,
xlim = c(0,1),
ylim = c(0.9,1),
type = 'l',
lwd = 2
)
lines( c(0,1),c(0.95,0.95) )
# plot both on the same graph
plot( theta, Coverage,
xlim = c(0,1),
ylim = c(0.85,1),
type = 'l',
lwd = 2,
col='red'
)
lines( c(0,1),c(0.95,0.95) )
# par new=T means plot on top of the current plot
# yes, it's a completely counter-intuitive command
par(new=T)
plot( theta, CoverageBoot,
xlim = c(0,1),
ylim = c(0.85,1),
type = 'l',
lwd = 2,
col='blue',
xlab="",
ylab="",
axes=F
)
```
Tuesday, September 09
Learning Objectives
- Review the logic and performance of the exact and bootstrap confidence intervals for a proportion. Can you express these in words?
- Lay the foundation for the traditional confidence interval for a proportion, better called the asymptotic Normal confidence interval.
- Continuous distributions and the Normal distribution.
- Formal definitions of expectation and variance.
- Properties of expectation and variance, especially linear transformations vs. nonlinear transformations.
- Properties of sums of random variables.
- Touch upon the Central Limit Theorem.
Reading
- Rosner 4.4 and 4.5. E[X] and V[X] for a discrete distribution.
- Rosner 4.9. E[X] and V[X] for Binomial.
- Rosner 5.1 - 5.7. Continuous distributions, the Normal distribution, sums of RVs, and the Normal approximation of the Binomial.
Class Outline/Summary
I will lecture for the first hour and leave the remaining class time and all of lab for the quiz.
Lecture Notes
Quiz 03
Due via email before class on Thursday in .pdf or .html format only. If your solution requires R code, please include the code in your document.
*Randomly assigned groups*
Group A: Jonathan, Ying, Christopher, Derek.
Group B: Travis, Alice T., Jea Young, Svetlana.
Group C: Lauren, Ryan, Alice C., Andrew.
Group D: Sam, Jie, Linda, Alex.
Q1 Rosner's Ch 5 Nutrition set of problems, 5.6 - 5.9.
Q2 Rosner's Ch 5 Nutrition set of problems, 5.21 - 5.24.
Q3a Let X ~ Binom(20, 0.17). Let Y ~ N( mu=E[X], sigma=sqrt(V[X]) ), i.e. Y is Normal with mean and variance equal to that of the Binomial X. Take a large sample from each of X and Y and sort them from smallest to largest. Plot Y by X. By large, I mean try getting samples of 10^6. If that crashes your laptop, go with 10^5.
Q3b Using the samples from 3, estimate the following probabilities:
- P( X > E[X] + 1*sqrt(V[X]) )
- P( Y > E[X] + 1*sqrt(V[X]) )
- P( X > E[X] + 2*sqrt(V[X]) )
- P( Y > E[X] + 2*sqrt(V[X]) )
- P( X > E[X] + 2.5*sqrt(V[X]) )
- P( Y > E[X] + 2.5*sqrt(V[X]) )
- P( X > E[X] + 3*sqrt(V[X]) )
- P( Y > E[X] + 3*sqrt(V[X]) )
Q4a-b Repeat Q3a-b for Binom(100, 0.17).
Q5a-b Repeat Q3a-b for Binom(1000, 0.17).
Q6 Discuss what Q3 - Q5 teaches us. A bullet point discussion is fine.
Quiz 03 Solutions
Quiz_3.pdf
Thursday, September 11
Learning Objectives
- Review Quiz 03.
- Sums of independent random variables.
- Law of large numbers and central limit theorem.
Reading
- Rosner 4.4 and 4.5. E[X] and V[X] for a discrete distribution.
- Rosner 4.9. E[X] and V[X] for Binomial.
- Rosner 5.1 - 5.7. Continuous distributions, the Normal distribution, sums of RVs, and the Normal approximation of the Binomial.
Class Outline/Summary
Lecture Notes
Quiz 04
Due via email before class on Tuesday in .pdf or .html format only, including "Quiz 04" in the subject line. If your solution requires R code, please include the code in your document. This is
take-home, solo submission, open everything and everyone. Seriously, talk to each other, talk to family/friends/professors, search the web, start a web forum, etc. Just remember you are academics and cite your sources.
G0 Make sure you've replied to the email regarding the midterm dates.
Rosner 4.37 - 4.41 (Renal) Hint: answers should be in the back of the book.
Rosner 4.46 - 4.51 (Cancer Epidemiology) Hint: can use dbinom() in R instead of the Excel command suggested.
G4.51a Following up on Rosner 4.51, let D = the total number of deaths over 8 hours. Solve for E[D].
G4.51b Following up on Rosner 4.51, let D = the total number of deaths over 8 hours. Solve for V[D].
Rosner 5.114 - 5.116 Hint: if I'm thinking about it correctly, 5.116 is a three star level problem, i.e. it combines several ideas and requires careful thought. Think about your rules for sums of independent random variables and linear transformations of variables.
Rosner 5.120 - 5.122
G1 Let X ~ N(0,1). Let L = 4*X + 17. Solve for E[L]. Hint: this is easy.
G2 Solve for V[L]. Hint: this is still easy.
G3 Let X ~ N(0,1). Let C = cos(X), where X is treated as radians so cos(3.141593) = -1. Solve for E[C] accurate to four decimal places.
Hint: this is not easy. Before you try getting a precise solution, try taking a big sample of X and go from there.
To work on a precise solution, first in R, try selecting a useful range for X then plot the pdf of X using dnorm(), plot C, and plot the pdf of C. To plot the pdf of C, think about what values should be on the x-axis and what values on the y-axis and how each relates to X.
The following is the start of my solution, which is organized to help me easily sum up four values for P[cos(x)=a] for a bunch of values of a.
# Split up X by the half-period of the cosine.
# These are ordered emphasize the symmetry, which will help when I need to sum them up.
# delta = the rectangle width for the numeric integration. This will let me control my level of accuracy for E[C].
delta <- 0.001
# Think about why going from -2*pi to 2*pi is sufficient for this problem.
x1pos <- seq(0, pi, delta)
x1neg <- seq(0, -pi, -delta)
x2pos <- seq(2*pi, pi, -delta)
x2neg <- seq(-2*pi, -pi, delta)
# Evaluate C over this range.
c1pos <- cos(x1pos)
c1neg <- cos(x1neg)
c2pos <- cos(x2pos)
c2neg <- cos(x2neg)
# Eyeball check that I've ordered them by the value of cos(x).
w <- sample( 1:length(c2neg), 10, replace = F )
round( c2neg[w], 3 )
round( c1neg[w], 3 )
round( c2pos[w], 3 )
round( c1pos[w], 3 )
# Look at the whole plot of C by X
par( mfrow = c(1,1) )
plot( c(x2neg,x1neg,x1pos,x2pos), c(c2neg,c1neg,c1pos,c2pos) )
# Look at the sections of C by X
par( mfrow = c(2,2) )
plot( x2neg, c2neg )
plot( x1neg, c1neg )
plot( x1pos, c1pos )
plot( x2pos, c2pos )
Second, think carefully about the formal definition of expectation.
Third, remember
numerical integration from Calc 101. You don't have to get fancy with the numerical integration. I bet you can get four decimals of accuracy in R simply using very skinny rectangles.
G4 Solve for V[C]. Hint: use your techniques and results from G3.
Tuesday, September 16
Learning objectives
- Quiz 04, understanding expectation and variance a little better and thinking about the precision of estimators vs. the precision of calculations.
- Law of large numbers and Central limit theorem.
- Asymptotic Normal (Wald) interval for a proportion.
- Bring it all back to operating characteristics.
Lecture Notes
My Quiz 04 G3-4 Solution
# First, let's do a quick and dirty solution
# to get our bearings.
# Let's just take a big sample from X, calculate C,
# and compute the mean and variance.
set.seed(7)
x <- rnorm(10^6)
c <- cos(x)
par(mfrow=c(1,2))
hist(x)
hist(c)
round( c( mean(x), var(x) ), 4 )
round( c( mean(c), var(c) ), 4 )
# Now let's see if we can get a more precise answer.
# Split up X by the half-period of the cosine.
# These are ordered emphasize the symmetry, which will help when I need to sum them up.
# delta = the rectangle width for the numeric integration. This will let me control my level of accuracy for E[C].
delta <- 0.001
# Think about why going from -2*pi to 2*pi is sufficient for this problem.
x1pos <- seq(0, pi, delta)
x1neg <- seq(0, -pi, -delta)
x2pos <- seq(2*pi, pi, -delta)
x2neg <- seq(-2*pi, -pi, delta)
# Evaluate C over this range.
c1pos <- cos(x1pos)
c1neg <- cos(x1neg)
c2pos <- cos(x2pos)
c2neg <- cos(x2neg)
# Eyeball check that I've ordered them by the value of cos(x).
w <- sample( 1:length(c2neg), 10, replace = F )
round( c2neg[w], 3 )
round( c1neg[w], 3 )
round( c2pos[w], 3 )
round( c1pos[w], 3 )
# Look at the whole plot of C by X
par( mfrow = c(1,1) )
plot( c(x2neg,x1neg,x1pos,x2pos), c(c2neg,c1neg,c1pos,c2pos) )
# Look at the sections of C by X
par( mfrow = c(2,2) )
plot( x2neg, c2neg )
plot( x1neg, c1neg )
plot( x1pos, c1pos )
plot( x2pos, c2pos )
# The pdf of C is related to the pdf of X.
f1pos <- dnorm(x1pos)
f1neg <- dnorm(x1neg)
f2pos <- dnorm(x2pos)
f2neg <- dnorm(x2neg)
# Now look at the sections of f(C) by X
par( mfrow = c(2,2) )
ylims <- c(0,0.4)
plot( x2neg, f2neg, ylim=ylims )
plot( x1neg, f1neg, ylim=ylims )
plot( x1pos, f1pos, ylim=ylims )
plot( x2pos, f2pos, ylim=ylims )
# Here's the trick, we can sum these up
# to get the pdf for f(c). Remember the
# values of c are the same at each position
# in the arrays c2neg, ..., c2pos.
f <- f2neg + f1neg + f1pos + f2pos
# We can use any of the segments to plot the pdf.
par( mfrow = c(2,2) )
plot( c2neg, f )
plot( c1neg, f )
plot( c1pos, f )
plot( c2pos, f )
# Let's pick one set of c values for the
# next steps.
par( mfrow = c(1,1) )
plot( c2pos, f )
# The expectation is the integral of c*f(c)
# over the span of c, [-1, 1].
plot( c2pos, c2pos*f )
# We can integrate numerically.
Ec <- sum( c2pos*f )*delta
round( Ec, 4 )
# The variance is the integral
# of (c-E[c])^2*f(c)
# over the span of c, [-1, 1].
plot( c2pos, (c2pos-Ec)^2*f )
# We can integrate numerically.
Vc <- sum( (c2pos-Ec)^2*f )*delta
round( Vc, 4 )
# Finally, let's clean this up and try
# different deltas until it stabilizes
# at four decimal places.
EVc <- function(delta=0.001){
# Segment x by half-period
x1pos <- seq(0, pi, delta)
x1neg <- seq(0, -pi, -delta)
x2pos <- seq(2*pi, pi, -delta)
x2neg <- seq(-2*pi, -pi, delta)
# Evaluate C over these ranges.
c1pos <- cos(x1pos)
c1neg <- cos(x1neg)
c2pos <- cos(x2pos)
c2neg <- cos(x2neg)
# The pdf of C is related to the pdf of X.
f1pos <- dnorm(x1pos)
f1neg <- dnorm(x1neg)
f2pos <- dnorm(x2pos)
f2neg <- dnorm(x2neg)
# Sum to get the pdf.
f <- f2neg + f1neg + f1pos + f2pos
# Estimate E[C] and V[C]
Ec <- sum( c2pos*f )*delta
Vc <- sum( (c2pos-Ec)^2*f )*delta
cat( round( c(Ec,Vc), 4 ) )
}
EVc( delta = 0.1 )
EVc( delta = 0.01 )
EVc( delta = 0.001 )
EVc( delta = 0.0001 )
EVc( delta = 0.00001 )
# These next two take a while to run,
# but both yield 0.6065 0.1998
EVc( delta = 0.000001 )
EVc( delta = 0.0000001 )
Thursday, September 18
Quiz 05
Due via email by the end of the day Thursday in .pdf or .html format only. If your solution requires R code, please include the code in your document. Please cc Lucy. Some of these emails are getting buried in my spam folder. It will be nice to have a back up to make sure nothing gets lost.
Groups:
- Group A: Alex, Travis, Ying, Alice C.
- Group B: Jonathan, Sam, Lauren, Svetlana
- Group C: Christopher, Andrew, Alice T., Jie
- Group D: Derek, Jea Young, Ryan, Linda
Q1 The asymptotic Normal confidence interval for a proportion, aka the Wald interval, is ( theta_hat
+ z_alpha/2 * sqrt( theta_hat*(1-theta_hat)/n ) ). See Rosner Eq 6.19. In words and fancy statistical jargon and derviations, justify the use of each of the following quantities in the formula:
- theta_hat -- Hint: what nice property does this estimator have?
- z_alpha/2 -- Hint: what is this and what important theorem justifies its use?
- theta_hat*(1-theta_hat) -- Hint: what is this estimating?
- sqrt( theta_hat*(1-theta_hat)/n ) ) -- Hint: what is this estimating and how would you derive it from the properties of sums of random variables?
Q2 Suppose you rolled 29 razorbacks out of 100 rolls. By hand, calculate the asymptotic Normal 95% CI. Check your answer using binconf() in Hmisc.
Q3 By default, binconf() calculates the Wilson Score Interval. See
wikipedia. This interval and another popular interval, Agresti-Coull, are well approximated by simply adding two successes and two failures to your dataset and calculating the Wald interval using that updated data. Suppose you rolled 29 razorbacks out of 100 rolls. By hand, calculate the "add two successes and two failures" 95% CI. Compare your answer to the Wilson interval using binconf() in Hmisc.
Q4 Let X ~ Bin( 40, theta ). Create a plot comparing the true coverage rates of the 95% CI over thetas ranging from 0 to 1 for the asymptotic Normal (Wald) interval, the Exact interval, and the Wilson interval.
Q5 Following up on Q4, calculate the expected true coverage rate given all values of theta are equally likely. In mathspeak, let C = 1 if theta is contained in the CI and 0 otherwise. E[C | theta] is the true coverage rate for a given theta. E[ C | theta ~ Unif[0,1] ] is the expected true coverage rate given theta is uniformly distributed. Accurate to two decimal places, find E[ C | theta ~ Unif[0,1] ] for each of the three methods used in Q4.
Q6 Repeat Q5 for Bin( 400, theta ) and Bin( 4000, theta ). Explain the practical implication of the results. Include any caveats you wish.
Tuesday, September 23
Learning Objectives
- Generic descriptions of asymptotically Normal CIs, "Exact" CIs, and bootstrap CIs.
- Applying those three approaches to data coming from other distribution/estimator dyads.
- Normal, Sample_mean with variance known (see Rosner Eq 6.6, 6.7).
- Normal, Sample_mean with variance estimated (see Rosner Eq 6.6, 6.7).
- Normal, Sample_variance (see Rosner Eq 6.15).
- Poisson, Lambda_hat (see Rosner Eq 6.23).
- Evaluating the performance of those approaches.
- Practice some basics (problems from Rosner).
Practice Exercises
In class: Work in groups of any size of your choosing. No submission, focus on getting the ideas and answers. Write them up only as much as helps that process.
- Pulmonary Disease 6.5 - 6.14 (Normal, Sample_mean with variance estimated)
- Obstetrics, Serology 6.36 - 6.39 (Normal under a transformation)
- Microbiology 6.18 - 6.22 (Normal, Sample_variance)
- Environmental Health 6.33 - 6.35 (Use Poisson to answer)
Lecture Notes
Quiz 5 Solutions
Quiz_5.pdf
Thursday, September 25
Quiz 06
Due via email by the end of the day Tuesday in .pdf or .html format only. If your solution requires R code, please include the code in your document and/or as a separate file. The quiz is
open everything.
Randomly assigned groups
Group A: Travis, Ryan, Svetlana, Linda
Group B: Derek, Ying, Jea Young, Alice C.
Group C: Sam, Lauren, Andrew, Christopher
Group D: Alex, Jonathan, Alice T.
Q1 Suppose the number of wild turkeys I see in my backyard each morning is well modeled as a Poisson random variable. This morning I saw 3 turkeys. Using the trial and error approach akin to when we calculated the exact Binomial CI, calculate an exact 95% CI for the true mean of wild turkeys I should see in my backyard each morning. Report two accurate decimal places.
Q2 Find a statistical package or fancy formula (hint: the Poisson has an associate with the Chi-sq distribution) to quickly calculate the exact CI for Q1. Report two accurate decimal places. What did you use and how do your answers compare?
Q3 Suppose X ~ Pois(2). Calculate the true coverage rate of the "exact" 95% CI. Report two accurate decimal places.
Q4 Suppose X ~ Pois(2). Calculate the true coverage rate of the asymptotic Normal 95% CI. Report two accurate decimal places.
Q5 Suppose X ~ Pois(2). Calculate the true coverage rate of the percentile bootstrap 95% CI. Report two accurate decimal places.
Q6 Suppose X ~ Pois(lambda). Explore the impact of different values of lambda on the true coverage rate of the "exact" 95% CI. Include in your presentation the smallest value of lambda where the exact interval behaved well, i.e. its true coverage rate was equal to 95% to two decimal places.
Q7 Suppose X ~ Pois(lambda). Explore the impact of different values of lambda on the true coverage rate of the asymptotic Normal 95% CI. Include in your presentation the smallest value of lambda where the Normal interval behaved well, i.e. its true coverage rate was equal to 95% to two decimal places.
Q8 Suppose X ~ Pois(lambda). Explore the impact of different values of lambda on the true coverage rate of the percentile bootstrap 95% CI. Include in your presentation the smallest value of lambda where the bootstrap interval behaved well, i.e. its true coverage rate was equal to 95% to two decimal places.
Q9 On a random sample of N=10 patients on a new medication, the sample standard deviation of their blood pressures was a shocking 15 mmHg (I'd expect it to be around 10 mmHg). Using the confidence interval based on the Chi-squared distribution, calculate a 95% CI for the true standard deviation of blood pressure. Comment on whether I have reason to be concerned.
Q10 Suppose X ~ N( 120, sigma=10 ) and a random sample of N=5 is taken. Using the confidence interval based on the Chi-squared distribution, calculate the true coverage rate accurate to two decimals.
Q11 Suppose X ~ N( 120, sigma=10 ) and a random sample of N is taken. Explore the impact of different values of N on the true coverage rate of the Chi-squared based 95% CI for the standard deviation. Include in your presentation the smallest value of N where the interval behaved well, i.e. its true coverage rate was equal to 95% to two decimal places.
Q12 Suppose X ~ Uniform( 100, 140 ) and a random sample of N is taken. Explore the impact of different values of N on the true coverage rate of the Chi-squared based 95% CI for the standard deviation. Include in your presentation the smallest value of N where the interval behaved well, i.e. its true coverage rate was equal to 95% to two decimal places.
Q13 Suppose X ~ N( 120, sigma=10 ) and a random sample of N is taken. Explore the impact of different values of N on the true coverage rate of the percentile bootstrap 95% CI for the standard deviation. Include in your presentation the smallest value of N where the interval behaved well, i.e. its true coverage rate was equal to 95% to two decimal places.
Q14 Suppose X ~ Uniform( 100, 140 ) and a random sample of N is taken. Explore the impact of different values of N on the true coverage rate of the percentile bootstrap 95% CI for the standard deviation. Include in your presentation the smallest value of N where the interval behaved well, i.e. its true coverage rate was equal to 95% to two decimal places.
Tuesday, September 30
Learning Objectives
- Confidence interval methods in the specific, e.g. the percentile bootstrap CI for the sample standard deviation.
- Confidence interval methods in the abstract, e.g. the percentile bootstrap approach.
- Examining the performance of confidence interval methods in the very specific, e.g. the true coverage of the percentile bootstrap 95% CI for the sample standard deviation for X ~ N(120, sigma=10) and N = 5.
- Examining the performance of confidence interval methods in the moderately specific, e.g. the true coverage of the percentile bootstrap 95% CI for the sample standard deviation for X ~ N(120, sigma=10) and N varies between 3 and 200.
Class Outline
For class, you'll be getting your first formal experience with larger group collaboration in this class, as you'll be working to compare your Quiz 6 solutions with each other, resolve disagreements, and solve the problems you are stuck on. You'll want your laptops for this.
We'll be aiming to work in three pairwise consultations starting with Phase 1.
- Phase 1) Groups A + D discuss/collaborate and B + C discuss/collaborate
- Phase 2) Groups A + B discuss/collaborate and C + D discuss/collaborate
- Phase 3) Groups A + C discuss/collaborate and B + D discuss/collaborate
The Quiz 06 groups were:
Group A: Travis, Ryan, Svetlana, Linda
Group B: Derek, Ying, Jea Young, Alice C.
Group C: Sam, Lauren, Andrew, Christopher
Group D: Alex, Jonathan, Alice T.
Thursday, October 02
Learning Objectives
- Overview of confidence intervals
- Introduction to one-sample hypothesis testing
Homework
Class outline
Tuesday, October 07
Learning Objectives
Class outline
- On the board lecture.
- Walkthrough of 2013 midterm.
Thursday, October 09
Midterm Exam Part 1 In-class, solo work - closed other people, open book, open laptop/calculator, closed wireless signals - place phones and laptops in airplane mode.
Midterm Exam Part 2 Begins right after everyone has submitted their part 1. Take-home, solo submission, open other people including classmates/professors/web-forums/etc., open laptop/calculator.
Tuesday, October 14
Midterm Exam Part 2 Due via email to Robert and Lucy before the start of class as a
pdf, html, or Word docx only. Make sure to show your work potentially including R code for all problems.
Midterm Exam Part 3 Due via email to Robert and Lucy by the end of the day as a
pdf, html, or Word docx only. Make sure to show your work potentially including R code for all problems. One submission for the entire class. The entire class period is devoted to coming to a consensus on your answers. Both the class and lab period will be devoted to the collaboration on part 3.
Thursday, October 16
Fall Break. No Class.
Tuesday, October 21
Learning Objectives
- Confidence intervals for the difference between two sample means.
- One- and two-sample hypothesis testing for sample means.
- When the null hypothesis informs the variance.
Homework
- Read Rosner Chapter 8 and practice problems from chapters 7 and 8.
Class outline / Lecture notes
Thursday, October 23
Learning Objectives
- Continuum of operating characteristics: Type I error vs. Type II error (Power) vs. Robustness vs. Persuasiveness.
- One-sample exact test for a proportion, Fisher's Exact Test.
- Thinking through picking a method to use, case study of equal variance vs. unequal variance two-sample t-test.
Homework
- Read Rosner Ch 8 and practice problems for 7 and 8. Make sure to try each method once, e.g. do at least one problem that uses Fisher's Exact Test, do at least one that uses the asymptotic Normal test for a proportion, etc. Focus repetition on the methods you find hard, not the ones you've mastered.
Quiz 07
Due via email before class on Thursday in .pdf or .html format only. If your solution requires R code, please include the code in your document and/or as a separate file. The quiz is
open everything including each other, me, Lucy, and other instructors.
Randomly assigned groups
Group A: Andrew, Jea Young, Christopher, Alice C.
Group B: Sam, Travis, Jonathan, Lauren
Group C: Ying, Alex, Alice T., Svetlana
Group D: Derek, Ryan, Linda
Q1) When choosing between the equal and unequal variance two-sample t-test, Rosner suggests performing an F test on the sample variances and using the unequal variance test only if the F test is significant (presumably at a 5% level). Other introductory texts will commonly give the rule of thumb to use the unequal variance test only if the ratio of the sample variances is outside of (1/2, 2), i.e. the larger is more than twice the size of the smaller. Which rule do you recommend, or would you recommend a different rule? This is intentionally a very open-ended question designed to let you think through what would lead you to prefer one rule over another and explore all the different factors that could impact your answer.
Q2) Prepare a brief presentation (
< 5 minutes) to share and support your conclusions. We'll share these on Thursday.
Tuesday, October 28
Learning Objectives
- Continuum of operating characteristics
- Power
- Nonparametric methods
- Sign test
- Wilcoxon rank sum test
- Wilcoxon Mann Whtiney signed rank sum test
Class Outline
Thursday, October 30
Learning Objectives
- Continuum of operating characteristics
- Two-sample t-test (equal vs unequal variance
- F test for the equality of two sample variances
- Application of statistical methods
Class Outline
- Quiz 07 presentations
- Consulting role playing; project introduction
Homework Assignments (HW)
- Read Rosner Chapter 9 with focus on:
- Sign test
- Wilcoxon rank sum test
- Wilcoxon Mann Whtiney signed rank sum test
Tuesday, November 04
Learning Objectives
Class Outline
Thursday, November 06
Learning Objectives
Class Outline
Tuesday, November 11
Learning Objectives
- Chapters 7-10 review
- Delta Method
- Common Odds Ratios
Class Outline
Thursday, November 13
Learning Objectives
- Paired two-sample data, continuous and dichotomous outcomes
- Omnibus testing, Chi-sq Ho revisited, comment on ANOVA
- Multiple comparisons, Bonferroni correction, alternative adjustments
Class Outline
Tuesday, November 18
Learning Objectives
Class Outline
Thursday, November 20
Learning Objectives
Tuesday, November 25 and Thursday, November 27
Thanksgiving Break. No Class or lab.
Tuesday, December 02
Learning Objectives
Class Outline
Thursday, December 04
Final Exam Part 1
Last regular semester class.