# from ChatGPT
set.seed(123) # For reproducibility
# Parameters
<- 0 # Null hypothesis mean
mu_0 <- 1 # Standard deviation of the population
sigma <- 10000 # Number of simulations
n_sims <- 0.05 # Significance level
alpha
# Function to perform the sequential testing
<- function() {
sequential_test <- 10
n1 <- 20
n2 <- 30
n3 # First sample
<- rnorm(n1, mean = mu_0, sd = sigma)
sample1 <- t.test(sample1, mu = mu_0)
t1 if (t1$p.value < alpha) {
return(1) # Reject H0 ## Exit this function.
}
# Second sample
<- rnorm(n1, mean = mu_0, sd = sigma)
sample2 <- c(sample1, sample2)
combined_sample2 <- t.test(combined_sample2, mu = mu_0)
t2 if (t2$p.value < alpha) {
return(1) # Reject H0
}
# Third sample
<- rnorm(n1, mean = mu_0, sd = sigma)
sample3 <- c(combined_sample2, sample3)
combined_sample3 <- t.test(combined_sample3, mu = mu_0)
t3 if (t3$p.value < alpha) {
return(1) # Reject H0
}
return(0) # Fail to reject H0
}
# Run simulations
<- replicate(n_sims, sequential_test())
results
# Calculate empirical type I error rate
<- mean(results)
type_I_error_rate type_I_error_rate
Two-stage designs
Simon’s designs and related topics
1 Repeat until significance
Suppose we would like to test
\(H_0: \mu = 0\)
\(H_1: \mu > 0\)
with a type I error rate of \(5\%\).
A sample of size 10 is taken, and if \(H_0\) is rejected, the experiment is over. If not, another sample of size 10 is taken, and \(H_0\) is tested using all 20 data. If \(H_0\) is not rejected, another sample of size 10 is taken, and \(H_0\) is tested using all 30 data. If each test is conducted with \(\alpha=0.05\), what is the actual type I error rate?
The analytical solution is relatively simple if there is only two stages. For three or more stages, a simulation will be much easier approach.
from ChatGPT
with full options
<- function(B=10, n1=n,n2=n,n3=n, seed=NA){
rep_sig # B <- 10 ; n <- 10 ; n1 <- n ; n2 <- n ; n3 <- n ; seed <- 620
<- ifelse(is.na(seed), sample(1:10^6,1), seed)
seed set.seed(seed)
<- matrix(rnorm(B*(n1+n2+n3)), ncol=B)
DATA
<- function(v, n=c(n1,n2,n3)){
ttes <- t.test(v[1:n1], alt='greater')$p.val
p1 <- t.test(v[1:(n1+n2)], alt='greater')$p.val
p2 <- t.test(v[1:(n1+n2+n3)], alt='greater')$p.val
p3 c(p1,p2,p3)
}
<- apply(DATA, 2, FUN=ttes, n=c(n1,n2,n3))
pval <- apply(pval, 2, min)
ov_pval list(seed, final_p=ov_pval, DATA, stage_p=pval)
}
# n <- 10
# sim1 <- rep_sig(B=10000, n1=n, n2=n, n3=n, seed=NA)
# sim0 <- rep_sig(B=10000, n1=n*3, n2=0, n3=0, seed=NA)
Speed comparison
# system.time( rep_sig(B=10000, n1=n, n2=n, n3=n, seed=NA) )
# system.time( results <- replicate(n_sims, sequential_test()) )
Simple adjustment
Will simple \(\alpha\) adjustment over- or under-correct?
- Bonferroni-correction: \(\alpha_3 = \alpha/3 = 0.0167\)
- \(1-(1-\alpha_3)^3 = 0.05\). \(\Rightarrow\) \(\alpha_3 = 0.0170\)
<- rep_sig(B=10000, n1=10, n2=10, n3=10, seed=NA)
sim1 table( sim1$final_p < 0.05 )
table( sim1$final_p < 0.05/3 )
plot(sim1$stage_p[1,], sim1$stage_p[2,], cex=0.8, pch=19,
col='#66666666', xlab='Stage 1 p-value', ylab='stage2 p-value',
xlim=c(0,1), ylim=c(0,1))
<- data.frame(ff(cor(t(sim1$stage_p)),3))
cdf names(cdf) <- row.names(cdf) <- c('Stage_1','Stage_2','Stage_3')
if(F){
%>%
cdf kable("html", booktabs=TRUE, align='c') %>%
# column_spec(3, color='darkgreen', bold=TRUE) %>%
# column_spec(4, color='orange', bold=TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover"), position='left') %>%
kable_classic_2(full_width = FALSE)
}
2 Stop when significant
Suppose we would like to test
\(H_0 : p = 0.1\)
\(H_1 : p > 0.1\)
by taking a random sample of size 40 from a Bernoulli(p) distribution. Under \(H_0\), the number of successes out of 40 (\(X\)) follows a Binomial(40, 0.1) distribution. We compute \(P_0[X \geq 8] = 0.042\), so the test that rejects \(H_0\) if \(X \geq 8\) has a type I error rate of 0.042.
The data were observed sequentially, and the 8th success happened at \(N = 32\), and we decided to reject \(H_0\) and terminate the study. The conclusion was that \(p\) was significantly greater than 0.1, and an estimate of \(p\) is \(\tilde{p} = 8/32 = 0.25\).
問い:
- Is type I error rate controlled?
- Is \(\tilde{p}\) unbiased?
To answer these questions, we must assume that these actions were pre-specified.
- A sample of size at most 40 will be taken.
- If X = 8 at any point, then \(H_0\) is rejected, and \(\tilde{p}\) will be calculated as \(\tilde{p} = x/n\) at that point.
答:
- Obviously yes.
- No.
\(\tilde{p}\) is an estimator such that
\[ \begin{align*} \tilde{p} = \begin{cases} X/40 & \text{if $X < 8$,} \\ 8 /Y & \text{if $Y \leq 40$,} \end{cases} \end{align*} \] where \(Y\) is the number of trials when 8th success happened. (\(Y\) has a \(\phantom{ 負の二項}\)分布.)
Before examining \(E_p[\tilde{p}]\), let’s consider \(\hat{p} = X/40\). We know that this (mle of \(p\)) is unbiased.
\[ \begin{align*} E_p[\hat{p}] &= \sum_{x=0}^{40} \frac{x}{40} P_p[X=x] \\ &= \sum_{x=0}^{7} \frac{x}{40} P_p[X=x] + \sum_{x=8}^{40} \frac{x}{40} P_p[X=x] \\ & = p \end{align*} \]
Now we write:
\[ \begin{align*} E_p[\tilde{p}] &= \sum_{x=0}^{7} \frac{x}{40} P_p[X=x] + \sum_{y=8}^{40} \frac{8}{y} P_p[Y=y]. \end{align*} \]
So, we need to compare the second term to calculate the bias. (Homework…)
Let’s use R to compute the bias.
<- function(p, N, x){
bias_sws <- 0:(x-1)
xv <- sum( xv/N * dbinom(xv, N, p) )
one <- x:N
yv <- sum( x/yv * dnbinom(yv-x, x, p) )
two ## dnbinom(x,k,p) ; x is number of failures until k-th success.
## dnbinom(8,4, 0.2) = choose(8+3,3) * 0.2^3 * 0.8^8 * 0.2
+ two
one
}
<- seq(0.05, 0.95, by=0.05)
trueP <- sapply(as.list(trueP), bias_sws, N=40, x=8) p.tilde
Under \(H_0\) (\(p=0.1\)), bias is negligible. But that is not interesting. When the true \(p\) is higher, the bias may not be negligible. That is concerning.
Can we intuitively understand why the bias is positive. (\(E_p[\tilde{p}] > p\).)
Suppose that the true \(p = 0.4\). Say, we are at n=17:
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | >8 |
---|---|---|---|---|---|---|---|---|---|---|
P[X=x] | 0.000 | 0.002 | 0.010 | 0.034 | 0.080 | 0.138 | 0.184 | 0.193 | 0.161 | 0.199 |
\(\hat{p}\) | 0.000 | 0.059 | 0.118 | 0.176 | 0.235 | 0.294 | 0.353 | 0.412 | 0.471 | higher than 0.471 |
At an interim n, if \(\hat{p}\) is low, the trial tends to continue, and the downward bias tends to get corrected; however, when \(\hat{p}\) is high, the trial tends to have stopped, and the upward bias remains uncorrected. Thus \(\tilde{p}\) tends to be higher than the true \(p\).
3 Simon’s design
In phase II studies, it is crucial to have mechanisms to stop early due to excessive toxicity, typically overseen by a Data Monitoring Committee. Additionally, it is desirable to discard ineffective treatments early, for which two-stage designs with a futility stop are commonly used in oncology.
Notation:
- Stage I sample size: \(n_1\).
- Stage I data: \(X_1 \sim Binomial(n_1, p)\).
- Stage I critical value: \(r_1\) so that if \(X_1 \leq r_1\) then terminate the study for futility.
- Stage II sample size: \(n_2\).
- Stage II data: \(X_2 \sim Binomial(N_2, p)\).
- Total sample size: \(N_t = n_1 + N_2\).
- Total data \(X_t = X_1 + X_2\).
- Stage II critical value \(r_t\) so that if \(X_t \leq r_t\) then terminate the study for futility, otherwise conclude efficacy.
Also we define \(R_1 = r_1 + 1\) and \(R_t = r_t + 1\).
- If \(X_1 \leq r_1\), then stop for futility.
- If \(X_1 \geq R_1\), then continue.
In his 1989 paper, Simon introduced two criteria for selecting a two-stage design for single-arm studies with a one sided test.
- The optimal design has the smallest expected sample size under \(H_0\), \(n_1 + E_0[N_2]\).
- The minimax design has the smallest total sample size, \(N_t\).
3.1 Bad Simon’s designs
For \(p_0 = 0.17\), \(p_1 = 0.32\), \(\alpha=0.1\) and \(\beta=0.1\), we have:
::ph2simon(pu=0.17, pa=0.32, ep1=0.10, ep2=0.10) clinfun
Simon 2-stage Phase II design
Unacceptable response rate: 0.17
Desirable response rate: 0.32
Error rates: alpha = 0.1 ; beta = 0.1
r1 n1 r n EN(p0) PET(p0) qLo qHi
Minimax 9 45 12 53 46.80 0.7748 0.525 1.000
Admissible 5 30 13 58 41.28 0.5972 0.032 0.525
Optimal 4 25 14 63 41.12 0.5759 0.000 0.032
For this parameter setting, the smallest \(N_t\) is 53, and the smallest \(n_1 + E_0[N_2]\) is 41.12.
Let’s not worry about “Admissible” design for now. Why might you not want to use this Minimax design? Or this Optimal design?
The single-stage design for the same parameter setting is:
ph2single(pu=0.17, pa=0.32, ep1=0.10, ep2=0.10)
n r Type I error Type II error
1 57 13 0.09349 0.08644
2 60 14 0.07453 0.09428
3 61 14 0.08394 0.08134
4 62 14 0.09407 0.06992
5 64 15 0.06706 0.08854
## Single-stage Binomial test
## Sample size and power
<- function(n.min, n.max, p0, p1, alp){
ssbt # n.min <- 10 ; n.max <- 40 ; p0 <- 0.17 ; p1 <- 0.32 ; alp=0.10
<- seq(n.min,n.max)
nv <- qbinom(1-alp,nv,p0)
cv <- 1-pbinom(cv,nv,p0)
type_I <- 1-pbinom(cv,nv,p1)
powerx <- c(p0=p0, p1=p1, alp=alp)
para list(para,nv=nv,cv=cv,type_I=type_I,power=powerx)
}
<- ssbt(n.min=50, n.max=70, p0=0.17, p1=0.32, alp=0.10)
ex1
plot(c(50,70), c(0,1), type='n', xlab='N', ylab='Power',
panel.first=grid(nx=NULL, ny=NULL))
abline(h=0.10, col='royalblue')
abline(h=0.90, col='royalblue')
title(main='Power and Type I Error Rate', adj=0, font.main=3)
lines(ex1$nv, ex1$type_I, col='darkgreen', type='b')
<- c(1,19)[1+(ex1$power > 0.90)]
pc lines(ex1$nv, ex1$power, col='darkgreen', type='b', pch=pc)
Some bad Simon’s desings
library(clinfun)
ph2simon(pu=0.23, pa=0.42, ep1=0.05, ep2=0.1)
Simon 2-stage Phase II design
Unacceptable response rate: 0.23
Desirable response rate: 0.42
Error rates: alpha = 0.05 ; beta = 0.1
r1 n1 r n EN(p0) PET(p0) qLo qHi
Minimax 14 47 16 50 47.31 0.8968 0.739 1.000
Admissible 7 28 17 54 35.96 0.6938 0.135 0.739
Admissible 7 27 18 58 35.34 0.7311 0.006 0.135
Optimal 5 21 19 62 35.31 0.6510 0.000 0.006
ph2simon(pu=0.29, pa=0.47, ep1=0.05, ep2=0.1)
Simon 2-stage Phase II design
Unacceptable response rate: 0.29
Desirable response rate: 0.47
Error rates: alpha = 0.05 ; beta = 0.1
r1 n1 r n EN(p0) PET(p0) qLo qHi
Minimax 22 59 23 61 59.13 0.9362 0.840 1.000
Admissible 8 28 24 64 43.34 0.5740 0.139 0.840
Optimal 10 32 25 67 42.85 0.6899 0.000 0.139
ph2simon(pu=0.17, pa=0.34, ep1=0.05, ep2=0.2)
Simon 2-stage Phase II design
Unacceptable response rate: 0.17
Desirable response rate: 0.34
Error rates: alpha = 0.05 ; beta = 0.2
r1 n1 r n EN(p0) PET(p0) qLo qHi
Minimax 9 36 10 39 36.22 0.9272 0.693 1.000
Admissible 3 18 11 43 27.17 0.6331 0.583 0.693
Optimal 3 17 11 44 25.78 0.6749 0.000 0.583
3.2 Admissible designs
The two criteria, “minimizing \(N\)” and “minimizing \(n_1 + E_0[N_2]\), give two extreme designs; i.e., among all the designs that satisfy \(\alpha\) and \(\beta\) conditions, minimizing \(N\) maximizes \(n_1 + E_0[N_2]\), and vice versa.
To consider the two criteria simultaneously, let’s define \(\omega(q) = qN + (1-q)(n_1 + E_0[N_2])\), and find a design that minimizes \(\omega(q)\) given \(q\). Clearly, when \(q=1\), the resulting design is the minimax design since \(\omega(1) = N\). And when \(q=0\), the resulting design is the optimal design since \(\omega(0) = n_1 + E_0[N_2]\).
“Admissible” in this context implies that these designs are acceptable although they are not the minimax or optimal ones.
In the output of ph2simon
, for each design the range of \(q\) values are given that each design minimizes \(\omega(q)\) function. In the final example above, for \(q\) in \((0, 0.583)\) the optimal design minimizes the value of \(\omega(q) = qN + (1-q)(n_1 + E_0[N_2])\). Unfortunately, for this parameter setting, no good design exists.
Professor Ana Ivanova’s webpage also solves for Simon’s designs.
4 Inference from Simon’s design
This section is roughly based on Koyama T and Chen H. Proper inference from Simon’s two-stage designs. Stat Med, 27(16):3145-3154, 2008. PMID: 17960777. PMCID: PMC6047527.
Problems
The p-values that are computed as if the data are observed in a single design (\(p_c\)) may contradict the hypothesis testing. That is, \(H_0\) is rejected, yet \(p_c > \alpha\).
Because a Simon’s design is given in terms of sample sizes and critical values, when the actual sample size deviates from the planned one, conducting hypothesis testing is not straightforward.
The maximum likelihood estimate \(\hat{\pi} = X_t / N_t\) of \(\pi\) is biased.
4.1 Example
For testing \(H_0: \pi = 0.1\), \(H_1: \pi > 0.1\), let’s set \(\alpha=0.05\) and \(\beta=0.2\) at \(\pi=0.3\). (\(\pi_0=0.1\), and \(\pi_1 = 0.3\)). Here are the Simon’s designs:
<- 0.10
pi0 <- 0.30
pi1 <- 0.05
alp <- 0.20
bet <- 1-bet
pow
<- ph2simon(pu=pi0, pa=pi1, ep1=alp, ep2=bet) ) ( ex_sim
Simon 2-stage Phase II design
Unacceptable response rate: 0.1
Desirable response rate: 0.3
Error rates: alpha = 0.05 ; beta = 0.2
r1 n1 r n EN(p0) PET(p0) qLo qHi
Minimax 1 15 5 25 19.51 0.5490 0.732 1.000
Admissible 1 12 5 26 16.77 0.6590 0.482 0.732
Admissible 1 11 5 27 15.84 0.6974 0.293 0.482
Optimal 1 10 5 29 15.01 0.7361 0.000 0.293
Let’s use the Optimal design. The stage 1 sample size is \(10\) and the stage 1 critical value is \(1\), i.e., the trial stops if \(X_1 \leq 1\). If continued, additional sample of size \(19\) is taken, and \(H_0\) will be rejected if \(X_t > 5\). Suppose we observe \(X_1 = 2\) and \(X_t = 6\). This sample path leads to rejection of \(H_0\) in Stage 2.
What is the conventional p-value if we compute it as if data were observed in a single stage.
\[ p_c = P[ X \geq 6 ], \] where \(X \sim Binom(29, 0.1)\).
1-pbinom(xt-1, nt, pi0)
[1] 0.06372
<- sum(dbinom(xt:nt, nt, pi0)) pc
P-value is 0.064; \(H_0\) is rejected even though this p-value is greater than \(\alpha\).
4.2 Proper P-value
The conventional p-value can be expressed as below.
\[ p_c = P_0[ X \geq x_t ] = \sum_{x_1=0}^{n_1} P_0[X_2 \geq x_t - x_1] P_0[X_1 = x_1], \] where
- \(X \sim Binom(29, 0.1)\).
- \(X_1 \sim Binom(10, 0.1)\).
- \(X_2 \sim Binom(19, 0.1)\).
For the current Simon’s design, this summation includes impossible sample paths because when \(X_1 \leq 1\), there is no stage 2.
Therefore, the proper p-value for a Simon’s design is
\[ p_p = \sum_{x_1=R_1}^{n_1} P_0[X_2 \geq x_t - x_1] P_0[X_1 = x_1]. \]
Clearly, \(p_p \leq p_c\), and this can be computed using twostage.inference()
function in the clinfun
package.
twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.1)
pumvue p.value 5% LCL 95% UCL
0.26131 0.04709 0.10160 0.40070
We define the P-value as the probability of obtaining the result that is at least as extreme as the observed one under the null hypothesis. And we are assuming that the larger value of \(X_t\), regardless of \(X_1\) and \(X_2\), is more extreme. For example, two sample paths, (\(X_1=2\), \(X_2=6\)) and (\(X_1=3\), \(X_2=5\)), lead to the same p-value.
4.3 Confidence interval
Because Simon’s design is used for a one-sided hypothesis test, an accompanying confidence interval should be one-sided of the form, \((\pi_L, 1)\). The method described below can be used to form a more common two-sided confidence interval of the form, \((\pi_L, \pi_U)\). We will use a \(90\%\) two-sided confidence interval, so that two statements, “\(H_0\) is rejected.” and “\(pi_0\) is outside of \((\pi_L, \pi_U)\)” are equivalent.
We can compute the proper p-value (\(p_p\)) for testing \(H_0: \pi = \pi'_0\) for any \(\pi'_0\). A \(90\%\) confidence interval is a collection of \(\pi'_0\) such that the corresponding p-value is within \([0.05, 0.95]\) (for one-sided \(\alpha=0.05\)).
Continuing with the example, we have \(X_1 = 2\) and \(X_2 = 4\) for the design with \(n_1=10\), \(r_1=1\), \(n_2 = 19\). Using twostage.inference
function with various values of pu
we can find \(\pi'_0\) that gives a p-value of \(0.05\) and \(0.95\).
twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.101505)
pumvue p.value 5% LCL 95% UCL
0.2613 0.0500 0.1016 0.4007
twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.4007)
pumvue p.value 5% LCL 95% UCL
0.2613 0.9500 0.1016 0.4007
4.4 Median estimator
Similarly, the value of \(p^*_0\) such that the p-value for testing \(H_0: p=p^*_0\) is \(0.5\) with the observed sample path may be used as an estimate for \(\pi\). If the test statistic is continuous, this estimator is known as the median unbiased estimator (Cox and Hinkley 1974). It is unbiased for the median. The proof uses the fact that the p-value is distributed \(Unif(0,1)\) under \(H_0\).
- Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall, London.
twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.2147)
pumvue p.value 5% LCL 95% UCL
0.2613 0.5001 0.1016 0.4007
4.5 R program for inference
I have written an R code to compute p-values, confidence interval, and an estimate.
https://biostat.app.vumc.org/wiki/pub/Main/TatsukiRcode/twostage2020Web.R
source('https://biostat.app.vumc.org/wiki/pub/Main/TatsukiRcode/twostage2020Web.R')
TwoStageInf(p0=0.1, p1=0.3, n1=10, R1=2, nT=29, RT=6, x1=2, x2=4)
$parameter
p0 p1 n1 R1 nT RT
1 0.1 0.3 10 2 29 6
$character
POWE PET EN
NULL 0.04709 0.7361 15.01
ALT 0.80506 0.1493 26.16
$data
x1 n2 m2 x2
1 2 19 19 4
$p.value
[1] 0.04709
$conf
[1] 0.1015 0.4008
$med.esti
[1] 0.2147
4.6 Aside: Programming notes
Confidence interval for a binomial proportion
Let’s say we have \(X=7\) where \(X \sim Binom(20, \pi)\). We can estimate \(\pi\) using binconf
function in Hmisc
library.
library(Hmisc)
<- 7
x <- 20
n binconf(x=x, n=n, method='all')
PointEst Lower Upper
Exact 0.35 0.1539 0.5922
Wilson 0.35 0.1812 0.5671
Asymptotic 0.35 0.1410 0.5590
Asymptotic
method.
<- x / n
phat <- qnorm(0.975)
z + c(-1,1) * z * sqrt(phat * (1 - phat)/n) phat
[1] 0.141 0.559
Wilson
score method. \[
\frac{1}{1 + z^2/n} \times (\hat{p} + z^2/2n \pm z \sqrt{\hat{p}(1-\hat{p})/n + z^2/4n^2}
\]
1/(1 + z^2/n)) * (phat + z^2/(2 * n) + c(-1, 1) * z * sqrt(phat * (1 - phat)/n + z^2/(4 * n^2))) (
[1] 0.1812 0.5671
Finally, the Exact
method (Clopper-Pearson) confidence interval is \((p_L, p_U)\), where \(p_L\) and \(p_U\) satisfy the following:
- \(P[ X \geq x | p=p_L] = \alpha/2\)
- \(P[X \leq x|p=p_U] = \alpha/2\)
This confidence interval is the collection of \(p^*\) such that \(H^*_0: p=p^*\) would not be rejected by the observed sample. This requires computation of \(p\)-values for various values of \(p^*\).
<- function(p.star, x, n){
f1 1-pbinom(x-1, n, p.star)
## or sum(dbinom(x:n, n, p.star))
}
<- function(p.star, x, n){
f2 pbinom(x, n, p.star)
## or sum(dbinom(0:x, n, p.star)
}
And we need to find p.star
for f1
and f2
that the resulting p-values are 0.05 for f1
and 0.95 for f2
.
f1(0.1539, x=7, n=20)
[1] 0.02499
f2(0.5922, x=7, n=20)
[1] 0.02499
Rather than using a trial-and-error method, we can use uniroot
to find the root.
<- function(p.star, x, n, alp){
g1 - f1(p.star, x, n)
alp
}uniroot(g1, x=7, n=20, alp=0.025, lower=0, upper=1)$root
[1] 0.1539
<- function(p.star, x, n, alp){
g2 - f2(p.star, x, n)
alp
}uniroot(g2, x=7, n=20, alp=0.025, lower=0, upper=1)$root
[1] 0.5922
To solve for \(p^*\) in a more straightforward way without iterations, we can use the following relationship between a Binomial random variable and a Beta random variable.
If \(X \sim Binomial(n, p)\) and \(Y \sim Beta(k, n-k+1)\), then \[ P[X \geq k] = P[Y \leq p]. \]
For the lower bound, instead of solving for \(p_L\) in \(P[X \geq 7 | p = p_L] = 0.025\) iteratively, we can solve \(P[Y \leq p_L] = 0.025\), where \(Y \sim Beta(7, 20-7+1)\).
qbeta(0.025, 7, 20-7+1)
[1] 0.1539
For the upper bound, we needed to solve \(P[X \leq 7 | p = p_U] = 0.025\), which is \(P[X \geq 8|p = p_U] = 0.975\). Instead, we can solve \(P[Y \leq p_U] = 0.975\), where \(Y \sim Beta(8,20-8+1)\).
qbeta(0.975, 8, 20-8+1)
[1] 0.5922
5 Extending or shortening the study
As an example, consider the design \((n_1=19, R_1=7, n_t=39, R_t=17)\). This is the minimax design for \(p_0=0.3\), \(p_1=0.5\) with \(\alpha=0.05\) and \(\beta=0.2\).
ph2simon(pu=0.3, pa=0.5, ep1=0.05, ep2=0.20)
Simon 2-stage Phase II design
Unacceptable response rate: 0.3
Desirable response rate: 0.5
Error rates: alpha = 0.05 ; beta = 0.2
r1 n1 r n EN(p0) PET(p0) qLo qHi
Minimax 6 19 16 39 25.69 0.6655 0.252 1.000
Admissible 6 18 17 42 24.68 0.7217 0.208 0.252
Optimal 5 15 18 46 23.63 0.7216 0.000 0.208
We assume that the sample size for stage 1 is always as planned. Let’s say we observe in stage 1 \(X_1=7\). In stage 2, the planned sample size is \(n_2=20\), but the actual sample size is \(n'_2=23\) and we observe \(X'_2=10\). Should \(H_0\) be rejected?
5.1 Conditional p-value
Conditional p-value is the p-value in the second stage conditioned on the first stage data. That is \(P_0[X'_2 \geq x'_2 | X_1=x_1]\), where \(X'_2 \sim Binom(n'_2, \pi_0)\). It does not depend on \(X_1\) other than that \(X'_2\) is not defined for \(X_1 < R_1\).
For \(n'_2= 23\) and \(X'_2 = 10\), we have \(P_0[ X'_2 \geq 10 ]\), where \(X'_2 \sim Binom(23, 0.3)\).
1-pbinom(9, 23, 0.3)
[1] 0.1201
This is compared to the conditional type I error rate computed with the original \(n_2\) and \(R_2\).
The proper conditional type I error rate at \(X_1 = 7\) is
\[ P_0[X_t \geq R_t | X_1=x_1 ] = P_0[ X_t \geq 17 | X_1 = 7] = P_0[X_2 \geq 10] \]
1-pbinom(9, 20, 0.3)
[1] 0.04796
\(H_0\) is not rejected because the conditional p-value is greater than the conditional type I error rate.
5.2 Unconditional p-value
The unconditional p-value without sample size change is \[ p_p = \sum_{x_1=R_1}^{n_1} P_0[X_2 \geq x_t - x_1] P_0[X_1 = x_1]. \]
When \(n_2\) is changed, we want to similarly compute
\[ p_p = \sum_{x_1=R_1}^{n_1} P_0[X'_2 \geq x'_t - x_1] P_0[X_1 = x_1]. \]
But this does not work…
When, however, the stage 2 sample size is changed, we cannot extend the conditional P-value based on \(X'_t\). It is demonstrated in Section 4.1 that the same value of \(X'_t\) may or may not lead to the rejection of H0 depending on the sample paths. … extending the conditional P-value based on \(X'_t\) would not make sense.
The conditional p-value, \(P_0[X'_2 \geq x'_t - x_1 ]\) is easily computed for the observed value of \(X_1=x_1\), but to compute the unconditional we need it for all values of \(X_1\) in the continuation region.
To do this, let’s examine the following plot, which shows the conditional type I error rate and the conditional power at the alternative. The two lines shown below are
- \(P_{\pi=\pi_0} [X_2 \geq R_t - x_1 | X_1=x_1]\)
- \(P_{\pi=\pi_1} [X_2 \geq R_t - x_1 | X_1=x_1]\)
Blue line = The original conditional type I error rate \[ P_0[ X_2 \geq R_t - x_1 | X_1 = x_1 ] = P_0[ X_2 \geq 17 - x_1 | X_1 = x_1 ], \] where \(X_2 \sim Binom(20, 0.3)\) under \(H_0\).
<- 0:19
x1v <- 20 # Planned
n2 <- 0.3 # Null
p0 <- 1-pbinom(17-x1v-1, n2, p0)
cp0 <- replace(cp0, x1v < 7, 0)
cp0
data.frame(x1v, cp0)
x1v cp0
1 0 0.00000
2 1 0.00000
3 2 0.00000
4 3 0.00000
5 4 0.00000
6 5 0.00000
7 6 0.00000
8 7 0.04796
9 8 0.11333
10 9 0.22773
11 10 0.39199
12 11 0.58363
13 12 0.76249
14 13 0.89291
15 14 0.96452
16 15 0.99236
17 16 0.99920
18 17 1.00000
19 18 1.00000
20 19 1.00000
The actual sample size for stage 2 is 23, and \(X'_2 = 10\). \(X'_2 \sim Binom(23,0.3)\) under \(H_0\).
The observed conditional p-value is 0.1201. (The black dot)
We proposed to extend the conditional p-value to other values of \(X_1\) as follows:
- Compute the conditional p-value. \(P_{\pi=\pi_0} [ X'_2 \geq X'_t - x_1]\).
- Find \(\pi^*\) such that \(P_{\pi=\pi_0} [ X'_2 \geq X'_t - x_1] = P_{\pi=\pi^*} [ X_2 \geq X_t - x_1 ]\) at the observed \(X_1\).
- Extend \(P_{\pi=\pi^*} [ X_2 \geq X_t - x_1 ]\) to other values of \(X_1\) using the original design.
For the current example, \(\pi^* = 0.3491\).
1-pbinom(9, 20, 0.3491) ) (
[1] 0.12
- \(P_{\pi=\pi_0} [X_2 \geq R_t - x_1 | X_1=x_1]\)
- \(P_{\pi=\pi_1} [X_2 \geq R_t - x_1 | X_1=x_1]\)
- \(P_{\pi=\pi^*} [X_2 \geq R_t - x_1 | X_1=x_1]\)
The unconditional p-value is
\[ p_p = \sum_{x_1=R_1}^{n_1} P_{\pi^*}[X_2 \geq x_t - x_1] P_{\pi^*}[X_1 = x_1], \] where \(\pi^* = 0.3491\).
P-value is \(0.0828\).
TwoStageInf(p0=0.3, p1=0.5, n1=19, R1=7, nT=39, RT=17, x1=7, m2=23, x2=10)
$parameter
p0 p1 n1 R1 nT RT
1 0.3 0.5 19 7 39 17
$character
POWE PET EN
NULL 0.0455 0.66550 25.69
ALT 0.8036 0.08353 37.33
$data
x1 n2 m2 x2
1 7 20 23 10
$p.value
[1] 0.08279
$conf
[1] 0.2821 0.5460
$med.esti
[1] 0.4046
5.3 Confidence interval and an estimate
Once we have a method to compute a p-value, we can compute a confidence interval and an estimate by inverting hypothesis testing.