Two-stage designs

Simon’s designs and related topics

Author

Tatsuki Koyama, PhD

Published

June 24, 2024

1 Repeat until significance

Suppose we would like to test

\(H_0: \mu = 0\)
\(H_1: \mu > 0\)

with a type I error rate of \(5\%\).

A sample of size 10 is taken, and if \(H_0\) is rejected, the experiment is over. If not, another sample of size 10 is taken, and \(H_0\) is tested using all 20 data. If \(H_0\) is not rejected, another sample of size 10 is taken, and \(H_0\) is tested using all 30 data. If each test is conducted with \(\alpha=0.05\), what is the actual type I error rate?

The analytical solution is relatively simple if there is only two stages. For three or more stages, a simulation will be much easier approach.

from ChatGPT

# from ChatGPT
set.seed(123)  # For reproducibility

# Parameters
mu_0 <- 0  # Null hypothesis mean
sigma <- 1  # Standard deviation of the population
n_sims <- 10000  # Number of simulations
alpha <- 0.05  # Significance level

# Function to perform the sequential testing
sequential_test <- function() {
  n1 <- 10
  n2 <- 20
  n3 <- 30
  # First sample
  sample1 <- rnorm(n1, mean = mu_0, sd = sigma)
  t1 <- t.test(sample1, mu = mu_0)
  if (t1$p.value < alpha) {
    return(1)  # Reject H0 ## Exit this function.
  }
  
  # Second sample
  sample2 <- rnorm(n1, mean = mu_0, sd = sigma)
  combined_sample2 <- c(sample1, sample2)
  t2 <- t.test(combined_sample2, mu = mu_0)
  if (t2$p.value < alpha) {
    return(1)  # Reject H0
  }
  
  # Third sample
  sample3 <- rnorm(n1, mean = mu_0, sd = sigma)
  combined_sample3 <- c(combined_sample2, sample3)
  t3 <- t.test(combined_sample3, mu = mu_0)
  if (t3$p.value < alpha) {
    return(1)  # Reject H0
  }
  
  return(0)  # Fail to reject H0
}


# Run simulations
results <- replicate(n_sims, sequential_test())

# Calculate empirical type I error rate
type_I_error_rate <- mean(results)
type_I_error_rate

with full options

rep_sig <- function(B=10, n1=n,n2=n,n3=n, seed=NA){
    # B <- 10 ;  n <- 10 ; n1 <- n ; n2 <- n ; n3 <- n ;  seed <- 620
    seed <- ifelse(is.na(seed), sample(1:10^6,1), seed)
    set.seed(seed)
    
    DATA <- matrix(rnorm(B*(n1+n2+n3)), ncol=B)
        
    ttes <- function(v, n=c(n1,n2,n3)){
        p1 <- t.test(v[1:n1], alt='greater')$p.val 
        p2 <- t.test(v[1:(n1+n2)], alt='greater')$p.val
        p3 <- t.test(v[1:(n1+n2+n3)], alt='greater')$p.val
        c(p1,p2,p3)
    }
    
    pval <- apply(DATA, 2, FUN=ttes, n=c(n1,n2,n3))
    ov_pval <- apply(pval, 2, min)
    list(seed, final_p=ov_pval, DATA, stage_p=pval)
}

# n <- 10
# sim1 <- rep_sig(B=10000, n1=n, n2=n, n3=n, seed=NA)
# sim0 <- rep_sig(B=10000, n1=n*3, n2=0, n3=0, seed=NA)

Speed comparison

# system.time( rep_sig(B=10000, n1=n, n2=n, n3=n, seed=NA) )
# system.time( results <- replicate(n_sims, sequential_test()) )

Simple adjustment

Will simple \(\alpha\) adjustment over- or under-correct?

  • Bonferroni-correction: \(\alpha_3 = \alpha/3 = 0.0167\)
  • \(1-(1-\alpha_3)^3 = 0.05\). \(\Rightarrow\) \(\alpha_3 = 0.0170\)
sim1 <- rep_sig(B=10000, n1=10, n2=10, n3=10, seed=NA)
table( sim1$final_p < 0.05 )
table( sim1$final_p < 0.05/3 ) 

plot(sim1$stage_p[1,], sim1$stage_p[2,], cex=0.8, pch=19, 
     col='#66666666', xlab='Stage 1 p-value', ylab='stage2 p-value', 
     xlim=c(0,1), ylim=c(0,1))

cdf <- data.frame(ff(cor(t(sim1$stage_p)),3))
    names(cdf) <- row.names(cdf) <- c('Stage_1','Stage_2','Stage_3')

if(F){
cdf %>%
    kable("html", booktabs=TRUE, align='c') %>%
    # column_spec(3, color='darkgreen', bold=TRUE) %>%
    # column_spec(4, color='orange', bold=TRUE) %>%
    kable_styling(bootstrap_options = c("striped", "hover"), position='left') %>%
    kable_classic_2(full_width = FALSE)
}

2 Stop when significant

Suppose we would like to test

\(H_0 : p = 0.1\)
\(H_1 : p > 0.1\)

by taking a random sample of size 40 from a Bernoulli(p) distribution. Under \(H_0\), the number of successes out of 40 (\(X\)) follows a Binomial(40, 0.1) distribution. We compute \(P_0[X \geq 8] = 0.042\), so the test that rejects \(H_0\) if \(X \geq 8\) has a type I error rate of 0.042.

The data were observed sequentially, and the 8th success happened at \(N = 32\), and we decided to reject \(H_0\) and terminate the study. The conclusion was that \(p\) was significantly greater than 0.1, and an estimate of \(p\) is \(\tilde{p} = 8/32 = 0.25\).

問い:

  1. Is type I error rate controlled?
  2. Is \(\tilde{p}\) unbiased?

To answer these questions, we must assume that these actions were pre-specified.

  • A sample of size at most 40 will be taken.
  • If X = 8 at any point, then \(H_0\) is rejected, and \(\tilde{p}\) will be calculated as \(\tilde{p} = x/n\) at that point.

答:

  1. Obviously yes.
  2. No.

\(\tilde{p}\) is an estimator such that

\[ \begin{align*} \tilde{p} = \begin{cases} X/40 & \text{if $X < 8$,} \\ 8 /Y & \text{if $Y \leq 40$,} \end{cases} \end{align*} \] where \(Y\) is the number of trials when 8th success happened. (\(Y\) has a \(\phantom{ 負の二項}\)分布.)

Before examining \(E_p[\tilde{p}]\), let’s consider \(\hat{p} = X/40\). We know that this (mle of \(p\)) is unbiased.

\[ \begin{align*} E_p[\hat{p}] &= \sum_{x=0}^{40} \frac{x}{40} P_p[X=x] \\ &= \sum_{x=0}^{7} \frac{x}{40} P_p[X=x] + \sum_{x=8}^{40} \frac{x}{40} P_p[X=x] \\ & = p \end{align*} \]

Now we write:

\[ \begin{align*} E_p[\tilde{p}] &= \sum_{x=0}^{7} \frac{x}{40} P_p[X=x] + \sum_{y=8}^{40} \frac{8}{y} P_p[Y=y]. \end{align*} \]

So, we need to compare the second term to calculate the bias. (Homework…)

Let’s use R to compute the bias.

bias_sws <- function(p, N, x){
    xv <- 0:(x-1)
        one <- sum( xv/N * dbinom(xv, N, p) )
    yv <- x:N
        two <- sum( x/yv * dnbinom(yv-x, x, p) )
            ## dnbinom(x,k,p) ; x is number of failures until k-th success.
            ## dnbinom(8,4, 0.2) = choose(8+3,3) * 0.2^3 * 0.8^8 * 0.2
    one + two
}

trueP <- seq(0.05, 0.95, by=0.05)
p.tilde <- sapply(as.list(trueP), bias_sws, N=40, x=8)

Under \(H_0\) (\(p=0.1\)), bias is negligible. But that is not interesting. When the true \(p\) is higher, the bias may not be negligible. That is concerning.

Can we intuitively understand why the bias is positive. (\(E_p[\tilde{p}] > p\).)

Suppose that the true \(p = 0.4\). Say, we are at n=17:

x 0 1 2 3 4 5 6 7 8 >8
P[X=x] 0.000 0.002 0.010 0.034 0.080 0.138 0.184 0.193 0.161 0.199
\(\hat{p}\) 0.000 0.059 0.118 0.176 0.235 0.294 0.353 0.412 0.471 higher than 0.471

At an interim n, if \(\hat{p}\) is low, the trial tends to continue, and the downward bias tends to get corrected; however, when \(\hat{p}\) is high, the trial tends to have stopped, and the upward bias remains uncorrected. Thus \(\tilde{p}\) tends to be higher than the true \(p\).

3 Simon’s design

In phase II studies, it is crucial to have mechanisms to stop early due to excessive toxicity, typically overseen by a Data Monitoring Committee. Additionally, it is desirable to discard ineffective treatments early, for which two-stage designs with a futility stop are commonly used in oncology.

Notation:

  • Stage I sample size: \(n_1\).
  • Stage I data: \(X_1 \sim Binomial(n_1, p)\).
  • Stage I critical value: \(r_1\) so that if \(X_1 \leq r_1\) then terminate the study for futility.
  • Stage II sample size: \(n_2\).
  • Stage II data: \(X_2 \sim Binomial(N_2, p)\).
  • Total sample size: \(N_t = n_1 + N_2\).
  • Total data \(X_t = X_1 + X_2\).
  • Stage II critical value \(r_t\) so that if \(X_t \leq r_t\) then terminate the study for futility, otherwise conclude efficacy.

Also we define \(R_1 = r_1 + 1\) and \(R_t = r_t + 1\).

  • If \(X_1 \leq r_1\), then stop for futility.
  • If \(X_1 \geq R_1\), then continue.

In his 1989 paper, Simon introduced two criteria for selecting a two-stage design for single-arm studies with a one sided test.

  1. The optimal design has the smallest expected sample size under \(H_0\), \(n_1 + E_0[N_2]\).
  2. The minimax design has the smallest total sample size, \(N_t\).

3.1 Bad Simon’s designs

For \(p_0 = 0.17\), \(p_1 = 0.32\), \(\alpha=0.1\) and \(\beta=0.1\), we have:

clinfun::ph2simon(pu=0.17, pa=0.32, ep1=0.10, ep2=0.10)

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.17 
Desirable response rate:  0.32 
Error rates: alpha =  0.1 ; beta =  0.1 

           r1 n1  r  n EN(p0) PET(p0)   qLo   qHi
Minimax     9 45 12 53  46.80  0.7748 0.525 1.000
Admissible  5 30 13 58  41.28  0.5972 0.032 0.525
Optimal     4 25 14 63  41.12  0.5759 0.000 0.032

For this parameter setting, the smallest \(N_t\) is 53, and the smallest \(n_1 + E_0[N_2]\) is 41.12.

Let’s not worry about “Admissible” design for now. Why might you not want to use this Minimax design? Or this Optimal design?

The single-stage design for the same parameter setting is:

ph2single(pu=0.17, pa=0.32, ep1=0.10, ep2=0.10)
   n  r Type I error Type II error
1 57 13      0.09349       0.08644
2 60 14      0.07453       0.09428
3 61 14      0.08394       0.08134
4 62 14      0.09407       0.06992
5 64 15      0.06706       0.08854
## Single-stage Binomial test
## Sample size and power

ssbt <- function(n.min, n.max, p0, p1, alp){
    # n.min <- 10 ; n.max <- 40 ; p0 <- 0.17 ; p1 <- 0.32 ; alp=0.10
    nv <- seq(n.min,n.max)
    cv <- qbinom(1-alp,nv,p0)
        type_I <- 1-pbinom(cv,nv,p0)
        powerx <- 1-pbinom(cv,nv,p1)
    para <- c(p0=p0, p1=p1, alp=alp)
    list(para,nv=nv,cv=cv,type_I=type_I,power=powerx)
}

ex1 <- ssbt(n.min=50, n.max=70, p0=0.17, p1=0.32, alp=0.10)

plot(c(50,70), c(0,1), type='n', xlab='N', ylab='Power', 
     panel.first=grid(nx=NULL, ny=NULL))
abline(h=0.10, col='royalblue')
abline(h=0.90, col='royalblue')
    title(main='Power and Type I Error Rate', adj=0, font.main=3)
lines(ex1$nv, ex1$type_I, col='darkgreen', type='b')
    pc <- c(1,19)[1+(ex1$power > 0.90)]
lines(ex1$nv, ex1$power, col='darkgreen', type='b', pch=pc)

Some bad Simon’s desings

library(clinfun)
ph2simon(pu=0.23, pa=0.42, ep1=0.05, ep2=0.1)

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.23 
Desirable response rate:  0.42 
Error rates: alpha =  0.05 ; beta =  0.1 

           r1 n1  r  n EN(p0) PET(p0)   qLo   qHi
Minimax    14 47 16 50  47.31  0.8968 0.739 1.000
Admissible  7 28 17 54  35.96  0.6938 0.135 0.739
Admissible  7 27 18 58  35.34  0.7311 0.006 0.135
Optimal     5 21 19 62  35.31  0.6510 0.000 0.006
ph2simon(pu=0.29, pa=0.47, ep1=0.05, ep2=0.1)

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.29 
Desirable response rate:  0.47 
Error rates: alpha =  0.05 ; beta =  0.1 

           r1 n1  r  n EN(p0) PET(p0)   qLo   qHi
Minimax    22 59 23 61  59.13  0.9362 0.840 1.000
Admissible  8 28 24 64  43.34  0.5740 0.139 0.840
Optimal    10 32 25 67  42.85  0.6899 0.000 0.139
ph2simon(pu=0.17, pa=0.34, ep1=0.05, ep2=0.2)

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.17 
Desirable response rate:  0.34 
Error rates: alpha =  0.05 ; beta =  0.2 

           r1 n1  r  n EN(p0) PET(p0)   qLo   qHi
Minimax     9 36 10 39  36.22  0.9272 0.693 1.000
Admissible  3 18 11 43  27.17  0.6331 0.583 0.693
Optimal     3 17 11 44  25.78  0.6749 0.000 0.583

3.2 Admissible designs

The two criteria, “minimizing \(N\)” and “minimizing \(n_1 + E_0[N_2]\), give two extreme designs; i.e., among all the designs that satisfy \(\alpha\) and \(\beta\) conditions, minimizing \(N\) maximizes \(n_1 + E_0[N_2]\), and vice versa.

To consider the two criteria simultaneously, let’s define \(\omega(q) = qN + (1-q)(n_1 + E_0[N_2])\), and find a design that minimizes \(\omega(q)\) given \(q\). Clearly, when \(q=1\), the resulting design is the minimax design since \(\omega(1) = N\). And when \(q=0\), the resulting design is the optimal design since \(\omega(0) = n_1 + E_0[N_2]\).

“Admissible” in this context implies that these designs are acceptable although they are not the minimax or optimal ones.

In the output of ph2simon, for each design the range of \(q\) values are given that each design minimizes \(\omega(q)\) function. In the final example above, for \(q\) in \((0, 0.583)\) the optimal design minimizes the value of \(\omega(q) = qN + (1-q)(n_1 + E_0[N_2])\). Unfortunately, for this parameter setting, no good design exists.

Professor Ana Ivanova’s webpage also solves for Simon’s designs.

4 Inference from Simon’s design

This section is roughly based on Koyama T and Chen H. Proper inference from Simon’s two-stage designs. Stat Med, 27(16):3145-3154, 2008. PMID: 17960777. PMCID: PMC6047527.

Problems

  • The p-values that are computed as if the data are observed in a single design (\(p_c\)) may contradict the hypothesis testing. That is, \(H_0\) is rejected, yet \(p_c > \alpha\).

  • Because a Simon’s design is given in terms of sample sizes and critical values, when the actual sample size deviates from the planned one, conducting hypothesis testing is not straightforward.

  • The maximum likelihood estimate \(\hat{\pi} = X_t / N_t\) of \(\pi\) is biased.

4.1 Example

For testing \(H_0: \pi = 0.1\), \(H_1: \pi > 0.1\), let’s set \(\alpha=0.05\) and \(\beta=0.2\) at \(\pi=0.3\). (\(\pi_0=0.1\), and \(\pi_1 = 0.3\)). Here are the Simon’s designs:

pi0 <- 0.10
pi1 <- 0.30
alp <- 0.05
bet <- 0.20
pow <- 1-bet

( ex_sim <- ph2simon(pu=pi0, pa=pi1, ep1=alp, ep2=bet) )

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.1 
Desirable response rate:  0.3 
Error rates: alpha =  0.05 ; beta =  0.2 

           r1 n1 r  n EN(p0) PET(p0)   qLo   qHi
Minimax     1 15 5 25  19.51  0.5490 0.732 1.000
Admissible  1 12 5 26  16.77  0.6590 0.482 0.732
Admissible  1 11 5 27  15.84  0.6974 0.293 0.482
Optimal     1 10 5 29  15.01  0.7361 0.000 0.293

Let’s use the Optimal design. The stage 1 sample size is \(10\) and the stage 1 critical value is \(1\), i.e., the trial stops if \(X_1 \leq 1\). If continued, additional sample of size \(19\) is taken, and \(H_0\) will be rejected if \(X_t > 5\). Suppose we observe \(X_1 = 2\) and \(X_t = 6\). This sample path leads to rejection of \(H_0\) in Stage 2.

What is the conventional p-value if we compute it as if data were observed in a single stage.

\[ p_c = P[ X \geq 6 ], \] where \(X \sim Binom(29, 0.1)\).

1-pbinom(xt-1, nt, pi0)
[1] 0.06372
    pc <- sum(dbinom(xt:nt, nt, pi0)) 

P-value is 0.064; \(H_0\) is rejected even though this p-value is greater than \(\alpha\).

4.2 Proper P-value

The conventional p-value can be expressed as below.

\[ p_c = P_0[ X \geq x_t ] = \sum_{x_1=0}^{n_1} P_0[X_2 \geq x_t - x_1] P_0[X_1 = x_1], \] where

  • \(X \sim Binom(29, 0.1)\).
  • \(X_1 \sim Binom(10, 0.1)\).
  • \(X_2 \sim Binom(19, 0.1)\).

For the current Simon’s design, this summation includes impossible sample paths because when \(X_1 \leq 1\), there is no stage 2.

Therefore, the proper p-value for a Simon’s design is

\[ p_p = \sum_{x_1=R_1}^{n_1} P_0[X_2 \geq x_t - x_1] P_0[X_1 = x_1]. \]

Clearly, \(p_p \leq p_c\), and this can be computed using twostage.inference() function in the clinfun package.

twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.1)
 pumvue p.value  5% LCL 95% UCL 
0.26131 0.04709 0.10160 0.40070 

We define the P-value as the probability of obtaining the result that is at least as extreme as the observed one under the null hypothesis. And we are assuming that the larger value of \(X_t\), regardless of \(X_1\) and \(X_2\), is more extreme. For example, two sample paths, (\(X_1=2\), \(X_2=6\)) and (\(X_1=3\), \(X_2=5\)), lead to the same p-value.

4.3 Confidence interval

Because Simon’s design is used for a one-sided hypothesis test, an accompanying confidence interval should be one-sided of the form, \((\pi_L, 1)\). The method described below can be used to form a more common two-sided confidence interval of the form, \((\pi_L, \pi_U)\). We will use a \(90\%\) two-sided confidence interval, so that two statements, “\(H_0\) is rejected.” and “\(pi_0\) is outside of \((\pi_L, \pi_U)\)” are equivalent.

We can compute the proper p-value (\(p_p\)) for testing \(H_0: \pi = \pi'_0\) for any \(\pi'_0\). A \(90\%\) confidence interval is a collection of \(\pi'_0\) such that the corresponding p-value is within \([0.05, 0.95]\) (for one-sided \(\alpha=0.05\)).

Continuing with the example, we have \(X_1 = 2\) and \(X_2 = 4\) for the design with \(n_1=10\), \(r_1=1\), \(n_2 = 19\). Using twostage.inference function with various values of pu we can find \(\pi'_0\) that gives a p-value of \(0.05\) and \(0.95\).

twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.101505)
 pumvue p.value  5% LCL 95% UCL 
 0.2613  0.0500  0.1016  0.4007 
twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.4007)
 pumvue p.value  5% LCL 95% UCL 
 0.2613  0.9500  0.1016  0.4007 

4.4 Median estimator

Similarly, the value of \(p^*_0\) such that the p-value for testing \(H_0: p=p^*_0\) is \(0.5\) with the observed sample path may be used as an estimate for \(\pi\). If the test statistic is continuous, this estimator is known as the median unbiased estimator (Cox and Hinkley 1974). It is unbiased for the median. The proof uses the fact that the p-value is distributed \(Unif(0,1)\) under \(H_0\).

  • Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall, London.
twostage.inference(x=6, r1=1, n1=10, n=29, pu=0.2147)
 pumvue p.value  5% LCL 95% UCL 
 0.2613  0.5001  0.1016  0.4007 

4.5 R program for inference

I have written an R code to compute p-values, confidence interval, and an estimate.

https://biostat.app.vumc.org/wiki/pub/Main/TatsukiRcode/twostage2020Web.R

source('https://biostat.app.vumc.org/wiki/pub/Main/TatsukiRcode/twostage2020Web.R')
TwoStageInf(p0=0.1, p1=0.3, n1=10, R1=2, nT=29, RT=6, x1=2, x2=4)
$parameter
   p0  p1 n1 R1 nT RT
1 0.1 0.3 10  2 29  6

$character
        POWE    PET    EN
NULL 0.04709 0.7361 15.01
ALT  0.80506 0.1493 26.16

$data
  x1 n2 m2 x2
1  2 19 19  4

$p.value
[1] 0.04709

$conf
[1] 0.1015 0.4008

$med.esti
[1] 0.2147

4.6 Aside: Programming notes

Confidence interval for a binomial proportion

Let’s say we have \(X=7\) where \(X \sim Binom(20, \pi)\). We can estimate \(\pi\) using binconf function in Hmisc library.

library(Hmisc)
x <- 7
n <- 20
binconf(x=x, n=n, method='all')
           PointEst  Lower  Upper
Exact          0.35 0.1539 0.5922
Wilson         0.35 0.1812 0.5671
Asymptotic     0.35 0.1410 0.5590

Asymptotic method.

phat <- x / n
z <- qnorm(0.975)
phat + c(-1,1) * z * sqrt(phat * (1 - phat)/n)
[1] 0.141 0.559

Wilson score method. \[ \frac{1}{1 + z^2/n} \times (\hat{p} + z^2/2n \pm z \sqrt{\hat{p}(1-\hat{p})/n + z^2/4n^2} \]

(1/(1 + z^2/n)) * (phat + z^2/(2 * n) + c(-1, 1) * z * sqrt(phat * (1 - phat)/n + z^2/(4 * n^2)))
[1] 0.1812 0.5671

Finally, the Exact method (Clopper-Pearson) confidence interval is \((p_L, p_U)\), where \(p_L\) and \(p_U\) satisfy the following:

  • \(P[ X \geq x | p=p_L] = \alpha/2\)
  • \(P[X \leq x|p=p_U] = \alpha/2\)

This confidence interval is the collection of \(p^*\) such that \(H^*_0: p=p^*\) would not be rejected by the observed sample. This requires computation of \(p\)-values for various values of \(p^*\).

f1 <- function(p.star, x, n){
    1-pbinom(x-1, n, p.star)
    ## or sum(dbinom(x:n, n, p.star))
}

f2 <- function(p.star, x, n){ 
    pbinom(x, n, p.star) 
    ## or sum(dbinom(0:x, n, p.star)
}

And we need to find p.star for f1 and f2 that the resulting p-values are 0.05 for f1 and 0.95 for f2.

f1(0.1539, x=7, n=20)
[1] 0.02499
f2(0.5922, x=7, n=20)
[1] 0.02499

Rather than using a trial-and-error method, we can use uniroot to find the root.

g1 <- function(p.star, x, n, alp){
    alp - f1(p.star, x, n)
}
uniroot(g1, x=7, n=20, alp=0.025, lower=0, upper=1)$root
[1] 0.1539
g2 <- function(p.star, x, n, alp){
    alp - f2(p.star, x, n)
}
uniroot(g2, x=7, n=20, alp=0.025, lower=0, upper=1)$root
[1] 0.5922

To solve for \(p^*\) in a more straightforward way without iterations, we can use the following relationship between a Binomial random variable and a Beta random variable.

If \(X \sim Binomial(n, p)\) and \(Y \sim Beta(k, n-k+1)\), then \[ P[X \geq k] = P[Y \leq p]. \]

For the lower bound, instead of solving for \(p_L\) in \(P[X \geq 7 | p = p_L] = 0.025\) iteratively, we can solve \(P[Y \leq p_L] = 0.025\), where \(Y \sim Beta(7, 20-7+1)\).

qbeta(0.025, 7, 20-7+1)
[1] 0.1539

For the upper bound, we needed to solve \(P[X \leq 7 | p = p_U] = 0.025\), which is \(P[X \geq 8|p = p_U] = 0.975\). Instead, we can solve \(P[Y \leq p_U] = 0.975\), where \(Y \sim Beta(8,20-8+1)\).

qbeta(0.975, 8, 20-8+1)
[1] 0.5922

5 Extending or shortening the study

As an example, consider the design \((n_1=19, R_1=7, n_t=39, R_t=17)\). This is the minimax design for \(p_0=0.3\), \(p_1=0.5\) with \(\alpha=0.05\) and \(\beta=0.2\).

ph2simon(pu=0.3, pa=0.5, ep1=0.05, ep2=0.20)

 Simon 2-stage Phase II design 

Unacceptable response rate:  0.3 
Desirable response rate:  0.5 
Error rates: alpha =  0.05 ; beta =  0.2 

           r1 n1  r  n EN(p0) PET(p0)   qLo   qHi
Minimax     6 19 16 39  25.69  0.6655 0.252 1.000
Admissible  6 18 17 42  24.68  0.7217 0.208 0.252
Optimal     5 15 18 46  23.63  0.7216 0.000 0.208

We assume that the sample size for stage 1 is always as planned. Let’s say we observe in stage 1 \(X_1=7\). In stage 2, the planned sample size is \(n_2=20\), but the actual sample size is \(n'_2=23\) and we observe \(X'_2=10\). Should \(H_0\) be rejected?

5.1 Conditional p-value

Conditional p-value is the p-value in the second stage conditioned on the first stage data. That is \(P_0[X'_2 \geq x'_2 | X_1=x_1]\), where \(X'_2 \sim Binom(n'_2, \pi_0)\). It does not depend on \(X_1\) other than that \(X'_2\) is not defined for \(X_1 < R_1\).

For \(n'_2= 23\) and \(X'_2 = 10\), we have \(P_0[ X'_2 \geq 10 ]\), where \(X'_2 \sim Binom(23, 0.3)\).

1-pbinom(9, 23, 0.3)
[1] 0.1201

This is compared to the conditional type I error rate computed with the original \(n_2\) and \(R_2\).

The proper conditional type I error rate at \(X_1 = 7\) is

\[ P_0[X_t \geq R_t | X_1=x_1 ] = P_0[ X_t \geq 17 | X_1 = 7] = P_0[X_2 \geq 10] \]

1-pbinom(9, 20, 0.3)
[1] 0.04796

\(H_0\) is not rejected because the conditional p-value is greater than the conditional type I error rate.

5.2 Unconditional p-value

The unconditional p-value without sample size change is \[ p_p = \sum_{x_1=R_1}^{n_1} P_0[X_2 \geq x_t - x_1] P_0[X_1 = x_1]. \]

When \(n_2\) is changed, we want to similarly compute

\[ p_p = \sum_{x_1=R_1}^{n_1} P_0[X'_2 \geq x'_t - x_1] P_0[X_1 = x_1]. \]

But this does not work…

When, however, the stage 2 sample size is changed, we cannot extend the conditional P-value based on \(X'_t\). It is demonstrated in Section 4.1 that the same value of \(X'_t\) may or may not lead to the rejection of H0 depending on the sample paths. … extending the conditional P-value based on \(X'_t\) would not make sense.

The conditional p-value, \(P_0[X'_2 \geq x'_t - x_1 ]\) is easily computed for the observed value of \(X_1=x_1\), but to compute the unconditional we need it for all values of \(X_1\) in the continuation region.

To do this, let’s examine the following plot, which shows the conditional type I error rate and the conditional power at the alternative. The two lines shown below are

  • \(P_{\pi=\pi_0} [X_2 \geq R_t - x_1 | X_1=x_1]\)
  • \(P_{\pi=\pi_1} [X_2 \geq R_t - x_1 | X_1=x_1]\)

Blue line = The original conditional type I error rate \[ P_0[ X_2 \geq R_t - x_1 | X_1 = x_1 ] = P_0[ X_2 \geq 17 - x_1 | X_1 = x_1 ], \] where \(X_2 \sim Binom(20, 0.3)\) under \(H_0\).

x1v <- 0:19
n2 <- 20 # Planned
p0 <- 0.3 # Null
cp0 <- 1-pbinom(17-x1v-1, n2, p0)
cp0 <- replace(cp0, x1v < 7, 0)

data.frame(x1v, cp0)
   x1v     cp0
1    0 0.00000
2    1 0.00000
3    2 0.00000
4    3 0.00000
5    4 0.00000
6    5 0.00000
7    6 0.00000
8    7 0.04796
9    8 0.11333
10   9 0.22773
11  10 0.39199
12  11 0.58363
13  12 0.76249
14  13 0.89291
15  14 0.96452
16  15 0.99236
17  16 0.99920
18  17 1.00000
19  18 1.00000
20  19 1.00000

The actual sample size for stage 2 is 23, and \(X'_2 = 10\). \(X'_2 \sim Binom(23,0.3)\) under \(H_0\).

The observed conditional p-value is 0.1201. (The black dot)

We proposed to extend the conditional p-value to other values of \(X_1\) as follows:

  1. Compute the conditional p-value. \(P_{\pi=\pi_0} [ X'_2 \geq X'_t - x_1]\).
  2. Find \(\pi^*\) such that \(P_{\pi=\pi_0} [ X'_2 \geq X'_t - x_1] = P_{\pi=\pi^*} [ X_2 \geq X_t - x_1 ]\) at the observed \(X_1\).
  3. Extend \(P_{\pi=\pi^*} [ X_2 \geq X_t - x_1 ]\) to other values of \(X_1\) using the original design.

For the current example, \(\pi^* = 0.3491\).

( 1-pbinom(9, 20, 0.3491) )
[1] 0.12

  • \(P_{\pi=\pi_0} [X_2 \geq R_t - x_1 | X_1=x_1]\)
  • \(P_{\pi=\pi_1} [X_2 \geq R_t - x_1 | X_1=x_1]\)
  • \(P_{\pi=\pi^*} [X_2 \geq R_t - x_1 | X_1=x_1]\)

The unconditional p-value is

\[ p_p = \sum_{x_1=R_1}^{n_1} P_{\pi^*}[X_2 \geq x_t - x_1] P_{\pi^*}[X_1 = x_1], \] where \(\pi^* = 0.3491\).

P-value is \(0.0828\).

TwoStageInf(p0=0.3, p1=0.5, n1=19, R1=7, nT=39, RT=17, x1=7, m2=23, x2=10)
$parameter
   p0  p1 n1 R1 nT RT
1 0.3 0.5 19  7 39 17

$character
       POWE     PET    EN
NULL 0.0455 0.66550 25.69
ALT  0.8036 0.08353 37.33

$data
  x1 n2 m2 x2
1  7 20 23 10

$p.value
[1] 0.08279

$conf
[1] 0.2821 0.5460

$med.esti
[1] 0.4046

5.3 Confidence interval and an estimate

Once we have a method to compute a p-value, we can compute a confidence interval and an estimate by inverting hypothesis testing.