---
title: "Hypothesis test for the difference between proportions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Hypothesis test for the difference between proportions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,comment=NA,fig.width=7,fig.height=5)
library(interpretCI)
library(glue)
```

```{r,echo=FALSE,message=FALSE}
x=propCI(n1=150,n2=100,p1=0.71,p2=0.63,P=0,alternative="greater")
  
two.sided<-greater<-less<-FALSE
if(x$result$alternative=="two.sided") two.sided=TRUE
if(x$result$alternative=="less") less=TRUE
if(x$result$alternative=="greater") greater=TRUE

twoS="The null hypothesis will be rejected if the proportion from population 1 is too big or if it is too small."
lessS="The null hypothesis will be rejected if the proportion from population 1 is too small."
greaterS="The null hypothesis will be rejected if the proportion from population 1 is too big."
```

This document is prepared automatically using the following R command.

```{r,echo=FALSE}
call=paste0(deparse(x$call),collapse="")
x1=paste0("library(interpretCI)\nx=",call,"\ninterpret(x)")
textBox(x1,italic=TRUE,bg="grey95",lcolor="grey50")
```

## Problem


```{r,echo=FALSE}

string=glue("Suppose the Acme Drug Company develops a new drug, designed to prevent colds. The company states that the drug is equally effective for men and women. To test this claim, they choose a a simple random sample of {x$result$n1} women and {x$result$n2} men from a population of {(x$result$n1+x$result$n2)*50} volunteers.

At the end of the study, {x$result$p1*100}% of the women caught a cold; and {x$result$p2*100}% of the men caught a cold. Based on these findings, can we reject the company's claim that the drug is {ifelse(two.sided,'equally',ifelse(less,'more','less'))} effective for men {ifelse(two.sided,'and','compared to')} women? Use a {x$result$alpha} level of significance.")

textBox(string)
```



## Solution

This lesson explains how to conduct a hypothesis test to determine whether the difference between two proportions is significant.

The test procedure, called the **two-proportion z-test**, is appropriate when the following conditions are met:

- The sampling method for each population is simple random sampling.

- The samples are independent.

- Each sample includes at least 10 successes and 10 failures.

- Each population is at least 20 times as big as its sample.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.


Since the above requirements are satisfied, we can use the following four-step approach to construct a confidence interval.


### 1. State the hypotheses 

The first step is to state the null hypothesis and an alternative hypothesis.

$$Null\ hypothesis(H_0): P_1 `r ifelse(two.sided,"=",ifelse(less,"\\geqq","\\leqq"))` P_2$$
$$Alternative\ hypothesis(H_1): P_1 `r ifelse(two.sided, "\\neq" ,ifelse(less,"<",">"))` P_2$$

Note that these hypotheses constitute a `r ifelse(two.sided,"two","one")`-tailed test. `r ifelse(two.sided,twoS,ifelse(less,lessS,greaterS))`.

### 2. Formulate an analysis plan

For this analysis, the significance level is `r x$result$alpha``. The test method, shown in the next section, is a **two-proportion z-test**.


### 3. Analyze sample data

Using sample data, we calculate the pooled sample proportion (p) and the standard error (SE). Using those measures, we compute the z-score test statistic (z).

$$p=\frac{p_1 \times n_1+ p_2 \times n_2}{n1+n2}$$
$$p=\frac{`r x$result$p1` \times `r x$result$n1`+ `r x$result$p2` \times `r x$result$n2`}{`r x$result$n1`+`r x$result$n2`}$$

$$p=`r x$result$p1*x$result$n1+x$result$p2*x$result$n2`/`r x$result$n1+x$result$n2`=`r round(x$result$ppooled,3)`$$

$$SE=\sqrt{p\times(1-p)\times[1/n_1+1/n_2]}$$

$$SE=\sqrt{`r round(x$result$ppooled,3)`\times`r round(1-x$result$ppooled,3)`\times[1/`r x$result$n1`+1/`r x$result$n2`]}=`r round(x$result$se,3)`$$


$$z=\frac{p_1-p_2}{SE}=\frac{`r x$result$p1`-`r x$result$p2`}{`r round(x$result$se,3)`}=`r round(x$result$z,2)`$$

where $p_1$ is the sample proportion in sample 1, where $p_2$ is the sample proportion in sample 2, $n_1$ is the size of sample 1, and $n_2$ is the size of sample 2.

Since we have a `r ifelse(two.sided,"two","one")`-tailed test, the P-value is the probability that the z statistic is `r if(!greater) "less than"` `r if(!greater) round(-abs(x$result$z),2)` `r if(!less) "or greater than "` `r if(!less) round(abs(x$result$z),2)`.

We can use following R code to find the p value.

```{r,echo=FALSE}

if(two.sided){
               string=glue("pnorm(-abs({round(x$result$z,2)}))\\times2")
} else if(greater){
               string=glue("pnorm({round(x$result$z,2)},lower.tail=FALSE)")
} else{
               string=glue("pnorm({round(x$result$z,2)})")
          }
```

$$p=`r string`=`r round(x$result$pvalue,3)`$$


Alternatively,we can use the Normal Distribution curve to find p value.

```{r}
draw_n(z=x$result$z,alternative=x$result$alternative)
```

### 4. Interpret results. 

Since the P-value (`r round(x$result$pvalue,3)`) is `r ifelse(x$result$pvalue>x$result$alpha,"greater","less")` than the significance level (`r x$result$alpha`), we cannot `r ifelse(x$result$pvalue>x$result$alpha,"reject","accept")` the null hypothesis.


### Result of propCI()

```{r,echo=FALSE}
print(x)
```

### Reference

The contents of this document are modified from StatTrek.com.
Berman H.B., "AP Statistics Tutorial", [online] Available at:  https://stattrek.com/hypothesis-test/difference-in-proportions.aspx?tutorial=AP URL[Accessed Data: 1/23/2022]. 

