Want to get started on ab testing? You probably have to first understand hypothesis testing and the following 4 terms:

- Type I Error
- Type II Error
- Sample Size
- Power

- Some python (or any other programming languages)
- Understanding of basic statistics (High school would be sufficient)
- Basic distributions, pdf, cdf.
- expected value, variance.
- central limit theorem, confidence intervals, z-score.

`requirements.txt`

:

```
numpy==1.19.4
pandas==1.1.5
statsmodels==0.12.1
matplotlib==3.3.3
```

Before we dive in, let's start by generating some random numbers:

In this example, we will use exponential distribution (but you wouldn't know this in real life) - so let's play along:

The exponential function is defined as :

f(x, \beta) =
\begin{cases}
\frac{1}{\beta}e^{(-x/\beta)}& x \ge 0 \\\\
0 & x < 0
\end{cases}

E[X] = \frac{1}{\beta},
Var[x] =\frac{1}{\beta^2}

Note, some sites refer exponential function with the rate \lambda parameter with \lambda = 1/\beta.

Now let's import some libraries, define a python function, and generate some random numbers:

```
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
def gen_exp(n_samples: int, size_sample: int, scale_input: float) -> np.ndarray:
"""[summary]
Args:
n_samples (int): [number of samples]
size_sample (int): [each sample size]
scale_input (float): [the exponential distribution scale parameter]
Returns:
np.ndarray: [returns a n_sample x size_sample ndarry]
"""
np.random.seed(100)
values: np.ndarray = np.random.exponential(
scale=scale_input, size=(n_samples, size_sample)
)
return values
eg1_n = 10000
eg1_size = 500
eg1_scale = 0.5
eg1_var = eg1_scale ** 2
eg1_null_mean = eg1_scale
eg1_data = gen_exp(n_samples=eg1_n, size_sample=eg1_size, scale_input=eg1_scale)
eg1_means = np.mean(eg1_data, axis=1)
```

To summarize, this is what the code did:

- generate some data (that follows exponential distribution)
- calculate sample mean
- The population variance is known (more on this in t-test if you do not know your population variance) to calculate z-statistics \frac{z-\mu}{\sigma/\sqrt{n}}.

In the above setup, 10000 different samples of size 500 each are generated. In reality, you would only have 1 such sample. E.g suppose you are running a factory line, and you take 0.1% of your supply chain to take measurements. But lets assume somehow you magically are able to take measurements of all your supply chain.

Now, suppose you want to test whether the population mean is equals to `0.5`

(which is the scale parameter in this case) we have to calculate the standardized z-score:

`eg1_z_score = ((eg1_means) - 0.5) / (eg1_var ** 0.5 / eg1_size ** 0.5)`

Recall that the distribution of sample means follows normal distribution (due to central limit theorem).

Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}

Where:

- \bar{X} is the sample mean.
- \mu is the population mean.
- \sigma is the population variance.
- n is the sample size.

Further more, your management decides to have a 95% confidence level (explained further later).

```
error_rate = 0.05
critical_region = st.norm.ppf(error_rate / 2)
eg1_rejected = (eg1_z_score > -critical_region) | (eg1_z_score < critical_region)
print(sum(eg1_rejected) / eg1_n)
"""
0.049
# Notice that this is almost the same as error_rate
"""
```

We will now introduce a few terms:

- The
`hypothesis test`

`alpha`

\alpha`Type I error`

`P-value`

In hypothesis testing, we always come up with a `null`

hypothesis and an `alternative`

hypothesis.

The `null`

and `alternative`

are two opposing roads. The null is either rejected or it is not. Only if the null is rejected, can we proceed to the alternative. The "alternative" is simply the other option when the null is rejected; nothing more.

H_0 | H_a |
---|---|

Assumption, status quo, nothing new | Rejection of an assumption |

Assumed to be "True"; a given. | Rejection of an assumption or the given |

Negation of the research question | Research question to be "proven" |

Always contains an equality =, \leq,\geq | Does not contain equality \neq,<,> |

In any hypothesis testing, we assume that the null is equal to some values, while the alternative is not equal, i.e a two tailed test such as the example above.

We can also assume the null to be less than or greater than a certain value while the alternative is the opposite that would result in a one tailed test.

- H_0 =, H_a \neq
- H_0 \leq, H_a >,
- H_0 \geq, H_a <

In a two tail test with error rate `0.05`

, it means that for any `z-statistics`

lower than `-1.96`

or higher than `1.96`

are rejected (this has to do with confidence interval).

```
st.norm.ppf(0.975) # 1 - alpha/2, alpha = 0.05
"""
1.959963984540054
"""
```

In the earlier example, this would be our hypothesis:

H_0 = 0.5, H_a \neq 0.5

In some sense, z-scores that are above or lower than the critical region are out-liers or extreme points under the (normal) distribution curve that will result in rejection of the null hypothesis.

- All statistical conclusions are made in reference to the null hypothesis.
- As researchers we either
**reject**the null hypothesis or**fail to reject**the null hypothesis; we do not**accept**the null. - This is due to the fact that the null hypothesis is assumed to be true from the start; rejecting or failing to rejection an assumption.
- If we
**reject**the null hypothesis, then we conclude the data supports the alternative hypothesis. - however if we
**fail to reject**the null hypothesis, it does not mean we have proven the null hypothesis is "true".

In the earlier example, **all samples** are actually all **true**; they came from the same distribution, they are **expected** to have the mean 0.5 but yet some of them got rejected!

let's take a look at some of the sample means that got rejected:

```
eg1_means[eg1_rejected][0:10]
"""
array([0.54408767, 0.54971147, 0.45515848, 0.55697267, 0.54692577,
0.57487153, 0.45142467, 0.55958869, 0.44157481, 0.45038526])
"""
```

Hmm, the values are about 10% (0.05) away from the mean (0.5)!

Indeed, if we see the rejected z-scores means they do seem like outlier if we observe their z-statistics!

```
eg1_z_score[eg1_rejected][0:10]
"""
eg1_z_score[eg1_rejected][0:10]
array([ 1.97166068, 2.22316444, -2.00537382, 2.54789543, 2.09858404,
3.3483564 , -2.17235491, 2.66488728, -2.61285386, -2.21883879])
"""
```

This error, is known as Type I error and is represented by \alpha.

The formal definition:

Therefore, we observe that

`alpha`

\alpha is the probability of rejecting the null hypothesis when the null hypothesis is true. This is also sometimes referred as`significance level`

.

The above concept is also known as `Type I error`

, also known as `false positive`

.

The formal definition:

The type I error rate or significance level is the probability of rejecting the null hypothesis given that it is true. It is denoted by the Greek letter α (alpha) and is also called the alpha level.

In the earlier example, when we look at the "extreme" z-scores, we notice that they are "rare" chances of occurring, the question is, how rare? This is where p-values come in.

For the formal definition (which is kinda hard to understand):

In null hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

Let's take a look at the same example:

Note that the P-value for a two-tailed test is always two times the P-value for either of the one-tailed tests.

```
alpha = 0.05
p_values = st.norm.sf(abs(eg1_z_score)) # one-sided
p_values = st.norm.sf(abs(eg1_z_score)) * 2 # twosided
print(sum(p_values < alpha))
print(p_values[eg1_rejected][0:10])
"""
490
[0.04864836 0.02620471 0.0449231 0.0108375 0.03585358 0.00081292
0.0298289 0.00770141 0.00897897 0.02649769]
"""
```

This means that the probability of the rejected samples are quite rare!

- p-value is the chance that your sample data correctly reflects the population assuming your null hypothesis is true
- High p value : low surprise. A low p-value makes your default assumption looks stupid.
- Does the evidence we collected make our null hypothesis looks ridiculous?

The above example assumes that we know or is able to assume our population variance, this might not always be true. In that scenario, we use t-test.

As compared to the Z-distribution, the T-distribution is flatter - you want to "play safe" and act upon it. but as n \geq 30 it does not matter that much.

- A Smaller sample size means more sampling error. This sampling error due to small n means a higher probability of extreme sample means. When you only collect 1 sample, the sample could be extreme.
- Given the same \alpha and s, a smaller n will push the critical values further outward in the tail(s) due to the uncertainty associated with small n. - think of this as collecting evidence.
- As sample size increases for t-distribution, it "narrows" down the curve and making it sharper. The larger sample size n leads to a higher degree of freedom df.

For the T-test, we need to calculate Sample variance and degree of freedom.

S^2 = \frac{1}{n-1}\sum_i^n (x_i - \bar{x})^2

df = n-1

```
eg2_n = 20000
eg2_size = 20 # smaller sample size to show the difference
eg2_scale = 0.5
eg2_var = eg2_scale ** 2
eg2_null_mean = eg2_scale
eg2_data = gen_exp(n_samples=eg2_n, size_sample=eg2_size, scale_input=eg2_scale)
alpha = 0.05
null_hypo_mean = 0.5
eg2_means = np.mean(eg2_data, axis=1)
eg2_z_score = ((eg2_means) - null_hypo_mean) / (eg2_var ** 0.5 / eg2_size ** 0.5)
critical_region = st.norm.ppf(alpha / 2)
eg2_z_rejected = (eg2_z_score > -critical_region) | (eg2_z_score < critical_region)
print(sum(eg2_z_rejected) / eg2_n) # still approximately alpha
sample_s_vars = np.var(eg2_data, axis=1, ddof=1) # to calculate sample variance
t_statistics = ((eg2_means) - null_hypo_mean) / (
sample_s_vars ** 0.5 / eg2_size ** 0.5
)
critical_region = st.t.ppf(alpha / 2, df=len(t_statistics) - 1)
eg2_t_rejected = (t_statistics > -critical_region) | (t_statistics < critical_region)
print(sum(eg2_t_rejected) / eg2_n)
"""
0.0464 # % of type 1 error for z-test - still approximately alpha
0.09655 # % of type 1 error for t-test
"""
```

In the earlier example when we look at the rejected means,

```
eg1_means[eg1_rejected][0:10]
"""
eg1_means[eg1_rejected][0:10]
array([0.54408767, 0.54971147, 0.45515848, 0.55697267, 0.54692577,
0.57487153, 0.45142467, 0.55958869, 0.44157481, 0.45038526])
"""
```

While for some cases,

```
eg1_means[~eg1_rejected][0:10]
"""
eg1_means[~eg1_rejected][0:10]
array([0.50777865, 0.50289682, 0.46315811, 0.46417931, 0.47596631,
0.50782371, 0.52971846, 0.51069177, 0.5045483 , 0.50073808])
"""
```

Notice that for values such as `0.463`

which is `0.47`

off but the hypothesis test did not reject the null hypothesis.

You are not comfortable with this \pm 0.05 of error, and you prefer it to be closer to \pm 0.035

In hypothesis testing, this is increasing the confidence level by increasing the sample size, but how much to increase?

In confidence interval, we have

\bar{X} \pm z_{\alpha/2} * \frac{\sigma}{\sqrt{n}}

If we want to have an interval such that:

\bar{X} \pm E

We can set:

E = z_{\alpha/2} * \frac{\sigma}{\sqrt{n}}

E is also known as the Margin of Error - sometimes annotated with m or MOE in literature.

The formal definition:

The margin of error is a statistic expressing the amount of random sampling error in the results of a survey.

To decrease E, we can decrease z which can be done by reducing \alpha, or we can increase our sample size n.

Rearranging the variables,

n = \bigg(\frac{z_{\alpha/2}* \sigma}{E} \bigg)^2

In the earlier example, we set \alpha = 0.05\ z_{\alpha/2} = 1.96\ \sigma=0.5 Suppose desired error is 0.035,

then:

n = (1.96 * 0.5 / 0.035)^2 = 784

```
eg3_n = 10000
eg3_size = 784
eg3_scale = 0.5
eg3_var = eg3_scale ** 2
eg3_null_mean = eg3_scale
eg3_data = gen_exp(n_samples=eg3_n, size_sample=eg3_size, scale_input=eg3_scale)
alpha = 0.05
null_hypo_mean = 0.5
eg3_means = np.mean(eg3_data, axis=1)
eg3_z_score = ((eg3_means) - null_hypo_mean) / (eg3_var ** 0.5 / eg3_size ** 0.5)
critical_region = st.norm.ppf(alpha / 2)
eg3_z_rejected = (eg3_z_score > -critical_region) | (eg3_z_score < critical_region)
print(sum(eg3_z_rejected) / eg3_n)
print(eg3_means[eg3_z_rejected][0:10])
print(eg3_means[~eg3_z_rejected][0:10])
"""
0.0494 # still alpha
[0.46383495 0.53533569 0.53716038 0.5378225 0.5455015 0.45397126
0.5419563 0.44747049 0.45018693 0.54140721] # we see values +- 0.035 getting rejected
[0.50698089 0.46728355 0.47093244 0.5143852 0.51581374 0.50987326
0.49651999 0.52181904 0.50310289 0.48981529] # means that we get accept
"""
```

Now values outside the range of 0.5 \pm 0.035 are rejected.

In type I error, we wrongly rejected the null hypothesis when it is true. There is also another type of error - what if we fail to reject the null hypothesis, when it is actually false?

Let's take a look at an example:

```
eg4_n = 20000
eg4_size = 100
eg4_scale = 0.7
eg4_var = eg4_scale ** 2
eg4_null_mean = eg4_scale
eg4_data = gen_exp(n_samples=eg4_n, size_sample=eg4_size, scale_input=eg4_scale)
alpha = 0.05
eg4_means = np.mean(eg4_data, axis=1)
null_hypo_mean = 0.5
eg4_z_score = ((eg4_means) - null_hypo_mean) / (eg4_var ** 0.5 / eg4_size ** 0.5)
critical_region = st.norm.ppf(alpha / 2)
num_rejected = (eg4_z_score > -critical_region) | (eg4_z_score < critical_region)
print(sum(num_rejected))
print(sum(num_rejected) / eg4_n) # Power
print(1-sum(num_rejected) / eg4_n) # Type 2 error
"""
16274
0.8137 # Power
0.18630000000000002 # Type 2 error
"""
```

In reality, we should have rejected 100% of them, but we only rejected about 81% of them! So we failed to reject 19% when the alternative is true.

(In this case we know the alternative is 0.7, more on that later)

The formal definition:

The second kind of error is the failure to reject a false null hypothesis as the result of a test procedure. This sort of error is called a type II error (false negative) and is also referred to as an error of the second kind.

Type II error is also known as false negative, and the probability is represented \beta.

To calculate the Type II error, we need to assume an alternative hypothesis.

To calculate the probability of type II error, we have to select a \mu_a that satisfies the alternative hypothesis.

There is no "one" type II error. It is different for every \mu_a that satisfies the alternative hypothesis.

Suppose:

H_0 = 0.5, H_a = 0.7

```
null_scale = 0.5 # null hypothesis
true_scale = 0.7
pop_var = true_scale ** 2 # population variance is fixed at null scale
n = 100
alpha = 0.05
z_alpha = -st.norm.ppf(alpha / 2)
left_tail = -z_alpha + (null_scale - true_scale) / (pop_var ** 0.5 / n ** 0.5)
right_tail = z_alpha + (null_scale - true_scale) / (pop_var ** 0.5 / n ** 0.5)
type_2_error = st.norm.cdf(right_tail) - st.norm.cdf(left_tail)
"""
0.18481101024000066
"""
```

Which is approximately same as the value above in example4.

A little bit of mathematics to understand what is going on :

We are trying to calculate Type II error, by definition:

P( \text{accept }H_0 | H_a \text{ is true})

This happens when there is an intersection of of two distributions. To calculate this value, we need to calculate area under the curve where the two distributions overlap.

In a two tailed test:

\begin{align}
& P(|\bar{X}| < \bar{x}_{crit|H_0} | H_a ) \\
& = P( \bar{x}_{left,crit|H_0} < \bar{X} < \bar{x}_{right,crit|H_0} | H_a) \\
& = P( \frac{\bar{x}_{l,c|H_0} - \mu_a}{\sigma/\sqrt{n}} < Z < \frac{\bar{x}_{r,c|H_0}-\mu_a}{\sigma/\sqrt{n}} ) \\
& = P( \frac{\bar{x}_{l,c|H_0} - \mu_0 + \mu_0 - \mu_a}{\sigma/\sqrt{n}} < Z < \frac{\bar{x}_{r,c|H_0}- \mu_0 + \mu_0-\mu_a}{\sigma/\sqrt{n}} )\\
&= P( -z_{\alpha/2}+ \frac{\mu_0-\mu_a}{\sigma/\sqrt{n}}< Z < z_{\alpha/2} + \frac{\mu_0-\mu_a}{\sigma/\sqrt{n}} )
\end{align}

\frac{\mu_1 - \mu_2}{s} is also known as the Cohen's d effect size.

Thus the new left (and right) tail is given by the sum of null hypothesis z_\alpha and the difference between the two means divided by the sample variance.

`left_tail = -z_alpha + (null_scale - true_scale) / (pop_var ** 0.5 / n ** 0.5)`

The formal definition:

The power of a binary hypothesis test is the probability that the test rejects the null hypothesis when a specific alternative hypothesis ) is true — i.e., it indicates the probability of avoiding a type II error.

Thus, by definition,

\text{Power} = 1 - \beta

Example:

Suppose our population mean is 5 and we want to estimate with 95% confidence with 0.5 margin of error.

```
Error_ci = 0.5
p_mean = 5
p_sd = p_mean
alpha = 0.05
z_alpha = -st.norm.ppf(alpha / 2)
sample_size = (critical_region * p_sd / Error_ci) ** 2
print(sample_size)
"""
384.1458820694127
"""
```

Suppose we come up with the alternative hypothesis H_a = 0.55, the z-score would be exactly in the confidence interval boundary.

```
sample_mean = 5.5
z_stats = (sample_mean - p_mean) / (p_sd / sample_size ** 0.5)
print(z_stats)
"""
1.9599639845400547
"""
```

As well as the power (what do you think the power is when H_a = 0.55?):

```
left_tail = -z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
right_tail = z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
type_2_error = st.norm.cdf(right_tail) - st.norm.cdf(left_tail)
power = 1 - type_2_error # %%
print(power)
"""
0.5000442877191609
"""
```

Naturally, the power is 0.5! You are equally unsure (at 95% confidence level) whether the null hypothesis or the alternative is true.

- Controlling \alpha is easy, simply choose
- Controlling \beta is abit more complicated
- The goal is to align the \alpha and \beta regions in the \mu_0 and \mu_a distributions respectively
- Since the means \sigma and critical values are set, we can "manipulate" the sample size n to generate a standard error that brings \alpha and \beta into alignment at c
- This new n will create a new \bar{x}_{crit} value for our decision rule.
- Since we are now controlling Type II error, we CAN use the phrases "reject H_0" or "accept H_0.

What if we want to control the Type II error (or the power?). This is known as power analysis. Power analysis is normally conducted before the data collection.

The formal definition:

Power analysis is the procedure that researchers can use to determine if the test contains enough power to make a reasonable conclusion.

Alternatively,

Power is the probability that a test of significance will detect a deviation from the null hypothesis, should such a deviation exist.

There are a few ways:

- Adjusting \alpha - but unlikely to happen
- increasing sample size, in that case, how much to increase?

(Note, there are other ways but usually not feasible (e.g out of your control) such as increasing the effect size, changing it to a one tailed test or decrease random error)

Example:

Suppose we have the following hypothesis, and you want to have a 80% chance that if a that a deviation of 0.5 is present, to be detected.

H_0 = 5, H_a = 5.5

There are online calculators (or libraries) that are available for free. Feel free to try it out with:

- Power 1-\beta = 0.8
- Type I error rate \alpha = 0.05 or 5\%
- True mean \mu_a = 5
- Null mean \mu_0 = 5.5
- s.d \sigma = 5.5

```
z_alpha = st.norm.ppf(0.975)
# Alpha value at right tail since alternative is bigger than null
z_beta = st.norm.ppf(0.8) # Beta power = 0.8
pop_var = p_sd ** 2
mean_null = p_mean
mean_true = 5.5
sample_size = (z_alpha + z_beta) ** 2 * pop_var / (mean_null - mean_true) ** 2
print(sample_size)
"""
784.8879734349089
"""
```

To control our type II error, we basically want to reduce the overlap between the two distributions.

In this case, the intersection point is where the right tail of the null hypothesis equates to the left tail of the hypothesis:

\mu_0 + z_{\alpha/2} \frac{\sigma}{\sqrt{n}}=\mu_a - z_\beta\frac{\sigma}{\sqrt{n}}

Solving for n gives us:

n = \frac{(z_a+z_\beta)^2 \sigma^2}{(\mu_0 - \mu_a)^2}

To verify our new sample size,

```
Error_ci = 0.5
p_mean = 5
p_sd = p_mean
alpha = 0.05
z_alpha = -st.norm.ppf(alpha / 2)
# our new sample size
sample_size = (z_alpha + z_beta) ** 2 * pop_var / (mean_null - mean_true) ** 2
sample_mean = 5.5
print(sample_size)
z_stats = (sample_mean - p_mean) / (p_sd / sample_size ** 0.5)
print(z_stats)
left_tail = -z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
right_tail = z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
type_2_error = st.norm.cdf(right_tail) - st.norm.cdf(left_tail)
power = 1 - type_2_error # %%
print(power)
"""
2.801585218112969
0.8000009605622265
"""
```

As we are collecting more data which resulted in a larger sample size, we are now "more" certain that the null hypothesis should be rejected.

Alternatively, we can also `decrease the error`

.

In the earlier example, when the alternative hypothesis H_a = 0.55 with 95% confidence interval, 0.05 margin of error, the power \beta = 0.5. To increase the power by 0.3, we can decrease the margin of error by 30%.

```
# Increase in power = 0.3
# 0.5 * (1- increase in power ) = 0.5 * 0.7 = 0.35
Error_ci = 0.35
p_mean = 5
p_sd = p_mean
alpha = 0.05
z_alpha = -st.norm.ppf(alpha / 2)
sample_size = (critical_region * p_sd / Error_ci) ** 2
print(sample_size)
sample_mean = 5.5
z_stats = (sample_mean - p_mean) / (p_sd / sample_size ** 0.5)
print(z_stats)
left_tail = -z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
right_tail = z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
type_2_error = st.norm.cdf(right_tail) - st.norm.cdf(left_tail)
power = 1 - type_2_error # %%
print(power)
"""
783.9711878967609 # Sample size required
2.7999485493429357 # z-score
0.7995424479338836 # power
"""
```

As expected, the result is similar. Thus, with a power of 80% to detect an difference of 0.5, any deviation from 5 \pm 0.35 will result in the rejection of the null hypothesis and conclude that the evidence is supportive of the alternative hypothesis.

How much to adjust alpha is not straight forward, and is usually done by plotting power curves.

For example, to get a power of 0.8 in the same example, we increase \alpha from 0.05 to 0.266.

```
Error_ci = 0.5
p_mean = 5
p_sd = p_mean
alpha = 0.266 # increase from 0.05 to 0.25
z_alpha = -st.norm.ppf(alpha / 2)
sample_size = (critical_region * p_sd / Error_ci) ** 2
print(sample_size)
sample_mean = 5.5
z_stats = (sample_mean - p_mean) / (p_sd / sample_size ** 0.5)
print(z_stats)
left_tail = -z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
right_tail = z_alpha + (p_mean - sample_mean) / (p_sd / sample_size ** 0.5)
type_2_error = st.norm.cdf(right_tail) - st.norm.cdf(left_tail)
power = 1 - type_2_error # %%
print(power)
"""
384.1458820694127
1.9599639845400547
0.8027436164758492
"""
```

The following table gives a summary of the 4 scenarios of hypothesis testing.

Null is true | Null is false | |
---|---|---|

Decision about null Don't reject |
Correct Inference (true negative = 1-\alpha) significance level |
Type II error (False negative = \beta) |

Decision about null Reject |
Type I error (false positive = \alpha) |
correct Inference (true positive = 1-\beta) power |