Python – Statistics – Experiments and Significance Testing

Statistical Experiments and Significance Testing

Confidence Intervals

Intro to sampling

Concept of sampling – a sample is a collection of data from a certain population that is meant to represent the whole. It will usually make up only a small portion of the total. The idea is that we can make conclusions about the sample and generalize it to a broader group.

What is a confidence interval

A confidence interval is a range of values that we are fairly sure includes the true value of an unknown population parameter. It has an associated confidence level that represents the frequency in which the interval will contain this value. So we have a 95% confidence interval this means that 95 times out of 100 we can expect our interval to hold the true parameter value of the population.

Calculating confidence intervals

For means , you take the sample mean then add and subtract the appropriate z-score for your confidence level with the population standard deviation over the square root of the number of samples. This takes a slightly different form if you don’t know the population variance

For proportions, similarly, you take the mean plus minus the z score times the square root of the sample proportion times its inverse, over the number of samples.

Both of these formulas are alike in the sense that they take the mean plus minus some value that we compute. This value is referred to as the margin of error. Adding it to the mean gives up the upper threshold of our interval, whereas subtracting it from the mean gives us the lower threshold

Example: means

sem –> standard error compute function
In this scenario, our sample of 10, 11, 12, 13 gives us a 95 percent confidence interval of (9.446, 13.554) meaning that 95 times out of 100 the true mean should fall in this range

Example: proportions

We can pass the proportion_confint function the number of successes, number of trials and the alpha value represented by 1 minus our confidence level. Here we can see a 95 percent confidence interval for 4 successes out of 10 trials.

Exercise confidence interval by hand


exercise confidence intervals

In this exercise a binomial sample of number of heads in 50 fair coin flips –> heads.

You might see at least one confidence interval that does not contain 0.5, the true population proportion for a fair coin flip. You could decrease the likelihood of this happening by increasing your confidence level or lowering the alpha value.

Hypothesis Test

We’ll go over the logistics of running a test for both means and proportions

Quick review

Hypothesis testing is really just a means of coming to some statistical inference. This is to say that we want to look at the distribution of our data and come to some conclusion about something that we think may or may not be true


Before we run a hypothesis test , there are a couple of assumptions that we need to check. Our assumptions include that :

    • the sample must be taken randomly,
    • each observation must be independent, and
    • the sample data must be normally distributed around the sample mean which will naturally occur in sufficiently large samples due to the Central Limit Theorem.
    • Lastly the variance between the sample and the population must be constant

Generating hypotheses

After checking the assumptions, we need to generate both our null and alternate hypotheses before we can run our test. The null hypothesis represents the treatment not effecting the outcome in any way. The alternate hypothesis on the other hand represents the outcome that the treatment does have a conclusive effect. As we can see the null hypothesis (H0) and the alternate(H1) change depending on the type of test. However the consistent theme is that we are taking the sample estimate and comparing it to the expected value from our control.

Which test to use ?

Focus on the two most common hypothesis tests: z-tests and t-tests. The test that you use depends on the situation. If you know the population standard deviation and you have a sufficient sample size, you will probably want a z-test, otherwise break out a t-test. In python –> proportions_ztest and ttest_ind functions .

Evaluating results

When you run the test, your result will be generated in the form of a test statistic, either a z score or t statistic. Using this, you can compute the p-value, which represents the probability of obtaining the sample results you got, given that the null hypothesis is true. It’s intuitive that if your p-value is small enough, falling in yellow here that you can reject the null. We use the significance level to determine how large of an effect you need to reject the null hypothesis, or how certain you need to be. A common alpha value is 0.05, which represents 95 % confidence in your test.

Types of errors

When you get the outcome, there will always be a probability of obtaining false results; this is what your significance level and power are for. There are two types of errors that you can get. See the confusion matrix , with the predictions on the y-axis.

    • Type I errors or false positives, occur when you incorrectly reject a true null hypothesis.
    • Type II errors or false negatives occur when you accept a null hypotheses when an effect really exists. This means that we predicted no effect when there really was an effect.

confusion matrix

Exercise : 1-tailed z-test

data :

You see that our test gave us a resulting p-value of .009 which falls under our alpha value of .05, so we can conclude that there is an effect and, therefore, we reject the null hypothesis. It looks like the change actually did have a noticeable positive effect on conversion rate!

Exercise : Two tailed t-test

In this exercise, you’ll tackle another type of hypothesis test with the two tailed t-test for means. More concretely, you’ll run the test on our laptops dataset from before and try to identify a significant difference in price between Asus and Toshiba.

With a p-value of .133, we cannot reject the null hypothesis! There’s not enough evidence here to conclude that Toshiba laptops are significantly more expensive than Asus. With that being said, .133 is fairly close to reasonable significance so we may want to run another test or examine this further. Interviewers won’t hesitate to throw you tricky situations like this to see how you handle them

Power and sample size

Power analysis

When running an experiment, how do you decide how long it should run OR how many observations are needed per group ? This question is relevant because it’s normally advised that you decide on a sample size before you start an experiment. Despite what you may read in many guides to A/B testing, there is no good general guidance here (as usual) the answer : it depends. In practice, the approach to use this problem is referred as power analysis.

Moving parts

  1. First you need to know the minimum size of the effect that you want to detect in a test, example : 20 percent improvement.
  2. Second is the significance level at which the test will be conducted, commonly known as alpha value.
  3. Lastly power is the probability of detecting an effect. If we see something interesting, we want to make sure we have enough power to conclude with high probability that the result is statistically significant.
  4. If we change 1+ of these parameters the needed sample size changes.

More power, smaller significance level or detecting a smaller effect all lead to a larger sample size. The python plot_power function does a good job visualizing this phenomenon

Calculating sample size

module : statsmodel.stats.power

    • zt_ind_solve_power() ,
    • tt_ind_solve_power()
    • One preliminary step must be taken; the power functions above require standardized minimum effect difference. T get this we can use the proportion_effectsize by inputting our baseline and desired minimum conversion rates

Example : conversion rates:

We’ll set power to 80 %, significance at 5 % and minimum effect size at 5 % as well. We compute the standard effect size and once we run we get our desired sample of +- 1091 impressions.

Changing input values :

We require 1807 observations since power and sample size are inversely related.

Exercise : Calculating sample size

Let’s finish up our dive into statistical tests by performing power analysis to generate needed sample size. Power analysis involves four moving parts: Sample size,Effect size,Minimum effect, Power
In this exercise, you’re working with a website and want to test for a difference in conversion rate. Before you begin the experiment, you must decide how many samples you’ll need per variant using 5% significance and 95% power.

Notice how lowering the power allowed you fewer observations in your sample, yet increased your chance of a Type II error. Remember that doing these calculations by hand is quite difficult, so you may be asked to show or explain these trade offs with white boarding rather than programming

Exercise: Visualizing the relationship

Now that we’ve gone over the effect on certain errors and calculated the necessary sample size for different power values, let’s take a step back and look at the relationship between power and sample size with a useful plot.

In this exercise, we’ll switch gears and look at a t-test rather than a z-test. In order to visualize this, use the plot_power() function that shows sample size on the x-axis with power on the y-axis and different lines representing different minimum effect sizes.

Notice that not only does an increase in power result in a larger sample size, but this increase grows exponentially as the minimum effect size is increased. Once again, power analysis can get confusing with all of these interconnected moving part

Multiple testing

Multiple comparisons problem

When running a typical hypothesis test with the significance level set to .05 there is a 5 percent chance that you’ll make a type I error and detect an effect that doesn’t exist. This is a risk that we are normally willing to take. The multiple comparisons problem arises when you run several sequential hypothesis tests. Some quick math explains this phenomenon quite easily. Since each test is independent, you can multiply the probability of each type I error to get our combined probability of an error. For instance , if we test linkage of 20 different colors of jelly beans to acne with 5% significance, there’s around 65 percent chance of at least one error; in this case it was the green jelly bean that were linked to acne.

Correcting for multiple comparisons

When you run multiple tests, the p-values have to be adjusted for the number of hypothesis tests you are running to control the type I error rate discussed earlier.

Common approaches

There isn’t a universally accepted way to control for the problem of multiple testing, but there a few common ones… :

    • Bonferroni correction,
    • Sidak correction,
    • step-based procedures,
    • Tukey’s procedure,
    • Dunnet’s correction.

Bonferroni correction.

The most conservative correction = most straightforward. is by dividing the alpha level (significance level) by number of tests. If we have had a significance level of .O5 and wanted to run 10 tests, our corrected p-value would come out to .005 for each test.


Side effects

With many tests, the corrected significance level will be come very very small . This reduces power which means you increasingly unlikely to detect a true effect when it occurs.

Exercise Calculating error rates

the probability of encountering an error is still extremely high. This is where the Bonferroni correction comes in. While a bit conservative, it controls the family-wise error rate for circumstances like these to avoid the high probability of a Type I error. 

Exercise : Bonferroni correction

Let’s implement multiple hypothesis tests using the Bonferroni correction approach that we discussed in the slides. You’ll use the imported multipletests() function in order to achieve this. Use a single-test significance level of .05 and observe how the Bonferroni correction affects our sample list of p-values already created.

As you can see, the Bonferroni correction did it’s job and corrected the family-wise error rate for our 5 hypothesis test results. In the end, only one of the tests remained significant. If you’re interested, check out some of the other methods,