Python – Statistics – Probability & Sample Distribution

Probability & Sample Distribution

Conditional Probabilities

You’re testing for a disease and advertising that the test is 99% accurate:

    • if you have the disease, you will test positive 99% of the time,
    • if you don’t have the disease, you will test negative 99% of the time.

Let’s say that 1% of all people have the disease and someone tests positive.

What’s the probability that the person has the disease? Select the correct set up for this problem.

print(.99 * .01 / ( (.01 * .99 ) + (.99 * .01)))

Bayes’ theorem applied

You have two coins in your hand. Out of the two coins, one is a real coin (heads and tails) and the other is a faulty coin with tails on both sides.

You are blindfolded and forced to choose a random coin and then toss it in the air. The coin lands with tails facing upwards. Find the probability that this is the faulty coin.

# Print the probability of the coin landing tails --> P(tails)
# Print the probability of the coin being faulty. --> P(faulty)
# Print the probability of the coin being faulty and landing tails. --> P(tails and faulty)
# Print and solve for the probability that the coin is faulty, given it came down on tails.
# Print P(faulty | tails)


Central Limit Theorem

With a large enough collection of samples from the same population, the sample means will be normally distributed.
Note that this doesn’t make any assumptions about the underlying distribution of the data; with a reasonably large sample of roughly 30 or more, this theorem will always ring true no matter what the population looks like.
Central Limit theorem matters because it promises our sample mean distribution will be normal , therefore we can perform hypothses tests.
More concretely we can assess the likelihood that a given mean came from a particular distribution and then, based on this, reject or fail to reject our hypothesis? This empowers all of the A/B testing you see in practice.


Law of Large Numbers

The law of large numbers states that as the size of a sample is increased, the estimate of the sample mean will be more accurately reflect the population mean


Handy tool –> list comprehension

# code with for loop
x = [1, 2, 3, 4]
out = []
for item in x:
> [1, 4, 9, 16]

# list comprehension code :
out = [item**2 for item in x]
> [1, 4, 9, 16]


Samples of a rolled dice

from numpy.random import randint

# Create a sample of 10 die rolls
small = randint(1, 7, size = 10)

# Calculate and print the mean of the sample
small_mean = small.mean()
> 3.4

# Create a sample of 1000 die rolls
large = randint(1, 7, size = 1000)

# Calculate and print the mean of the large sample
large_mean = large.mean()
> 3.562

The law of large numbers is show here –> the more f so, you’re correct! It’s important to distinguish between the law of large numbers and central limit theorem in interviews.

SImulating central limit theorem

from numpy.random import randint

# Create a list of 1000 sample means of size 30
means = [randint(1, 7, 30).mean() for i in range(1000)]

# Create and show a histogram of the means


from numpy.random import randint

# Adapt code for 100 samples of size 30
means = [randint(1, 7, 30).mean() for i in range(100)]

# Create and show a histogram of the means


Whether we took 100 or 1000 sample means, the distribution was still approximately normal. This will always be the case when we have a large enough sample (typically above 30). That’s the central limit theorem at work.


Probability Distributions

What is it ?

  • describe the likelihood of an outcome
  • the probabilities must all add up to 1 and can be
    • discrete ( like the roll of a dice )
    • continuous ( like the amount of rainfall

 example of a continuous probability distribution where the total area under the curve adds up to 1)

  • Most common :
  • Most common :
    • Bernoulli
    • Binomial
    • Poisson
    • Normal
      –> use the rvs in scipy to simulate all the distributions, visualize with matplotlib

Bernoulli distribution

Discrete distribution that models the probability of two outcome

plt.hist(bernouilli.rvs(p=0.5, size= 1000))

Both heads and tails have the same probability of 0.5, so the values are even in this sample
Since there only 2 possible outcomes in Bernouilli, the probability of one is always 1 minus the probability of the other.


# Generate bernoulli data
from scipy.stats import bernoulli
data = bernoulli.rvs(p=0.5, size=1000)

# Plot distribution



Binomial distribution

The sum of the outcomes of multiple Bernouilli trials, meaning those have an established success and failure.

plt.hist(binom.rvs(2, 0.5, size = 10000))

–> Result of a sample representing the number of heads in two consecutive coint flips using a fair coin, taking the form of a binomial distribution

It’s used to model the number of successful outcomes in trials where there is some consistent probability of success.
These parameters are often referred as
k – number of sucessess
n – number of trials
p – probability of success

For this exercise, consider a game where you are trying to make a ball in a basket. You are given 10 shots and you know that you have an 80% chance of making a given shot. To simplify things, assume each shot is an independent event.

# Generate binomial data
from scipy.stats import binom
data = binom.rvs(n=10, p=0.8, size=1000)

# Plot the distribution

# Assign and print probability of 8 or less successes
prob1 = binom.cdf(k=8, n=10, p=0.8)

# Assign and print probability of all 10 successes
prob2 = binom.pmf(k=10, n=10, p=0.8)
> 0.6241903616


Normal distribution

The normal distribution is a bell-curve shaped continuous probability distribution

The overlay is serving as reminder for the
68,2 % – 95 & 99 % –>
68 % falls within 1 st.deviation
95 % falls within 2 st.devs
99 % falls within 3 st devs

# Generate normal data
from scipy.stats import norm
data = norm.rvs(size=1000)

# Plot distribution

# Compute and print true probability for greater than 2
true_prob = 1 - norm.cdf(2)

# Compute and print sample probability for greater than 2
sample_prob = sum(obs > 2 for obs in data) / len(data)



Poisson distribution

Like binomial distribution the Poisson distribution represents a count or number of times something happened.

It’s calculated not by a probability p and number of trials n, but by an average rate shown by lambda
As the rate of events changes, the distribution changes as well

In a 15 min interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of 1 hour ?