Probability & Sample Distribution
Conditional Probabilities
You’re testing for a disease and advertising that the test is 99% accurate:
-
- if you have the disease, you will test positive 99% of the time,
- if you don’t have the disease, you will test negative 99% of the time.
Let’s say that 1% of all people have the disease and someone tests positive.
What’s the probability that the person has the disease? Select the correct set up for this problem.
print(.99 * .01 / ( (.01 * .99 ) + (.99 * .01))) 0.5
Bayes’ theorem applied
You have two coins in your hand. Out of the two coins, one is a real coin (heads and tails) and the other is a faulty coin with tails on both sides.
You are blindfolded and forced to choose a random coin and then toss it in the air. The coin lands with tails facing upwards. Find the probability that this is the faulty coin.
# Print the probability of the coin landing tails --> P(tails) print(3/4) # Print the probability of the coin being faulty. --> P(faulty) print(1/2) # Print the probability of the coin being faulty and landing tails. --> P(tails and faulty) print(1) # Print and solve for the probability that the coin is faulty, given it came down on tails. # Print P(faulty | tails) print(2/3) 0.75 0.5 1 0.6666666666666666
Central Limit Theorem
With a large enough collection of samples from the same population, the sample means will be normally distributed.
Note that this doesn’t make any assumptions about the underlying distribution of the data; with a reasonably large sample of roughly 30 or more, this theorem will always ring true no matter what the population looks like.
Central Limit theorem matters because it promises our sample mean distribution will be normal , therefore we can perform hypothses tests.
More concretely we can assess the likelihood that a given mean came from a particular distribution and then, based on this, reject or fail to reject our hypothesis? This empowers all of the A/B testing you see in practice.
Law of Large Numbers
The law of large numbers states that as the size of a sample is increased, the estimate of the sample mean will be more accurately reflect the population mean
Handy tool –> list comprehension
# code with for loop x = [1, 2, 3, 4] out = [] for item in x: out.append(item**2) print(out) > [1, 4, 9, 16] # list comprehension code : out = [item**2 for item in x] print(out) > [1, 4, 9, 16]
Samples of a rolled dice
from numpy.random import randint # Create a sample of 10 die rolls small = randint(1, 7, size = 10) # Calculate and print the mean of the sample small_mean = small.mean() print(small_mean) > 3.4 # Create a sample of 1000 die rolls large = randint(1, 7, size = 1000) # Calculate and print the mean of the large sample large_mean = large.mean() print(large_mean) > 3.562
The law of large numbers is show here –> the more f so, you’re correct! It’s important to distinguish between the law of large numbers and central limit theorem in interviews.
SImulating central limit theorem
from numpy.random import randint # Create a list of 1000 sample means of size 30 means = [randint(1, 7, 30).mean() for i in range(1000)] # Create and show a histogram of the means plt.hist(means) plt.show()
from numpy.random import randint # Adapt code for 100 samples of size 30 means = [randint(1, 7, 30).mean() for i in range(100)] # Create and show a histogram of the means plt.hist(means) plt.show()
Whether we took 100 or 1000 sample means, the distribution was still approximately normal. This will always be the case when we have a large enough sample (typically above 30). That’s the central limit theorem at work.
Probability Distributions
What is it ?
- describe the likelihood of an outcome
- the probabilities must all add up to 1 and can be
- discrete ( like the roll of a dice )
- continuous ( like the amount of rainfall
example of a continuous probability distribution where the total area under the curve adds up to 1)
- Most common :
- Most common :
- Bernoulli
- Binomial
- Poisson
- Normal
–> use the rvs in scipy to simulate all the distributions, visualize with matplotlib
Bernoulli distribution
Discrete distribution that models the probability of two outcome
plt.hist(bernouilli.rvs(p=0.5, size= 1000))
Both heads and tails have the same probability of 0.5, so the values are even in this sample
Since there only 2 possible outcomes in Bernouilli, the probability of one is always 1 minus the probability of the other.
# Generate bernoulli data from scipy.stats import bernoulli data = bernoulli.rvs(p=0.5, size=1000) # Plot distribution plt.hist(data) plt.show()
Binomial distribution
The sum of the outcomes of multiple Bernouilli trials, meaning those have an established success and failure.
plt.hist(binom.rvs(2, 0.5, size = 10000))
–> Result of a sample representing the number of heads in two consecutive coint flips using a fair coin, taking the form of a binomial distribution
It’s used to model the number of successful outcomes in trials where there is some consistent probability of success.
These parameters are often referred as
k – number of sucessess
n – number of trials
p – probability of success
For this exercise, consider a game where you are trying to make a ball in a basket. You are given 10 shots and you know that you have an 80% chance of making a given shot. To simplify things, assume each shot is an independent event.
# Generate binomial data from scipy.stats import binom data = binom.rvs(n=10, p=0.8, size=1000) # Plot the distribution plt.hist(data) plt.show() # Assign and print probability of 8 or less successes prob1 = binom.cdf(k=8, n=10, p=0.8) print(prob1) # Assign and print probability of all 10 successes prob2 = binom.pmf(k=10, n=10, p=0.8) print(prob2) > 0.6241903616 0.10737418240000005 >
Normal distribution
The normal distribution is a bell-curve shaped continuous probability distribution
The overlay is serving as reminder for the
68,2 % – 95 & 99 % –>
68 % falls within 1 st.deviation
95 % falls within 2 st.devs
99 % falls within 3 st devs
# Generate normal data from scipy.stats import norm data = norm.rvs(size=1000) # Plot distribution plt.hist(data) plt.show() # Compute and print true probability for greater than 2 true_prob = 1 - norm.cdf(2) print(true_prob) # Compute and print sample probability for greater than 2 sample_prob = sum(obs > 2 for obs in data) / len(data) print(sample_prob) 0.02275013194817921 0.014
Poisson distribution
Like binomial distribution the Poisson distribution represents a count or number of times something happened.
It’s calculated not by a probability p and number of trials n, but by an average rate shown by lambda
As the rate of events changes, the distribution changes as well
In a 15 min interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of 1 hour ?