Python – Statistics – EDA

Exploratory Data Analysis

  • Descriptive Statistics
  • Working with categorical data
  • Relationship between two variables

Descriptive Statistics

  • what it sound like
  •  help you describe the data with numerical calculations or plots
  • many types – focus on most common ones :
    • measures of centrality
      – measures of variability

Measures of centrality

  • Mean
    • The mean is the average, which is the sum divided by the number of observations
  • Median
    • The median is the middle value when all the observations are sorted
  • Mode
    • The mode is the most common observation ( or the peak of the distribution)

If the distribution is perfectly normal then all the values are the same
If the distribution is skewed, the values will differ

Measures of variability

  • Variance
  • Standard deviation
  • Range
    Are used to describe how spread out the data is.

Modality

The modality of a distribution is determined by the number of peaks it contains.
Most commonly is one peak, but this picture shows a bimodal distribution

Skewness

Another concept is skewness, which is a measurement of the symmetry of the distribution.
Also seen from the centrality measures we see thats asymmetrical with more data on the right than on the left .. in the distribution is skewed left.

 

 

Exercise – Mean or Median

# visualize the distribution
plt.hist(weather['Temp3pm'])
plt.show()

# Assign the mean to the variable and print the result
mean = weather['Temp3pm'].mean()
print('Mean:', mean)

# Assign the median to the variable and print the result
median = weather['Temp3pm'].median()
print('Median:', median)

Mean: 22.873684210526317
Median: 23.1

 

# Visualize the distribution
plt.hist(weather['Temp9am'])
plt.show()

# Assign the mean to the variable and print the result
mean = weather['Temp9am'].mean()
print('Mean:', mean)

# Assign the median to the variable and print the result
median = weather['Temp9am'].median()
print('Median:', median)

Mean: 16.989473684210527
Median: 16.2

 

The that distribution for Temp3pm was skewed left while the distribution for Temp9am was more skewed right.

For this reason, the median was higher than the mean initially, but after adapting our code we see that the mean became higher than the median due to the change in skewness.

While neither column is harshly skewed, it is enough that we should consider using median instead of mean due to the robustness of the metric.

Plotting a histogram

# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# Set default Seaborn style
sns.set()

# Plot histogram of versicolor petal lengths
_ = plt.hist(versicolor_petal_length)

# Label axes
_ = plt.ylabel("count")
_ = plt.xlabel("petal length (cm)")

# Show histogram
plt.show()

Adjusting the number of bins in a histogram

# Import numpy
import numpy as np

# Compute number of data points: n_data
n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins
n_bins = (np.sqrt(n_data))

# Convert number of bins to integer: n_bins
n_bins = int(n_bins)

# Plot the histogram
_ = plt.hist(versicolor_petal_length, bins = n_bins)

# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')

# Show histogram
plt.show()

Standard deviation by hand

# Create a sample list
import math
nums = [1, 2, 3, 4, 5]

# Compute the mean of the list
mean = sum(nums) / len(nums)

# Compute the variance and print the std of the list
variance = sum(pow(x - mean, 2) for x in nums) / len(nums)
std = math.sqrt(variance)
print(std)

# Compute and print the actual result from numpy
real_std = np.array(nums).std()
print(real_std)

1.4142135623730951
1.4142135623730951

 

Categorical data

Categorical features can only take a limited and usually fixed number of possible values.

Types of variables

– ordinal : takes a sort of order ( number of stars given in an movie review)
– nominal variables : order doesn’t matter (eye color, gender)

Encoding categorical data

In Machine Learning, you may have to encode the categorical variables as something else :
– label encoding –> mappping each value to a number as you can
Food    Categorical # Calories
Apple      1           95
Chicken    2          231
Broccoli   3           50
The number have no relationship with each other

One-Hot Encoding

Apple Chicken Broccoli Calories
 1       0       0      95
 1       1       0     231
 0       0       1      50
One Hot encoding maps each category to its own column containing a 1 or 0 if the observation has that feature or not

The preprocessing package in scikit learn and the fit_transform function are helpful here, together with pandas get_dummies function

Example laptop models
  Company  Product       Price
0 Apple    Macbook Pro   1339.69
1 Apple    Macbook Air    898.94
2 Apple    Macbook Pro   2539.45
3 Apple    Macbook Pro   1800.60
4 Apple    Macbook Pro   2139.97

 

Barplot

company_count = df['company'].value_counts()
sns.barplot(company_count.index, company_count.values)

Bees Swarm Plot

Histograms are great but

    • a dataset can look different depending on how the bins are chosen (and the # of bins is arbitrary) and is known as binning bias. The same data may be interpreted differently for 2 different choices of bin number.
    • histograms don’t plot all the data. All the data are contained in the bins and not the actual values.

To remedy these problems –> bee swarm plot. Each plot represent the share. The position along the y -axis is the quantitative information. The data are spread in x to make them visible. But the precise location is unimportant. This solves the problem of binning bias and all data are displayed. The requirement is that all the data is well-organised in pandas dataframe where each column is a feature and each row an observation.

_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

 

 

Boxplot

Excellent tool for packaging a lot of information in one visualization.

Example laptop prices for each brand

df.boxplot('Price', 'Company',rot = 30, figsize=(12,8), vert=False)

 

Exercise : Encoding techniques

from sklearn import preprocessing

# Create the label encoder and print our encoded new_vals
encoder = preprocessing.LabelEncoder()
new_vals = encoder.fit_transform(laptops['Company'])
print(new_vals)

[2 0 1 2 2 1 1 1 2 1 2 2 0 1 1 2 2 2 1 1 1 2 1 2 1 1 2 1 2 1 1 2 1 1 2 1 0
2 1 1 2 1 2 2 1 2 1 2 2 1 2 2 1 2 1 0 1 1 2 2 2 1 1 2 1 1 1 1 2 2 1 2 1 1]

# One-hot encode Company for laptops2
laptops2 = pd.get_dummies(data=laptops, columns=['Company'])
print(laptops2.head())

  Product             Price    Company_Apple Company_Dell Company_Lenovo
0 ThinkPad 13          960.00      0             0            1 
1 MacBook Pro         1518.55      1             0            0
2 XPS 13              1268.00      0             1            0
3 ThinkPad Yoga       2025.00      0             0            1
4 IdeaPad 520S-14IKB   599.00      0             0            1

 

Exploring laptop prices

# Get some initial info about the data
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 3 columns):
Company 309 non-null object
Product 309 non-null object
Price 309 non-null float64
dtypes: float64(1), object(2)
memory usage: 7.3+ KB

# Produce a countplot of companies
sns.countplot(laptops['Company'])
plt.show()

# Visualize the relationship with price
laptops.boxplot('Price', 'Company', rot=30)
plt.show()

It appears that Asus is the most common brand while Toshiba is less common. Furthermore, despite a few outliers, there is a steady increase in price as we move from Acer to Asus to Toshiba models. During your interview prep, don’t forget to emphasize communicating results. Companies are looking for someone that can break down insights and share them effectively with non-technical team members. Recording yourself is an excellent way to practice this.

 

Empirical Cumulative Distribution Function

Limit to the efficacy of the bee swarms, which is the number of data points. If too many, the data points overlap. We can compute an empirical cumulative distribution function ( ECDF ). The x-value of an ECDF is the quantity you are measuring, ( in this case the percent of vote that sent to Obama. The y-value is the fraction of data Points that have a smaller value than the corresponding x-value.

Example : 20% of counties in the swing states had 36% or less of its people vote for Obama. Example 2: 75 % of counties in swing states had 50% or less of its people vote for Obama.

How to make an ECDF

The x-axis is the sorted data. We need to generate it using the NumPy function sort.
The y-axis is evenly spaced data points with a maximum of 1, which we can generate using the np.arrange function and then dividing by the total number of data points.

import numpy as np
x = np.sort(df_swing['dem_share'])
y = np.arange(1, len(x) + 1) / len(x)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('percent vote Obama')
_ = plt.ylabel('ECDF')
plt.margins(0.02)

 

Interpretation with 3 ECDF

Ohio and Pennsylvania were similar, with PA having slightly more Democratic counties. Florida had a greater fraction of heavily Republican counties.

Function for a ECDF

def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""

  # Number of data points: n
  n = len(data)

  # x-data for the ECDF: x
  x = np.sort(data)

  # y-data for the ECDF: y
  y = np.arange(1, n + 1) / n

  return x, y

 

Plotting the ECDF

# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot
_ = plt.plot(x_vers,y_vers, marker = '.', linestyle = 'none')

# Make the margins nice
plt.margins(0.02)

# Label the axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

 

 

Comparison of ECDF –

# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot
_ = plt.plot(x_set,y_set, marker = '.', linestyle = 'none')
_ = plt.plot(x_vers,y_vers, marker = '.', linestyle = 'none')
_ = plt.plot(x_virg,y_virg, marker = '.', linestyle = 'none')

# Make nice margins
plt.margins(0.02)

# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

 

The ECDFs expose clear differences among the species. Setosa is much shorter, also with less absolute variability in petal length than versicolor and virginica.

 

 

Two or more variables

About analyzing the relationship between variables, in the context of two or more numerical variables . We’ll go over different types of relationships, correlation and more

Types of relationship

By using scatterplots to compare one variable against another, we can get a feel for the kind of relationship we’re dealing with. Here we’re seeing two common distinctions : strong or weak and positive or negative
Note that not all relationships are linear, they can be quadratic or exponential as well, and often there is no apparent relation between the variables at all

What’s correlation ?

Correlation describes the relatedness between variables, meaning how much information variables reveal about each other.

If two variables are positively correlated, like we see on the far left, then if one increases the other is likely to increase as well.

python : the scatter, pair plot and corr functions are helpful

Covariance

The covariance of two variables is calculated as the average of the product between the values from each sample, where the values have each had their mean subtracted.
Covariance fall short when it comes to interpretability since we can’t get anything from the magnitude
It’s mainly important because we can use the covariance to calculate the Pearson Correlation Coefficient, a metric that’s more interpretable and important.

Pearson Correlation

To get the Pearson Correlation Coefficient we take the covariance function and divide it by the product of sample standard deviations of each variable.

A positive value means there is a positive relationship,,, while a negative value means there’s a negative relationship. 1 meaning perfectly correlated, a value of 0 means there is no correlation . The value only falls between positive 1 and negative 1. The same concept with R squared value which is the Pearson correlation squared
R squared is often interpreted as the amount of variable Y that is explained by X and is great to include in interview answers.

Correlation vs causation

Our goal is often to find out if there is a relationship between to variables, that is does information about one of the variables tell us more about it’s counterpart.


But how do we know that variables are actually related and don’t just appear that way due by chance. How can we be sure that one variable actually causes the other.
Graph –> Does this mean that margarine causes divorce or are they simply just correlated

Exercise : Types of relationships :

# Display a scatter plot and examine the relationship
plt.scatter(weather['MinTemp'], weather['MaxTemp'])
plt.show()

# Display a scatter plot and examine the relationship
plt.scatter(weather['MaxTemp'], weather['Humidity9am'])
plt.show()

# Display a scatter plot and examine the relationship
plt.scatter(weather['MinTemp'], weather['Humidity3pm'])
plt.show()

 

Exercise : Pearson correlation

# Generate the pair plot for the weather dataset
sns.pairplot(weather)
plt.show()

# Look at the scatter plot for the humidity variables
plt.scatter(weather['Humidity9am'], weather['Humidity3pm'])
plt.show()

# Compute and print the Pearson correlation
r = weather['Humidity9am'].corr(weather['Humidity3pm'])
print(r)
> 0.6682963889852788

# Compute and print the Pearson correlation
r = weather['Humidity9am'].corr(weather['Humidity3pm'])

# Calculate the r-squared value and print the result
r2 = r ** 2
print(r2)
> 0.446620063530763

We see here that humidity in the morning has a moderately strong correlation with afternoon humidity, giving us a ~0.67 pearson coefficient. When we square that result, we get a r-squared value of ~0.45, meaning that Humidity9am explains around 45% of the variability in the Humidity3pm variable.

Exercise :Sensitivity to outliers

Plot and compute the correlation for a dataset with an outlier and then remove it and see what changes. In the end, you want to see how correlation performs and come to a conclusion about when and where you should use it.

A sample dataset from the famous Anscombe’s quartet has been imported for you as the df variable, along with the all the packages used previously in this chapter.

# Display the scatter plot of X and Y
plt.scatter(df.X, df.Y)
plt.show()

# Compute and print the correlation
corr  = df['X'].corr(df['Y'])
print(corr)
0.8162867394895984

# Drop the outlier from the dataset
df = df.drop(2)

# Compute and print the correlation once more
new_corr  = df['X'].corr(df['Y'])
print(new_corr)
0.9999965537848281

Notice how our correlation initially suffered due to the outlier, but became nearly perfect once it was removed. This effectively conveyed the biggest complaint about Pearson correlation; the metric is not very robust to outliers, so data scientists often must seek other tools when the outliers can’t or shouldn’t be easily removed.