Exploratory Data Analysis
- Descriptive Statistics
- Working with categorical data
- Relationship between two variables
Descriptive Statistics
- what it sound like
- help you describe the data with numerical calculations or plots
- many types – focus on most common ones :
- measures of centrality
– measures of variability
- measures of centrality
Measures of centrality
- Mean
- The mean is the average, which is the sum divided by the number of observations
- Median
- The median is the middle value when all the observations are sorted
- Mode
- The mode is the most common observation ( or the peak of the distribution)
If the distribution is perfectly normal then all the values are the same
If the distribution is skewed, the values will differ
Measures of variability
- Variance
- Standard deviation
- Range
Are used to describe how spread out the data is.
Modality
The modality of a distribution is determined by the number of peaks it contains.
Most commonly is one peak, but this picture shows a bimodal distribution
Skewness
Another concept is skewness, which is a measurement of the symmetry of the distribution.
Also seen from the centrality measures we see thats asymmetrical with more data on the right than on the left .. in the distribution is skewed left.
Exercise – Mean or Median
# visualize the distribution plt.hist(weather['Temp3pm']) plt.show()
# Assign the mean to the variable and print the result mean = weather['Temp3pm'].mean() print('Mean:', mean) # Assign the median to the variable and print the result median = weather['Temp3pm'].median() print('Median:', median) Mean: 22.873684210526317 Median: 23.1
# Visualize the distribution plt.hist(weather['Temp9am']) plt.show() # Assign the mean to the variable and print the result mean = weather['Temp9am'].mean() print('Mean:', mean) # Assign the median to the variable and print the result median = weather['Temp9am'].median() print('Median:', median) Mean: 16.989473684210527 Median: 16.2
The that distribution for Temp3pm was skewed left while the distribution for Temp9am was more skewed right.
For this reason, the median was higher than the mean initially, but after adapting our code we see that the mean became higher than the median due to the change in skewness.
While neither column is harshly skewed, it is enough that we should consider using median instead of mean due to the robustness of the metric.
Plotting a histogram
# Import plotting modules import matplotlib.pyplot as plt import seaborn as sns # Set default Seaborn style sns.set() # Plot histogram of versicolor petal lengths _ = plt.hist(versicolor_petal_length) # Label axes _ = plt.ylabel("count") _ = plt.xlabel("petal length (cm)") # Show histogram plt.show()
Adjusting the number of bins in a histogram
# Import numpy import numpy as np # Compute number of data points: n_data n_data = len(versicolor_petal_length) # Number of bins is the square root of number of data points: n_bins n_bins = (np.sqrt(n_data)) # Convert number of bins to integer: n_bins n_bins = int(n_bins) # Plot the histogram _ = plt.hist(versicolor_petal_length, bins = n_bins) # Label axes _ = plt.xlabel('petal length (cm)') _ = plt.ylabel('count') # Show histogram plt.show()
Standard deviation by hand
# Create a sample list import math nums = [1, 2, 3, 4, 5] # Compute the mean of the list mean = sum(nums) / len(nums) # Compute the variance and print the std of the list variance = sum(pow(x - mean, 2) for x in nums) / len(nums) std = math.sqrt(variance) print(std) # Compute and print the actual result from numpy real_std = np.array(nums).std() print(real_std) 1.4142135623730951 1.4142135623730951
Categorical data
Categorical features can only take a limited and usually fixed number of possible values.
Types of variables
– ordinal : takes a sort of order ( number of stars given in an movie review)
– nominal variables : order doesn’t matter (eye color, gender)
Encoding categorical data
In Machine Learning, you may have to encode the categorical variables as something else :
– label encoding –> mappping each value to a number as you can
Food Categorical # Calories
Apple 1 95
Chicken 2 231
Broccoli 3 50
The number have no relationship with each other
One-Hot Encoding
Apple Chicken Broccoli Calories
1 0 0 95
1 1 0 231
0 0 1 50
One Hot encoding maps each category to its own column containing a 1 or 0 if the observation has that feature or not
The preprocessing
package in scikit
learn and the fit_transform
function are helpful here, together with pandas
get_dummies
function
Example laptop models
Company Product Price
0 Apple Macbook Pro 1339.69
1 Apple Macbook Air 898.94
2 Apple Macbook Pro 2539.45
3 Apple Macbook Pro 1800.60
4 Apple Macbook Pro 2139.97
Barplot
company_count = df['company'].value_counts() sns.barplot(company_count.index, company_count.values)
Bees Swarm Plot
Histograms are great but
-
- a dataset can look different depending on how the bins are chosen (and the # of bins is arbitrary) and is known as binning bias. The same data may be interpreted differently for 2 different choices of bin number.
- histograms don’t plot all the data. All the data are contained in the bins and not the actual values.
To remedy these problems –> bee swarm plot. Each plot represent the share. The position along the y -axis is the quantitative information. The data are spread in x to make them visible. But the precise location is unimportant. This solves the problem of binning bias and all data are displayed. The requirement is that all the data is well-organised in pandas dataframe where each column is a feature and each row an observation.
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing) _ = plt.xlabel('state') _ = plt.ylabel('percent of vote for Obama') plt.show()
Boxplot
Excellent tool for packaging a lot of information in one visualization.
Example laptop prices for each brand
df.boxplot('Price', 'Company',rot = 30, figsize=(12,8), vert=False)
Exercise : Encoding techniques
from sklearn import preprocessing # Create the label encoder and print our encoded new_vals encoder = preprocessing.LabelEncoder() new_vals = encoder.fit_transform(laptops['Company']) print(new_vals) [2 0 1 2 2 1 1 1 2 1 2 2 0 1 1 2 2 2 1 1 1 2 1 2 1 1 2 1 2 1 1 2 1 1 2 1 0 2 1 1 2 1 2 2 1 2 1 2 2 1 2 2 1 2 1 0 1 1 2 2 2 1 1 2 1 1 1 1 2 2 1 2 1 1] # One-hot encode Company for laptops2 laptops2 = pd.get_dummies(data=laptops, columns=['Company']) print(laptops2.head()) Product Price Company_Apple Company_Dell Company_Lenovo 0 ThinkPad 13 960.00 0 0 1 1 MacBook Pro 1518.55 1 0 0 2 XPS 13 1268.00 0 1 0 3 ThinkPad Yoga 2025.00 0 0 1 4 IdeaPad 520S-14IKB 599.00 0 0 1
Exploring laptop prices
# Get some initial info about the data laptops.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 309 entries, 0 to 308 Data columns (total 3 columns): Company 309 non-null object Product 309 non-null object Price 309 non-null float64 dtypes: float64(1), object(2) memory usage: 7.3+ KB # Produce a countplot of companies sns.countplot(laptops['Company']) plt.show() # Visualize the relationship with price laptops.boxplot('Price', 'Company', rot=30) plt.show()
It appears that Asus is the most common brand while Toshiba is less common. Furthermore, despite a few outliers, there is a steady increase in price as we move from Acer to Asus to Toshiba models. During your interview prep, don’t forget to emphasize communicating results. Companies are looking for someone that can break down insights and share them effectively with non-technical team members. Recording yourself is an excellent way to practice this.
Empirical Cumulative Distribution Function
Limit to the efficacy of the bee swarms, which is the number of data points. If too many, the data points overlap. We can compute an empirical cumulative distribution function ( ECDF ). The x-value of an ECDF is the quantity you are measuring, ( in this case the percent of vote that sent to Obama. The y-value is the fraction of data Points that have a smaller value than the corresponding x-value.
Example : 20% of counties in the swing states had 36% or less of its people vote for Obama. Example 2: 75 % of counties in swing states had 50% or less of its people vote for Obama.
How to make an ECDF
The x-axis is the sorted data. We need to generate it using the NumPy function sort.
The y-axis is evenly spaced data points with a maximum of 1, which we can generate using the np.arrange function and then dividing by the total number of data points.
import numpy as np x = np.sort(df_swing['dem_share']) y = np.arange(1, len(x) + 1) / len(x) _ = plt.plot(x, y, marker='.', linestyle='none') _ = plt.xlabel('percent vote Obama') _ = plt.ylabel('ECDF') plt.margins(0.02)
Interpretation with 3 ECDF
Ohio and Pennsylvania were similar, with PA having slightly more Democratic counties. Florida had a greater fraction of heavily Republican counties.
Function for a ECDF
def ecdf(data): """Compute ECDF for a one-dimensional array of measurements.""" # Number of data points: n n = len(data) # x-data for the ECDF: x x = np.sort(data) # y-data for the ECDF: y y = np.arange(1, n + 1) / n return x, y
Plotting the ECDF
# Compute ECDF for versicolor data: x_vers, y_vers x_vers, y_vers = ecdf(versicolor_petal_length) # Generate plot _ = plt.plot(x_vers,y_vers, marker = '.', linestyle = 'none') # Make the margins nice plt.margins(0.02) # Label the axes _ = plt.xlabel('petal length (cm)') _ = plt.ylabel('ECDF') # Display the plot plt.show()
Comparison of ECDF –
# Compute ECDFs x_set, y_set = ecdf(setosa_petal_length) x_vers, y_vers = ecdf(versicolor_petal_length) x_virg, y_virg = ecdf(virginica_petal_length) # Plot all ECDFs on the same plot _ = plt.plot(x_set,y_set, marker = '.', linestyle = 'none') _ = plt.plot(x_vers,y_vers, marker = '.', linestyle = 'none') _ = plt.plot(x_virg,y_virg, marker = '.', linestyle = 'none') # Make nice margins plt.margins(0.02) # Annotate the plot plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right') _ = plt.xlabel('petal length (cm)') _ = plt.ylabel('ECDF') # Display the plot plt.show()
The ECDFs expose clear differences among the species. Setosa is much shorter, also with less absolute variability in petal length than versicolor and virginica.
Two or more variables
About analyzing the relationship between variables, in the context of two or more numerical variables . We’ll go over different types of relationships, correlation and more
Types of relationship
By using scatterplots to compare one variable against another, we can get a feel for the kind of relationship we’re dealing with. Here we’re seeing two common distinctions : strong or weak and positive or negative
Note that not all relationships are linear, they can be quadratic or exponential as well, and often there is no apparent relation between the variables at all
What’s correlation ?
Correlation describes the relatedness between variables, meaning how much information variables reveal about each other.
If two variables are positively correlated, like we see on the far left, then if one increases the other is likely to increase as well.
python : the scatter, pair plot and corr functions are helpful
Covariance
The covariance of two variables is calculated as the average of the product between the values from each sample, where the values have each had their mean subtracted.
Covariance fall short when it comes to interpretability since we can’t get anything from the magnitude
It’s mainly important because we can use the covariance to calculate the Pearson Correlation Coefficient, a metric that’s more interpretable and important.
Pearson Correlation
To get the Pearson Correlation Coefficient we take the covariance function and divide it by the product of sample standard deviations of each variable.
A positive value means there is a positive relationship,,, while a negative value means there’s a negative relationship. 1 meaning perfectly correlated, a value of 0 means there is no correlation . The value only falls between positive 1 and negative 1. The same concept with R squared value which is the Pearson correlation squared
R squared is often interpreted as the amount of variable Y that is explained by X and is great to include in interview answers.
Correlation vs causation
Our goal is often to find out if there is a relationship between to variables, that is does information about one of the variables tell us more about it’s counterpart.
But how do we know that variables are actually related and don’t just appear that way due by chance. How can we be sure that one variable actually causes the other.
Graph –> Does this mean that margarine causes divorce or are they simply just correlated
Exercise : Types of relationships :
# Display a scatter plot and examine the relationship plt.scatter(weather['MinTemp'], weather['MaxTemp']) plt.show()
# Display a scatter plot and examine the relationship plt.scatter(weather['MaxTemp'], weather['Humidity9am']) plt.show()
# Display a scatter plot and examine the relationship plt.scatter(weather['MinTemp'], weather['Humidity3pm']) plt.show()
Exercise : Pearson correlation
# Generate the pair plot for the weather dataset sns.pairplot(weather) plt.show()
# Look at the scatter plot for the humidity variables plt.scatter(weather['Humidity9am'], weather['Humidity3pm']) plt.show()
# Compute and print the Pearson correlation r = weather['Humidity9am'].corr(weather['Humidity3pm']) print(r) > 0.6682963889852788 # Compute and print the Pearson correlation r = weather['Humidity9am'].corr(weather['Humidity3pm']) # Calculate the r-squared value and print the result r2 = r ** 2 print(r2) > 0.446620063530763
We see here that humidity in the morning has a moderately strong correlation with afternoon humidity, giving us a ~0.67 pearson coefficient. When we square that result, we get a r-squared value of ~0.45, meaning that Humidity9am explains around 45% of the variability in the Humidity3pm variable.
Exercise :Sensitivity to outliers
Plot and compute the correlation for a dataset with an outlier and then remove it and see what changes. In the end, you want to see how correlation performs and come to a conclusion about when and where you should use it.
A sample dataset from the famous Anscombe’s quartet has been imported for you as the df variable, along with the all the packages used previously in this chapter.
# Display the scatter plot of X and Y plt.scatter(df.X, df.Y) plt.show()
# Compute and print the correlation corr = df['X'].corr(df['Y']) print(corr) 0.8162867394895984 # Drop the outlier from the dataset df = df.drop(2) # Compute and print the correlation once more new_corr = df['X'].corr(df['Y']) print(new_corr) 0.9999965537848281
Notice how our correlation initially suffered due to the outlier, but became nearly perfect once it was removed. This effectively conveyed the biggest complaint about Pearson correlation; the metric is not very robust to outliers, so data scientists often must seek other tools when the outliers can’t or shouldn’t be easily removed.