Python – Statistics – EDA

Exploratory Data Analysis

  • Descriptive Statistics
  • Working with categorical data
  • Relationship between two variables

Descriptive Statistics

  • what it sound like
  •  help you describe the data with numerical calculations or plots
  • many types – focus on most common ones :
    • measures of centrality
      – measures of variability

Measures of centrality

  • Mean
    • The mean is the average, which is the sum divided by the number of observations
  • Median
    • The median is the middle value when all the observations are sorted
  • Mode
    • The mode is the most common observation ( or the peak of the distribution)

If the distribution is perfectly normal then all the values are the same
If the distribution is skewed, the values will differ

Measures of variability

  • Variance
  • Standard deviation
  • Range
    Are used to describe how spread out the data is.

Modality

The modality of a distribution is determined by the number of peaks it contains.
Most commonly is one peak, but this picture shows a bimodal distribution

Skewness

Another concept is skewness, which is a measurement of the symmetry of the distribution.
Also seen from the centrality measures we see thats asymmetrical with more data on the right than on the left .. in the distribution is skewed left.

 

 

Exercise – Mean or Median

 

 

The that distribution for Temp3pm was skewed left while the distribution for Temp9am was more skewed right.

For this reason, the median was higher than the mean initially, but after adapting our code we see that the mean became higher than the median due to the change in skewness.

While neither column is harshly skewed, it is enough that we should consider using median instead of mean due to the robustness of the metric.

Plotting a histogram

Adjusting the number of bins in a histogram

Standard deviation by hand

 

Categorical data

Categorical features can only take a limited and usually fixed number of possible values.

Types of variables

– ordinal : takes a sort of order ( number of stars given in an movie review)
– nominal variables : order doesn’t matter (eye color, gender)

Encoding categorical data

In Machine Learning, you may have to encode the categorical variables as something else :
– label encoding –> mappping each value to a number as you can
Food    Categorical # Calories
Apple      1           95
Chicken    2          231
Broccoli   3           50
The number have no relationship with each other

One-Hot Encoding

Apple Chicken Broccoli Calories
 1       0       0      95
 1       1       0     231
 0       0       1      50
One Hot encoding maps each category to its own column containing a 1 or 0 if the observation has that feature or not

The preprocessing package in scikit learn and the fit_transform function are helpful here, together with pandas get_dummies function

Example laptop models
  Company  Product       Price
0 Apple    Macbook Pro   1339.69
1 Apple    Macbook Air    898.94
2 Apple    Macbook Pro   2539.45
3 Apple    Macbook Pro   1800.60
4 Apple    Macbook Pro   2139.97

 

Barplot

Bees Swarm Plot

Histograms are great but

    • a dataset can look different depending on how the bins are chosen (and the # of bins is arbitrary) and is known as binning bias. The same data may be interpreted differently for 2 different choices of bin number.
    • histograms don’t plot all the data. All the data are contained in the bins and not the actual values.

To remedy these problems –> bee swarm plot. Each plot represent the share. The position along the y -axis is the quantitative information. The data are spread in x to make them visible. But the precise location is unimportant. This solves the problem of binning bias and all data are displayed. The requirement is that all the data is well-organised in pandas dataframe where each column is a feature and each row an observation.

 

 

Boxplot

Excellent tool for packaging a lot of information in one visualization.

Example laptop prices for each brand

df.boxplot('Price', 'Company',rot = 30, figsize=(12,8), vert=False)

 

Exercise : Encoding techniques

 

Exploring laptop prices

It appears that Asus is the most common brand while Toshiba is less common. Furthermore, despite a few outliers, there is a steady increase in price as we move from Acer to Asus to Toshiba models. During your interview prep, don’t forget to emphasize communicating results. Companies are looking for someone that can break down insights and share them effectively with non-technical team members. Recording yourself is an excellent way to practice this.

 

Empirical Cumulative Distribution Function

Limit to the efficacy of the bee swarms, which is the number of data points. If too many, the data points overlap. We can compute an empirical cumulative distribution function ( ECDF ). The x-value of an ECDF is the quantity you are measuring, ( in this case the percent of vote that sent to Obama. The y-value is the fraction of data Points that have a smaller value than the corresponding x-value.

Example : 20% of counties in the swing states had 36% or less of its people vote for Obama. Example 2: 75 % of counties in swing states had 50% or less of its people vote for Obama.

How to make an ECDF

The x-axis is the sorted data. We need to generate it using the NumPy function sort.
The y-axis is evenly spaced data points with a maximum of 1, which we can generate using the np.arrange function and then dividing by the total number of data points.

 

Interpretation with 3 ECDF

Ohio and Pennsylvania were similar, with PA having slightly more Democratic counties. Florida had a greater fraction of heavily Republican counties.

Function for a ECDF

 

Plotting the ECDF

 

 

Comparison of ECDF –

 

The ECDFs expose clear differences among the species. Setosa is much shorter, also with less absolute variability in petal length than versicolor and virginica.

 

 

Two or more variables

About analyzing the relationship between variables, in the context of two or more numerical variables . We’ll go over different types of relationships, correlation and more

Types of relationship

By using scatterplots to compare one variable against another, we can get a feel for the kind of relationship we’re dealing with. Here we’re seeing two common distinctions : strong or weak and positive or negative
Note that not all relationships are linear, they can be quadratic or exponential as well, and often there is no apparent relation between the variables at all

What’s correlation ?

Correlation describes the relatedness between variables, meaning how much information variables reveal about each other.

If two variables are positively correlated, like we see on the far left, then if one increases the other is likely to increase as well.

python : the scatter, pair plot and corr functions are helpful

Covariance

The covariance of two variables is calculated as the average of the product between the values from each sample, where the values have each had their mean subtracted.
Covariance fall short when it comes to interpretability since we can’t get anything from the magnitude
It’s mainly important because we can use the covariance to calculate the Pearson Correlation Coefficient, a metric that’s more interpretable and important.

Pearson Correlation

To get the Pearson Correlation Coefficient we take the covariance function and divide it by the product of sample standard deviations of each variable.

A positive value means there is a positive relationship,,, while a negative value means there’s a negative relationship. 1 meaning perfectly correlated, a value of 0 means there is no correlation . The value only falls between positive 1 and negative 1. The same concept with R squared value which is the Pearson correlation squared
R squared is often interpreted as the amount of variable Y that is explained by X and is great to include in interview answers.

Correlation vs causation

Our goal is often to find out if there is a relationship between to variables, that is does information about one of the variables tell us more about it’s counterpart.


But how do we know that variables are actually related and don’t just appear that way due by chance. How can we be sure that one variable actually causes the other.
Graph –> Does this mean that margarine causes divorce or are they simply just correlated

Exercise : Types of relationships :

 

Exercise : Pearson correlation

We see here that humidity in the morning has a moderately strong correlation with afternoon humidity, giving us a ~0.67 pearson coefficient. When we square that result, we get a r-squared value of ~0.45, meaning that Humidity9am explains around 45% of the variability in the Humidity3pm variable.

Exercise :Sensitivity to outliers

Plot and compute the correlation for a dataset with an outlier and then remove it and see what changes. In the end, you want to see how correlation performs and come to a conclusion about when and where you should use it.

A sample dataset from the famous Anscombe’s quartet has been imported for you as the df variable, along with the all the packages used previously in this chapter.

Notice how our correlation initially suffered due to the outlier, but became nearly perfect once it was removed. This effectively conveyed the biggest complaint about Pearson correlation; the metric is not very robust to outliers, so data scientists often must seek other tools when the outliers can’t or shouldn’t be easily removed.