Python – Statistics – Regression and Classification

Regression and Classification

Regression models

Getting started

Regression is a technique used to model and analyze the relationships between variables contribute to producing a particular outcome. More concretely, it’s a way to determine which variables have an impact, which don’t, which factors interact, and how certain we are about this. (most common techniques – linear and logistic regression )

 

Assumptions

In order to effectively leverage regression models we need the true relationship of the variables to be linear, the errors to be normally distributed and homoscedastic ( uniform variance & each observation to be independent )

Linear regression

Simple linear regression involves one independent and one independent variable with a linear relationship. This results in a fit that will look similar to the plot.

We are solving for the Y-value or the dependent variable, which is our output. This is calculated by taking the y-intercept plus our population slope coefficient, times the independent variable, X, plus our random irreducible error term. More variables can be included by simply adding a beta coefficient for each additional factor. Note that sometimes you will only see the linear component of our intercept and slope, without the random error component.

 

 

Example: linear regression

To implement linear regression in python, we’ll call on the scikit-learn package.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)

After creating the linear regression object and changing any default parameters, simply call the fit function to create your model.

coef = lm.coef_
print(coef)
[0.790866669]
Looking at the coefficient. Since we have only one independent variable in this example, there is only one coefficient. It is essentially the slope of the line and tells us that for every O.8 units of the dependent variable, we get 1 unit of the independent variable.

Logistic regression

Another regression technique is logistic regression, one of the most common Machine Learning techniques for two-class classification. While linear gives us a continuous output, logistic regression produces a discrete output. This allows us to compute probabilities that each observation belong to a class, thanks to the sigmoid function.

The sigmoid function is also called the logistic function. It gives us an S-shaped curve that takes any real number and maps or converts it between 0 and 1.

Example : logistic regression

Similar to linear regression, we can implement logistic regression and then fit the model to our data.

from sklearn.linear_model import LogisticRegression
clf =LogisticRegression(solver='lbfgs')
clf.fit(X_train, y_train)
coefs = clf.coef_
print(coefs)
[[0.401 3.851]]

Since we use 2 independent variables,we got two coefficients back) . Note that these are only interpretabel when you normalize your data first, since you can’t draw any conclusions based on their magnitudes otherwise.

We can also print accuracy to see how our model performed :
accuracy = clf.score(X_test, y_test)
print(accuracy)
0.85833333333
Here it accurately identified around 85% of the observations in the test set.
Other noteworthy functions include predict, ravel for data preparation

Exercise : Linear Regression

from sklearn.linear_model import LinearRegression 
X = np.array(weather['Humidity9am']).reshape(-1,1)
y = weather['Humidity3pm']

# Create and fit your linear regression model
lm = LinearRegression()
lm.fit(X, y)

# Assign and print predictions
preds = lm.predict(X)
print(preds)

[62.90599123 54.20645768 43.92519074 52.6247243  51.04299093 61.32425786
 64.48772461 60.53339117 47.87952418 65.27859129 59.74252449 37.59825725
 70.8146581  74.76899154 27.31699032 38.38912393 70.8146581  48.67039087
 47.08865749 68.44205804 36.80739056 70.02379142 58.16079111 59.74252449
 62.11512455 45.50692412 28.89872369 69.23292473 59.74252449 48.67039087
 71.60552479 57.36992442 47.08865749 51.83385761 57.36992442 26.52612363
 66.86032467 32.06219044 58.16079111 47.87952418 50.25212424 ... ]

# Plot your fit to visualize your model
plt.scatter(X, y)
plt.plot(X, preds, color='red')
plt.show()

# Assign and print coefficient 
coef = lm.coef_
print(coef)
[0.79086669]

Despite some noise in the plot, we have a decent looking fit here using Humidity9am to predict the dependent variable Humidity3pm with a linear model. Furthermore, take another look at our coefficient. This means that for every 1 unit of humidity in the morning, we can expect about 0.80 units of humidity in the afternoon. More practically, this information tells us that humidity drops about 20% from morning to afternoon!

Exercise : Logistic Regression

from sklearn.linear_model import LogisticRegression

# Create and fit your model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Compute and print the accuracy
acc = clf.score(X_test, y_test)
print(acc)
0.7875

# Assign and print the coefficents
coefs = clf.coef_
print(coefs)
[[0.20747557 3.17409056]]

Since our features were normalized beforehand, we can look at the magnitude of our coefficients to tell us the importance of each independent variable. Here you can see the the second variable, Humidity3pm was much more important to our outcome than humidity from that morning. This is intuitive since we are trying to predict the rain for tomorrow!

Evaluating models

Regression techniques : r-squared ( coefficient of determination), the mean absolute error (MAE) and the mean squared error (MSE).

R-squared

R-squared was also discussed when analyzing relationships between 2 or more variables. R-squared tells us the proportion of variance of the dependent variable that is explained by the regression model. Here the residuals are plotted and show us how good of a fit our model is. this is often the first metric data scientist go to when evaluating their model. In python we use the score function.

MAE vs. MSE

MAE is the sum of the absolute residuals over the number of points, and MSE is the sum of residuals squared over the number of points as well. The resulting penalty functions look like this, with absolute error scaling linearly and squared error scaling more exponentially. The resulting penalty functions look like this, with absolute error scaling linearly and the squared error scaling more exponentially. As a result, different scenarios call for different metrics. In the exercises, you can leverage the mean_absolute_error function in python.

MAE vs. MSE question

What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error ? In which cases would each error metric be appropriate ?
Typically, if your datasets has outliers or if you are worried about individual observations, you will want to use MSE ( since by squaring the errors, they are weighted more heavily. On the other hand, if you are not concerned with outliers or singular observations MAE can be used to suppress those errors a bit more, since this involves taking the absolute value instead of squaring the errors.

Classification techniques.

Precision

Precision is the number of true positives over the number of true positives plus false positives. It can be interpreted as the percentage of observations that you correctly guessed and is linked to the rate of the type I error.

Recall

Recall is the number of true positives over the number of true positives plus false negatives and is linked to the rate of type II errors.

Confusion matrix.

Interviewers may ask you to choose a metric based on the context of the problem. Using the confusion matrix that we discuss earlier in the course we can easily see where our model weaknesses are, whether it has false positives, known as type I errors or false negatives, known as type II errors. This also ties in nicely with precision and recall.

For example if you are building a spam detector, you probably don’t want to make any type I errors and want to optimize for precision. On the other hand if you are trying to classify a rare disease, you normally want to avoid type II errors, so recall is the priority. In python : precision_score and recall_score functions.

Exercise : Regression evaluation

Let’s revisit the linear regression model that you created with LinearRegression() and then trained with the fit() function a few exercises ago. Evaluate the performance your model, imported here as lm for you to call.

The weather data has been imported for you with the X and y variables as well, just like before. Let’s get to calculating the R-squared, mean squared error, and mean absolute error values for the model.

# R-squared score
r2 = lm.score(X, y)
print(r2)
0.44662006353076383

# Mean squared error
from sklearn.metrics import mean_squared_error
preds = lm.predict(X)
mse = mean_squared_error(y, preds)
print(mse)
226.12721831681654

# Mean absolute error
from sklearn.metrics import mean_absolute_error
preds = lm.predict(X)
mae = mean_absolute_error(y, preds)
print(mae)
11.522404665934568

Note that our R-squared value tells us the percentage of the variance of y that X is responsible for. Which error metric would you recommend for this dataset? If you remember from when you plotted your model fit, there aren’t too many outliers, so mean squared error would be a good choice to go with!

Exercise : Classification evaluation

Moving forward with evaluation metrics, this time you’ll evaluate our logistic regression model from before with the goal of predicting the binary RainTomorrow feature using humidity.

We have gone ahead and imported the model as clf and the same test sets assigned to the X_test and y_test variables.

# Generate and output the confusion matrix
from sklearn.metrics import confusion_matrix
preds = clf.predict(X_test)
matrix = confusion_matrix(y_test, preds)
print(matrix)
[[185   0]
 [ 51   4]]

# Compute and print the precision
from sklearn.metrics import precision_score
preds = clf.predict(X_test)
precision = precision_score(y_test, preds)
print(precision)
1.0

# Compute and print the recall
from sklearn.metrics import recall_score
preds = clf.predict(X_test)
recall = recall_score(y_test, preds)
print(recall)
0.07272727272727272

You can see here that the precision of our rain prediction model was quite high, meaning that we didn’t make too many Type I errors. However, there were plenty of Type II errors shown in the bottom-left quadrant of the confusion matrix. This is indicated further by the low recall score, meaning that there were plenty of rainy days that we missed out on. Think a little about the context and what method you would choose to optimize for!

Missing data and outliers

Handling missing data

We’ll look at how to handle missing data and then deal with outliers. How do we identify and correct for null values in our dataset? There are two common approaches.

Dropping the whole row

when you detect a missing value and the second is imputing the missing values with some other value. Dropping the whole row is likely the simple approach of correcting null values as it can be done in one line of code : df.dropna(inplace=True).However there are some trade-offs: by dropping any rows with a null value, you could potentially lose a significant portion of your dataset and exclude information that could strengthen your model or produce insights.

In the example the line of code would drop rows 4 and 6 since they contain at least one null value

Impute missing values

for the nulls. This approach takes a little more thought but allows you to preserve the information contained in the rows with some null values. There are a few popular ways to impute values: a constant value (like 0), insert a randomly selected record from another observation), use the mean-median or mode, or use another model to predict the value an impute that as well.

A few useful functions

    • isnull() is a pandas function that identifies any rows in that have a null value. You can also take it a step further and specify which fields must be null.
    • dropna() if you want to drop the rows outright; either all of them or a specified subset.
    • fillna() is useful for imputation, you specify a technique and the DataFrame fill in the nulls.

Dealing with outliers

There are a few different ways to use statistics to identify outliers. These include : standard deviation, or z-scores, as well as interquartile range (IQR)

Standard deviations

Using standard deviations is a popular straightforward method. Any observation that falls outside the 3 standard deviations of the mean is deemed an outlier. On the normal curve shown, the tails make up around 0.1 % of the population; anything past this threshold is considered an outlier.

Interquartile range (IQR)

Using IQR is another way to determine whether or not a value is an outlier. Cfr box plots : it summarizes the data effectively in 1 plot using median, quartiles and range. The IQR is computed by subtracting the first quartile from the third quartile. Using this value, you can set the outlier thresholds with the formula (First quartile) - (1.5 x IQR) and (Third quartile) + ( 1.5 x IQR ). In the box plot these outliers are represented as dots outside of the end points in the plot.

Exercise : Handling null values

# Identify and print the the rows with null values
nulls = laptops[laptops.isnull().any(axis=1)]
print(nulls)

     Company                               Product  Price
1       Asus                       ZenBook UX430UN    NaN
2       Acer                               Swift 3    NaN
5       Acer                              Aspire 3    NaN
10      Asus  X751NV-TY001T (N4200/4GB/1TB/GeForce    NaN
86      Asus     GL553VE-FY082T (i7-7700HQ/8GB/1TB    NaN
115  Toshiba                     Portege Z30-C-16P    NaN
165     Asus                            ROG G701VI    NaN
207     Acer                         Chromebook 11    NaN

# Impute constant value 0 and print the head
laptops.fillna(0, inplace=True)
print(laptops.head())

  Company                               Product  Price
0    Acer                              Aspire 3  400.0
1    Asus                       ZenBook UX430UN    0.0
2    Acer                               Swift 3    0.0
3    Asus                       Vivobook E200HA  191.9
4    Asus  E402WA-GA010T (E2-6110/2GB/32GB/W10)  199.0

# Impute median price and print the head
laptops.fillna(laptops.median(), inplace=True)
print(laptops.head(5))
  Company                               Product  Price
0    Acer                              Aspire 3  400.0
1    Asus                       ZenBook UX430UN  812.0
2    Acer                               Swift 3  812.0
3    Asus                       Vivobook E200HA  191.9
4    Asus  E402WA-GA010T (E2-6110/2GB/32GB/W10)  199.0

# Drop each row with a null value and print the head
laptops.dropna(inplace=True)
print(laptops.head())
  Company                                    Product   Price
0    Acer                                   Aspire 3  400.00
3    Asus                            Vivobook E200HA  191.90
4    Asus       E402WA-GA010T (E2-6110/2GB/32GB/W10)  199.00
6    Asus  X540UA-DM186 (i3-6006U/4GB/1TB/FHD/Linux)  389.00
7    Asus     X542UQ-GO005 (i5-7200U/8GB/1TB/GeForce  522.99


Notice that the observations at index 1 and 2 are gone now. The technique that you decide on should be entirely dependent on the context of the situation.

Exercise : Identifying outliers

# Calculate the mean and std
mean, std = laptops['Price'].mean(), laptops['Price'].std()

# Compute and print the upper and lower threshold
cut_off = std * 3
lower, upper = mean - cut_off, mean + cut_off
print(lower, 'to', upper)
-1756.329166612584 to 3727.2904290710558

# Identify and print rows with outliers
outliers = laptops[(laptops['Price'] > upper) | 
                   (laptops['Price'] < lower)]
print(outliers)
    Company             Product    Price
65     Asus   ROG G703VI-E5062T   3890.0
127    Asus         ZenBook Pro  -3004.0
224    Asus  Rog GL753VD-GC082T  11369.9
262    Asus          ROG G701VO   3975.0

# Drop the rows from the dataset
laptops = laptops[(laptops['Price'] <= upper) | 
                  (laptops['Price'] >= lower)]

In this scenario, dropping the outliers was likely the right move since the values were unthinkable for laptops prices. This implies that there was some mistake in data entry or data collection. With that being said, this won’t always be the best path forward. It’s important to understand why you got the outliers that you did and if they provide valuable information before you throw them out.

Bias variance trade off

Types of Error

There are three types : bias, variance and irreducible error. The irreducible error = stems from multiple sources (framing the problem, algorithm used, …) .

Bias error

Bias is the simplifying assumptions made by a model to mak the target function easier to learn. In general high bias makes algorithms faster to learn and easier to understand but less flexible. Too much bias can lead to a problem with under fitting. This means that our model is making too many assumptions and not fitting the training data well ( straight line across the data points ). Samples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Variance error

On the other hand, variance is the amount that the estimate of the target function would change if different training data was used. Some variance will exist, but ideally results would not change too much from one training dataset to the next. Too much variance in the model will lead to the problem of over-fitting. This means your model is too flexible, and will fit itself too closely to your training data, making it not generalize to unseen data. This is seen in the curving around all the data points to ensure strong test performance, but it will not hold up later on. Examples high variance machine learning algorithms include : Decision Trees, k-Nearest Neighbors and Support Vector Machines. High variance results in over-fitting to your training set. You’ll see strong performance at first, until you apply your model to your test set, where it will fail to generalize and likely struggle.

Bias Variance trade off

The goal is to minimize the error, achieving low bias and low variance, which ultimately leads to good prediction performance. This is easier said than done, due to the inherent trade-off between bias and variance. Increasing the bias will decrease the variance and increasing the variance will decrease the bias. We can see this phenomenon at work in the plot with error on the y-axis and model complexity on the x-axis. The optimum model complexity falls somewhere in the middle. Keep this in mind when choosing an algorithm for the problem and data.

 

Exercise: Visualizing the tradeoff

In this exercise, you’ll revisit our weather dataset one last time by visualizing the difference between high bias and high variance models using the already imported preds and preds2 variables.

As a reminder, we are using the Temp9am feature to predict our dependent variable, the Temp3pm feature. The usual packages have been imported.

# Use X and y to create a scatterplot
plt.scatter(X, y)

# Add your model predictions to the scatter plot 
plt.plot(np.sort(X), preds)

# Add the higher-complexity model predictions as well
plt.plot(np.sort(X), preds2)
plt.show()