Regression and Classification
Regression is a technique used to model and analyze the relationships between variables contribute to producing a particular outcome. More concretely, it’s a way to determine which variables have an impact, which don’t, which factors interact, and how certain we are about this. (most common techniques – linear and logistic regression )
In order to effectively leverage regression models we need the true relationship of the variables to be linear, the errors to be normally distributed and homoscedastic ( uniform variance & each observation to be independent )
Simple linear regression involves one independent and one independent variable with a linear relationship. This results in a fit that will look similar to the plot.
We are solving for the Y-value or the dependent variable, which is our output. This is calculated by taking the y-intercept plus our population slope coefficient, times the independent variable, X, plus our random irreducible error term. More variables can be included by simply adding a beta coefficient for each additional factor. Note that sometimes you will only see the linear component of our intercept and slope, without the random error component.
Example: linear regression
To implement linear regression in python, we’ll call on the scikit-learn package.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
After creating the linear regression object and changing any default parameters, simply call the fit function to create your model.
coef = lm.coef_
Looking at the coefficient. Since we have only one independent variable in this example, there is only one coefficient. It is essentially the slope of the line and tells us that for every O.8 units of the dependent variable, we get 1 unit of the independent variable.
Another regression technique is logistic regression, one of the most common Machine Learning techniques for two-class classification. While linear gives us a continuous output, logistic regression produces a discrete output. This allows us to compute probabilities that each observation belong to a class, thanks to the sigmoid function.
The sigmoid function is also called the logistic function. It gives us an S-shaped curve that takes any real number and maps or converts it between 0 and 1.
Example : logistic regression
Similar to linear regression, we can implement logistic regression and then fit the model to our data.
from sklearn.linear_model import LogisticRegression
coefs = clf.coef_
Since we use 2 independent variables,we got two coefficients back) . Note that these are only interpretabel when you normalize your data first, since you can’t draw any conclusions based on their magnitudes otherwise.
We can also print accuracy to see how our model performed :
accuracy = clf.score(X_test, y_test)
Here it accurately identified around 85% of the observations in the test set.
Other noteworthy functions include predict, ravel for data preparation
Exercise : Linear Regression
from sklearn.linear_model import LinearRegression X = np.array(weather['Humidity9am']).reshape(-1,1) y = weather['Humidity3pm'] # Create and fit your linear regression model lm = LinearRegression() lm.fit(X, y) # Assign and print predictions preds = lm.predict(X) print(preds) [62.90599123 54.20645768 43.92519074 52.6247243 51.04299093 61.32425786 64.48772461 60.53339117 47.87952418 65.27859129 59.74252449 37.59825725 70.8146581 74.76899154 27.31699032 38.38912393 70.8146581 48.67039087 47.08865749 68.44205804 36.80739056 70.02379142 58.16079111 59.74252449 62.11512455 45.50692412 28.89872369 69.23292473 59.74252449 48.67039087 71.60552479 57.36992442 47.08865749 51.83385761 57.36992442 26.52612363 66.86032467 32.06219044 58.16079111 47.87952418 50.25212424 ... ] # Plot your fit to visualize your model plt.scatter(X, y) plt.plot(X, preds, color='red') plt.show() # Assign and print coefficient coef = lm.coef_ print(coef) [0.79086669]
Despite some noise in the plot, we have a decent looking fit here using Humidity9am to predict the dependent variable Humidity3pm with a linear model. Furthermore, take another look at our coefficient. This means that for every 1 unit of humidity in the morning, we can expect about 0.80 units of humidity in the afternoon. More practically, this information tells us that humidity drops about 20% from morning to afternoon!
Exercise : Logistic Regression
from sklearn.linear_model import LogisticRegression # Create and fit your model clf = LogisticRegression() clf.fit(X_train, y_train) # Compute and print the accuracy acc = clf.score(X_test, y_test) print(acc) 0.7875 # Assign and print the coefficents coefs = clf.coef_ print(coefs) [[0.20747557 3.17409056]]
Since our features were normalized beforehand, we can look at the magnitude of our coefficients to tell us the importance of each independent variable. Here you can see the the second variable, Humidity3pm was much more important to our outcome than humidity from that morning. This is intuitive since we are trying to predict the rain for tomorrow!
Regression techniques : r-squared ( coefficient of determination), the mean absolute error (MAE) and the mean squared error (MSE).
R-squared was also discussed when analyzing relationships between 2 or more variables. R-squared tells us the proportion of variance of the dependent variable that is explained by the regression model. Here the residuals are plotted and show us how good of a fit our model is. this is often the first metric data scientist go to when evaluating their model. In python we use the score function.
MAE vs. MSE
MAE is the sum of the absolute residuals over the number of points, and MSE is the sum of residuals squared over the number of points as well. The resulting penalty functions look like this, with absolute error scaling linearly and squared error scaling more exponentially. The resulting penalty functions look like this, with absolute error scaling linearly and the squared error scaling more exponentially. As a result, different scenarios call for different metrics. In the exercises, you can leverage the
mean_absolute_error function in python.
MAE vs. MSE question
What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error ? In which cases would each error metric be appropriate ?
Typically, if your datasets has outliers or if you are worried about individual observations, you will want to use MSE ( since by squaring the errors, they are weighted more heavily. On the other hand, if you are not concerned with outliers or singular observations MAE can be used to suppress those errors a bit more, since this involves taking the absolute value instead of squaring the errors.
Precision is the number of true positives over the number of true positives plus false positives. It can be interpreted as the percentage of observations that you correctly guessed and is linked to the rate of the type I error.
Recall is the number of true positives over the number of true positives plus false negatives and is linked to the rate of type II errors.
Interviewers may ask you to choose a metric based on the context of the problem. Using the confusion matrix that we discuss earlier in the course we can easily see where our model weaknesses are, whether it has false positives, known as type I errors or false negatives, known as type II errors. This also ties in nicely with precision and recall.
For example if you are building a spam detector, you probably don’t want to make any type I errors and want to optimize for precision. On the other hand if you are trying to classify a rare disease, you normally want to avoid type II errors, so recall is the priority. In python :
Exercise : Regression evaluation
Let’s revisit the linear regression model that you created with
LinearRegression() and then trained with the
fit() function a few exercises ago. Evaluate the performance your model, imported here as
lm for you to call.
The weather data has been imported for you with the X and y variables as well, just like before. Let’s get to calculating the R-squared, mean squared error, and mean absolute error values for the model.
# R-squared score r2 = lm.score(X, y) print(r2) 0.44662006353076383 # Mean squared error from sklearn.metrics import mean_squared_error preds = lm.predict(X) mse = mean_squared_error(y, preds) print(mse) 226.12721831681654 # Mean absolute error from sklearn.metrics import mean_absolute_error preds = lm.predict(X) mae = mean_absolute_error(y, preds) print(mae) 11.522404665934568
Note that our R-squared value tells us the percentage of the variance of y that X is responsible for. Which error metric would you recommend for this dataset? If you remember from when you plotted your model fit, there aren’t too many outliers, so mean squared error would be a good choice to go with!
Exercise : Classification evaluation
Moving forward with evaluation metrics, this time you’ll evaluate our logistic regression model from before with the goal of predicting the binary RainTomorrow feature using humidity.
We have gone ahead and imported the model as
clf and the same test sets assigned to the X_test and y_test variables.
# Generate and output the confusion matrix from sklearn.metrics import confusion_matrix preds = clf.predict(X_test) matrix = confusion_matrix(y_test, preds) print(matrix) [[185 0] [ 51 4]] # Compute and print the precision from sklearn.metrics import precision_score preds = clf.predict(X_test) precision = precision_score(y_test, preds) print(precision) 1.0 # Compute and print the recall from sklearn.metrics import recall_score preds = clf.predict(X_test) recall = recall_score(y_test, preds) print(recall) 0.07272727272727272
You can see here that the precision of our rain prediction model was quite high, meaning that we didn’t make too many Type I errors. However, there were plenty of Type II errors shown in the bottom-left quadrant of the confusion matrix. This is indicated further by the low recall score, meaning that there were plenty of rainy days that we missed out on. Think a little about the context and what method you would choose to optimize for!
Missing data and outliers
Handling missing data
We’ll look at how to handle missing data and then deal with outliers. How do we identify and correct for null values in our dataset? There are two common approaches.
Dropping the whole row
when you detect a missing value and the second is imputing the missing values with some other value. Dropping the whole row is likely the simple approach of correcting null values as it can be done in one line of code :
df.dropna(inplace=True).However there are some trade-offs: by dropping any rows with a null value, you could potentially lose a significant portion of your dataset and exclude information that could strengthen your model or produce insights.
In the example the line of code would drop rows 4 and 6 since they contain at least one null value
Impute missing values
for the nulls. This approach takes a little more thought but allows you to preserve the information contained in the rows with some null values. There are a few popular ways to impute values: a constant value (like 0), insert a randomly selected record from another observation), use the mean-median or mode, or use another model to predict the value an impute that as well.
A few useful functions
isnull()is a pandas function that identifies any rows in that have a null value. You can also take it a step further and specify which fields must be null.
dropna()if you want to drop the rows outright; either all of them or a specified subset.
fillna()is useful for imputation, you specify a technique and the
DataFramefill in the nulls.
Dealing with outliers
There are a few different ways to use statistics to identify outliers. These include : standard deviation, or z-scores, as well as interquartile range (IQR)
Using standard deviations is a popular straightforward method. Any observation that falls outside the 3 standard deviations of the mean is deemed an outlier. On the normal curve shown, the tails make up around 0.1 % of the population; anything past this threshold is considered an outlier.
Interquartile range (IQR)
Using IQR is another way to determine whether or not a value is an outlier. Cfr box plots : it summarizes the data effectively in 1 plot using median, quartiles and range. The IQR is computed by subtracting the first quartile from the third quartile. Using this value, you can set the outlier thresholds with the formula
(First quartile) - (1.5 x IQR) and
(Third quartile) + ( 1.5 x IQR ). In the box plot these outliers are represented as dots outside of the end points in the plot.
Exercise : Handling null values
# Identify and print the the rows with null values nulls = laptops[laptops.isnull().any(axis=1)] print(nulls) Company Product Price 1 Asus ZenBook UX430UN NaN 2 Acer Swift 3 NaN 5 Acer Aspire 3 NaN 10 Asus X751NV-TY001T (N4200/4GB/1TB/GeForce NaN 86 Asus GL553VE-FY082T (i7-7700HQ/8GB/1TB NaN 115 Toshiba Portege Z30-C-16P NaN 165 Asus ROG G701VI NaN 207 Acer Chromebook 11 NaN # Impute constant value 0 and print the head laptops.fillna(0, inplace=True) print(laptops.head()) Company Product Price 0 Acer Aspire 3 400.0 1 Asus ZenBook UX430UN 0.0 2 Acer Swift 3 0.0 3 Asus Vivobook E200HA 191.9 4 Asus E402WA-GA010T (E2-6110/2GB/32GB/W10) 199.0 # Impute median price and print the head laptops.fillna(laptops.median(), inplace=True) print(laptops.head(5)) Company Product Price 0 Acer Aspire 3 400.0 1 Asus ZenBook UX430UN 812.0 2 Acer Swift 3 812.0 3 Asus Vivobook E200HA 191.9 4 Asus E402WA-GA010T (E2-6110/2GB/32GB/W10) 199.0 # Drop each row with a null value and print the head laptops.dropna(inplace=True) print(laptops.head()) Company Product Price 0 Acer Aspire 3 400.00 3 Asus Vivobook E200HA 191.90 4 Asus E402WA-GA010T (E2-6110/2GB/32GB/W10) 199.00 6 Asus X540UA-DM186 (i3-6006U/4GB/1TB/FHD/Linux) 389.00 7 Asus X542UQ-GO005 (i5-7200U/8GB/1TB/GeForce 522.99
Notice that the observations at index 1 and 2 are gone now. The technique that you decide on should be entirely dependent on the context of the situation.
Exercise : Identifying outliers
# Calculate the mean and std mean, std = laptops['Price'].mean(), laptops['Price'].std() # Compute and print the upper and lower threshold cut_off = std * 3 lower, upper = mean - cut_off, mean + cut_off print(lower, 'to', upper) -1756.329166612584 to 3727.2904290710558 # Identify and print rows with outliers outliers = laptops[(laptops['Price'] > upper) | (laptops['Price'] < lower)] print(outliers) Company Product Price 65 Asus ROG G703VI-E5062T 3890.0 127 Asus ZenBook Pro -3004.0 224 Asus Rog GL753VD-GC082T 11369.9 262 Asus ROG G701VO 3975.0 # Drop the rows from the dataset laptops = laptops[(laptops['Price'] <= upper) | (laptops['Price'] >= lower)]
In this scenario, dropping the outliers was likely the right move since the values were unthinkable for laptops prices. This implies that there was some mistake in data entry or data collection. With that being said, this won’t always be the best path forward. It’s important to understand why you got the outliers that you did and if they provide valuable information before you throw them out.
Bias variance trade off
Types of Error
There are three types : bias, variance and irreducible error. The irreducible error = stems from multiple sources (framing the problem, algorithm used, …) .
Bias is the simplifying assumptions made by a model to mak the target function easier to learn. In general high bias makes algorithms faster to learn and easier to understand but less flexible. Too much bias can lead to a problem with under fitting. This means that our model is making too many assumptions and not fitting the training data well ( straight line across the data points ). Samples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
On the other hand, variance is the amount that the estimate of the target function would change if different training data was used. Some variance will exist, but ideally results would not change too much from one training dataset to the next. Too much variance in the model will lead to the problem of over-fitting. This means your model is too flexible, and will fit itself too closely to your training data, making it not generalize to unseen data. This is seen in the curving around all the data points to ensure strong test performance, but it will not hold up later on. Examples high variance machine learning algorithms include : Decision Trees, k-Nearest Neighbors and Support Vector Machines. High variance results in over-fitting to your training set. You’ll see strong performance at first, until you apply your model to your test set, where it will fail to generalize and likely struggle.
Bias Variance trade off
The goal is to minimize the error, achieving low bias and low variance, which ultimately leads to good prediction performance. This is easier said than done, due to the inherent trade-off between bias and variance. Increasing the bias will decrease the variance and increasing the variance will decrease the bias. We can see this phenomenon at work in the plot with error on the y-axis and model complexity on the x-axis. The optimum model complexity falls somewhere in the middle. Keep this in mind when choosing an algorithm for the problem and data.
Exercise: Visualizing the tradeoff
In this exercise, you’ll revisit our weather dataset one last time by visualizing the difference between high bias and high variance models using the already imported
As a reminder, we are using the
Temp9am feature to predict our dependent variable, the
Temp3pm feature. The usual packages have been imported.
# Use X and y to create a scatterplot plt.scatter(X, y) # Add your model predictions to the scatter plot plt.plot(np.sort(X), preds) # Add the higher-complexity model predictions as well plt.plot(np.sort(X), preds2) plt.show()