Python – Machine Learnining Code Snippets

Feature Engineering

Skewed Distribution

You’ve been fine-tuning a machine learning model you created for predicting mortgage defaults on a mortgages DataFrame. You’re aiming to eliminate skewness from your numerical variables by using a log transformation. Despite the transformation, one of your columns is still skewed – why do you think so and what is the solution?

  • The presence of an outlier on the right is responsible for skewness, and the solution is applying another round of transformation.
  • The presence of an outlier on the right is responsible for skewness, and the solution is removing the outlier.
  • The concentration of values on the left-hand side is responsible for skewness, and the solution is applying another round of transformation.
  • The mean is responsible for skewness, and the solution is removing outliers from the column.

Exploratory Data Analysis – Datavisualisation

As part of a project, you’ve been assigned to create a machine learning model that predicts whether customers are going to churn or not, you want to investigate the relationship between the Age column and the target variable Attrition in your churn DataFrame.

Create a boxplot to visually uncover the distribution of Age by different values of Attrition.

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x='Attrition', y='Age',data=churn)
plt.show()

Exploratory Data Analysis – Observation

Consider the histogram below. Which of the following statements is true.

  • The mean value for Group 1 is the lowest of the three groups
  • Group 2 clearly has fewer observations than either of the other groups
  • Group 3 has a skewed distribution, making it’s mean higher
  • All of the groups have very different distributions of the data

Skewed Data

You’ve been tasked with creating a classification system that predicts whether employees in your organization are leaving their job or staying. The employee_churn contains numerous numeric columns which are skewed, and you want a statistical measure that measures their spread in order to identify which columns could be transformed.

What is most appropriate measure of spread?

  • Range because it covers the distance from the minimum and maximum of the distribution.
  • Variance because it measures how far values are spread out around their mean.
  • Standard Deviation because it measures how narrowly or widely spread values are around their mean.
  • Interquartile range because it is based on the median and is less sensitive to outliers.

Scaling Data

Before we fit a model to our data we should consider centering and scaling the data so that:

  • It’s easier to interpret the model output because the variables are the same values
  • A feature does not have more influence on the model because of larger or smaller values
  • We can remove outliers from the data because they will all be on the same range
  • We can use both the original and scaled version of the variables in our model

Scaling or Standardization

As part of an interview for a Data Scientist role, you’ve been asked about key differences between Min-Max scaling and Standardization. Choose the correct answer:

  • There are no differences between them, they’re just two techniques to get data in the same range.
  • Min-Max scaling always produces columns between 0 and 1 whereas the range of standardized data is not predetermined.
  • Min-Max scaling is a preprocessing technique, whereas Standardization is a statistical measure.
  • Min-Max scaling is a preprocessing technique, whereas Standardization is a machine learning algorithm,

Transform Data

Consider the variable x in the Pandas DataFrame df shown in the plot below. Apply a Box-Cox Transformation to the variable x.

from sklearn.preprocessing import PowerTransformer
log = PowerTransformer(method='box-cox')
df['log_x'] = log.fit_transform(df[['x']])
df['log_x'].head()

> 0  -0.191372
  1  1.681201
  2  0.748264
  3  0.388754
  4  -0.922371
Name: log_x, dtype: float64

Missing data

Consider the data frame df below that shows the total number of observations per month. Fit a suitable imputer to fill the missing values.

Date        Ozone  Solar Wind
1976-05-31   26     27    31
1976-06-30   9      30    30
1976-07-31   26     31    31
1976-08-31   26     28    31
1976-09-30   29     30    31

Code

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
imputer.fit(df)

> SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan, strategy='median', verbose=0)

 


Model Selection & Validation

Bias-Variance

The bias-variance trade off describes:

  • Technical details underpinning model fitting that we don’t need to worry about.
  • Incorporating expert knowledge of a process at the expense of model simplicity.
  • Balancing how well a model fits with how well it generalizes to new data.
  • Choosing to focus on only one of a models ability to explain or predict a process.

Assessing Performance

Which of the following metrics would not be used when assessing the performance of a regression model?

  • Accuracy.
  • Root Mean Square Error.
  • Median Absolute Deviation.
  • Akaike Information Criteria.

Grid Search

Determine if 50, 150 or 250 is the best value for the n_estimators hyperparameter of a random forest classifier.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

model_params = {'n_estimators': [50, 150, 250]}
rf = RandomForestClassifier(random_state=42)

clf = GridSearchCV(rf, model_params, cv=5)
clf.fit(X_train, y_train)
clf.best_params_

> {'n_estimators': 150}

 


Unsupervised Learning

PCA

Available in you working session is the dataset scaled_samples. Instantiate a principal component analysis model object with 2 components, and fit the model to the scaled_samples object.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_samples)
pca_features = pca.transform(scaled_samples)
print(pca_features.shape)

Dendrogram

Consider the dendrogram below. Which height yields 4 clusters?

  • 10
  • 8
  • 6
  • 4

Clustering – hierarchical clustering

Using the scipy package and the linkage matrix ward_linkage_matrix package return a cut tree at height set to 3.

from scipy.cluster.hierarchy import
from scipy.cluster.hierarchy import cut_tree

cutree = cut_tree(ward_linkage_matrix, height = 3)
cutree[:5]

> array([[0],[1],[2],[3],[4]])

Clustering – linkage

The Pandas DataFrame df is loaded on your working session. Using the scipy package, compute the cluster distance (with linkage method complete) between the observations the columns x_scaled and y_scaled.

from scipy.cluster.hierarchy import linkage
distances = linkage(df[['x_scaled', 'y_scaled']], method = 'complete', metric = 'euclidean')
distances[:5]

> array([[0.00000000e+00, 9.00000000e+00, 0.00000000e+00, 2.00000000e+00],
         [3.00000000e+00, 6.00000000e+00, 2.25024613e-02, 2.00000000e+00],
         [1.60000000e+01, 2.10000000e+01, 2.99617089e-02, 2.00000000e+00],
         [2.00000000e+01, 2.30000000e+01, 2.99617089e-02, 2.00000000e+00],
         [1.40000000e+01, 2.50000000e+01, 3.74708522e-02, 2.00000000e+00]])

 


Supervised Learning

Regression – Logistic Regression

A LogisticRegression model is fitted on the training data X_train and y_train, and stored in model. Use model and the X_test feature data to predict values for the response variable, and store it in y_pred. Then get the model score using X_test and y_test.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.score(X_test, y_test)

Regression – Lasso Regression

Available in this session are the training (X_train, y_train) and test (X_test, y_test) sets.

Implement a Lasso Regression model with alpha equal to 0.01. Determine the RMSE achieved by the model.

import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.01)
lasso_model.fit(X_train, y_train)
lasso_predictions = lasso_model.predict(X_test)
print("RMSE: ", np.sqrt(mean_squared_error(y_test,lasso_predictions)))

> RMSE: 4.7128077313531485

 

Regression – interpretation results

Consider the summary below from a linear model, fitting the continuous target chol (cholesterol) with the feature sex (0 if female, 1 if male). What does this model suggest regarding the relationship between the variables chol and sex?

Intercept coefficient: 261.3020833333333
Sex variable coefficient: [-22.01222826]
  • All else equal, females are likely to have higher cholesterol than males, by 261.30.
  • All else equal, females are likely to have higher cholesterol than males, by 22.01.
  • All else equal, males are likely to have higher cholesterol than females, by 22.01.
  • All else equal, males are likely to have higher cholesterol than females, by 261.30.

 

Regression – Generalized LM or simple LM

In which of the following situations would you recommend the use of a generalized linear model with a non-Gaussian distribution rather than a simple linear model?

  • Where the response variable is continuous, and the feature variable is continuous.
  • Where the response variable is continuous, and the feature variable is categorical.
  • Where the response variable is binary, and the feature variable is continuous.
  • Where the response variable is continuous, and the feature variable is binary.

 

Regression – Linear Regression with Polynomial features – interaction terms

An array X with two features has been loaded for you along with the target variable array y. Fit a multiple linear regression model with interaction terms on the target variable y as a function of the feature variables contained in X.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

interaction_term = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_inter = interaction_term.fit_transform(X)
model = LinearRegression()
model.fit(X_inter,y)

print("Regression coefficients: {}".format(model.coef_))
> Regression coefficients: [-0.33715159 0.08155747 0.80662 ]

 

Generalized Linear Model

A (Poisson) generalized linear model is fitted on the dataset score and stored in this session as model. Using model, apply the prediction method on the test dataset.

import statsmodels.api as sm
from statsmodels.formula.api import glm

model = glm('goal ~ player',data = score,family = sm.families.Poisson()).fit()

model.predict(test)

> 0    2.0
  1    2.0
  2    0.2
  3    0.2

 

Classifier – Decision Tree Classifier

Available in this session are the training data X_train and y_train for feature and target variables, respectively; as well as the testing data X_test and y_test for feature and target variables, respectively.

Using sklearn, fit a classification decision tree model on X_train and y_train.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

model = DecisionTreeClassifier(max_depth=4, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

> 0.9833333333333333

 

Classifier – RandomForest Classifier

Available in this session are the training data X_train and y_train for feature and target variables, respectively; as well as the testing data X_test and y_test for feature and target variables, respectively.

Using sklearn, fit a classification random forest model on X_train and y_train.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(n_estimators=10, random_state=1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

> 0.9833333333333333

# Second Example

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(n_estimators=300, max_depth=1, random_state=1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

> 0.9333333333333333

 

Classifier – Gradient Boosting Classifier

Available in this session are the training data X_train and y_train for feature and target variables, respectively; as well as the testing data X_test and y_test.

Using sklearn, fit a classification gradient boosting model on X_train and y_train.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

model = GradientBoostingClassifier(n_estimators=300, max_depth=1, random_state=1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

> 0.95