Feature Engineering
Skewed Distribution
You’ve been fine-tuning a machine learning model you created for predicting mortgage defaults on a mortgages DataFrame. You’re aiming to eliminate skewness from your numerical variables by using a log transformation. Despite the transformation, one of your columns is still skewed – why do you think so and what is the solution?
The presence of an outlier on the right is responsible for skewness, and the solution is applying another round of transformation.- The presence of an outlier on the right is responsible for skewness, and the solution is removing the outlier.
The concentration of values on the left-hand side is responsible for skewness, and the solution is applying another round of transformation.The mean is responsible for skewness, and the solution is removing outliers from the column.
Exploratory Data Analysis – Datavisualisation
As part of a project, you’ve been assigned to create a machine learning model that predicts whether customers are going to churn or not, you want to investigate the relationship between the Age column and the target variable Attrition in your churn DataFrame.
Create a boxplot to visually uncover the distribution of Age by different values of Attrition.
import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(x='Attrition', y='Age',data=churn) plt.show()
Exploratory Data Analysis – Observation
Consider the histogram below. Which of the following statements is true.
- The mean value for Group 1 is the lowest of the three groups
Group 2 clearly has fewer observations than either of the other groupsGroup 3 has a skewed distribution, making it’s mean higherAll of the groups have very different distributions of the data
Skewed Data
You’ve been tasked with creating a classification system that predicts whether employees in your organization are leaving their job or staying. The employee_churn contains numerous numeric columns which are skewed, and you want a statistical measure that measures their spread in order to identify which columns could be transformed.
What is most appropriate measure of spread?
Range because it covers the distance from the minimum and maximum of the distribution.Variance because it measures how far values are spread out around their mean.Standard Deviation because it measures how narrowly or widely spread values are around their mean.- Interquartile range because it is based on the median and is less sensitive to outliers.
Scaling Data
Before we fit a model to our data we should consider centering and scaling the data so that:
It’s easier to interpret the model output because the variables are the same values- A feature does not have more influence on the model because of larger or smaller values
We can remove outliers from the data because they will all be on the same rangeWe can use both the original and scaled version of the variables in our model
Scaling or Standardization
As part of an interview for a Data Scientist role, you’ve been asked about key differences between Min-Max scaling and Standardization. Choose the correct answer:
There are no differences between them, they’re just two techniques to get data in the same range.- Min-Max scaling always produces columns between 0 and 1 whereas the range of standardized data is not predetermined.
Min-Max scaling is a preprocessing technique, whereas Standardization is a statistical measure.Min-Max scaling is a preprocessing technique, whereas Standardization is a machine learning algorithm,
Transform Data
Consider the variable x in the Pandas DataFrame df shown in the plot below. Apply a Box-Cox Transformation to the variable x.
from sklearn.preprocessing import PowerTransformer log = PowerTransformer(method='box-cox') df['log_x'] = log.fit_transform(df[['x']]) df['log_x'].head() > 0 -0.191372 1 1.681201 2 0.748264 3 0.388754 4 -0.922371 Name: log_x, dtype: float64
Missing data
Consider the data frame df below that shows the total number of observations per month. Fit a suitable imputer to fill the missing values.
Date Ozone Solar Wind 1976-05-31 26 27 31 1976-06-30 9 30 30 1976-07-31 26 31 31 1976-08-31 26 28 31 1976-09-30 29 30 31
Code
from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') imputer.fit(df) > SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan, strategy='median', verbose=0)
Model Selection & Validation
Bias-Variance
The bias-variance trade off describes:
Technical details underpinning model fitting that we don’t need to worry about.Incorporating expert knowledge of a process at the expense of model simplicity.- Balancing how well a model fits with how well it generalizes to new data.
Choosing to focus on only one of a models ability to explain or predict a process.
Assessing Performance
Which of the following metrics would not be used when assessing the performance of a regression model?
- Accuracy.
Root Mean Square Error.Median Absolute Deviation.Akaike Information Criteria.
Grid Search
Determine if 50, 150 or 250 is the best value for the n_estimators hyperparameter of a random forest classifier.
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV model_params = {'n_estimators': [50, 150, 250]} rf = RandomForestClassifier(random_state=42) clf = GridSearchCV(rf, model_params, cv=5) clf.fit(X_train, y_train) clf.best_params_ > {'n_estimators': 150}
Unsupervised Learning
PCA
Available in you working session is the dataset scaled_samples. Instantiate a principal component analysis model object with 2 components, and fit the model to the scaled_samples object.
from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(scaled_samples) pca_features = pca.transform(scaled_samples) print(pca_features.shape)
Dendrogram
Consider the dendrogram below. Which height yields 4 clusters?
108- 6
4
Clustering – hierarchical clustering
Using the scipy package and the linkage matrix ward_linkage_matrix package return a cut tree at height set to 3.
from scipy.cluster.hierarchy import from scipy.cluster.hierarchy import cut_tree cutree = cut_tree(ward_linkage_matrix, height = 3) cutree[:5] > array([[0],[1],[2],[3],[4]])
Clustering – linkage
The Pandas DataFrame df is loaded on your working session. Using the scipy package, compute the cluster distance (with linkage method complete) between the observations the columns x_scaled and y_scaled.
from scipy.cluster.hierarchy import linkage distances = linkage(df[['x_scaled', 'y_scaled']], method = 'complete', metric = 'euclidean') distances[:5] > array([[0.00000000e+00, 9.00000000e+00, 0.00000000e+00, 2.00000000e+00], [3.00000000e+00, 6.00000000e+00, 2.25024613e-02, 2.00000000e+00], [1.60000000e+01, 2.10000000e+01, 2.99617089e-02, 2.00000000e+00], [2.00000000e+01, 2.30000000e+01, 2.99617089e-02, 2.00000000e+00], [1.40000000e+01, 2.50000000e+01, 3.74708522e-02, 2.00000000e+00]])
Supervised Learning
Regression – Logistic Regression
A LogisticRegression model is fitted on the training data X_train and y_train, and stored in model. Use model and the X_test feature data to predict values for the response variable, and store it in y_pred. Then get the model score using X_test and y_test.
from sklearn.linear_model import LogisticRegression model = LogisticRegression(random_state=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) model.score(X_test, y_test)
Regression – Lasso Regression
Available in this session are the training (X_train, y_train) and test (X_test, y_test) sets.
Implement a Lasso Regression model with alpha equal to 0.01. Determine the RMSE achieved by the model.
import numpy as np from sklearn.metrics import mean_squared_error, r2_score from sklearn.linear_model import Lasso lasso_model = Lasso(alpha=0.01) lasso_model.fit(X_train, y_train) lasso_predictions = lasso_model.predict(X_test) print("RMSE: ", np.sqrt(mean_squared_error(y_test,lasso_predictions))) > RMSE: 4.7128077313531485
Regression – interpretation results
Consider the summary below from a linear model, fitting the continuous target chol (cholesterol) with the feature sex (0 if female, 1 if male). What does this model suggest regarding the relationship between the variables chol and sex?
Intercept coefficient: 261.3020833333333 Sex variable coefficient: [-22.01222826]
All else equal, females are likely to have higher cholesterol than males, by 261.30.- All else equal, females are likely to have higher cholesterol than males, by 22.01.
All else equal, males are likely to have higher cholesterol than females, by 22.01.All else equal, males are likely to have higher cholesterol than females, by 261.30.
Regression – Generalized LM or simple LM
In which of the following situations would you recommend the use of a generalized linear model with a non-Gaussian distribution rather than a simple linear model?
Where the response variable is continuous, and the feature variable is continuous.Where the response variable is continuous, and the feature variable is categorical.- Where the response variable is binary, and the feature variable is continuous.
Where the response variable is continuous, and the feature variable is binary.
Regression – Linear Regression with Polynomial features – interaction terms
An array X with two features has been loaded for you along with the target variable array y. Fit a multiple linear regression model with interaction terms on the target variable y as a function of the feature variables contained in X.
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures interaction_term = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) X_inter = interaction_term.fit_transform(X) model = LinearRegression() model.fit(X_inter,y) print("Regression coefficients: {}".format(model.coef_)) > Regression coefficients: [-0.33715159 0.08155747 0.80662 ]
Generalized Linear Model
A (Poisson) generalized linear model is fitted on the dataset score and stored in this session as model. Using model, apply the prediction method on the test dataset.
import statsmodels.api as sm from statsmodels.formula.api import glm model = glm('goal ~ player',data = score,family = sm.families.Poisson()).fit() model.predict(test) > 0 2.0 1 2.0 2 0.2 3 0.2
Classifier – Decision Tree Classifier
Available in this session are the training data X_train and y_train for feature and target variables, respectively; as well as the testing data X_test and y_test for feature and target variables, respectively.
Using sklearn, fit a classification decision tree model on X_train and y_train.
from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score model = DecisionTreeClassifier(max_depth=4, random_state=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred) > 0.9833333333333333
Classifier – RandomForest Classifier
Available in this session are the training data X_train and y_train for feature and target variables, respectively; as well as the testing data X_test and y_test for feature and target variables, respectively.
Using sklearn, fit a classification random forest model on X_train and y_train.
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score model = RandomForestClassifier(n_estimators=10, random_state=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred) > 0.9833333333333333 # Second Example from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score model = RandomForestClassifier(n_estimators=300, max_depth=1, random_state=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred) > 0.9333333333333333
Classifier – Gradient Boosting Classifier
Available in this session are the training data X_train and y_train for feature and target variables, respectively; as well as the testing data X_test and y_test.
Using sklearn, fit a classification gradient boosting model on X_train and y_train.
from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score model = GradientBoostingClassifier(n_estimators=300, max_depth=1, random_state=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred) > 0.95