# How to interprete Results – Statistics and ML results

## Statistics

### Statistics – Categorical variable

Which measures of center can be used for a variable that records the type of housing that a person has. The variable has three possible values: for free, renting, own.

mode is an appropriate measure.

### Statistics – IQR

Consider a set of data: 2, 3, 4, 5, 6, 6, 7, 8, 8, 8, 9. A summary of this set of data is:

• Minimum = 2
• First quartile = 3.5
• Median = 6
• Third quartile = 8
• Maximum = 9

What would happen to value of the range, standard deviation and interquartile range (IQR) if the last value in the data was change to 50?

The range and standard deviation would increase and the IQR would remain the same.

### Statistics – tests

Which type of statistical test should be used to compare the average commute distance traveled by residents of Osaka and Yokohama using 2000 randomly selected residents of each city?
Independent T-Test

### Statistics – Hypothesis Test

A researcher runs 10,000 separate hypothesis tests. Each hypothesis test has a 5% probability of getting a false positive which means there will be 500 significant results. Most of the results will be false alarms.

What is this type of problem called?

Multiple Testing Problem

### Statistics – Hypothesis Test

A hypothesis test is conducted to determine if the means of two independent samples, data1 and data2, are significantly different.

Null Hypothesis: the means of the samples are equal
Alternative Hypothesis: the means of the samples are unequal

```stat, p = ttest_ind(data1, data2)

if p > 0.05:
print('No significant difference was found')
else:
print('The samples probably have different mean values')```

### Statistics – Hypothesis Test

A small p-value (typically < 0.05) indicates

• strong evidence against the alternative hypothesis so the null hypothesis can be accepted.
• weak evidence against the null hypothesis so we fail to reject the null hypothesis.
• strong evidence against the null hypothesis so we can accept the null hypothesis.
• strong evidence against the null hypothesis so the null hypothesis can be rejected.

### Statistics – Hypothesis Test

Whats the correct statement about null and alternative hypothesis

• The null hypothesis is a statement about the population parameter.
• The null and alternative hypotheses are stated in terms of statistics obtained from the random sample.
• The hypothesis that the estimate is based solely on chance is called the alternative hypothesis.
• The null and alternative hypotheses are not complementary and therefore it is sufficient to define the null hypothesis.

### Statistics – Distribution Data

You’re a data scientist at a prominent logistics company. Your team has estimated that on average the company experiences 0.1 equipment failures per day. What is the most suitable distribution that you could use to model equipment failures for 365 days?
Poisson distribution

### Statistics – Distribution Data

You have data of a population where the data are Exponentially distributed.

Which distribution increasingly approximates the sample mean as the sample size increases to infinity?

• Exponential
• Normal
• Poisson
• Gamma

### Statistics – GLM & Distribution

In which of the following situations would you recommend the use of a generalized linear model with a non-Gaussian distribution rather than a simple linear model?
• Where the response variable is continuous, and the feature variable is continuous.
• Where the response variable is continuous, and the feature variable is categorical.
• Where the response variable is binary, and the feature variable is continuous.
• Where the response variable is continuous, and the feature variable is binary.

### Statistics – LM

What is the main limitation of a linear model?
• Linear models are prone to over-fitting.
• Linear models assume variables are linearly separable
• Linear models are sensitive to outliers.
• Linear models are prone to multicollinearity.

### Statistics – Regression

A simple linear regression model is fitted to data of a company’s advertising budget and their sales. What is the relationship between a company’s advertising budget and their sales as described by the regression model output, below?

```Regression coefficients: [[0.1]]
Regression intercept: [7]
Regression score: 0.9```
• If the company increases there advertising budget by \$1000, then there will be no change in sales.
• If the company increases there advertising budget by \$1000, there sales will most likely decrease by \$100
• If the company increases there advertising budget by \$1000, there sales will most likely increase by \$100
• If the company increases there advertising budget by \$1000, there sales will most likely increase by \$70

### Statistics – GLM

A (Poisson) generalized linear model is fitted on the dataset score and stored in this session as model. Using model, apply the prediction method on the test dataset.

```import statsmodels.api as sm
from statsmodels.formula.api import glm
model = glm('goal ~ player', data = score, family = sm.families.Poisson()).fit()
model.predict(test)
```

### Statistics – Data Analysis

Determine the correlation coefficient for the citric acid and pH of wine. acid and pH are numpy arrays.
Select the code to return the output

```import numpy as np
np.corrcoef(acid, pH)[0,1]
```

Expected Output
-0.5419041447395097

### Statitics – Data Analysis

Consider the distribution of the number of vehicles owned by a sample of 30 small businesses. What percentage of small businesses own two vehicles or less?

20 % ( 1 + 2 + 3 = 6 / 30 = 0.20 or 20 % )

### Statistics – Sampling Distributions

Code to ensure that the same sample of random numbers are generated each time the code is executed. Use a seed value of 42.

```import numpy as np
np.random.seed(42)
np.random.randint(0, 11, 5)```

### Statistics – Sampling Distributions

Your company makes coffee machines. Which example represents a Poisson process?

• The amount of time store clerks spend with his or her customer.
• The results of employee performance reviews.
• The number of customers that visit a specific store within an interval.
• The results of the final inspection of coffee machines. 90% pass inspection and 10% fail.

### Statistics – Sampling Distributions

You have a population that is divided into three categories:

• Category 1: 30% of the population
• Category 2: 36% of the population
• Category 3: 34% of the population

What is process of randomly sampling from the population so that the eventual sample size has 30% of participants in Category 1, 36% of participants in Category 2 and 34% of participants in Category 3 known as?

• Systematic sampling
• Random sampling
• Stratified sampling
• Clustered sampling

### Statistics – Sampling Distributions

Create an array of 100 numbers sampled from a Poisson distribution were the observed interval is equal to 5. The random seed has been set for you.
Complete the code to return the output

```import numpy as np
np.random.poisson(5,100)```

Expected Output

```array([ 5, 4, 4, 5, 5, 3, 5, 4, 6, 7, 2, 5, 5, 6, 4, 6, 6,
1, 7, 2, 11, 4, 3, 8, 3, 3, 5, 8, 3, 2, 5, 3, 8, 10,
3, 2, 5, 7, 6, 6, 2, 4, 9, 7, 11, 8, 3, 2, 3, 4, 5,
5, 4, 3, 8, 2, 1, 4, 5, 4, 3, 4, 8, 2, 2, 2, 7, 6,
7, 6, 7, 6, 4, 2, 3, 4, 7, 4, 3, 2, 6, 3, 5, 9, 6,
6, 9, 4, 9, 2, 10, 6, 9, 4, 1, 6, 8, 6, 2, 3])```

### Statistics – Probability Theory

Create an array of 100 numbers sampled from a Binomial distribution with 10 trials and a probability of success of 0.5. The random seed has been set for you.
```import numpy as np
np.random.binomial(10,0.5,100)```

### Statistics – Correlation

Autocorrelation is:
• the correlation of a time series with another time series.
• the automatic detection of a relationship between two time series.
• the correlation of a time series with a lagged version of itself.
• the upward and downward movement of a time series.

## Machine Learning

### Machine Learning – Evaluation

What this table called :
This is a confusion matrix and it is useful for evaluating the performance of a classification algorithm.

### Machine learning – Features – PCA

You are working with a data set that contains a large number of features. Why might you use principal component analysis with this data?
To combine features in a way that they best represent the underlying processes

### Machine Learning techniques

Overfitting means: We have built a model that will only perform well on the training data.

### Machine Learning techniques

Select the most suitable description of classification in the context of machine learning.

Classification involves predicting to which category an observation belongs.

### Machine Learning techniques

Which is not a technique for improving the performance of a machine learning model?

Locally Linear Embedding

### Machine Learning techniques

A supermarket chain would like to determine whether there are interesting groups of items that their customers tend to buy together using their sales logs. This will allow for special offerings to create value for customers or simply placing the items together to make it easier to find. An example would be placing the sticky tape with the gift wrapping paper.

This requires a unsupervised ML system

### Machine Learning – Linear Model Interpretation

You have built a simple linear model to predict the total daily revenue from sales of books in your store. The estimates of the model coefficients are shown below. Based on these estimates what can you tell about total daily revenue from books?

``````Intercept: 1473.8
Sunny day indicator: -537.2
Number of bestsellers: 18.9``````

I can tell that the daily revenue will increase if there are more bestsellers in stock.

### Machine Learning – Linear Model Interpretation

Consider the summary below from a linear model, fitting the continuous target chol (cholesterol) with the feature sex (0 if female, 1 if male). What does this model suggest regarding the relationship between the variables chol and sex?

• Intercept coefficient: 261.30
• Sex variable coefficient: [-22.01]

Relation ship : All else equal, females are likely to have higher cholesterol than males, by `22.01`.