Machine-Learning/02-Regression scrapped.qmd at main · fhdsl/Machine-Learning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
# Regression

*In the second week of class,*

## Linear Regression in depth

Last week, we were exposed to the Linear Regression model. Now, let's take it apart and look at it carefully.

### One predictor

### Many Predictors

### Assumptions and Techniques

-   Removing pairwise correlated predictors

-   Identify quadratic, cubic, or interactions in residual plots

-   Identifying and removing outliers

## Overfitting

## Population and Sample

The way we formulate machine learning model is based on some fundamental concepts in inferential statistics. We will refresh this quickly in the context of our problem. Recall the following definitions:

**Population:** The entire collection of individual units that a researcher is interested to study. For NHANES, this could be the entire US population.

**Sample:** A smaller collection of individual units that the researcher has selected to study. For NHANES, this could be a random sampling of the US population.

In Machine Learning problems, we often like to take two, non-overlapping samples from the population: the **Training Set**, and the **Test Set**. We **train** our model using the Training Set, which gives us a function $f()$ that relates the predictors to the outcome. Then, for 2 main use cases:

1.  **Classification and Prediction:** We use the trained model to classify or predict the outcome using predictors from the Test Set. We look at the Testing Error to evaluate the model.
2.  **Inference**: We examine the function $f()$'s trained values, which are called **parameters**. Because these parameters are derived from the **Training Set**, they are an *estimated* quantity from a sample, similar to other summary statistics like the mean of a sample. Therefore, to say anything about the true population, we have to use statistical tools such as p-values and confidence intervals.

If the concepts of population, sample, estimation, p-value, and confidence interval is new to you, we recommend do a bit of reading here <https://www.nature.com/collections/qghhqm/pointsofsignificance.>

## Generalizability of a model

When we look how the model performs in Classification and Prediction on the training and test data, we are asking how **generalizable** a model is. Usually we want our model to perform well on unseen, testing data that also represent the population.

We saw that in the first exercise, the Training and Testing Error looked different - why?

-   There is always going to be some variation, because Training and Test are samples from the population, as we discussed above.

-   The model was trained on the Training Set, so the prediction is going to naturally perform better than data it has never seen before on the Testing Set.

    -   How do we interpret the difference between Training Error and Testing Error?

To see how generalizable a model is, let's look at a problem in **prediction**, not classification. The underlying principles of **model generalizability** applies to both prediction and classification, but it is generally easier to visualize this concept in prediction models.

### Underfitting and Overfitting a Prediction Model

In a prediction model, the outcome is continuous. Instead of yes/no for $Hyptertension$, we will use the clinical metric of Mean Arterial Pressure $BloodPressure$. Here is how to interpret the continuous value:

The prediction model we will use is called a **Linear Model**, in its basic form draws a straight line as a line of "best fit" for given the data.

To evaluate the model, we will let the model make prediction given the input variable. This prediction is on the scale of $BloodPressure$, so there's no to draw a Decision Boundary in prediction problems. To examine the accuracy, we will look at the mean squared error between the true $BloodPressure$ and the predicted $BloodPressure$. Let's take a look.

```{python}
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import (confusion_matrix, accuracy_score)
import matplotlib.pyplot as plt
from formulaic import model_matrix
import statsmodels.api as sm

nhanes = pd.read_csv("classroom_data/NHANES.csv")
nhanes.drop_duplicates()

nhanes['BloodPressure'] = nhanes['BPDiaAve'] + (nhanes['BPSysAve'] - nhanes['BPDiaAve']) / 3
nhanes['Hypertension'] = (nhanes['BPDiaAve'] > 80) | (nhanes['BPSysAve'] > 130)

nhanes_tiny = nhanes.sample(n=300, random_state=1)

y, X = model_matrix("BloodPressure ~ BMI", nhanes_tiny)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
model = sm.OLS(y_train, X_train)
linear_model = model.fit()

plt.clf()
fig, (ax1, ax2) = plt.subplots(2, layout='constrained')

ax1.plot(X_train.BMI, linear_model.predict(X_train), label="fitted line")
ax1.scatter(X_train.BMI, y_train, alpha=.5, color="brown", label="Training set")
ax1.set(xlabel='BMI', ylabel='Blood Pressure')
ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax1.set_ylim(np.min(nhanes_tiny.BloodPressure), np.max(nhanes_tiny.BloodPressure))
train_err = np.mean((linear_model.predict(X_train) - y_train.BloodPressure) ** 2)
ax1.set_title('Training Error: ' + str(round(train_err, 2)))

ax2.plot(X_test.BMI, linear_model.predict(X_test), label="fitted line")
ax2.scatter(X_test.BMI, y_test, alpha=.5, color="brown", label="Testing set")
ax2.set(xlabel='BMI', ylabel='Blood Pressure')
ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax2.set_ylim(np.min(nhanes_tiny.BloodPressure), np.max(nhanes_tiny.BloodPressure))
test_err = np.mean((linear_model.predict(X_test) - y_test.BloodPressure) ** 2)
ax2.set_title('Testing Error: ' + str(round(test_err, 2)))

plt.show()
```

We see that Training Error \< Testing Error.

It turns out that the relationship between Training Error and Testing Error is connected to the **Model Complexity.**

A model is more complex relative to another model if:

-   It contains more predictors

-   The relationship between a predictor and response is fit via a higher order polynomial or smooth function

Let's look at what happens if we increase the complexity of the model by fitting it with a more smooth function. We use a polynomial function of order 2.

```{python}
p_degree = 2
y, X = model_matrix("BloodPressure ~ BMI + poly(BMI, degree=" + str(p_degree) + ")", nhanes_tiny)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

X_train_BMI = X_train.BMI
X_test_BMI = X_test.BMI
X_train.drop('BMI', axis=1, inplace=True)
X_test.drop('BMI', axis=1, inplace=True)

model = sm.OLS(y_train, X_train)
linear_model = model.fit()

plt.clf()
fig, (ax1, ax2) = plt.subplots(2, layout='constrained')

ax1.scatter(X_train_BMI, y_train, alpha=.5, color="brown", label="Training set")
ax1.scatter(X_train_BMI, linear_model.predict(), label="fitted line")

ax1.set(xlabel='BMI', ylabel='Blood Pressure')
train_err = np.mean((linear_model.predict(X_train) - y_train.BloodPressure) ** 2)
ax1.set_title('Training Error: ' + str(round(train_err, 2)))
ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax1.set_ylim(np.min(nhanes_tiny.BloodPressure), np.max(nhanes_tiny.BloodPressure))

ax2.scatter(X_test_BMI, y_test, alpha=.5, color="brown", label="Testing set")
ax2.scatter(X_test_BMI, linear_model.predict(X_test), label="fitted line")
ax2.set(xlabel='BMI', ylabel='Blood Pressure')
test_err = np.mean((linear_model.predict(X_test) - y_test.BloodPressure) ** 2)
ax2.set_title('Testing Error: ' + str(round(test_err, 2)))
ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
ax2.set_ylim(np.min(nhanes_tiny.BloodPressure), np.max(nhanes_tiny.BloodPressure))

fig.suptitle('Polynomial Degree: ' + str(p_degree))
plt.show()
```

We see that both Training and Testing error both decreased slightly!

What happens if we keep increasing the model complexity?

```{python}
for p_degree in [4, 10]:
  y, X = model_matrix("BloodPressure ~ BMI + poly(BMI, degree=" + str(p_degree) + ")", nhanes_tiny)

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

  X_train_BMI = X_train.BMI
  X_test_BMI = X_test.BMI
  X_train.drop('BMI', axis=1, inplace=True)
  X_test.drop('BMI', axis=1, inplace=True)

  model = sm.OLS(y_train, X_train)
  linear_model = model.fit()

  plt.clf()
  fig, (ax1, ax2) = plt.subplots(2, layout='constrained')

  ax1.scatter(X_train_BMI, y_train, alpha=.5, color="brown", label="Training set")
  ax1.scatter(X_train_BMI, linear_model.predict(), label="fitted line")

  ax1.set(xlabel='BMI', ylabel='Blood Pressure')
  train_err = np.mean((linear_model.predict(X_train) - y_train.BloodPressure) ** 2)
  ax1.set_title('Training Error: ' + str(round(train_err, 2)))
  ax1.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
  ax1.set_ylim(np.min(nhanes_tiny.BloodPressure), np.max(nhanes_tiny.BloodPressure))

  ax2.scatter(X_test_BMI, y_test, alpha=.5, color="brown", label="Testing set")
  ax2.scatter(X_test_BMI, linear_model.predict(X_test), label="fitted line")
  ax2.set(xlabel='BMI', ylabel='Blood Pressure')
  test_err = np.mean((linear_model.predict(X_test) - y_test.BloodPressure) ** 2)
  ax2.set_title('Testing Error: ' + str(round(test_err, 2)))
  ax2.set_xlim(np.min(nhanes_tiny.BMI), np.max(nhanes_tiny.BMI))
  ax2.set_ylim(np.min(nhanes_tiny.BloodPressure), np.max(nhanes_tiny.BloodPressure))

  fig.suptitle('Polynomial Degree: ' + str(p_degree))
  plt.show()

```

Let's summarize it:

```{python}
train_err = []
test_err = []
polynomials = list(range(1, 10))

for p_degree in polynomials:
  if p_degree == 1:
    y, X = model_matrix("BloodPressure ~ BMI", nhanes_tiny)
  else:
    y, X = model_matrix("BloodPressure ~ poly(BMI, degree=" + str(p_degree) + ")", nhanes_tiny)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

  model = sm.OLS(y_train, X_train)
  linear_model = model.fit()

  train_err.append(np.mean((linear_model.predict(X_train) - y_train.BloodPressure) ** 2))
  test_err.append(np.mean((linear_model.predict(X_test) - y_test.BloodPressure) ** 2))

plt.clf()
plt.plot(polynomials, train_err, color="blue", label="Training Error")
plt.plot(polynomials, test_err, color="red", label="Testing Error")
plt.xlabel('Polynomial Degree')
plt.ylabel('Error')
plt.legend()
plt.show()


```

As our Polynomial Degree increased, the following happened:

-   In the linear model, we see that the Training Error is fairly high, and the Testing Error is even higher. This is an example of **Underfitting**, where our model failed to capture the complexity of the data in both the Training and Testing Set.

-   After degree 6, we see that the Training Error is low, but the Testing Error is huge! This is an example of **Overfitting**, in which our model fitted the shape of of the training set so well that it fails to generalize to the testing set.

We want to find a model that is "just right" that doesn't underfit or overfit the data. Usually, as the model becomes more flexible, the Training Error keeps lowering, and the Testing Error will lower a bit before increasing. It seems that our ideal prediction model is around a polynomial of degree 5 or 6, with the minimal Testing Error.

Here is another illustration of the phenomena, using synthetic controlled data:

On the left shows the Training Data in black dots. Then, three models are displayed: linear regression (orange line), two other models of increasing complexity in blue and green.

On the right shows the training error (grey curve), testing error (red curve), and where each of the three models on the left land in the error rate with their respective colors.

![Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor.](images/testing_error-01.png)

Lastly, we recommend examining interactive tutorial: [https://mlu-explain.github.io/bias-variance/](https://mlu-explain.github.io/bias-variance/+)

Hopefully you start to see the importance of examining the Testing Error instead of the Training Error to evaluate our model. A highly flexible data will overfit the model and make it seem like the Training Error is small, but it will not generalize to the Testing data, which will have Testing Error.

### Underfitting and Overfitting a Classification Model

Let's do the same exercise for a Classification model. We won't be able to visualize each individual model fit clearly, but we can look at the classification Training and Testing Error.

```{python}
train_err = []
test_err = []
polynomials = list(range(1, 8))

for p_degree in polynomials:
  if p_degree == 1:
    y, X = model_matrix("Hypertension ~ BMI", nhanes_tiny)
  else:
    y, X = model_matrix("Hypertension ~ poly(BMI, degree=" + str(p_degree) + ")", nhanes_tiny)
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

  logit_model = sm.Logit(y_train, X_train).fit()

  train_pred_cut = [1 if x >= .5 else 0 for x in logit_model.predict(X_train)]
  train_err.append(1 - accuracy_score(y_train, train_pred_cut))

  test_pred_cut = [1 if x >= .5 else 0 for x in logit_model.predict(X_test)]
  test_err.append(1 - accuracy_score(y_test, train_pred_cut))


plt.clf()
plt.plot(polynomials, train_err, color="blue", label="Training Error")
plt.plot(polynomials, test_err, color="red", label="Testing Error")
plt.xlabel('Polynomial Degree')
plt.ylabel('Error')
plt.legend()
plt.show()

```

## Inference

Now let's examine the function $f()$'s trained values, which are called **parameters**.

### Linear Model

For the Linear Model, the model we first fitted was of the following form:

$$
BloodPressure=\beta_0 + \beta_1 \cdot BMI
$$

which is an equation of a line.

$\beta_0$ is a parameter describing the intercept of the line, and $\beta_1$ is a parameter describing the slope of the line.

Suppose that from fitting the model on the Training Set, $\beta_1=2$. That means increasing $BMI$ by 1 will lead to an increase of $BloodPressure$ by 2. This measures the strength of association between a variable and the outcome.

Let's see this in practice:

```{python}
y, X = model_matrix("BloodPressure ~ BMI", nhanes_tiny)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
linear_model = sm.OLS(y_train, X_train).fit()

linear_model.summary()
```

Based on the output, $\beta_0=69$, $\beta_1=.55$. We also see associated standard errors, p-values, and confidence intervals. This is necessarily to report and interpret because we derive these parameters based on a sample of the data (train or test set), so there are statistical uncertainties associated with them. For instance, the 95% confidence interval of true population parameter will fall between (.22, .87).

### Logistic Regression

Let's do the same for our Logic Regression Classifier model, which has an equation of:

$$
\frac{p(Hypertension)}{1-p(Hypertension)}=e^{\beta_0 + \beta_1 \cdot BMI}
$$

On the left hand side of the equationis the **Odds** of having Hypertension.

$\beta_0$ is a parameter describing \_\_, and $\beta_1$ is a parameter describing \_\_\_

```{python}
y, X = model_matrix("Hypertension ~ BMI", nhanes_tiny)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
logit_model = sm.Logit(y_train, X_train).fit()

logit_model.summary()
```

## Appendix

Flexibility vs. Interpretability of models

![Source: An Introduction to Statistical Learning, Ch. 2, by Gareth James, Daniela Witten, Trevor Hastie, Roebert Tibshirani, Jonathan Taylor](images/flexibility_vs_interpretability.png){width="500"}

##