Unlike most implementations of linear models (e.g. Stata, R), Python packages don’t usually drop perfectly collinear variables.
Here’s Statsmodels as a first example: (see https://github.com/statsmodels/statsmodels/issues/3824)
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd
= np.random.normal(size = 30)
e
# creating two variables x and collinear
# where collinear is just 2 times x
= np.arange(30)
x1 = 2 * x1
x2
= 2 * x1 + e
y = pd.DataFrame({"y": y, "x1": x1, "x2": x2}) data
= smf.ols("y ~ x1 + x2", data = data)
model = model.fit()
res res.summary()
<td>y</td> <th> R-squared: </th> <td> 0.998</td>
<td>OLS</td> <th> Adj. R-squared: </th> <td> 0.998</td>
<td>Least Squares</td> <th> F-statistic: </th> <td>1.354e+04</td>
<td>Tue, 04 Jun 2019</td> <th> Prob (F-statistic):</th> <td>3.80e-39</td>
<td>20:23:08</td> <th> Log-Likelihood: </th> <td> -35.185</td>
<td> 30</td> <th> AIC: </th> <td> 74.37</td>
<td> 28</td> <th> BIC: </th> <td> 77.17</td>
<td> 1</td> <th> </th> <td> </td>
<td>nonrobust</td> <th> </th> <td> </td>
Dep. Variable: |
---|
Model: |
Method: |
Date: |
Time: |
No. Observations: |
Df Residuals: |
Df Model: |
Covariance Type: |
<td> 0.3973</td> <td> 0.003</td> <td> 116.364</td> <td> 0.000</td> <td> 0.390</td> <td> 0.404</td>
coef | std err | t | P>|t| | [0.025] | ||
---|---|---|---|---|---|---|
Intercept | 0.3681 | 0.288 | 1.277 | 0.212 | -0.222 | 0.959 |
x1 | ||||||
x2 | 0.7946 | 0.007 | 116.364 | 0.000 | 0.781 | 0.809 |
<td> 3.352</td> <th> Durbin-Watson: </th> <td> 2.071</td>
<td>-0.715</td> <th> Prob(JB): </th> <td> 0.224</td>
<td> 2.407</td> <th> Cond. No. </th> <td>7.31e+16</td>
Omnibus: | |||
---|---|---|---|
Prob(Omnibus): | 0.187 | Jarque-Bera (JB): | 2.993 |
Skew: | |||
Kurtosis: |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.01e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Neither does the popular machine learning package Scikit-Learn:
from sklearn.linear_model import LinearRegression
= LinearRegression()
lm = data[["x1", "x2"]], y = data.y)
lm.fit(X lm.coef_
array([0.39728922, 0.79457845])
source: https://knowyourmeme.com/photos/1250147-yamero