I have run into a problem with the given code for calculating multivariate p-values in the Machine Learning Course. I have downloaded and directly copied and pasted the code from linked course. Everything runs, and I get an output for all the p-values.
However, when I did the Multiple Linear Regression – Exercise (using sklearn), I get very different p-values than what is suggested in the solutions. In fact, I also get very different answers than what is suggested using the same data set in statsmodels.api or in R.
I think that there may be a problem with the calculation of the Standard Error or the SSE in the downloadable python code (sklearn – How to properly include p-values.ipynb). I don’t know a whole lot about python (or linear algebra honestly) but when I do the same analysis in R I get a very different Standard error for the coefficients.
Maybe I am doing something wrong, but I don’t have any idea what it might be.
Any help would be appreciated. If I didn’t explain well enough, please let me know!
Minimal reproducible example:
The code for the python is all downloadable from your course. But if we look at the SE, t-values, and p-values, I got:
[[ 18.44863049 526.67425796]]
data <- read_csv(‘real_estate_price_size_year.csv’)
y <- data$price
x1 <- data$size
x2 <- data$year
model <- lm(y~x1+x2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.772e+06 1.583e+06 -3.647 0.000429
x1 2.277e+02 1.247e+01 18.254 < 2e-16
x2 2.917e+03 7.859e+02 3.711 0.000344
All of these values are also the same using statsmodels.api.