Posted on:

14 Dec 2020


Code for Multivariate P-values

I have run into a problem with the given code for calculating multivariate p-values in the Machine Learning Course. I have downloaded and directly copied and pasted the code from linked course. Everything runs, and I get an output for all the p-values.

However, when I did the Multiple Linear Regression - Exercise (using sklearn), I get very different p-values than what is suggested in the solutions. In fact, I also get very different answers than what is suggested using the same data set in statsmodels.api or in R.

I think that there may be a problem with the calculation of the Standard Error or the SSE in the downloadable python code (sklearn - How to properly include p-values.ipynb). I don't know a whole lot about python (or linear algebra honestly) but when I do the same analysis in R I get a very different Standard error for the coefficients.

Maybe I am doing something wrong, but I don't have any idea what it might be.

Any help would be appreciated. If I didn't explain well enough, please let me know!


Minimal reproducible example:

The code for the python is all downloadable from your course. But if we look at the SE, t-values, and p-values, I got:


[[12.34242586  5.53812016]]


[[ 18.44863049 526.67425796]]


[0. 0.]


In R:

data <- read_csv('real_estate_price_size_year.csv')
y <- data$price
x1 <- data$size
x2 <- data$year
model <- lm(y~x1+x2)


Estimate    Std. Error    t value  Pr(>|t|)

(Intercept) -5.772e+06 1.583e+06  -3.647  0.000429

x1              2.277e+02 1.247e+01   18.254  < 2e-16

x2              2.917e+03 7.859e+02    3.711  0.000344

All of these values are also the same using statsmodels.api.

0 answers ( 0 marked as helpful)

Submit an answer