I am just starting on linear regression and I’m not too sure about this part.
y = data [‘GPA’]
x1 = data [[‘SAT’,’Attendance’]]
which aggregates the two independent variable into x1.
But why is
x = sm.add_constant(x1)
We’re doing it because we assume in our model that:
GPA = c + SAT * a_1 + Attendance * a_2 + e
In this case, we have a constant term, SAT score, Attendance and a residual. Hence, before we regress GPA on SAT and Attendance, we need to make sure we include a constant factor among the exogenous variables. Then, the regression will calculate the values for the constant term (c) as well as the coefficients for SAT and Attendance (a_1 and a_2).
The idea is that there is often some minimal GPA value that can’t be explained by changes in SAT or Attendance. Hence, we add the constant term here. Of course, we say often, rather than always, because it is possible for the constant factor to be non-significant (a.k.a equivalent to 0). That being said, we prefer to assume there exist a constant and find out it’s not significant, than to assume there is no constant factor and attribute the shifts in values to the other coefficients (a_1 and a_2).
Hope this helps!