At the time I was done with this chapter, R squared was pretty clear to me, however, now that I moved to learning principles of data science, where R squared is defined by 1-SSR/SST.
Could you please explain?
As far as I understand, the concept is clear to you but in our lecture, we define R-squared as SSR / SST, while according to another source it is 1 – SSR/ SST, correct?
That’s a valid question.
In both cases, what is meant is that the R-squared = Variability explained / Total variability.
1. Now, according to our framework, SST = Sum of Squares Total; SSR = Sum of Squares Regression; SSE = Sum of Squares Error
In that case, R-squared = SSR / SST, or R-squared = 1 – SSE/SST
2. Unfortuntely, there is a different notation in some books that you may come across. Some sources have the abbreviations as:
TSS (SST) = Total Sum of Squares; RSS (SSR) = Residual Sum of Squares; ESS (SSE) = Estimation Sum of Squares
You can see how this is a problem, as residuals (conceptually) mean error. And estimation (at least in the case of regression analysis) means regression.
Using this notation, you can state: R-squared = Variability explained / Total variability = ESS / TSS or as you saw it = 1 – RSS / TSS.
3. Some people even define SSR = Sum of Squares Regression Error, stating that SSR stands for the sum of errors. This third abbreviation in my opinion is the most misleading of them all. Do you even need to say the word ‘regression’ here? That to me is basically saying: ‘You can’t make me use your notation. I prefer my own.’
Conceptually, the three notations have the same meaning. Unfortunately, their abbreviations are opposite.
I have seen the first notation (the one from our lectures) used much more often than the others. When creating the R-squared lecture, I put the extra effort to research the usage of each one of those, as I anticipated some confusion. Predominantly, sources were using the first notation, so I stuck with it. I like the second one, but only when they put: TSS, RSS, and ESS as abbreviations. That makes it clear which framework the author is using.
In any case, now you know about this ridiculous confusion in statistics. In the material you are using, just assume that SSR and SSE have switched places. Everything else should be the same.
The 365 Team