You may be wondering what all of those sums of squares are all about. Maybe that’s what got you here in the first place. Well, they are the determinants of a good linear regression. This tutorial is based on the ANOVA framework you may have heard before.
Before reading it, though, make sure you are not mistaking regression for correlation. If you’ve got this checked, we can get straight into the action.
A quick side-note: Want to learn more about linear regression? Check out our explainer videos The Linear Regression Model. Geometrical Representation and The Simple Linear Regression Model.
The 3 Sums of Squares
There are three terms we must define. The sum of squares total, the sum of squares regression, and the sum of squares error.
What is the SST?
The sum of squares total, denoted SST, is the squared differences between the observed dependent variable and its mean. You can think of this as the dispersion of the observed variables around the mean – much like the variance in descriptive statistics.
It is a measure of the total variability of the dataset.
Side note: There is another notation for the SST. It is TSS or total sum of squares.
What is the SSR?
The second term is the sum of squares due to regression, or SSR. It is the sum of the differences between the predicted value and the mean of the dependent variable. Think of it as a measure that describes how well our line fits the data.
If this value of SSR is equal to the sum of squares total, it means our regression model captures all the observed variability and is perfect. Once again, we have to mention that another common notation is ESS or explained sum of squares.
What is the SSE?
The last term is the sum of squares error, or SSE. The error is the difference between the observed value and the predicted value.
We usually want to minimize the error. The smaller the error, the better the estimation power of the regression. Finally, I should add that it is also known as RSS or residual sum of squares. Residual as in: remaining or unexplained.
The Confusion between the Different Abbreviations
It becomes really confusing because some people denote it as SSR. This makes it unclear whether we are talking about the sum of squares due to regression or sum of squared residuals.
In any case, neither of these are universally adopted, so the confusion remains and we’ll have to live with it.
Simply remember that the two notations are SST, SSR, SSE, or TSS, ESS, RSS.
There’s a conflict regarding the abbreviations, but not about the concept and its application. So, let’s focus on that.
How Are They Related?
Mathematically, SST = SSR + SSE.
The rationale is the following: the total variability of the data set is equal to the variability explained by the regression line plus the unexplained variability, known as error.
Given a constant total variability, a lower error will cause a better regression. Conversely, a higher error will cause a less powerful regression. And that’s what you must remember, no matter the notation.
Well, if you are not sure why we need all those sums of squares, we have just the right tool for you. The R-squared. Care to learn more? Just dive into the linked tutorial where you will understand how it measures the explanatory power of a linear regression!
Interested in learning more? You can take your skills from good to great with our statistics tutorials!
Ready to take the first step towards a career in data science?
Check out the complete Data Science Program today. We also offer a free preview version of the Data Science Program. You’ll receive 12 hours of beginner to advanced content for free. It’s a great way to see if the program is right for you.
Next Tutorial: Measuring Variability with the R-squared