Resolved: Is WCSS an average of the sum of squares of the points within a cluster?
Good morning. It's not very clear how WCSS is calculated: it stands for "within-cluster sum of squares", so if we have for instance 3 clusters containing respectively 13 points, 18 points and 40 points, we first calculate the sum s_1 of the squares of the distances between every possible pair from point 1 to point 13 in the first cluster (so we have to sum binom(13,2) = 78 squares), then we do the analogue thing with the other 2 cluster obtainnig s_2 and s_3. Finally we calculate the average of s_1, s_2 and s_3 to obtain WCSS. Am I right?
Another thing: since the distances are always positive, why do we need to sum the squares? Is it maybe to make easier comparisons?
Thank you for reaching out!
In my explanation below, I will use the numbers you've provided as well as the notation used in sklearn's documentation.
1. The total number of samples is N = 13+18+40 = 71. The set of all samples is denoted by X, while each sample is denoted by x_1, x_2, ..., x_71.
2. The samples are distributed (for example) into K = 3 number of cluster. The set of all clusters is denoted by C while each individual cluster is denoted by c_1, c_2, c_3.
3. The number of samples in the first cluster is 13, in the second 18, and in the third one 40.
4. Each cluster is described by a centroid, namely mu_1, mu_2, mu_3.
5. We find the squared distances between each point in a cluster and the respective centroids.
That is, the squared distance between mu_1 and each of the 13 points in c_1 is calculated. All 13 squared distances are summed up (let's call this sum S_1). Then, the squared distance between mu_2 and each of the 18 points in c_2 is calculated. All 18 squared distances are summed up (let's call this sum S_2). Finally, the squared distance between mu_3 and each of the 40 points in c_3 is calculated. All 40 squared distances are summed up (let's call this sum S_3).
6. The WCSS is then represented by the sum S_1+S_2+S_3.
Working with squared numbers is a common practice. As you mention, this has the advantage of dealing only with positive numbers.
In our case, we are using the Pythagorean theorem, which uses the squared distance. There is no need to take the extra step of calculating the square root, as we would arrive at the same result, that is, find the solution which minimizes the distances between the points in a cluster and its centroid.
Hope this helps!
Thanks, very clear.
Just one thing: I said the we WOULDN'T need to sum the squares of the distances, because the distances are already positive quantities! So I asked why we don't simply sum the distances instead of the squares of the distances. I have supposed that we sum the squares just to obatin bigger numbers, easier to compare with other numbers, but I think it's not a very good reason.