29 Sept 2022

Posted on:

28 Sept 2022

0

# Couldn't we choose the number of clusters that minimizes the product WCSS*number_of_clusters?

Good morning. I was thinking that a more scientific method to find the "elbow", instead of searching for it graphically, could be to list the values of WCSS(i)*i, where i is the number of clusters (from 1 to a given number). The number of clusters "i" we have to choose is the one that minimizes that product. Since WCSS is usually much bigger than i, I would actually try to minimize WCSS(i)*i^3, or maybe WCSS(i)*i^10, or something similar, based on how much we care that the number of clusters is low.

Instructor
Posted on:

29 Sept 2022

0

Hi Alessandro,
thanks for reaching out! Your proposal is an interesting alternative to the elbow method, thank you for sharing it.
Could you elaborate a bit more on your proposed solution? The following two questions will be important:
What exactly is the subject of your minimization?
Also what is the complexity of your approach?
Thanks!

Best,
365 Eli

Posted on:

29 Sept 2022

0

If by "the subject of my minimization" you mean what I want to minimize, the answer is that we want to minimize the product WCSS(i)*i^n, with n an appropriate positive number. The value of i that minimizes this product is what we want to be the number_of_clusters. Usually n is 1, but since WCSS makes usually use of very big numbers, it could be useful to increase n if we don't want to risk to find a number of clusters too big. We could say that n is proportional to how much we care that the number_of_clusters is low. If WCSS(i) were a continuous function, my approach would consist of taking the derivative of WCSS(i)*i^n with respect to i, put it equal to 0 and derive i. The reason why we want to minimize the product of those two quantities is simply that we want either WCSS or i to be as low as possible.
What do you mean by "what is the complexity of my approach"? If you're asking if this approach is too complicated to implement, the answer is no, it's just a list of values. The only "complicated" thing is to choose the right n for us.
Thanks