PCA Categorical Variables
My understanding was that Kmeans and PCA can be applied only to continuous variables but in this course they seem to be applied to a dataset w/ categorical variables such as marital status, education and occupation. Can you please help me understand to what extent would doing so (using categorical variables) would still give meaningful results w/ Kmeans and PCA? Thanks!
Hi gowa,
thanks for reaching out! You're correct that K-means and PCA are best used with numerical variables. There is no theoretical proof that k-means will deliver good results in the case of categorical variables. Nonetheless, in practice K-means and PCA can give good results with categorical variables as well. It's crucial to transform them into integers for the algorithms to work. In our case specifically, though we have categorical variables, some them have some ordering as well, like the settlement size. Sometimes the algorithms are capable of separating accordingly with the categorical variables. When it comes to clustering in practice, you can choose the variables to put into the algorithm and decide which combination makes the most sense based on your results. Some people would say never use the categorical variables, but sometimes we don't have any such data available, so we can only try to solve the problem with the available resources.
So, it rather depends on the case and you can decide how and when to apply K-means and PCA.
Best,
365 Eli