Resolved: Customer Analytics in Python - Lesson 2.Segmentation Data - Standardizing Segmentation Data
I have a questions on the the course Customer Analytics in Python - Lesson 2.Segmentation Data - Standardizing Segmentation Data.
Regarding standardization, should we standardize also the categorical variables (Columns = 'Sex', 'Marital status', 'Education', 'Occupation', 'Settlement size')
I have researched online and see that people have a quite divided opinion on this and would like to know your opnion on this too,
thanks for reaching out! That's a great question.
There is indeed some dispute on whether or not categorical variables need to be standardized - an argument against it is, that having 0 and 1 for a two category variable, does not necessarily mean that 1 is regarded as having a higher value, there is simply one unit of distance between the two variables. It's true, that for most models, there won't be any difference whether those variables have been standardized or not.
The question of whether standardization is really necessary becomes important based on the model you'll be using on your data. For PCA in particular, this might be a problem, as I believe that all predictors need to be on the same scale for it to perform.
So, in this instance, standardizing the whole data set is needed. But it's not a definite rule for every machine learning algorithm and it usually depends on the particular use case.