Importing and Exploring Segmentation Data
In this section, the correlations between each features are calculated for exploration (df.corr()).
Just wondering if there is any rule of thumb indicating that we can one of the feature if it has high correlation with the other one?
In other words, if two features has correlation, says as larege as 0.9, will that be redundant if we use both of them for clustering? (since these two features may indicate the same trend?)
thanks for reaching out!
You’re absolutely right, when two features have high correlation, they are likely to contain very similar information. In such cases we might want to remove one of the features, but the question then becomes which one do we keep? If we have prior knowledge on the dataset, we can decide, that one of the features makes more sense for our model, and leave the other one out. Otherwise, to avoid the correlation issue, we can rely on the following method:
Using dimensionality reduction such as PCA helps us avoid collinearity, as the PCA components are orthogonal to each other. In addition, PCA keeps the features with the most variance, which means that we’ve not lost any important feature by mistake.