The 365 Data Science team is proud to invite you to our own community forum. A very well built system to support your queries, questions and give the chance to show your knowledge and help others in their path of becoming Data Science specialists.
Anybody can ask a question
Anybody can answer
The best answers are voted up and moderated by our team

Customer Analytics: Any rule of thumb to ignore some features based on correlation?

Customer Analytics: Any rule of thumb to ignore some features based on correlation?


Importing and Exploring Segmentation Data

In this section, the correlations between each features are calculated for exploration (df.corr()).
Just wondering if there is any rule of thumb indicating that we can one of the feature if it has  high correlation with the other one?
In other words, if two features has correlation, says as larege as 0.9, will that be redundant if we use both of them for clustering?  (since these two features may indicate the same trend?)

Many thanks!!!

1 Answer

365 Team

Hi MinliYu, 
thanks for reaching out! 
You’re absolutely right, when two features have high correlation, they are likely to contain very similar information. In such cases we might want to remove one of the features, but the question then becomes which one do we keep? If we have prior knowledge on the dataset, we can decide, that one of the features makes more sense for our model, and leave the other one out. Otherwise, to avoid the correlation issue, we can rely on the following method:
Using dimensionality reduction such as PCA helps us avoid collinearity, as the PCA components are orthogonal to each other. In addition, PCA keeps the features with the most variance, which means that we’ve not lost any important feature by mistake.

Online Data Science Training
SAVE 60%