Resolved: Cons: handling dataset with imbalance cases
Is it true that Decision Tree cannot handle dataset with imbalance cases? Say I have a dataset where "default" is my outcome variable where yes=1 and no=0. However, I have 95% of the cases in 0 and only 5% in 1. I tried to build a DT, but got nothing out of it. How do I solve this problem? Many thanks.
Hi,
Unfortunately, yes.
You have arrived at the age old problem in ML classification problems - unbalanced datasets. This problem is not specific to decision trees, but pretty much every classification algorithm struggles with it. There are some tricks that can help you build a better model, but there is no one size fits all technique, it all depends on your particular dataset.
As a starting point, you can adjust the weights of your classes, in order to put more emphasize on the smaller one.
This can be done through class_weight
parameter. You can either set the weights manually, like this {0:1, 1:N}
(so, the class '1' will be N times more important); or leave that to be done automatically by sklearn with "balanced"
. You can (and should) read more about this parameter here https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
As a side note, this stack exchange answer is pretty well written and gives you more ideas about different approaches you can take: https://stats.stackexchange.com/a/28054
Hope this helps!
Best,
Nikola, 365 team
Thanks!