Resolved: Cons: handling dataset with imbalance cases

Question

Is it true that Decision Tree cannot handle dataset with imbalance cases? Say I have a dataset where "default" is my outcome variable where yes=1 and no=0. However, I have 95% of the cases in 0 and only 5% in 1. I tried to build a DT, but got nothing out of it. How do I solve this problem? Many thanks.

Answer 1

Hi,

Unfortunately, yes.
You have arrived at the age old problem in ML classification problems - unbalanced datasets. This problem is not specific to decision trees, but pretty much every classification algorithm struggles with it. There are some tricks that can help you build a better model, but there is no one size fits all technique, it all depends on your particular dataset.

As a starting point, you can adjust the weights of your classes, in order to put more emphasize on the smaller one.
This can be done through class_weight parameter. You can either set the weights manually, like this {0:1, 1:N} (so, the class '1' will be N times more important); or leave that to be done automatically by sklearn with "balanced". You can (and should) read more about this parameter here https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

As a side note, this stack exchange answer is pretty well written and gives you more ideas about different approaches you can take: https://stats.stackexchange.com/a/28054

Hope this helps!

Best,
Nikola, 365 team

Answer 2

Christine Li

Posted on:

16 Jun 2022

0

Thanks!

Resolved: Cons: handling dataset with imbalance cases

Submit an answer

related questions