Class imbalance is a real problem in many real life applications. For example, occurrence of fraud in financial transactions or in healthcare claims. Also, in hospital setting, mortality rate, for some conditions are small but they need to be predicted.
What are the other approaches you would take in situations where the minority class is really tiny, say, 1-2% of the samples?
There are tools like SMOTE that I know of. What are your thoughts on their success in real applications?
Dealing with imbalanced datasets is generally a hard task and is an active area of research.
There are simple ways to approach it that are, however, limited in their simplicity.
For instance, you can oversample, i.e. repeat samples from the minority class, or undersample which is what we used in our lectures here – throw away points from the majority class(es).
Oversampling is obviously quite naive and not really recommended. I’ve never seen it used in practice.
Undersampling is better in principle but only works if you have large amounts of data so cutting maybe >50% of it still leaves you with a big enough dataset.
An improvement on oversampling is using SMOTE and SMOTE-like techniques, as you mentioned.
I have seen a paper that dealt with NNs for better portfolio diversification that used this method extremely successfully but unfortunately I can’t find the reference now. Still, the point is that it is definitely a viable approach to the problem.
Another way to look at it is to tweak the training process instead of the dataset.A more or less standard topic we did not go into in this course is regularization.
With it you can impose certain restrictions on what you’re learning by introducing extra terms (basically Lagrange multipliers) to your cost function.
There is tons of information on regularization online but most of it focuses on the typical applications such as L1, L2, etc. and not on “engineering your own restriction”.
The two paragraphs so far describe methods to 1) tweak the dataset or 2) tweak the training process.
The third option is to change the learning algorithm entirely.
Non-neural network approaches such as decision trees inherently handle imbalanced datasets better.
Taking this idea further, I believe a popular approach in recent literature (2017) is variations of boosting and bagging.
There even seems to be things like SMOTEBoost, which I discovered now.
You can read furthere here: SMOTEBoost and later arXiv:1712.06658.
In short, it’s still an open problem of how to truly deal with imbalanced datasets.
You can try basic techniques such as undersampling or more advanced ones like some sort of smartly regularized random forest.
What method will be successful depends highly on the given dataset so you can’t generally know before you have tried applying a few of the aforementioned approaches.