I have taken the course customer analytics.
In the deep learning lesson, the class which is high in the dataset is reduced by removing some of the training data belonging to it to make the dataset have an equal number of data for both the classes.
What if one of the classes is 80% and the other class is 20 % in the dataset. Then by following the above method reduces the dataset by 60% . In such scenarios how to tackle the imbalanced class problem?
Dealing with imbalanced datasets is generally a hard task and is an active area of research.
There are simple ways to approach it that are, however, limited in their simplicity.
For instance, you can oversample, i.e. repeat samples from the minority class, or undersample which is what we used in our lectures here – throw away points from the majority class(es).
Oversampling is obviously quite naive and not really recommended. I’ve never seen it used in practice.
Undersampling is better in principle but only works if you have large amounts of data so cutting maybe >50% of it still leaves you with a big enough dataset.
An improvement on oversampling is using SMOTE and SMOTE-like techniques, as you mentioned.
I have seen a paper that dealt with NNs for better portfolio diversification that used this method extremely successfully but unfortunately I can’t find the reference now. Still, the point is that it is definitely a viable approach to the problem.
Another way to look at it is to tweak the training process instead of the dataset.
A more or less standard topic we did not go into in this course is regularization.
With it you can impose certain restrictions on what you’re learning by introducing extra terms (basically Lagrange multipliers) to your cost function.
There is tons of information on regularization online but most of it focuses on the typical applications such as L1, L2, etc. and not on “engineering your own restriction”.
The two paragraphs so far describe methods to 1) tweak the dataset or 2) tweak the training process.
The third option is to change the learning algorithm entirely.
Non-neural network approaches such as decision trees inherently handle imbalanced datasets better.
Taking this idea further, I believe a popular approach in recent literature (2017) is variations of boosting and bagging.
There even seems to be things like SMOTEBoost, which I discovered now.
You can read furthere here: SMOTEBoost and later arXiv:1712.06658.
In short, it’s still an open problem of how to truly deal with imbalanced datasets.
You can try basic techniques such as undersampling or more advanced ones like some sort of smartly regularized random forest.
What method will be successful depends highly on the given dataset so you can’t generally know before you have tried applying a few of the aforementioned approaches.
The 365 Team