Techniques and intuition for binning continuous variables (fine classing)
Hello,
I was wondering how the bin size will effect the model's performance. At the beginning of the video it states, "you can visualize everything with a hundred or fewer categories and still make sense of it without getting lost..." Then 50 is chosen as the bin size when using pandas' cut method. Let's say I used Sturge's rule, this is a rule used for binning when creating a histogram, I would end up with around 43 bins rather than the chosen 50. I was wondering
- how it was decided to use 50 for cutting?
- What sort of an impact would a bin size of 43 have compared to a bin size of 50? I am sure it doesn't matter much (assuming) but I was trying to get an intuition of how the bin size would effect the accuracy of the model.
1 answers ( 0 marked as helpful)
Hi John,
thanks for reaching out! The topic of choosing the appropriate number of bins for a Histogram is a complex one. And even though there are advanced techniques, such as Sturges rules, they rarely work so well in practice, as real data is discrete and usually noisy.
If you're looking to develop a bit more intuition on the matter, you can check out the chapter on Histogram in the Data Visualization Course.
There is a lecture specifically dedicated to choosing the appropriate number of bins. Hope you'll find it instructive: https://learn.365datascience.com/courses/the-complete-data-visualization-course-with-python-r-tableau-and-excel/histogram-how-to-choose-the-right-number-of-bins
Best,
365 Eli