Accuracy difference by changin random_state
Hi there,
First of all congratulations for the course, everything is very well explained.
I have 'played' with the random_state while train_test_splitting. The variation in the overall accuracy of the resulting models is huge. I have seen final accuracies of around 0.67 all the way up to 0.93. My assumption is that this is caused by the small size of the dataset combined with the considerable inbalance between the difference classes. Regardless if that is the right reason for the accuracy differences, how valid can we consider a model that performs so differently just by modifying the train test split?
Thanks for your attention,
Dan
Hi Dan,
Thank you for the kind words.
Yes, you are most probably correct - the huge difference in test accuracies is largely dicated by the small dataset size. In fact, that is one of the main problems of machine learning in many bussinesses - not enough data is the most common pitfall.
However, it is worth exploring what areas are impacted by the inssuficient data. In our case, it is likely that the test dataset is too small, but the train dataset could be sufficient. Thus, the final test accuracies are somewhat useless due to the huge variations. However, the model itself might be adequate, we are just unable to evaluate its performance.
Of course, in a real production environment, we would need to source more data, or choose a different method.
Best regards,
Nikola Pulev, 365 Team