Resolved: On normalization and priors
Hey there. I was wondering if it's better to split the data into train-validation-test first, calculate the mean and the variance of the train data, and then normalize all three using these values. If all the data is normalized before splitting, won't that imply the train data carries some kind of "hidden information" from validation and test sets? Or does this not make a difference significant enough to worry about in large datasets?
Also, in this particular case, nearly 85% of the labels are 0. By keeping the same number of 0s and 1s, we'll be throwing almost 70% of our data away. Isn't there a better way to handle this? Maybe use some sort of weighted cost function for different labels?
you are absolutely right. the standardization should be independent among datasets otherwise we have dataleak...
another issue with this is that shuffling happens before splitting the dataset. common sense you shuffle before....
balancing dataset with dropping indices is one way to do it. it is a simple way useful at low level of imbalance. in this case you lose most of your data so as you pointed out so it is not a wise solution.
oanother way is to boost up the smaller group. there is a function for that  imblearn.over_sampling.SMOTE you most probably need to pip install imblearn.
there are other methods as well but for start thats good enough i suppose