Why did you scale before train test split?

Question

Why did you use StandardScaler before train test split? Isn't doing this a form of data leakage because the train data was fit and transformed using enough on the test (unseen) data?

Answer 1

Hi Garrett!

Thanks for reaching out.

Can you please clarify what you mean by saying "using enough"?
In general, we want to normalize or standardize (which of the two we'll use depends on whether the data contains a lot or not too many outliers, respectively) the data before proceeding with further statistical/analytical steps.
At the same time, please keep in mind that we use .fit() to compute the mean and variance of each feature. Only then can we use .transform(), which is the method we use to standardize the data.

Hope this helps.
Kind regards,
Martin

Answer 2

Hii, why just using 2 sets and not three to get a more accurate model, like train, validation, and a test set. I understand you use the function random_state, but it makes a model more accurate with three sets?

Answer 3

Hi Rosaline!

Thanks for reaching out!

You are right that three sets are generally better. However, a validation set can be omitted/not used for relatively small datasets as the one shown in the course. In addition, we wanted to use standard splits to compare the model's validity. So, we chose to not use a validation set in thie exercise.

Hope this helps.
Best,
Martin

Why did you scale before train test split?

Submit an answer

related questions