Last answered:

22 May 2024

Posted on:

15 Oct 2022

0

Why did you scale before train test split?

Why did you use StandardScaler before train test split? Isn't doing this a form of data leakage because the train data was fit and transformed using enough on the test (unseen) data?

3 answers ( 0 marked as helpful)
Instructor
Posted on:

20 Oct 2022

0

Hi Garrett!

Thanks for reaching out.

Can you please clarify what you mean by saying "using enough"?
In general, we want to normalize or standardize (which of the two we'll use depends on whether the data contains a lot or not too many outliers, respectively) the data before proceeding with further statistical/analytical steps.
At the same time, please keep in mind that we use .fit() to compute the mean and variance of each feature. Only then can we use .transform(), which is the method we use to standardize the data.

Hope this helps.
Kind regards,
Martin

Posted on:

30 Jan 2023

0

Hii, why just using 2 sets and not three to get a more accurate model, like train, validation, and a test set. I understand you use the function random_state, but it makes a model more accurate with three sets?

Instructor
Posted on:

22 May 2024

0

Hi Rosaline!

Thanks for reaching out!


You are right that three sets are generally better. However, a validation set can be omitted/not used for relatively small datasets as the one shown in the course. In addition, we wanted to use standard splits to compare the model's validity. So, we chose to not use a validation set in thie exercise.


Hope this helps.
Best,
Martin

Submit an answer