Resolved: can't understand the order of train_test_split, standard scaler and count vectorizer
Hello Hristina,
I must admit my confusion about what you said about the order of these three preprocessing steps:
1. train_test_split
2. standard scaler
3. count vectorizer
I understood train_test_split should be applied on data before applying CountVectorizer. But I can't find out the order of these in general.
Can you explain me what the order of them is and why should it be?
Thank you!
Hey Mohsen,
Thank you for reaching out and for engaging with the Machine Learning with Naïve Bayes course!
The purpose of a test dataset is to have a set of records that the model hasn't encountered before. In our case, this means that the model shouldn't know what words to expect during the testing process.
Every preprocessing tool in sklearn
, be it CountVectorizer
, StandardScaler
, or any other, has a method called fit()
. This method is used to learn from whatever data you feed as an argument. In the example presented in the lecture, the vectorizer
object is learning the words that are present in the training data and the frequency at which they appear in the comments. It is therefore important to not include the test dataset as an argument to the fit()
method.
To answer your question, the general order is the following:
1. Split the data into training and testing sets.
2. If preprocessing is needed, apply the fit_transform()
method to the training set, and then transform()
to the test set.
3. Fit your ML model to the training dataset and then test it on the test dataset.
Hope this helps! Let me know if something has remained unclear and enjoy the course!
Kind regards,
365 Hristina