Encoding features with one hot encoding vs ordinal encoding in Data Preprocessing
Why the choice of the ordinal encoder and not one hot encoder when preprocessing our data? I'm guessing our choice affects the performance of our model. In practice, do we have to try both and select the encoding with better performance?
Hi Nehita,
thanks for reaching out! In practice the two types of encoders will likely give similar results (though I have not yet tried the one-hot encoder). The reason for choosing this one here is that we show the one-hot encoder in another course, and I wanted to show our students other possibilities. From my experience though, both techniques lead to very similar results. If you do find a significant difference in the results, I'd be happy if you share them here in the hub.
Hope this helps!
Best,
365 Eli
Hi Eli, Could you point out one-hot encoder course you're referring to?
Regarding the differences between one-hot and ordinal, I see an extremely different result.
enc_i = OneHotEncoder()
enc_t = LabelEncoder()
x_train_transf = enc_i.fit_transform(x_train)
x_test_transf = enc_i.transform(x_test)
y_train_transf = enc_t.fit_transform(y_train)
y_test_transf = enc_t.transform(y_test)
C= 1.0
clf = svm.SVC(C=C, kernel='linear').fit(x_train_transf, y_train_transf)
y_test_pred = clf.predict(x_test_transf)
ConfusionMatrixDisplay.from_predictions(y_test_transf, y_test_pred, display_labels=enc_t.classes_.tolist())
result of the one-hot encoder is 100% accuracy in the mushroom case.
Please check my code, *probably I did something wrong?