Resolved: performance test

Question

random forest has it's own build in cross validation, with that remaining 37% data left. so why should we need to perform another testing.

Answer 1

Hi,

You are correct in that random forests have in built cross validation, however, cross-validation and the test dataset have different purposes in mind.

Cross validation is used during training to prevent overfitting and during hyperparameter optimization to evaluate the different configurations. Because of this, it can't be used as the final testing benchmark, since the model has already seen the data, even if not trained on it. You may say that we, as the data scientists, are the ones overfitting on that data.

That's why the testing set exists - to have a final dataset that was truly never seen. We use this dataset to give us a final benchmark on the accuracy (or any other metrics) of our model.

Hope this answers your question.

Best,
Nikola, 365 Team

Resolved: performance test

Submit an answer