Last answered:

13 Apr 2022

Posted on:

13 Apr 2022


Resolved: Question 5 of Linear Regression Practice Exam: 2


In Question 5 of this practice exam we are required to calculate the r squared value. Can you please elaborate why we are calculating it on the training data instead of the test data?

I've seen in a lot of articles online, as well as in the sklearn documentation, the r squared being calculated on the test data to check the predictive power of the model, yet in the exam and in the video materials, it is calculated on the train data and I do not understand why that is the case.

Thank you in advance!

Kind Regards,
Desislava Hristova

1 answers ( 1 marked as helpful)
Posted on:

13 Apr 2022


Dear Desislava,

Thank you for your question!

The reason for evaluating the R-squared value on the training dataset is to check whether we need to make improvements to our model. If the result doesn't satisfy us, we can go back, change some of the parameters of the model and then run the code again to calculate the new, hopefully improved, R-squared value. This is referred to as "finetuning of the hyperparameters".

In principle, it is not the best practice to finetune on the training data, as this typically leads to overfitting. The data is therefore usually split into three parts instead of two - training, validation, and testing. The training data is used to fit the model, the validation dataset to finetune the model, and the test dataset to test the performance of the model on a completely new dataset.

In any case, I agree that we should make it explicit which dataset we expect you to calculate the R-squared value on. We will make the appropriate adjustments to the question.

Thank you again for your input and your engagement with our Machine Learning course!

Kind regards,
365 Hristina

Submit an answer