Considering Audi as dummy variable
When Audi is not considered as dummy and so it works as a "benchmark" reference for other dummy varibles for reasons explicited in lesson, I see a problem, the intercept variable assumes its value plus the error, and we lose the real effect of the Audi category per si in the equation because it is keept "hidden" in intercept. Is wrong my view? If I am right, I would like to consider Audi as dummy also because the equation would have more meaning. What is the real problem considering Audi as dummy?
Hey Luiz,
Thank you for your question!
As outlined in an earlier video, one of the OLS assumptions is that there should be no multicollinearity between the features in a dataset. That is to say, they should be independent and one feature should not be determined by another. Had we introduced a dummy variable for Audi, we would have also introduced multicollinearity into our dataset. The reason is that if the value of all brands, except for Audi, is 0, then we know for sure that the Audi dummy variable should be 1. If any of the brands, except for Audi, has a value of 1, then we know for sure that Audi should be 0. Therefore, the Audi dummy variable is completely determined by the rest of the values and it should not be present in our dataset.
Of course, there is nothing special about Audi. We could have just as well dropped any of the other brands and pick it as a benchmark. The important thing is to have N-1 dummies for N feature categories. If that is not the case, the algorithm might perform poorly and give misleading results.
Hope this helps!
Kind regards,
365 Hristina