Regarding Creating dummie_variables
Why only Audi brand is dropped while creating dummie variables and not any other brands?
Hey Siddhant,
Thank you for reaching out!
In this course, we've discussed that one of the assumption of Ordinary Least Squares (OLS) regression is the absence of multicollinearity among features. Essentially, the variables in our dataset should be independent, with no single feature being a predictor of another.
Consider the scenario where we introduce a dummy variable for each car brand, including Audi. Doing so inadvertently introduces multicollinearity into our dataset. Here's why: if all other brand variables are 0, the Audi variable must be 1, indicating the car is an Audi. Conversely, if any other brand variable is 1, the Audi variable must be 0. This interdependency means the Audi variable is redundant—it's determined entirely by the other brand variables.
However, the choice of Audi isn't unique; this logic applies to any brand. The key is to ensure that for N categories, we use only N-1 dummy variables. This approach prevents multicollinearity, ensuring the regression model's integrity and accuracy. Failing to do so could compromise the model's performance and lead to unreliable interpretations.
Hope this helps!
Kind regards,
365 Hristina