thanks for sharing this great resource: When can we safely ignore multicollinearity
A couple of questions popped up when I was reading it
1. How do you generally define a control variable? And what would be the control variables in our practical example (car sales)?
2. Under what use case would we specify a regression model with both x and x^2?
1. A control variable is a variable which in scientific experimentation is kept constant.
Now, when you are creating models that are not aiming at best predictive power, but rather at evaluating different factors and their relationship with the dependent variable, you use control variables.
For instance, you’ve got Income as a function of education, age, experience (tenure). Now you want to check if marital status makes a difference, number of children, etc.
In this case, you know that education, age and experience are good predictors – in fact you don’t even question it. They can be you control variables. So from here on whatever model you create it would be:
Income <- education + age + experience + whatever_new_we_want_to_test
You could have 2 models:
Income <- marital status
Income <- number of children
and you can compare the two to see which one is better, right? Yes, but not really. It doesn’t make sense because these two features on their own explain little to nothing from income.
In fact, a better idea is to build these 2 models:
Income <- education + age + experience + marital status
Income <- education + age + experience + number of children
Now, these two models are both very good models on their own, because the control variables are there – education, age and experience. We know they are unquestionably the biggest determinants of income (hypothetically). Now we test if marital status or number of children is more important.
After comparing the two new models we reach an insight which is strictly comparing the two features of question.
In fact, you can have a situation that ‘number of children’ is insignificant. This would tell you that it is not useful at all.
Alternatively, if you created the model: Income <- number of children, it could be significant because underlyingly, ‘Age’ governs the number of children 🙂 So the number of children on their own don’t affect income, but rather age does!
2. When your relationship is quadratic, you can use both x and x^2 in the regression. This would give you better results. If the relationship is not quadratic, x^2 would most likely be insignificant.