🛠️ Scheduled Maintenance | We’ll be undergoing scheduled maintenance and upgrades between 00:00 PST Jan 26th until 00:00 PST Jan 28th. There may be brief interruption of services in that period. We apologize for the inconvenience.

The 365 Data Science team is proud to invite you to our own community forum. A very well built system to support your queries, questions and give the chance to show your knowledge and help others in their path of becoming Data Science specialists.
Anybody can ask a question
Anybody can answer
The best answers are voted up and moderated by our team

How do dummy variables meet the linearity assumption?

How do dummy variables meet the linearity assumption?

Super Learner

This question is based on the exercise from ‘Dealing with Categorical Data – Dummy Variables Exercise’. 
I decided to test the linearity assumption by plotting the dependent variable (price) vs the independent variables (size, year, view), as seen below. The price vs size graph clearly shows that a line can be drawn through the observations to create a clear linear regression, but this doesn’t appear to be the case for the price vs year and the price vs view graphs. Since both price vs year and price vs view are piecewise functions, are we only looking for linearity in each respective piece (does it pass the linearity assumption because for each year and view, it’s a linear vertical line)?

1 Answer

365 Team

Hi Varun,

Thanks for reaching out.

First of all, price-size is obviously alright, so I won’t comment on it.

Second, you can use a parameter called ‘alpha’ when plotting and set it to a number between 0 and 1, e.g. alpha = 0.5 to see the actual density of the points.

Third, when we have a relationship like price and view. Since view is a dummy variable, the linearity assumption does not need to hold. It is included in the regression in a different way (that is why it is called a ‘dummy’ / indicator it is not a real variable and does not need to be treated like one). So linearity is fine.

Finally, we’ve got year. You have correctly identified that it behaves strangely. It behaves more like a dummy rather than a continuous variable, right? And that is precisely the case.

If we had the interval from 1900 to 2010 then it would have looked linear. However, if you have just a couple of years like in this example, then you could treat it as a categorical variable (and create several dummies). This is not uncommon practice!

Hope this helps!