Including a Dummy Variable Into a Regression

Realizing how to include dummy variables into a regression is the best way to end your introduction into the world of linear regressions. Another useful concept you can learn is the Ordinary Least Squares. But now, onto dummy variables. Apart from the offensive use of the word “dummy”, there is another meaning – an imitation or a copy that stands as a substitute.

What Is a Dummy Variable?

In regression analysis, a dummy is a variable that is used to include categorical data into a regression model. In previous tutorials, we have only used numerical data. We did that when we first introduced linear regressions and again when we were exploring the adjusted R-squared. However, representing numbers on a scale makes more sense than representing categories like gender or season. It’s time to find out how to include such variables into a regression we are working with.

How to Include Categorical Data Into a Regression

Firstly, make sure that you check the article where we made our first steps into the world of linear regressions. We will be using the SAT-GPA example from there. If you don’t have time to read it, here is a brief explanation: Based on the SAT score of a student, we can predict his GPA. Now, we can improve our prediction by adding another regressor – attendance.

In the picture below, you can see a dataset that includes a variable that measures if a student attended more than 75% of their university lectures.

Keep in mind that this is categorical data, so we cannot simply put it in the regression.

We will start off by going through the process of using a dummy and explain it later.

Using a Dummy Variable

The time has come to write some code. We can begin by importing the relevant libraries by writing:

import numpy as np

import pandas as pd

import statsmodels.api as sm

import matplotlib.pyplot as plt

import seaborn as sns sns.set()

After that, let’s load the file ’1.03. Dummies.csv’ into the variable raw_data. You can download the file from here. If you don’t know how to load it, here’s what you need to type:

raw_data = pd.read_csv(’1.03. Dummies.csv’)

Now, let’s simply write

raw_data

and see what happens.

As you can tell from the picture above, there is a third column named ‘Attendance’. It reflects if a student attended more than 75% of the lessons with two possibilities – Yes and No.

Mapping Values

What we would usually do in such cases is to map the Yes/No values with 1s and 0s. In this way, if the student attended more than 75% of the lessons, the dummy will be equal to 1. Otherwise, it will be a 0.

So, we will have transformed our yes/no question into 0s and 1s. That’s what the dummy name stands for – we are imitating the categories with numbers.

How to Do it

In pandas, that’s done quite intuitively.

Let’s create a new variable data equal to raw_data. This is what we need to run:

data = raw_data.copy()

Then, we have to overwrite the series ‘attendance’ in the data frame. So, this is how the code should look like:

data[‘Attendance’] = data[‘Attendance’].map({‘Yes’:1, ‘No’ : 0})

This is the proper syntax to map Yes to 1 and No to 0.

We can write

data

and find out if we have done our job.

As you can see in the picture above, we have successfully created a dummy variable! The categorical data in the series was replaced or mapped to numerical.

The Descriptive Statistics

Let’s take a look at the descriptive statistics of the variables. We can do that by writing:

data.describe()

The mean of ‘Attended more than 75%’ is 0.46, as shown below.

The fact that the mean is less than 0.5 gives us the information that there are more 0s than 1s. Since the two outcomes are 0 and 1, this implies that 46% of the students have attended more than 75% of the lessons.

In any case, now we can create a regression that explains GPA taking both SAT scores and attendance into consideration.

Creating the Regression

We can load GPA in the variable y, and SAT, and ‘Attendance’ in the variable x1. This is the code we need to run:

y = data[‘GPA’]

x1 = data[[‘SAT’, ‘Attendance’]]

We must use the statsmodels method for adding a constant. Then we can fit the regression and get the summary as before.

x = sm.add_constant(x1)

results = sm.OLS(y,x).fit()

results.summary()

The Results

As you can see in the picture below, our overall model is significant,

the SAT score is significant, and the dummy variable is significant.

The adjusted R-squared of this model is 0.555, which is a great improvement from what we would get without attendance.

A model without the dummy variable would be:

GPA = 0.275 + 0.0017 * the SAT score of a student.

The model, including the dummy variable is:

GPA = 0.6439 + 0.0014 * the SAT score of a student + 0.2226 * the dummy variable.

Explaining the Equation

Now, we said that the dummy is 0 or 1, so actually we can represent this equation with two others.

If the student did not attend, the dummy would be 0. So, 0.2226 * 0 is 0. The model becomes GPA = 0.6439 + 0.0014 * SAT.

If the student attended, the dummy variable would be 1, so the model becomes:

GPA = 0.6439 + 0.0014 * SAT + 0.2226.

Let’s add the intercept and the dummy together.

As you can see in the picture above, we got GPA = 0.8665 + 0.0014 * SAT.

Plotting the Data

There will be two equations, which we can call yhat_no, and yhat_yes. They will represent the two equations we just talked about. Certainly, we can parametrize these equations, but there is no need for such a simple example.

The 2 Equations

So, what we observe above are two equations that have the same slope but a different intercept. The students who attended are spread around the upper line.

On average, their GPA is 0.2226 higher than the GPA of students who did not attend.

We can even think about these as two separate regressions. We can color the points, which refer to students who attended classes, so the red line, and students who did not attend – the green line.

You can clearly see the difference now.

Finally, we will put the original regression line on the graph.

As you can see, it is steeper and goes somewhat between the two lines of the dummies.

To use this model for prediction purposes, we need two pieces of information: an SAT score and whether a person attended more than 75% of their lectures.