How To Perform A Linear Regression In Python (With Examples!)

Statistics Tutorials 15 min read
The linear regression model

How To Perform A Linear Regression In Python (With Examples!)

15 min read
Blog / Statistics Tutorials / How To Perform A Linear Regression In Python (With Examples!)

If you want to become a better statistician, a data scientist, or a machine learning engineer, going over several linear regression examples is inevitable. They will help you to wrap your head around the whole subject of regressions analysis.

Regression Analysis is coming

Regression analysis is one of the most widely used methods for prediction. It is applied whenever we have a causal relationship between variables.

Cause and effect

A large portion of the predictive modelling that occurs in practice is carried out through regression analysis. There are also many academic papers based on it. And it becomes extremely powerful when combined with techniques like factor analysis. Moreover, fundamentals of regression analysis are used in machine learning.

Machine learning

Therefore, it is easy to see why regressions are a must for data science. The general point is the following.

 “The amount of money you spend depends on the amount of money you earn.”

The amount of money you spend depends on the amount of money you earn

In the same way, the amount of time you spend reading our tutorials is affected by your motivation to learn additional statistical methods.

Statistical methods

You can quantify these relationships and many others using regression analysis.

What You Will Learn

We will use our typical step-by-step approach. We’ll start with the simple linear regression model, and not long after, we’ll be dealing with the multiple regression model. Along the way, we will learn how to build a regression, how to interpret it and how to compare different models.

How to build a regression

We will also develop a deep understanding of the fundamentals by going over some linear regression examples.

A quick side note: You can learn more about the geometrical representation of the simple linear regression model in the linked tutorial

What is a Linear Regression

Let’s start with some dry theory. A linear regression is a linear approximation of a causal relationship between two or more variables.

A linear regression is a linear approximation of a causal relationship between two or more variables.

Regression models are highly valuable, as they are one of the most common ways to make inferences and predictions.

The Process of Creating a Linear Regression

The process goes like this.

  1. First, you get sample data;
  2. Then, you can design a model that explains the data;
  3. Finally, you use the model you’ve developed to make a prediction for the whole population.

The process

There is a dependent variable, labeled Y, being predicted, and independent variables, labeled x1, x2, and so forth. These are the predictors. Y is a function of the X variables, and the regression model is a linear approximation of this function.

The regression model is a linear approximation of this function.

The Simple Linear Regression

The easiest regression model is the simple linear regression:

Y = β0 + β1 * x1 + ε.

Let’s see what these values mean. Y is the variable we are trying to predict and is called the dependent variable. X is an independent variable.

Simple linear regression model

When using regression analysis, we want to predict the value of Y, provided we have the value of X.

But to have a regression, Y must depend on X in some way. Whenever there is a change in X, such change must translate to a change in Y.

Providing a Linear Regression Example

Think about the following equation: the income a person receives depends on the number of years of education that person has received. The dependent variable is income, while the independent variable is years of education.

The dependent variable is income, while the independent variable is years of education

There is a causal relationship between the two. The more education you get, the higher the income you are likely to receive. This relationship is so trivial that it is probably the reason you are reading this tutorial, right now. You want to get a higher income, so you are increasing your education.

More education translates into a higher income

Is the Reverse Relationship Possible?

Now, let’s pause for a second and think about the reverse relationship. What if education depends on income.

What if education depends on income

This would mean the higher your income, the more years you spend educating yourself.

The higher your income, the more years you spend educating yourself

Putting high tuition fees aside, wealthier individuals don’t spend more years in school. Moreover, high school and college take the same number of years, no matter your tax bracket. Therefore, a causal relationship like this one is faulty, if not plain wrong. Hence, it is unfit for regression analysis.

It is unfit for regression analysis

Let’s go back to the original linear regression example. Income is a function of education. The more years you study, the higher the income you will receive. This sounds about right.

The more years you study, the higher income you will receive.

The Coefficients

What we haven’t mentioned, so far, is that, in our model, there are coefficients. β1is the coefficient that stands before the independent variable. It quantifies the effect of education on income.

It quantifies the effect of education on income

If β1 is 50, then for each additional year of education, your income would grow by $50. In the USA, the number is much bigger, somewhere around 3 to 5 thousand dollars.

For each additional year of education, your income will increase from $3000 to $5000

The Constant

The other two components are the constant β0 and the error – epsilon(ε).

In this linear regression example, you can think of the constant β0 as the minimum wage. No matter your education, if you have a job, you will get the minimum wage. This is a guaranteed amount.

So, if you never went to school and plug an education value of 0 years in the formula, what could possibly happen? Logically, the regression will predict that your income will be the minimum wage.

The regression will predict that your income will be the minimum wage

Epsilon

The last term is the epsilon(ε). This represents the error of estimation. The error is the actual difference between the observed income and the income the regression predicted. On average, across all observations, the error is 0.

On average the error is 0

If you earn more than what the regression has predicted, then someone earns less than what the regression predicted. Everything evens out.

The Linear Regression Equation

The original formula was written with Greek letters. This tells us that it was the population formula. But don’t forget that statistics (and data science) is all about sample data. In practice, we tend to use the linear regression equation.

It is simply ŷ = β0+ β1* x.

 ŷ = β0+ β1* x

The here is referred to as y hat. Whenever we have a hat symbol, it is an estimated or predicted value.

B is the estimate of the regression constant β0. Whereas, bis the estimate of β1, and x is the sample data for the independent variable.

The Regression Line

You may have heard about the regression line, too. When we plot the data points on an x-y plane, the regression line is the best-fitting line through the data points.

Regression line

You can take a look at a plot with some data points in the picture above. We plot the line based on the regression equation.

The grey points that are scattered are the observed values. B0, as we said earlier, is a constant and is the intercept of the regression line with the y-axis.

b0 is a constant and is the intercept of the regression line with the y axis

B1 is the slope of the regression line. It shows how much y changes for each unit change of x.

B1 is the slope of the regression line

The Estimator of the Error

The distance between the observed values and the regression line is the estimator of the error term epsilon. Its point estimate is called residual.

 Its point estimate is called residual

Now, suppose we draw a perpendicular from an observed point to the regression line. The intercept between that perpendicular and the regression line will be a point with a y value equal to .

The intercept between that perpendicular and the regression line will be a point with a y value equal to ŷ

As we said earlier, given an x, is the value predicted by the regression line.

ŷ is the value predicted by the regression line

First Linear Regression Example in Python

We believe it is high time that we actually got down to it and wrote some code! So, let’s get our hands dirty with our first linear regression example in Python. If this is your first time hearing about Python, don’t worry. We have plenty of tutorials that will give you the base you need to use it for data science and machine learning.

Now, how about we write some code? First off, we will need to use a few libraries.

data-science-training

Importing the Relevant Libraries

Let’s import the following libraries:

Simple linear regression

The first three are pretty conventional. We won’t even need numpy, but it’s always good to have it there – ready to lend a helping hand for some operations. In addition, the machine learning library we will employ for this linear regression example is: statsmodels. So, we can basically write the following code: import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import statsmodels.api as sm

Loading the Data

The data which we will be using for our linear regression example is in a .csv file called: ‘1.01. Simple linear regression.csv’. You can download it from here. Make sure that you save it in the folder of the user.

Now, let’s load it in a new variable called: data using the pandas method: ‘read_csv’.  We can write the following code:

data = pd.read_csv(‘1.01. Simple linear regression.csv’)

After running it, the data from the .csv file will be loaded in the data variable. As we are using pandas, the data variable will be automatically converted into a data frame.

The data variable will be automatically converted into a data frame.

Visualizing the Data Frame

Let’s see if that’s true. We can write data and run the line. As you can see below, we have indeed displayed the data frame.

The data frame

There are two columns – SAT and GPA. And that’s what our linear regression example will be all about. Let’s further check

data.describe()

This is a pandas method which will give us the most useful descriptive statistics for each column in the data frame – number of observations, mean, standard deviation, and so on.

Descriptive statistics for each column in the data frame

In this linear regression example we won’t put that to work just yet. However, it’s good practice to use it.

The Problem

Let’s explore the problem with our linear regression example.

So, we have a sample of 84 students, who have studied in college.

We have a sample of 84 students, who have studied in college

Their total SAT scores include critical reading, mathematics, and writing. Whereas, the GPA is their Grade Point Average they had at graduation.

SAT and GPA

That’s a very famous relationship. We will create a linear regression which predicts the GPA of a student based on their SAT score.

When you think about it, it totally makes sense.

  1. You sit the SAT and get a score.
  2. With this score, you apply to college.
  3. The next 4 years, you attend college and graduate receiving many grades, forming your GPA.

You can see the timeline below.

The timeline

Meaningful Regressions

Before we finish this introduction, we want to get this out of the way. Each time we create a regression, it should be meaningful. Why would we predict GPA with SAT? Well, the SAT is considered one of the best estimators of intellectual capacity and capability.

On average, if you did well on your SAT, you will do well in college and at the workplace. Furthermore, almost all colleges across the USA are using the SAT as a proxy for admission.

And last but not least, the SAT stood the test of time and established itself as the leading exam for college admission.

Why would I predict GPA with SAT?

It is safe to say our regression makes sense.

Creating our First Regression in Python

After we’ve cleared things up, we can start creating our first regression in Python. We will go through the code and in subsequent tutorials, we will clarify each point.

Important: Remember, the equation is:

Our dependent variable is GPA, so let’s create a variable called y which will contain GPA.

Just a reminder – the pandas’ syntax is quite simple.

This is all we need to code:

y = data[‘GPA’]

x1 = data[‘SAT’]

It goes like this:

  1. First, we write the name of the data frame, in this case data
  2. Then, we add in square brackets the relevant column name, which is GPA in our case.

Similarly, our independent variable is SAT, and we can load it in a variable x1.

Create your first regression

Exploring the Data

It’s always useful to plot our data in order to understand it better and see if there is a relationship to be found. We will use some conventional matplotlib code.

plt.scatter(x1,y)

plt.xlabel(‘SAT’, fontsize = 20)

plt.ylabel(‘GPA’, fontsize = 20)

plt.show()      

You can see the result we receive after running it, in the picture below.

The result

Each point on the graph represents a different student. For instance, the highlighted point below is a student who scored around 1900 on the SAT and graduated with a 3.4 GPA.

The highlighted point below is a student who scored around 1900 on the SAT and graduated with a 3.4 GPA

Observing all data points, we can see that there is a strong relationship between SAT and GPA. In general, the higher the SAT of a student, the higher their GPA.

the higher the SAT of a student, the higher their GPA

Adding a Constant

Next, we need to create a new variable, which we’ll call x.

We have our x1, but we don’t have an x0. In fact, in the regression equation there is no explicit x0. The coefficient b0 is alone.

The coefficient b0 is alone

That can be represented as: b0 * 1. So, if there was an x0, it would always be 1.

If there was an x0, it would always be 1

It is really practical for computational purposes to incorporate this notion into the equation. And that’s how we estimate the intercept b0.

In terms of code, statsmodels uses the method: .add_constant(). So, let’s declare a new variable:

x = sm.add_constant(x1)

The Results Variable

Right after we do that, we will create another variable named results. It will contain the output of the ordinary least squares regression, or OLS. As arguments, we must add the dependent variable y and the newly defined x. At the end, we will need the .fit() method. It is a method that applies a specific estimation technique to obtain the fit of the model.

The .fit() method

That itself is enough to perform the regression.

Displaying the Regression Results

In any case, results.summary() will display the regression results and organize them into three tables.

So, this is all the code we need to run:

x = sm.add_constant(x1)

results = sm.OLS(y,x).fit()

results.summary()

And this is what we get after running it:

Regression results

As you can see, we have a lot of statistics in front of us! And we will examine it in more detail in subsequent tutorials.

Plotting the Regression line

Let’s plot the regression line on the same scatter plot.  We can achieve that by writing the following:

plt.scatter(x1,y)

yhat = 0.0017*x1 + 0.275

fig = plt.plot(x1,yhat, lw=4, c=’orange’, label = ‘regression line’)

plt.xlabel(‘SAT’, fontsize = 20)

plt.ylabel(‘GPA’, fontsize = 20)

plt.show()

As you can see below, that is the best fitting line, or in other words – the line which is closest to all observations simultaneously.

The best fitting line

So that’s how you create a simple linear regression in Python!

How to Interpret the Regression Table

Now, let’s figure out how to interpret the regression table we saw earlier in our linear regression example.

While the graphs we have seen so far are nice and easy to understand. When you perform regression analysis, you’ll find something different than a scatter plot with a regression line. The graph is a visual representation, and what we really want is the equation of the model, and a measure of its significance and explanatory power. This is why the regression summary consists of a few tables, instead of a graph.

The regression summary consists of a few tables, instead of a graph

Let’s find out how to read and understand these tables.

The 3 Main Tables

Typically, when using statsmodels, we’ll have three main tables – a model summary

A model summary

a coefficients table

A coefficients table

and some additional tests.

Additional tests

Certainly, these tables contain a lot of information, but we will focus on the most important parts.

We will start with the coefficients table.

The Coefficients Table

We can see the coefficient of the intercept, or the constant as they’ve named it in our case.

Coefficient of the intercept

Both terms are used interchangeably. In any case, it is 0.275, which means b0 is 0.275.

b0 is 0.275

Looking below it, we notice the other coefficient is 0.0017. This is our b1. These are the only two numbers we need to define the regression equation.

These are the only two numbers we need to define the regression equation

Therefore,

ŷ= 0.275 + 0.0017 * x1.

Or GPA equals 0.275 plus 0.0017 times SAT score.

GPA equals 0.275 plus 0.0017 times SAT score

So, this is how we obtain the regression equation.

A Quick Recap

Let’s take a step back and look at the code where we plotted the regression line. We have plotted the scatter plot of SAT and GPA. That’s clear. After that, we created a variable called: y hat(ŷ). Moreover, we imported the seaborn library as a ‘skin’ for matplotlib. We did that in order to display the regression in a prettier way.

The seaborn library as a ‘skin’ for matplotlib

That’s the regression line – the predicted variables based on the data.

The regression line

Finally, we plot that line using the plot method.

Plot that line using the plot method

Naturally, we picked the coefficients from the coefficients table – we didn’t make them up.

The Predictive Power of Linear Regressions

You might be wondering if that prediction is useful.

Well, knowing that a person has scored 1700 on the SAT, we can substitute in the equation and obtain the following:

0.275 + 0.0017 * 1700, which equals 3.165. So, the expected GPA for this student, according to our model is 3.165.

The expected GPA for this student is 3.165

And that’s the predictive power of linear regressions in a nutshell!

The Standard Errors

What about the other cells in the table?

The standard errors show the accuracy of prediction for each variable. The lower the standard error, the better the estimate!

The lower the standard error, the better the estimate

The T-Statistic

The next two values are a T-statistic and its P-value.

T-statistic and its P-value

If you have gone over our other tutorials, you may know that there is a hypothesis involved here. The null hypothesis of this test is: β = 0. In other words, is the coefficient equal to zero?

Is the coefficient equal to zero?

The Null Hypothesis

If a coefficient is zero for the intercept(b0), then the line crosses the y-axis at the origin. You can get a better understanding of what we are talking about, from the picture below.

If a coefficient is zero for the intercept(b0), then the line crosses the y-axis at the origin

If β1is zero, then 0 * x will always be 0 for any x, so this variable will not be considered for the model. Graphically, that would mean that the regression line is horizontal – always going through the intercept value.

The regression line is horizontal

The P-Value

Let’s paraphrase this test. Essentially, it asks, is this a useful variable? Does it help us explain the variability we have in this case? The answer is contained in the P-value column.

The answer is contained in the P-value column

As you may know, a P-value below 0.05 means that the variable is significant. Therefore, the coefficient is most probably different from 0. Moreover, we are longing to see those three zeroes.

The coefficient is most probably different from 0

What does this mean for our linear regression example?

Well, it simply tells us that SAT score is a significant variable when predicting college GPA.

What you may notice is that the intercept p-value is not zero.

The intercept p-value is not zero

Let’s think about this. Does it matter that much? This test is asking the question: Graphically, that would mean that the regression line passes through the origin of the graph.

The regression line passes through the origin of the graph

Usually, this is not essential, as it is causal relationship of the Xs we are interested in.

The F-statistic

The last measure we will discuss is the F-statistic. We will explain its essence and see how it can be useful to us.

F-statistic

Much like the Z-statistic which follows a normal distribution and the T-statistic that follows a Student’s T distribution, the F-statistic follows an F distribution.

F distribution

We are calling it a statistic, which means that it is used for tests. The test is known as the test for overall significance of the model.

The Null Hypothesis and the Alternative Hypothesis

The null hypothesis is: all the βs are equal to zero simultaneously.

The alternative hypothesis is: at least one β differs from zero.

At least one β differs from zero

This is the interpretation: if all βs are zero, then none of the independent variables matter. Therefore, our model has no merit.

In our case, the F-statistic is 56.05.

The F-statistic is 56.05

The cell below is its P-value.

Its P-value

As you can see, the number is really low – it is virtually 0.000. We say the overall model is significant.

Important: Notice how the P-value is a universal measure for all tests. There is an F-table used for the F-statistic, but we don’t need it, because the P-value notion is so powerful.

The F-test is important for regressions, as it gives us some important insights. Remember, the lower the F-statistic, the closer to a non-significant model.

Moreover, don’t forget to look for the three zeroes after the dot!

What We Learned

Well, that was a long journey, wasn’t it? We embarked on it by first learning about what a linear regression is. Then, we went over the process of creating one. We also went over a linear regression example. Afterwards, we talked about the simple linear regression where we introduced the linear regression equation. By then, we were done with the theory and got our hands on the keyboard and explored another linear regression example in Python! We imported the relevant libraries and loaded the data. We cleared up when exactly we need to create regressions and started creating our own. The process consisted of several steps which, now, you should be able to perform with ease. Afterwards, we began interpreting the regression table. We mainly discussed the coefficients table. Lastly, we explained why the F-statistic is so important for regressions.

Next Steps

You thought that was all you need to know about regressions? Well, seeing a few linear regression examples is not enough. There are many more skills you need to acquire in order to truly understand how to work with linear regressions. The first thing which you can clear up is the misconception that regression and correlation are referring to the same concept.

***

Interested in learning more? You can take your skills from good to great with our statistics tutorials!

 

Next Tutorial: The Differences between Correlation and Regression

 

Leave a Reply

Your email address will not be published.

A Free Data Science Career Guide?

Check your email shortly!