Calculating and Using Covariance and Linear Correlation Coefficient

Statistics Tutorials 9 min read
Co-variance and correlation

Calculating and Using Covariance and Linear Correlation Coefficient

9 min read
Blog / Statistics Tutorials / Calculating and Using Covariance and Linear Correlation Coefficient

If you’ve always had a hard time figuring out how relationships work, covariance and the linear correlation coefficient will help you out. Well, not in the real world, of course. But in the field of statistics. We are talking about representing the relationship between 2 variables. See, there are many univariate measures which we can use when working with one variable. For instance, we have measures of central tendency, asymmetry and variability.

However, when we have 2 variables it is a whole new ball game. And because of that, learning how to work with covariance and the linear correlation coefficient, will be truly beneficial to your progress in studying statistics.

Measures of relationship between variables

Understanding the Relationship between 2 Variables

Let’s zoom out a bit and think of an example that is very easy to understand. It will help us grasp the nature of the relationship between two variables a bit better.

Think about real estate. Which is one of the main factors that determine house prices?

Real estate

Their size.

Typically, larger houses are more expensive, as people like having extra space.

Their size

The table that you can see in the picture below shows us data about several houses.

Housing dataOn the left side, we can see the size of each house. On the right, we have the price at which it’s been listed in a local newspaper.

data-science-training

Creating a Scatter Plot

We can present these data points in a scatter plot. The X-axis will show a house’s size and the Y-axis will provide information about its price.

The X-axis will show a house’s size and the Y-axis will provide information about its price.

We can certainly notice a pattern. There is a clear relationship between these variables.

We say that the two variables are correlated and the main statistic to measure this correlation is called covariance.

The two variables are correlated and the main statistic to measure this correlation is called covariance.

Unlike variance, covariance may be positive, equal to zero, or negative.

Covariance may be positive, equal to zero, or negative.

 

The Formulas for Covariance

To understand the concept better, let’s take a look at a few formulas. They will allow us to calculate the covariance between two variables. It is formulas with an ‘S’, because there is a sample and a population formula. The same way it is for variance.

Sample formula and population formula

Since this is obviously sample data, we should use the sample covariance formula.

Use the sample formula

Let’s apply it in practice for the example that we saw earlier. X is the house size and Y stands for house price.

Applying the Sample Formula

First, we need to calculate the mean size and the mean price.

We can also compute the sample standard deviations, in case we need them later on.

Compute the sample standard deviations

Now, let’s calculate the nominator of the covariance function.

Starting with the first house, we can take the difference between its size and the average house size.  Then, we will multiply that by the difference between the price of the same house and the average house price. You can see the result that we obtained in the picture below.

calculate the nominator of the covariance function

 

Finding the Sum

Once we’re ready, we have to perform this calculation for all the houses that we have in the table. Then, we can proceed by summing the numbers we’ve obtained.

Proceed by summing the numbers we’ve obtained

Our sample size is 5. Now we have to divide the sum by the sample size minus 1.

Divide the sum by the sample size minus 1

The Sign of the Covariance

The result you see above is the covariance.

It gives us a sense of the direction in which the two variables are moving.

  • If they go in the same direction the covariance will have a positive sign.
  • If they move in opposite directions the covariance will have a negative sign.
  • Finally, if their movements are independent, the covariance between the house size and its price will be equal to zero.

Covarience gives a sense of direction

The Problem

There is just one tiny problem with covariance, though. It could be a number like 5 or 50. But it can also be something like 0.0023456 or even over 30 million, just like our example!

There is just one tiny problem with covariance

Values of a completely different scale! How could we interpret such numbers?

Why We Need Correlation

This is where correlation comes into place.

It adjusts covariance so that the relationship between the two variables becomes easy and intuitive to interpret.

The formulas for the correlation coefficient are: the covariance divided by the product of the standard deviations of the two variables. This is either sample or population, depending on the data you are working with.

The formulas for the correlation coefficient

We already have the standard deviations of the two data sets.

The standard deviations of the two data sets

Now, we’ll use the formula in order to find the sample correlation coefficient.

Putting the Formula to Use

Mathematically, there is no way to obtain a correlation value greater than 1 or less than -1.

Mathematically, there is no way to obtain a correlation value greater than 1 or less than -1.

This concept is similar. We manipulated the strange covariance value in order to get something intuitive.

Let’s examine it for a bit.

As shown in the picture below, by calculating the formula, we got a sample correlation coefficient of 0.87.

Sample correlation coefficient of 0.87

So, there is a strong relationship between the two values.

A Correlation of 1

A correlation of 1 is also known as a perfect positive correlation. This means that the entire variability of one variable is explained by the other.

Perfect positive correlation

However, logically, we know that size determines the price. On average, the bigger the house you build, the more expensive it will be.

Size determines price

The relationship goes only this way.

The relationship goes only this way.

Once a house is built, if for some reason it becomes more expensive, its size doesn’t increase, although there is a positive correlation.

A Correlation of 0

A correlation of 0 between two variables means that they are absolutely independent of each other. For instance, we would expect a correlation of 0 between the price of coffee in Brazil and the price of houses in London.

Correlation of zero

Obviously, the two variables don’t have anything in common!

Negative Correlation

Finally, we can have a negative correlation coefficient. It can be a perfect negative correlation of -1 or much more likely an imperfect negative correlation of a value between -1 and 0.

Negative correlation

Think of the following businesses – a company producing ice cream and a company selling umbrellas. Ice cream tends to be sold more when the weather is very good, and people buy umbrellas when it’s rainy. Obviously, there is a negative correlation between the two. Hence, when one of the companies makes more money, the other won’t.

Ice creams are sold more when it is sunny and umbrellas more when it's rainy

Important: Before we continue, we must note that the correlation between two variables x and y is the same as the correlation between y and x.

Perfect positive correlation

The formula is completely symmetrical with respect to both variables.

The formula is completely symmetrical with respect to both variables.

Therefore, the correlation between price and size is the same as the one of size and price!

Causality

This leads us to causality. It is very important for any analyst or researcher to understand the direction of causal relationships. In the housing business, size determines the price and not vice versa.

Important: Correlation does not imply causation!

Correlation does not imply causation

Representing the Relationship between 2 Variables

To sum up, using covariance and correlation is not rocket science. Based on the sign of the covariance, we can tell whether or not the 2 variables are moving in the same direction. However, this is not always possible because we can obtain values of an entirely different scale. Therefore, we turn to correlation. It makes the result of the covariance easy to interpret.

***

Interested in learning more? You can take your skills from good to great with our statistics tutorials!

Next Tutorial: Examples of Different Distributions

  1. Great write-up! Thanks for this. You’ve made the understanding of these statistical tools very easy to comprehend by non-statisticians. It’s very insightful. Keep it up!

Leave a Reply

Your email address will not be published.

A Free Data Science Career Guide?

Check your email shortly!