If you’ve always had a hard time figuring out how relationships work, **covariance** and the **linear correlation coefficient **will help you out. Well, not in the real world, of course. But in the field of statistics. We are talking about representing the relationship between 2 variables. See, there are many univariate measures which we can use when working with one variable. For instance, we have measures of central tendency, asymmetry and variability.

However, when we have 2 variables it is a whole new ball game. And because of that, learning how to work with **covariance** and the **linear correlation coefficient**, will be truly beneficial to your progress in studying statistics.

**Understanding the Relationship between 2 Variables**

Let’s zoom out a bit and think of an example that is very easy to understand. It will help us grasp the nature of the relationship between two variables a bit better.

Think about real estate. Which is one of the main factors that determine house prices?

Their size.

Typically, larger houses are more expensive, as people like having extra space.

The table that you can see in the picture below shows us data about several houses.

On the left side, we can see the size of each house. On the right, we have the price at which it’s been listed in a local newspaper.

**Creating a Scatter Plot**

We can present these data points in a **scatter plot**. The *X-axis* will show a house’s size and the *Y-axis* will provide information about its price.

We can certainly notice a pattern. There is a clear relationship between these variables.

We say that the two variables are correlated and the main statistic to measure this **correlation** is called **covariance**.

Unlike **variance**, **covariance** may be positive, equal to zero, or negative.

**The Formulas for Covariance**

To understand the concept better, let’s take a look at a few formulas. They will allow us to calculate the **covariance** between two variables. It is formulas with an ‘*S’*, because there is a sample and a population formula. The same way it is for **variance**.

Since this is obviously sample data, we should use the **sample** **covariance** formula.

Let’s apply it in practice for the example that we saw earlier. **X** is the house size and **Y** stands for house price.

**Applying the Sample Formula**

First, we need to calculate the **mean** size and the **mean** price.

We can also compute the sample **standard deviations**, in case we need them later on.

Now, let’s calculate the nominator of the **covariance** function.

Starting with the first house, we can take the difference between its size and the average house size. Then, we will multiply that by the difference between the price of the same house and the average house price. You can see the result that we obtained in the picture below.

**Finding the Sum**

Once we’re ready, we have to perform this calculation for all the houses that we have in the table. Then, we can proceed by summing the numbers we’ve obtained.

Our sample size is 5. Now we have to divide the sum by the sample size minus 1.

**The Sign of the Covariance**

The result you see above is the **covariance**.

It gives us a sense of the direction in which the two variables are moving.

- If they go in the same direction the
**covariance**will have a positive sign. - If they move in opposite directions the
**covariance**will have a negative sign. - Finally, if their movements are independent, the
**covariance**between the house size and its price will be equal to zero.

**The Problem**

There is just one tiny problem with **covariance**, though. It could be a number like 5 or 50. But it can also be something like 0.0023456 or even over 30 million, just like our example!

Values of a completely different scale! How could we interpret such numbers?

**Why We Need Correlation**

This is where **correlation** comes into place.

It adjusts **covariance** so that the relationship between the two variables becomes easy and intuitive to interpret.

The formulas for the **correlation coefficient** are: the **covariance** divided by the product of the **standard deviations** of the two variables. This is either sample or population, depending on the data you are working with.

We already have the **standard deviations** of the two data sets.

Now, we’ll use the formula in order to find the **sample correlation coefficient**.

**Putting the Formula to Use**

Mathematically, there is no way to obtain a **correlation** value greater than 1 or less than -1.

This concept is similar. We manipulated the strange **covariance** value in order to get something intuitive.

Let’s examine it for a bit.

As shown in the picture below, by calculating the formula, we got a **sample correlation coefficient** of 0.87.

So, there is a strong relationship between the two values.

**A Correlation of 1**

A **correlation** of 1 is also known as a **perfect positive correlation**. This means that the entire variability of one variable is explained by the other.

However, logically, we know that size determines the price. On average, the bigger the house you build, the more expensive it will be.

The relationship goes only this way.

Once a house is built, if for some reason it becomes more expensive, its size doesn’t increase, although there is a positive **correlation**.

**A Correlation of 0**

A **correlation** of 0 between two variables means that they are absolutely independent of each other. For instance, we would expect a **correlation** of 0 between the price of coffee in Brazil and the price of houses in London.

Obviously, the two variables don’t have anything in common!

**Negative Correlation**

Finally, we can have a **negative correlation coefficient**. It can be **a perfect negative correlation** of -1 or much more likely an **imperfect negative correlation** of a value between -1 and 0.

Think of the following businesses – a company producing ice cream and a company selling umbrellas. Ice cream tends to be sold more when the weather is very good, and people buy umbrellas when it’s rainy. Obviously, there is a **negative correlation** between the two. Hence, when one of the companies makes more money, the other won’t.

**Important:** Before we continue, we must note that the **correlation** between two variables x and y is the same as the **correlation** between y and x.

The formula is completely symmetrical with respect to both variables.

Therefore, the **correlation** between price and size is the same as the one of size and price!

**Causality**

This leads us to **causality**. It is very important for any analyst or researcher to understand the direction of **causal** **relationships**. In the housing business, size determines the price and not vice versa.

**Important:** **Correlation** does not imply **causation**!

**Representing the Relationship between 2 Variables**

To sum up, using **covariance** and **correlation** is not rocket science. Based on the sign of the **covariance**, we can tell whether or not the 2 variables are moving in the same direction. However, this is not always possible because we can obtain values of an entirely different scale. Therefore, we turn to **correlation**. It makes the result of the **covariance** easy to interpret.

***

**Interested in learning more? You can take your skills from good to great with our statistics tutorials!**

**Next Tutorial: **Examples of Different Distributions

Great write-up! Thanks for this. You’ve made the understanding of these statistical tools very easy to comprehend by non-statisticians. It’s very insightful. Keep it up!