Learning How to Use the SQL SELF JOIN

Join over 2 million students who advanced their careers with 365 Data Science. Learn from instructors who have worked at Meta, Spotify, Google, IKEA, Netflix, and Coca-Cola and master Python, SQL, Excel, machine learning, data analysis, AI fundamentals, and more.

Start for Free
Martin Ganchev 25 Apr 2023 9 min read

If you ever need to make a table join itself, the SQL SELF JOIN will do the job. Often, you’ll want to combine certain rows of a table with other rows of the same table. And as not to feel puzzled when we work with nested queries, make sure you check out our tutorial on subqueries.

Side note: We are going to be working with the ‘employees’ database. So, if you don’t have it on your computer, download it here.

Understanding the Concept

To better understand the mechanics of this SQL instrument, look at the picture below. related column, SQL self join

Think of a case where you are joining two tables. You can check how this is done with different tables in the linked tutorial. In a self-join statement, you will have to comply with the same logical and syntactic structure. However, the two tables will be identical to the table you’ll be using in the self-join. You can think of them as virtual projections of the underlying, base table.

2 tables will be identical to the table you'll be using, sql self join

How it Works

So, the SQL SELF JOIN will reference both implied tables and will treat them as two separate tables in its operations. Furthermore, the data used will not come from two sources, but from a single source, which is the underlying table that stores data physically.

the underlying table that stores data physically, sql self join

Usually, when joining separate tables, using aliases is at our discretion. It is done mainly for convenience. Here, using aliases is obligatory. Only they can help us distinguish the two virtual tables.

Important: These references to the original table let you use different blocks of the available data. You can either filter both in the join, or you can filter one of them in the WHERE clause, and the other one – in the join.

e1 e2, sql self join

If this sounds strange, please don’t worry. We will provide examples of both cases later in this tutorial.

So, our task is: from the “emp_manager” table created in the linked exercise, extract the record data only of the employees who are managers as well.  

Solving the Task

First, let’s run the following query:

SELECT
    *
FROM
    emp_manager
ORDER BY emp_manager.emp_no;

only two rows are affected, sql self join

As you can see in the picture above, only two rows were affected by our query.

Working with the SQL SELF JOIN

Now, let’s observe the following code.

SELECT
    e1.*
FROM
    emp_manager e1
        JOIN
    emp_manager e2 ON e1.emp_no = e2.manager_no;

We are using the structure of the join syntax. Also, we are joining the same table to itself, providing different aliases for each virtual table.

join e1 and e2

However, the connection is made with different columns of the base table. And, we would like to extract everything from the first table, provided that its employee number equals the manager number of the other table.

Let’s run this query and observe the output.

42 rows returned, sql self join

An Issue

You can see that 42 rows were returned.

The query connected the employee numbers of the first table coinciding with manager numbers, with the records from the “manage_no” column of the second table.

numbers match from e1 and e2, sql self join

So, we have two records from the left side, 110022 and 110039, connected to 42 records on the right side having the same values.

So, why do we see 22 against 39, and 39 against 22, and not something else?

What actually happened

Here’s the explanation. Every time a connection between two implied tables is established, all data of the corresponding row of the left table is displayed. For instance, whenever we have 110022 in the second table, only column data from the first table is being recorded in the new table. “manager_no” column information is not taken from E2, sql self join

This means the “manager_no” column information is not taken from E2!

Then the same process is repeated for the other manager number – 110039.

110039, sql self join

Obtaining Different Results

What will happen if we change the table in the SELECT statement to E2? Logically, we will retrieve 42 connections again, but this time the data displayed will refer solely to table E2, and not E1.

e2, sql self join

What if we would like to obtain a result containing “emp_no” and “dept_no” from E1, and “manager_no” from E2? To do that, we will have to adjust the selection retrieved by the execution of our query: let’s set the first two columns to be extracted from the left virtual table, and the third one from the right. We can achieve that by running the following:

SELECT
    e1.emp_no, e1.dept_no, e2.manager_no
FROM
    emp_manager e1
        JOIN
    emp_manager e2 ON e1.emp_no = e2.manager_no;

e1 e2

As shown in the picture above, it works.

How to Obtain Two Rows of Data

There are two ways to accomplish that.

The first is simply by using SELECT DISTINCT instead of SELECT, then applying the same code used at the beginning. Let’s use this code:

SELECT DISTINCT
    e1.*
FROM
    emp_manager e1
        JOIN
    emp_manager e2 ON e1.emp_no = e2.manager_no;

Thus, all repeating rows from the first query will be stacked and we will see only the two desired records. 110022 110039

Another Way to Do it

The second way is more sophisticated. We will use the same base query that would lead us to the 42-row output, but this time a WHERE clause will be added, and its condition will shorten the length of E2 to two rows.

Why would we do that? Because this creates a connection of the “emp_no” from E1 that isn’t with all 42 rows from the “manager_no” column of E2, but with two rows only!

Let’s see what will happen if we run the following query:

SELECT
    manager_no
FROM
    emp_manager;

sql self join

How to Do it

By using the IN operator to state we want the “emp_no” column from the second virtual table to equal any of the manager numbers from the base table. In other words, the WHERE-subquery regards this small two-row table. the WHERE-subquery regards this small two-row table

We can write:

SELECT
    e1.*
FROM
    emp_manager e1
        JOIN
    emp_manager e2 ON e1.emp_no = e2.manager_no
WHERE
    e2.emp_no IN (SELECT
        manager_no
    FROM
        emp_manager);

110039 110022

As you can see in the picture above, the desired output was obtained!

The Importance of the SQL SELF JOIN

In conclusion, whenever you need to join a table with itself, you’ll need to apply an SQL Self join. Keeping in mind some little differences that could alter your final output, you may well use it freely.

If you hate typing the same code structure over and over again every time you need to execute a specific type of query, there is a perfect tool for you in SQL. It is called ‘view’ and you can dive into the tutorial that will teach you how to retrieve your output more quickly.

***

Eager to hone your SQL skills? Check put our SQL course.

Next Tutorial: Intro to SQL Views

Martin Ganchev

Instructor at 365 Data Science

Martin holds an MSc degree in Economic and Social Sciences from Bocconi University. His diverse academic and research experience combined with his friendly and explanatory approach to teaching have made him one of the most beloved instructors on our team. Some of the courses he has authored include: SQL, SQL + Tableau, SQL+Tableau+Python, Introduction to Python, Introduction to Jupyter, to name a few.

Top