If you ever need to make a table join itself, the SQL SELF JOIN will do the job. Often, you’ll want to combine certain rows of a table with other rows of the same table. And as not to feel puzzled when we work with nested queries, make sure you check out our tutorial on subqueries.
Side note: We are going to be working with the ‘employees’ database. So, if you don’t have it on your computer, download it here.
Understanding the Concept
Think of a case where you are joining two tables. You can check how this is done with different tables in the linked tutorial. In a self-join statement, you will have to comply with the same logical and syntactic structure. However, the two tables will be identical to the table you’ll be using in the self-join. You can think of them as virtual projections of the underlying, base table.
How it Works
So, the SQL SELF JOIN will reference both implied tables and will treat them as two separate tables in its operations. Furthermore, the data used will not come from two sources, but from a single source, which is the underlying table that stores data physically.
Usually, when joining separate tables, using aliases is at our discretion. It is done mainly for convenience. Here, using aliases is obligatory. Only they can help us distinguish the two virtual tables.
Important: These references to the original table let you use different blocks of the available data. You can either filter both in the join, or you can filter one of them in the WHERE clause, and the other one – in the join.
If this sounds strange, please don’t worry. We will provide examples of both cases later in this tutorial.
So, our task is: from the “emp_manager” table created in the linked exercise, extract the record data only of the employees who are managers as well.
Solving the Task
First, let’s run the following query:
SELECT * FROM emp_manager ORDER BY emp_manager.emp_no;
As you can see in the picture above, only two rows were affected by our query.
Working with the SQL SELF JOIN
Now, let’s observe the following code.
SELECT e1.* FROM emp_manager e1 JOIN emp_manager e2 ON e1.emp_no = e2.manager_no;
We are using the structure of the join syntax. Also, we are joining the same table to itself, providing different aliases for each virtual table.
However, the connection is made with different columns of the base table. And, we would like to extract everything from the first table, provided that its employee number equals the manager number of the other table.
Let’s run this query and observe the output.
You can see that 42 rows were returned.
The query connected the employee numbers of the first table coinciding with manager numbers, with the records from the “manage_no” column of the second table.
So, we have two records from the left side, 110022 and 110039, connected to 42 records on the right side having the same values.
So, why do we see 22 against 39, and 39 against 22, and not something else?
What actually happened
Here’s the explanation. Every time a connection between two implied tables is established, all data of the corresponding row of the left table is displayed. For instance, whenever we have 110022 in the second table, only column data from the first table is being recorded in the new table.
This means the “manager_no” column information is not taken from E2!
Then the same process is repeated for the other manager number – 110039.
Obtaining Different Results
What will happen if we change the table in the SELECT statement to E2? Logically, we will retrieve 42 connections again, but this time the data displayed will refer solely to table E2, and not E1.
What if we would like to obtain a result containing “emp_no” and “dept_no” from E1, and “manager_no” from E2? To do that, we will have to adjust the selection retrieved by the execution of our query: let’s set the first two columns to be extracted from the left virtual table, and the third one from the right. We can achieve that by running the following:
SELECT e1.emp_no, e1.dept_no, e2.manager_no FROM emp_manager e1 JOIN emp_manager e2 ON e1.emp_no = e2.manager_no;
As shown in the picture above, it works.
How to Obtain Two Rows of Data
There are two ways to accomplish that.
The first is simply by using SELECT DISTINCT instead of SELECT, then applying the same code used at the beginning. Let’s use this code:
SELECT DISTINCT e1.* FROM emp_manager e1 JOIN emp_manager e2 ON e1.emp_no = e2.manager_no;
Another Way to Do it
The second way is more sophisticated. We will use the same base query that would lead us to the 42-row output, but this time a WHERE clause will be added, and its condition will shorten the length of E2 to two rows.
Why would we do that? Because this creates a connection of the “emp_no” from E1 that isn’t with all 42 rows from the “manager_no” column of E2, but with two rows only!
Let’s see what will happen if we run the following query:
SELECT manager_no FROM emp_manager;
How to Do it
By using the IN operator to state we want the “emp_no” column from the second virtual table to equal any of the manager numbers from the base table. In other words, the WHERE-subquery regards this small two-row table.
We can write:
SELECT e1.* FROM emp_manager e1 JOIN emp_manager e2 ON e1.emp_no = e2.manager_no WHERE e2.emp_no IN (SELECT manager_no FROM emp_manager);
As you can see in the picture above, the desired output was obtained!
The Importance of the SQL SELF JOIN
In conclusion, whenever you need to join a table with itself, you’ll need to apply an SQL Self join. Keeping in mind some little differences that could alter your final output, you may well use it freely.
If you hate typing the same code structure over and over again every time you need to execute a specific type of query, there is a perfect tool for you in SQL. It is called ‘view’ and you can dive into the tutorial that will teach you how to retrieve your output more quickly.
Eager to hone your SQL skills? Learn how to put theory into practice with our hands-on tutorials!
Next Tutorial: Intro to SQL Views