Duplicate records in SQL, also known as duplicate rows, are identical rows in an SQL table. This means, for a pair of duplicate records, the values in each column coincide. Usually, we will retrieve duplicate data, when we are joining tables. So, if you want to learn a thing or two about the syntax of the JOINS, take a look at the linked tutorial.
Side note: We will be working with the ‘employees’ database, so if you don’t have it on your computer, feel free to check out how to download it in the following article.
Generally, duplicate rows are not always allowed in a database or a data table. The regular practice of many companies is to clean their data from such occurrences. However, they are sometimes encountered, especially in new, raw, or uncontrolled data. In this tutorial, we will show you a way to handle them in your queries.
How to Create Duplicate Records in SQL?
As a matter of fact, our data is clean; it does not have any duplicate rows. Therefore, we can insert one row in the “dept_manager_dup” table that is identical to one of the others:
INSERT INTO dept_manager_dup VALUES('110228', 'd003', '1992-03-21', '9999-01-01');
We will also create a duplicate row in the other table. Let it be the ninth department, “Customer Service”.
The code should be:
INSERT INTO departments_dup VALUES('d009', 'Customer Service');
Results from the Code
Let’s check what happened with the two tables first. We can achieve that by writing:
SELECT * FROM dept_manager_dup ORDER BY dept_no ASC;
The table with the data about the managers looks like it hasn’t changed. However, the employee numbered 110228 appears there twice, as shown in the picture above.
Fine, the situation is clear. Now, look at the picture below – it is a snipped photo of the output of 20 rows, when the initial state of the two tables was used, without duplicates.
How do you expect the new output to change regarding the one from the picture? Do you think the number of presented rows will be lower, equal, or higher?
We obtained 25 rows. Where do you think these additional 5 rows came from?
Why We Have More Rows
First, we don’t have just one, but two employees with number 110228; hence, two times the department name “Human Resources” is appended against them. This gives us a single additional record compared to what we had earlier.
The remaining 4 extra records are at the end of this result set. The “departments_dup” table contains a duplicate row for department number 9, “Customer Service”. That’s why every time department number 9 from the first table is matched with department number 9 from the second table, two records are displayed.
In the picture above, you can see each of these employees twice: 111784, 111877, 111939, and 111692!
How to Handle Duplicate Records in SQL?
Whether having duplicate values is a mistake in the database or you just don’t want to see duplicate rows in query results, here is how to handle duplicates:
GROUP BY the field that differs most among records. In our case, this will be the “emp_no” column. This will stack all rows that have the same “emp_no”. By writing the following code, we should retrieve the initial output, with no duplicate values:
SELECT m.dept_no, m.emp_no, d.dept_name FROM dept_manager_dup m JOIN departments_dup d ON m.dept_no = d.dept_no GROUP BY emp_no ORDER BY dept_no;
Keeping an Eye on Duplicate Data
This is an important tool in your arsenal. You cannot allow yourself to assume there are no duplicate rows in your data, especially if it contains millions of rows! It would be good to get used to grouping the joins by the field that differs most among records.
So, if you were thinking that there was only one type of a JOIN in SQL, let me prove you wrong. Dive into the next tutorial to see what a LEFT JOIN is.
Eager to hone your SQL skills? Learn how to put theory into practice with our hands-on tutorials!
Next Tutorial: SQL LEFT JOIN