What are Duplicate Records in SQL

SQL Tutorials 5 min read
duplicate records in sql
Blog / SQL Tutorials / What are Duplicate Records in SQL

Duplicate records in SQL, also known as duplicate rows, are identical rows in an SQL table. This means, for a pair of duplicate records, the values in each column coincide. Usually, we will retrieve duplicate data, when we are joining tables.  So, if you want to learn a thing or two about the syntax of the JOINS, take a look at the linked tutorial.

Side note: We will be working with the ‘employees’ database, so if you don’t have it on your computer, feel free to check out how to download it in the following article.

Man looking out of a window with his laptop, duplicate records in sql

Generally, duplicate rows are not always allowed in a database or a data table. The regular practice of many companies is to clean their data from such occurrences. However, they are sometimes encountered, especially in new, raw, or uncontrolled data. In this tutorial, we will show you a way to handle them in your queries.

How to Create Duplicate Records in SQL?

As a matter of fact, our data is clean; it does not have any duplicate rows. Therefore, we can insert one row in the “dept_manager_dup” table that is identical to one of the others:

INSERT INTO dept_manager_dup

VALUES('110228', 'd003', '1992-03-21', '9999-01-01');

The last value is ‘9999-01-01’ because we want the contract to be with an indefinite end.

The last value is ‘9999-01-01’ because we want the contract to be with an indefinite end, duplicate records in sql

We will also create a duplicate row in the other table. Let it be the ninth department, “Customer Service”. The code should be:

INSERT INTO departments_dup

VALUES('d009', 'Customer Service');

Results from the Code

Let’s check what happened with the two tables first. We can achieve that by writing:

SELECT

    *

FROM

    dept_manager_dup

ORDER BY dept_no ASC;

The table with the data about the managers looks like it hasn’t changed, duplicate records in sql

The table with the data about the managers looks like it hasn’t changed. However, the employee numbered 110228 appears there twice, as shown in the picture above.

data science training

And how about the “departments_dup” table? We find it in the same way we did before; the only difference is that the Customer Service department is shown twice, as you can see in the picture below.

Customer Service department is shown twice, duplicate records in sql

Different Output

Fine, the situation is clear. Now, look at the picture below – it is a snipped photo of the output of 20 rows, when the initial state of the two tables was used, without duplicates.

the output of 20 rows, duplicate records in sql

How do you expect the new output to change regarding the one from the picture? Do you think the number of presented rows will be lower, equal, or higher?

Let’s check the output of the same query, this time with the two tables containing duplicate records.

25 rows returned

We obtained 25 rows. Where do you think these additional 5 rows came from?

Why We Have More Rows

First, we don’t have just one, but two employees with number 110228; hence, two times the department name “Human Resources” is appended against them. This gives us a single additional record compared to what we had earlier.

two employees with number 110228;

The remaining 4 extra records are at the end of this result set. The “departments_dup” table contains a duplicate row for department number 9, “Customer Service”. That’s why every time department number 9 from the first table is matched with department number 9 from the second table, two records are displayed.

two records are displayed.

In the picture above, you can see each of these employees twice: 111784, 111877, 111939, and 111692!

How to Handle Duplicate Records in SQL?

Whether having duplicate values is a mistake in the database or you just don’t want to see duplicate rows in query results, here is how to handle duplicates:

GROUP BY the field that differs most among records. In our case, this will be the “emp_no” column. This will stack all rows that have the same “emp_no”. By writing the following code, we should retrieve the initial output, with no duplicate values:

SELECT

    m.dept_no, m.emp_no, d.dept_name

FROM

    dept_manager_dup m

        JOIN

    departments_dup d ON m.dept_no = d.dept_no

GROUP BY emp_no

ORDER BY dept_no;

20 rows returned

Keeping an Eye on Duplicate Data

This is an important tool in your arsenal. You cannot allow yourself to assume there are no duplicate rows in your data, especially if it contains millions of rows! It would be good to get used to grouping the joins by the field that differs most among records.

So, if you were thinking that there was only one type of a JOIN in SQL, let me prove you wrong. Dive into the next tutorial to see what a LEFT JOIN is.

***

Eager to hone your SQL skills? Learn how to put theory into practice with our hands-on tutorials!

Next Tutorial: SQL LEFT JOIN

Earn your Data Science Degree

Expert instructions, unmatched support and a verified certificate upon completion!

Leave a Reply

Your email address will not be published.

You have Successfully Subscribed!