Delete Duplicate Data in SQL Server

Jignesh Trivedi
12y
9.7k
0
2

Article

Problem Statement

We are sometimes in the situation where our database table neither has a primary key nor is unique and the table contains duplicate data. Now we want to delete this duplicate data from this database table.

Example

Suppose I have a table named #DeleteData. The definition of the table and a test data script is given below.

CREATE TABLE #DeleteData
(
Id INT,
Name VARCHAR(20)
)

INSERT INTO #DeleteData VALUES (1,'Jignesh')
INSERT INTO #DeleteData VALUES (1,'Jignesh')
INSERT INTO #DeleteData VALUES (1,'Jignesh')
INSERT INTO #DeleteData VALUES (2,'Tejas')
INSERT INTO #DeleteData VALUES (2,'Tejas')
INSERT INTO #DeleteData VALUES (3,'Rakesh')

This table contains duplicate data.

Now I want to delete duplicate data from the database table.

Solution

There are many ways to delete the duplicate data from the database table. In this article I will explain it one by one.

1. Using CTE (Common Table Expression)

Using a Common Table Expression, we can remove the duplicate record. We can generate CTE with one additional column that says the Row Number is nothing but a serial number that is partitioned by the id. After creating the CTE we can fire a DELETE statement with a row number greater than 1.

;WITH duplicateData AS
(
SELECT id,name, ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) AS rowno FROM #DeleteData
)
DELETE duplicateData WHERE rowno >1

2. Using Temporary Table

This is the very simplest way to delete a duplicate record from the table. In this method, I am inserting distinct data into the temporary table, truncating the original table and re-inserting the data into the original table from the temporary table.

SELECT * INTO #temp FROM #DeleteData
TRUNCATE TABLE #DeleteData
INSERT INTO #DeleteData
SELECT ID, NAME FROM #temp
GROUP BY ID, NAME

3. By Adding Identity Column

In this method, I temporarily added an identity column in the database table. This will generate a sequential number for each row of the database table. Now I keep the lowest row number for each duplicate record and delete all other records.

ALTER TABLE #DeleteData ADD rowNo int identity(1,1)
DELETE #DeleteData
WHERE rowno not in(select min(rowno) from #DeleteData group by id,Name)
ALTER TABLE #DeleteData DROP COLUMN rowNo

4. Using Cursor

The cursor is used to manipulate data in a set on a row-by-row basis. Here the basic idea is to take a unique record from the table, keep it and delete all other records. Using the Group by clause, we can find a unique row of the table and we can perform the loop on this unique record using a cursor. Within the loop we can keep the first and delete all other records.

DECLARE @id int, @Name VARCHAR(20), @Count INT
DECLARE DuplicateCursor CURSOR
FOR SELECT id, Name, Count(1) as c FROM #DeleteData GROUP BY id, Name

OPEN DuplicateCursor
FETCH NEXT FROM DuplicateCursor INTO @id, @Name, @Count
WHILE @@FETCH_STATUS = 0
BEGIN
    DELETE TOP (@Count-1) FROM #DeleteData
    WHERE id = @id AND Name = @Name
    FETCH NEXT FROM DuplicateCursor INTO @id, @Name, @Count
END

5. Using %%physloc%%

%%physloc%% is an undocumented and unsupported feature in SQL Server 2008. This is the same as
%%lockres%% in SQL Server 2005. This is a mechanism for identifying the physical address of a row.

Here the basic idea is to keep the smallest physical address row and delete all other rows.

DELETE #DeleteData
WHERE #DeleteData.%%physloc%%
      NOT IN (SELECT MIN(b.%%physloc%%)
              FROM   #DeleteData b
              GROUP BY b.Id, b.Name);

Final output

Conclusion

We can remove duplicate data (rows) from SQL Server using the above described methods.