Systems Engineering and RDBMS

Deleting Duplicate Data

Posted by decipherinfosys on September 12, 2007

Ran into an issue yesterday at a client site where some of the tables in the schema did not have any primary key, any alternate key enforced. As a result, the same order for the same vendor was entered into the system multiple times by different operators. The only thing distinguishing them was the datetime field. This was creating issues during invoicing since the company would end up placing multiple duplicate orders. So, the deal was to first delete the duplicate data and to preserve the first order for the given vendor and the given quantity i.e. the vendor_name + qty was their alternate key in the system. This delete was achieved quite simply by using a CTE (Common Table Expressions) and once the delete was done, the alternate key was enforced in the system to prevent such an occurence in the future.

Here is a sample code snippet to show how in SQL Server 2005, using CTE and the ranking functions, one can get rid of the duplicate data. There are many other ways of doing it as well. This is just one of the ways.

set nocount on

declare @test_tbl TABLE (Vendor_Name varchar(10), QTY INT, Order_Date datetime)

INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/19/2007 10:00:05′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 11:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 12:15:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 10:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 11:00:00′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 15:10:30′)

/*before delete*/
select * from @test_tbl

Vendor_Name QTY Order_Date
———– ———– ———————–
ABCD 1000 2007-08-19 10:00:05.000
ABCD 1000 2007-08-20 11:00:10.000
ABCD 1000 2007-08-20 12:15:10.000
XYZ 2000 2007-09-03 10:00:10.000
XYZ 2000 2007-09-03 11:00:00.000
XYZ 2000 2007-09-03 15:10:30.000

;with test (row_num, Vendor_Name, QTY) as
(
select row_number() over (partition by Vendor_Name, QTY order by Order_Date asc) as row_num, Vendor_Name, QTY
from @test_tbl
)
delete test where row_num > 1;

select * from @test_tbl

/*after delete*/

Vendor_Name QTY Order_Date
———– ———– ———————–
ABCD 1000 2007-08-19 10:00:05.000
XYZ 2000 2007-09-03 10:00:10.000

The same logic can be applied in Oracle as well since the same functionality exists in Oracle also.

One Response to “Deleting Duplicate Data”

  1. […] by decipherinfosys on January 5, 2008 In response to one of the previous blog posts on how to delete duplicate data one of our readers asked how to identify the count of the duplications. It is pretty simple and can […]

Sorry, the comment form is closed at this time.

 
%d bloggers like this: