Systems Engineering and RDBMS

Identifying Duplicate Data in the table

Posted by decipherinfosys on January 5, 2008

In response to one of the previous blog posts on how to delete duplicate data one of our readers asked how to identify the count of the duplications. It is pretty simple and can be done by simply grouping on the columns that you need to find the duplicate occurences of. We will use the same example as we did for the deletion of the duplicate data (SQL Server Syntax):

set nocount on

declare @test_tbl TABLE (Vendor_Name varchar(10), QTY INT, Order_Date datetime)

INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/19/2007 10:00:05′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 11:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 12:15:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 10:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 11:00:00′)

select vendor_name, qty, count(*) as occurence
from @test_tbl
group by vendor_name, qty
having count(*) > 1

vendor_name qty         occurence
 ----------- ----------- -----------
 ABCD        1000        3
 XYZ         2000        2

And if you now want to delete the duplicate enteries, you can take a look at the previous post that explains how easy it is to delete the duplicate
data using a CTE (Common Table Expression) – we used CTE to demonstrate the functionality since that is available in SQL Server 2005, Oracle and DB2. If you
need to do this in SQL Server 2000, you can also do this:

Assuming that there was a primary key in the table (all good designs should always have primary keys/alternate keys defined):

declare @test_tbl TABLE (ID_VAL INT IDENTITY PRIMARY KEY, Vendor_Name varchar(10), QTY INT, Order_Date datetime)

INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/19/2007 10:00:05′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 11:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 12:15:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 10:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 11:00:00′)

select * from @test_tbl

/**********************************
Using the NOT IN clause
**********************************/
delete from @test_tbl
where ID_VAL NOT IN (select MAX(ID_VAL)
from @test_tbl
group by vendor_name, qty)

select * from @test_tbl

ID_VAL      Vendor_Name QTY         Order_Date
 ----------- ----------- ----------- -----------------------
 3           ABCD        1000        2007-08-20 12:15:10.000
 5           XYZ         2000        2007-09-03 11:00:00.000

or

/**********************************
Using the EXISTS clause
One can change it to be NOT EXISTS
as well and reverse the comparison
operator
**********************************/

delete @test_tbl
from @test_tbl as outer_tbl
where exists (select 1
from @test_tbl as inner_tbl
where inner_tbl.vendor_name = outer_tbl.vendor_name
and inner_tbl.qty = outer_tbl.qty
and inner_tbl.id_val > outer_tbl.id_val)

select * from @test_tbl

ID_VAL      Vendor_Name QTY         Order_Date
 ----------- ----------- ----------- -----------------------
 3           ABCD        1000        2007-08-20 12:15:10.000
 5           XYZ         2000        2007-09-03 11:00:00.000

or

/**********************************
Dumping the data in a temp table
after cleaning it up,
deleting the data from old table
and inserting it back
**********************************/
select vendor_name, qty, max(order_date) as order_date
into #temp
from @test_tbl
group by vendor_name, qty

delete from @test_tbl
insert into @test_tbl (vendor_name, qty, order_date) select vendor_name, qty, order_date from #temp

select * from @test_tbl

ID_VAL      Vendor_Name QTY         Order_Date
 ----------- ----------- ----------- -----------------------
 6           ABCD        1000        2007-08-20 12:15:10.000
 7           XYZ         2000        2007-09-03 11:00:00.000

Assuming that there was no primary key in this table:

set nocount on

declare @test_tbl TABLE (Vendor_Name varchar(10), QTY INT, Order_Date datetime)

INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/19/2007 10:00:05′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 11:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘ABCD’, 1000, ’08/20/2007 12:15:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 10:00:10′)
INSERT INTO @test_tbl (Vendor_Name, QTY, Order_Date) Values (‘XYZ’, 2000, ’09/03/2007 11:00:00′)

select identity(int, 1, 1) as ID_VAL, vendor_name, qty, order_date into #temp
from @test_tbl
order by vendor_name, qty, order_date

delete #temp
from #temp as outer_tbl
where exists (select 1
from #temp as inner_tbl
where inner_tbl.vendor_name = outer_tbl.vendor_name
and inner_tbl.qty = outer_tbl.qty
and inner_tbl.id_val > outer_tbl.id_val)

delete from @test_tbl
insert into @test_tbl (vendor_name, qty, order_date) select vendor_name, qty, order_date from #temp

select * from @test_tbl

Vendor_Name QTY         Order_Date
----------- ----------- -----------------------
ABCD        1000        2007-08-20 12:15:10.000
XYZ         2000        2007-09-03 11:00:00.000

And please add the Primary Key column then to the main table.

Sorry, the comment form is closed at this time.

 
%d bloggers like this: