Systems Engineering and RDBMS

Interesting delete issue

Posted by decipherinfosys on April 6, 2007

At one of the client sites, we were presented with this issue yesterday – they had a table which had a primary key that was composed of two string columns (let’s not get into the natural and surrogate key issues – if you want to read up on that, we have blogged it in the past and you can search for it on our blog).  The issue was that what their business people really wanted was to have the combination of the two columns unique regardless of their position, so, if one has say COL1 and COL2 as two columns that make up the primary key of a table, the:

COL1    COL2

ABC      XYZ

XYZ      ABC

This to them was one thing i.e.  regardless of the position, if the columns have the same value, then it was a duplicate – of course, since there were no rules set up to prevent it – either in the UI layer or in the schema, the table had a lot of “duplicate” data since the primary key creation will not prevent this so called duplication since the above two records as far as the database engine is concerned, are unique.

Our aim was to first delete the duplicates and then enforce a rule so that future dupes are not allowed keeping the above requirements in mind.  We will assume for the sake of this blog post that the foreign keys are defined as on delete cascade and that there are no circular references in the foreign keys.  Let’s create some dummy data to demonstrate how this can be done (we are using SQL Server syntax here – same issue would apply to Oracle/DB2 as well – the SQL to get rid of the dupes can be different since Oracle has the least() and the greatest() functions to make this comparison easier):

CREATE TABLE DEMO_DUPES
(
COL1                VARCHAR(10)    NOT NULL,
COL2                VARCHAR(10)    NOT NULL,
constraint pk_demo_dupes primary key (col1, col2)
)
GO

INSERT INTO DEMO_DUPES VALUES (‘ABC’, ‘XYZ’)
INSERT INTO DEMO_DUPES VALUES (‘ABC’, ‘CDE’)
INSERT INTO DEMO_DUPES VALUES (‘ABC’, ‘FGH’)
INSERT INTO DEMO_DUPES VALUES (‘XYZ’, ‘ABC’)
INSERT INTO DEMO_DUPES VALUES (‘CDE’, ‘TEF’)
INSERT INTO DEMO_DUPES VALUES (‘CDE’, ‘ABC’)
GO

SELECT * FROM DEMO_DUPES

COL1       COL2
———- ———-
ABC        CDE
ABC        FGH
ABC        XYZ
CDE        ABC      –> DUPE (as per above definition)
CDE        TEF
XYZ        ABC      –> DUPE (as per above definition)

Using SQL Server 2005’s row_number() function (in SQL 2000, you would need to use a table variable/temp table):

delete demo_dupes
from demo_dupes, (
select row_number() over (partition by (case when col1 < col2 then col1 else col2 end),
(case when col1 > col2 then col1 else col2 end)
order by col1, col2) as ROWNUM,
col1, col2
from demo_dupes
) as IV
where demo_dupes.col1 = IV.col1
and demo_dupes.col2 = IV.col2
and IV.rownum <> 1

What this does is that it ranks the duplicates based on the least and then the greatest value and orders it by the column values to provide the ranking of the rows.  The duplicate records in this manner get a rownum value of more than 1 and thus can be  deleted out easily.

The result is that those 2 dupes get removed and here is the final data set after the delete is executed:

COL1       COL2
———- ———-
ABC        CDE
ABC        FGH
ABC        XYZ
CDE        TEF

Please note that Oracle already has a least() and a greatest() function so you can make use of that instead of the case statements shown above.

Once the de-duplication was done, next aim was to make sure that this does not occur again which meant that we needed to put in a check constraint to dis-allow this from happening. They key thing would be to ensure that we sort the results in a least and a greatest manner and then put a primary key (unique index) on that combination – that way, when we put in a check constraint, it will be as simple as using a less than or equal to operator.  Here is how that can be done:

If we had a computed column(since function based indexes are not allowed in SQL Server), then we could have just put in a check constraint – let’s go ahead and create such a column and enforce the uniqueness:

alter table demo_dupes add check_val as
(
case when col1 < col2 then col1 else col2 end + case when col1 > col2 then col1 else col2 end
)

create unique index check_val_ind_1 on demo_dupes (check_val)
go

Now, let’s try to add a dupe:

INSERT INTO DEMO_DUPES (col1, col2) VALUES (‘XYZ’, ‘ABC’);

we will now get an error:

Msg 2601, Level 14, State 1, Line 1
Cannot insert duplicate key row in object ‘dbo.DEMO_DUPES’ with unique index ‘check_val_ind_1’.
The statement has been terminated.

Note, that in Oracle, this can be very easily done by using the least() and greatest() functions and then using a FBI (Function Based Index – you can search on our blog site more on the FBI), the same constraint could be enforced and there will not be a need for a computed column as we did in the case of SQL Server.

Sorry, the comment form is closed at this time.

 
%d bloggers like this: