Another use of the checksum functions
Posted by decipherinfosys on February 11, 2009
We have discussed the checksum functions before in our blog posts – you can access those posts here and here. Yet another use of checksum functions is when you want to quickly compare two records to see whether anything has changed – this is a typical requirement at the time of data loading from a source to the destination system. One has to match up the records based on the primary key or alternate key(s) and then ascertain which records have changed so that one can then update them. If you are lucky, the source data that you are using might already have a flag (or status code value) to indicate whether a record changed or not and then proceed to fire off the update for that entire record. However, at times you will not be getting such a flag option in which case you need to determine which records are for update and which are there for an insert. Instead of checking one column at a time for the changed records in order to ascertain which attributes have changed, folks typically fire off the update (for all the matching records) to all the non-primary, non-alternate key attributes which is not only a waste of resources, it is not optimal. The alternative is not pretty either – checking each column to see whether it has changed or not and then only fire off the update for those subset of records.
Beginning SQL Server 2008, one can make use of the MERGE command to do this work. And prior to SQL Server 2008, one can make use of the checksum functions. One can improve this process by just doing a checksum on the source and the destination rows. So, join the source and the destination rows based on the primary key/alternate key(s) and then compare the checksum values – the rows where the checksum has changed are the ones that need to be updated. Do remember that the regular checksum function is case insensitive and Binary_Checksum function is case sensitive. There is an outside chance of the checksum functions returning the same value even when the source and the destination rows do not match but the chances of that happening for the matching primary key records is negligible.
Here is an example:
create table tblSource (col1 int primary key, col2 int, col3 nvarchar(100));
create table tblDestination (col4 int primary key, col5 int, col6 nvarchar(100));
insert into tblSource values (1, 10, ‘Row 1 is same’);
insert into tblDestination values (1, 10, ‘Row 1 is same’);
insert into tblSource values (2, 1, ‘Row 2 is different’);
insert into tblDestination values (2, 10, ‘Row 2’);
And here is the SQL to do the comparison:
select col1, col2, col3, col4, col5, col6, checksum(col1, col2, col3) as Source_Checksum, checksum(col4,col5, col6) as Destination_Checksum
from dbo.tblSource as Source
inner join dbo.tblDestination as Destination
on Source.col1 = Destination.col4
where checksum(col1, col2, col3) <> checksum(col4,col5, col6);
And you will get this output:
col1 col2 col3 col4 col5 col6 Source_Checksum Destination_Checksum ------------------------------------------------------------------------------------------------------- 2 1 Row 2 is different 2 10 Row 2 1950188260 -1661133429
As you can see from above, the source and the destination checksums are different for the second row. You can use this technique in order to quickly make an analysis of whether there has been a difference between the source and the destination row. In addition, you can do this in SSIS as well – you can download the checksum transformation from sqlis.com.