Systems Engineering and RDBMS

Bulk Data Loading for the interfaces

Posted by decipherinfosys on April 2, 2007

A key design for interfacing with another application where the integration is not real time and the volume of data involved is high is that the data should get into the system as fast as possible.  Whether that data is in the form of XML of just data records doesn’t matter.  We had a similar situation at a client site over the weekend.  The client was integrating with another application and they get their data loads in the form of an XML document.  This data is provided to them every 3 hours and they are supposed to receive the XML, consume it and process it as well as send back an XML document detailing any errors that occured in the load so that the host system could correct it on their end.  It was also required to store the error messages within their system and to provide the supervisor the ability to correct any issues with the data that was provided so that they do not have to wait for another 3 hours to get the data again.  The host system in this case was a mainframe system which was capable of providing the data only ever 3 hours since it required a lot of processing to be done.

This client application was being benchmarked over the weekend and they started running into performance issues as soon as the XML document size increased.  It started taking them nearly 6 hours to process the document.  And the reason for that was that the application was loading the document into memory and then firing off the inserts one record at a time after doing the transformations and look-ups against it.  I got a call on Saturday night to help identify and resolve this issue since this week they are supposed to go live.  It did not take much time to identify the performance bottleneck – after it was identified, we decided to move to SQLXML (this client was using SQL Server 2005).  Once we switched to SQLXML and used bulk data load into a set of interface tables, the data upload was finished in a matter of minutes.  Once the data was loaded into these interface tables, all that remained was to process this data from the interface tables to the actual transactional tables.  We wrote database stored procedure to do that and the entire processing from start to finish was done in less than 10 minutes.  Any errors that occured were logged into a message_log table and was cross-referenced with the data in the interface tables.  We then prepared a GUI for their supervisor to help see the failed records and correct them, if needed.  This will be converted to an alert today and via a dashboard, the supervisor will be able to see the summary as well as the failed records.

The aim in such situations should be to get the data into the system as soon as possible since bulk data load will reduce the loading time.  The transformation, look-up and subsequent updates to the transacational tables can be then done through database code in a SET based fashion.  The stored procedure code can first weed out the bad records via validation routine and then process the remaining good records in a SET based fashion to do the insert/update/delete logic.

Sorry, the comment form is closed at this time.

%d bloggers like this: