SSIS – Data Flow Transformations
Posted by decipherinfosys on October 13, 2009
We have written before about different aspects of SQL Server Integration Services – you can read more on those blog posts here. Until our previous articles we have been looking at mainly the Control Flow Tasks in the Sql Server Integration Services. Control Flow Tasks maintain the Flow of the logic in the SSIS. The main transformation logic in performed by the data flow task. One of the main uses of the Sql Server Integration Services (SSIS) is to perform the Extraction, Transformation, Data Cleansing and Loading (ETL or sometimes called as ETLM) of the data from various sources. As data from various sources might come in a different format, there is generally a requirement of cleaning up of the data before we load the data into our system. Data cleansing or rather transformation in Sql Server Integration Services (SSIS) is done using the Data Transformation Tasks. Today, in our post, we are going to look at a few of such SSIS Data Transformation Tasks.
One of the data flow transformations which is used for the data cleansing purposes is called the Lookup Transform. We are going to explore the Lookup Transform in Sql Server 2005. There have been some enhancements in SQL Server 2008 which we are going to look into in future posts. The look up transform is very similar to that of the lookup tab in the data pump task in SQL Server 2000 DTS. This task is used to derive new columns based on the existing set of columns. This transform performs look up on an existing table based on the columns which are existing in the dataflow (incoming source). Let us explore more into this transform with the help of an example.
We will pick up a business use case scenario which is very common and we have faced it as well in our projects. Let us consider a business scenario where we are getting an inbound file from a client and the file contains the health application records created on that particular day. Furthermore the file contains the City and the State Code but not the zip code. Our task is to look up the zip code by using the City and State Code Columns and populate our destination table. If the look up fails, then, we need to insert a -1 in the destination table.
For that we should have an already existing table in the database which stores the state code, city and the zip code. The following code block creates the table and populates it with the required data.
IF EXISTS(SELECT 1 FROM SYS.TABLES WHERE NAME ='lkpZipCode')
DROP TABLE lkpZipCode
CREATE TABLE lkpZipCode
ZipCodeID INT IDENTITY(1,1) PRIMARY KEY
, CityCode VARCHAR(20)
, StateCode CHAR(2)
, ZipCode VARCHAR(20)
/*Using SQL Server 2008 functionality to do the inserts in one single statement*/
INSERT INTO lkpZipCode (CityCode, StateCode, ZipCode)
Let us now create a table to hold the incoming values. The following piece of code creates the destination table.
IF EXISTS(SELECT 1 FROM SYS.TABLES WHERE NAME ='Customers')
DROP TABLE Customers
CREATE TABLE dbo.Customers
CustomerID INT IDENTITY(1,1) PRIMARY KEY
, CustomerFirstName Varchar(20) NOT NULL
, CustomerLastName Varchar(20) NOT NULL
, City Varchar(20) NOT NULL
, [State] Varchar(2) NOT NULL
, ZipCode Varchar(20) NOT NULL
Create a text file (Customers.txt) and populate it with some sample data.
Open up the Business Intelligence Development Studio (BIDS) and create a new SSIS project. If you are new to SSIS, we would recommended to read our previous articles about creating and running the SSIS Packages using BIDS.
Since we are dealing with the flow of data in this scenario, we should use the data flow tasks to transform the data according to our needs. Drag and drop the Data Flow task from the Control Flow task onto the Control Flow Pane of the BIDS. Double click on the Data Flow Task which takes us to the data flow pane. Since our source is a text file, drag and drop the Flat File Source from the data flow task on to the data flow pane. Configure the Flat File Source to use the flat file we created earlier.
Configuration of the Lookup Transform:
Since we have to derive the zip code based on the State and City columns in the text file, by looking up the zip code in our look up table, we have to use the look up transform. Drag and drop the lookup transform from the Data Flow Task on to the Data Flow pane. Double click on the Transform to edit it. The following image shows the Reference Table Tab of the Transform:
This shows the reference table tab of the lookup transform. In this tab, we can configure the table which we will use to look up our zip code based on the City and the State columns from the incoming file. Click on the NEW button beside the OLEDB Connection Manager to create a connection to the database in which our table is located. Provide the server name and the database name in the relevant fields and click OK. The image below shows the creation of the connection to the database:
We can either specify a table or a sql query to generate the result set against which we will do a lookup using the City and State columns from the incoming file. In our case we are using the table to do a look up. If needed, we can specify a sql query by clicking on the radio button against the ‘Use Results of an Sql Query’ and specify the sql query. For our scenario, drop down the ‘Use Table or View’ and select the table lkpZipCode. The image shown below depicts the same:
In the columns tab, we configure the columns on which we do a look up on the source columns and the destination columns. The image below shows the configuration of columns tab in the Lookup Table:
Drag the City column from the Source Available columns and drop it on the CityCode column of the destination. Repeat the same with the State column in the source and the StateCode in the destination. Since we are looking up the ZipCode column in the lookup table, mention the same in the Lookup Column of the editor. Since this is a new column added to the data flow, select the ‘Add as new column’ in the lookup operation drop down. Give the output alias as ZipCode. This finishes the configuration of the Lookup Transformation.
Lookup Transform Outputs
The lookup Transform has two outputs: Lookup Match output and the Error Output. If the Lookup is successful then those rows are sent out to the Lookup Match output. If the lookup fails to find any records, then those rows are sent to the Error Output.
Derived Column Transform:
As per our business scenario, if the lookup fails, then we have to insert a -1 into the look up table. For that we have to make use of the Derived Column Transform. A Derived Column Transform derives a new column and adds it to the data flow based on either the existing columns in the data flow or based on the expressions. Drag and drop the Derived Column Transform on to the data flow from the data flow task and double click on it to edit the properties. The image below shows the Transformation editor:
In the derived column name provide the name of the column we intend to derive. In this case, since we are intending to insert a -1 zip code if the look up fails, type the name of the derived column as ZipCode and the Expression as -1. In the derived column, select the ‘add as new column’ since we are adding a new column to the dataflow. Drag the error output (red) arrow and join the Derived column Transform. This opens up the editor as shown below:
Even if the lookup fails, since we should not error the rows out, select the Redirect Row option in the Error column of the Lookup Output as shown in the image above and click OK. This will redirect the rows on which the lookup failed to the Derived Column Transform, which adds the column ZipCode with a value of -1 to the data flow for the failed lookup records.
UNION ALL Transform
Since we have to combine the result set of both the successful lookups and un-successful lookup, we will use the UNION ALL Transform. Drag and drop the union all transform from the data flow task on the data flow. Drag the green arrow (look up match) arrow from the lookup transform and connect it to one of the input of the union all transform.
As a final step, we have to load the cleansed records into our destination table. To do that drag and drop the OLEDB Destination from the Data Flow tasks on to the data flow. Configure the data flow to insert the data set from the UNION ALL transform into the destination table. This transform has been dealt in detail in our previous posts. The image below shows the screen shot of the entire SSIS package:
To summarize, in this blog post, we discussed the use of SSIS to cleanse/transform the data from various sources. We understood the importance of the lookup transform and dealt with the configuration of the lookup transform in detail. We also explored other transforms like Derived Column Transforms and UNION ALL transform which are also very handy transforms in the data flow. In some of the upcoming posts, we will look at some of the enhancements in SQL Server 2008 and will also cover other business use case scenarios.
- Lookup Transform – MSDN article – here.
- MVP Stacia Misner’s post on Lookup Transform and Caching – here.
- MVP Jorg Klein’s post on the case sensitive nature of the lookup transform – here.
- Arshad’s article on sqlserverperformance.com – it discusses the SSIS enhancements in SQL Server 2008 and has a series of posts on the topic – here.
- Brian Knight’s books on SSIS – these are the best resource for anyone new or experienced in SSIS. There are other authors as well – too many books & authors to list here – you can see all of them from this link at amazon – here.