Systems Engineering and RDBMS

SSIS – Data Flow Transformations

Posted by decipherinfosys on October 13, 2009

We have written before about different aspects of SQL Server Integration Services – you can read more on those blog posts here.   Until our previous articles we have been looking at mainly the Control Flow Tasks in the Sql Server Integration Services. Control Flow Tasks maintain the Flow of the logic in the SSIS. The main transformation logic in performed by the data flow task. One of the main uses of the Sql Server Integration Services (SSIS) is to perform the Extraction, Transformation, Data Cleansing and Loading (ETL or sometimes called as ETLM) of the data from various sources. As data from various sources might come in a different format, there is generally a requirement of cleaning up of the data before we load the data into our system. Data cleansing or rather transformation in Sql Server Integration Services (SSIS) is done using the Data Transformation Tasks. Today, in our post, we are going to look at a few of such SSIS Data Transformation Tasks.

One of the data flow transformations which is used for the data cleansing purposes is called the  Lookup Transform. We are going to explore the Lookup Transform in Sql Server 2005. There have been some enhancements in SQL Server 2008 which we are going to look into in future posts. The look up transform is very similar to that of the lookup tab in the data pump task in SQL Server 2000 DTS. This task is used to derive new columns based on the existing set of columns. This transform performs look up on an existing table based on the columns which are existing in the dataflow (incoming source). Let us explore more into this transform with the help of an example.

We will pick up a business use case scenario which is very common and we have faced it as well in our projects.  Let us consider a business scenario where we are getting an inbound file from a client and the file contains the health application records created on that particular day. Furthermore the file contains the City and the State Code but not the zip code. Our task is to look up the zip code by using the City and State Code Columns and populate our destination table. If the look up fails, then, we need to insert a -1 in the destination table.

For that we should have an already existing table in the database which stores the state code, city and the zip code. The following code block creates the table and populates it with the required data.

IF EXISTS(SELECT 1 FROM SYS.TABLES WHERE NAME ='lkpZipCode')
DROP TABLE lkpZipCode
GO
CREATE TABLE lkpZipCode
(
ZipCodeID INT IDENTITY(1,1) PRIMARY KEY
, CityCode VARCHAR(20)
, StateCode CHAR(2)
, ZipCode VARCHAR(20)
)

/*Using SQL Server 2008 functionality to do the inserts in one single statement*/

INSERT INTO lkpZipCode (CityCode, StateCode, ZipCode)
VALUES (‘Bellevue’,’WA’,98007),
(‘Seattle’,’WA’,98004),
(‘Atlanta’,’GA’,30339),
(‘Alpharetta’,’GA’,30005),
(‘Irvington’,’AL’,36544),
(‘Jacksonville’,’FL’,36265),
(‘Austin’,’TX’,73301),
(‘Houston’,’TX’,77001)
GO

Let us now create a table to hold the incoming values. The following piece of code creates the destination table.

IF EXISTS(SELECT 1 FROM SYS.TABLES WHERE NAME ='Customers')
DROP TABLE Customers
GO

CREATE TABLE dbo.Customers
(
CustomerID     INT IDENTITY(1,1) PRIMARY KEY
, CustomerFirstName Varchar(20) NOT NULL
, CustomerLastName  Varchar(20) NOT NULL
, City        Varchar(20)   NOT NULL
, [State]        Varchar(2)   NOT NULL
, ZipCode        Varchar(20)          NOT NULL
)

Create a text file (Customers.txt) and populate it with some sample data.

Open up the Business Intelligence Development Studio (BIDS) and create a new SSIS project.   If you are new to SSIS, we would recommended to read our previous articles about creating and running the SSIS Packages using BIDS.

Package Creation:

Since we are dealing with the flow of data in this scenario, we should use the data flow tasks to transform the data according to our needs.  Drag and drop the Data Flow task from the Control Flow task onto the Control Flow Pane of the BIDS.   Double click on the Data Flow Task which takes us to the data flow pane.   Since our source is a text file, drag and drop the Flat File Source from the data flow task on to the data flow pane.   Configure the Flat File Source to use the flat file we created earlier.

Configuration of the Lookup Transform:

Since we have to derive the zip code based on the State and City columns in the text file, by looking up the zip code in our look up table, we have to use the look up transform.   Drag and drop the lookup transform from the Data Flow Task on to the Data Flow pane.   Double click on the Transform to edit it.   The following image shows the Reference Table Tab of the Transform:

SSIS_DFT_1

This shows the reference table tab of the lookup transform.   In this tab, we can configure the table which we will use to look up our zip code based on the City and the State columns from the incoming file.   Click on the NEW button beside the OLEDB Connection Manager to create a connection to the database in which our table is located. Provide the server name and the database name in the relevant fields and click OK.   The image below shows the creation of the connection to the database:

SSIS_DFT_2

We can either specify a table or a sql query to generate the result set against which we will do a lookup using the City and State columns from the incoming file.   In our case we are using the table to do a look up.   If needed, we can specify a sql query by clicking on the radio button against the ‘Use Results of an Sql Query’ and specify the sql query.   For our scenario, drop down the ‘Use Table or View’ and select the table lkpZipCode.   The image shown below depicts the same:

SSIS_DFT_3

In the columns tab, we configure the columns on which we do a look up on the source columns and the destination columns.   The image below shows the configuration of columns tab in the Lookup Table:

SSIS_DFT_4

Drag the City column from the Source Available columns and drop it on the CityCode column of the destination.   Repeat the same with the State column in the source and the StateCode in the destination.   Since we are looking up the ZipCode column in the lookup table, mention the same in the Lookup Column of the editor.   Since this is a new column added to the data flow, select the ‘Add as new column’ in the lookup operation drop down.   Give the output alias as ZipCode.   This finishes the configuration of the Lookup Transformation.

Lookup Transform Outputs

The lookup Transform has two outputs:  Lookup Match output and the Error Output.   If the Lookup is successful then those rows are sent out to the Lookup Match output.  If the lookup fails to find any records, then those rows are sent to the Error Output.

Derived Column Transform:

As per our business scenario, if the lookup fails, then we have to insert a -1 into the look up table.   For that we have to make use of the Derived Column Transform.   A Derived Column Transform derives a new column and adds it to the data flow based on either the existing columns in the data flow or based on the expressions.   Drag and drop the Derived Column Transform on to the data flow from the data flow task and double click on it to edit the properties.   The image below shows the Transformation editor:

SSIS_DFT_5

In the derived column name provide the name of the column we intend to derive. In this case, since we are intending to insert a -1 zip code if the look up fails, type the name of the derived column as ZipCode and the Expression as -1.   In the derived column, select the ‘add as new column’ since we are adding a new column to the dataflow.  Drag the error output (red) arrow and join the Derived column Transform.   This opens up the editor as shown below:

SSIS_DFT_6

Even if the lookup fails, since we should not error the rows out, select the Redirect Row option in the Error column of the Lookup Output as shown in the image above and click OK.   This will redirect the rows on which the lookup failed to the Derived Column Transform, which adds the column ZipCode with a value of -1 to the data flow for the failed lookup records.

UNION ALL Transform

Since we have to combine the result set of both the successful lookups and un-successful lookup, we will use the UNION ALL Transform.   Drag and drop the union all transform from the data flow task on the data flow.   Drag the green arrow (look up match) arrow from the lookup transform and connect it to one of the input of the union all transform.

OLEDB Destination

As a final step, we have to load the cleansed records into our destination table.   To do that drag and drop the OLEDB Destination from the Data Flow tasks on to the data flow.   Configure the data flow to insert the data set from the UNION ALL transform into the destination table.   This transform has been dealt in detail in our previous posts. The image below shows the screen shot of the entire SSIS package:

SSIS_DFT_7

To summarize, in this blog post, we discussed the use of SSIS to cleanse/transform the data from various sources.  We understood the importance of the lookup transform and dealt with the configuration of the lookup transform in detail.   We also explored other transforms like Derived Column Transforms and UNION ALL transform which are also very handy transforms in the data flow.   In some of the upcoming posts, we will look at some of the enhancements in SQL Server 2008 and will also cover other business use case scenarios.

Resources:

  • Lookup Transform – MSDN article – here.
  • MVP Stacia Misner’s post on Lookup Transform and Caching – here.
  • MVP Jorg Klein’s post on the case sensitive nature of the lookup transform – here.
  • Arshad’s article on sqlserverperformance.com – it discusses the SSIS enhancements in SQL Server 2008 and has a series of posts on the topic – here.
  • Brian Knight’s books on SSIS – these are the best resource for anyone new or experienced in SSIS.  There are other authors as well – too many books & authors to list here – you can see all of them from this link at amazon – here.

6 Responses to “SSIS – Data Flow Transformations”

  1. Ralph Wilson said

    Other than the fact that you are storing the Zip Code as an integer value, this is a really good article.

    However, I strongly disagree with the implied “Best Practice” of storing the Zip COde as an Integer. Zip COdes are not numeric values even though they are comprised of numeric characters. They are still strings (VarChars or Chars). If you store a Zip Code as a numeric value, then you will constantly be converting between strings and integers or integers and strings as you make use of them. You will also lose any leading zeros and, in order to properly display the Zip Code with the leading zeros, you will have to manipulate the string (which you hwill have already had to convert from the integer value) so that you can left-zero-pad the string.

    Perhaps it would have been better to have stuffed the Zip Code column with ‘00000’.😉

    • You are absolutely right Ralph. It should be variable length string … besides the points mentioned by you, for international zip codes, it is alphanumeric a majority of the times. Sorry for the oversight and thanks for the note in this regard🙂

  2. Piyurkumar said

    Instead of having “Union All”, set look up to “Ignore Failure”, take “Derived Column Task” and set the null value columns( i.e. ISNULL([LKPColumn])? DefaultValue :[LKPColumn]) and Performance will be much better (40%). Generate 2M data and load to some destination using both methods and see the differences…🙂

  3. Alok Raj said

    Hello….

    I want to know how can we change the source table name dynamically in the SSIS. Each day a new table is generated and the SSIS package fetches data from the generated Table.

    Regards,
    Alok Raj

    • Piyurkumar said

      Yes, you can do that.
      But to achieve that :

      1. You need to create one variable in SSIS, let’s say User::vSQL.
      2. Prepare SQL statement in expression of User::vSQL variable, “SELECT col1, col2 FROM tblName”
      3. Select “Data Access Mode” property in “OLEDB Source” as “SQL command from Variable” and select variable name, User::vSQL.

      Make sure you need to set up some mechanism to get correct newly generated table name at runtime using some ParameterTable or PackageVariableTable and assign to User::vSQL in SSIS.

  4. Ganesh said

    This article is good but has many missing steps. I know it’s hard to write a blog.

    Piyur Kumar, you are right about setting the Lookup transformation. This really worked very well for me. The original steps described in the article leads to data loss. This Union all control omits rows. I had a hard time figuring out, but thankfully I looked at your comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: