Systems Engineering and RDBMS

Sparse Columns in SQL Server 2008

Posted by decipherinfosys on July 13, 2008

Sparse columns are a new feature in SQL Server 2008. These columns are normal columns in a table that have an optimized storage format for NULL values. So, they basically reduce the space requirements when they contain NULL values (they do not take up any space when they have NULL values) but there is an overhead associated with the retrieval of those columns for not null values. BOL states that one should consider them when the savings in terms of space are at least 20-40%. First let’s see an example and then we can go into the advantages and the limitations of sparse columns in SQL Server 2008. Both Create Table and Alter Table commands support the creation/alteration of sparse columns/column sets (we will discuss column sets shortly). In the example below, we are creating a de-normalized table with the employee information and their address in the same table:

CREATE TABLE dbo.SPARSE_TEST
(
ID            INT        IDENTITY(1,1)    NOT NULL,
FIRST_NAME    NVARCHAR(50)            NOT NULL,
LAST_NAME        NVARCHAR(50)            NOT NULL,
ADDR_LINE_1    NVARCHAR(20)            SPARSE NULL,
ADDR_LINE_2    NVARCHAR(20)            SPARSE NULL,
CITY            NVARCHAR(20)            SPARSE NULL,
STATE_NAME    NVARCHAR(2)            SPARSE NULL,
COUNTRY        NVARCHAR(10)            SPARSE NULL,
ZIP_CODE        NVARCHAR(20)            SPARSE NULL,
ADDRESS_SET   XML COLUMN_SET FOR ALL_SPARSE_COLUMNS,
CONSTRAINT PK_SPARSE_TEST PRIMARY KEY (ID)
)
GO
CREATE UNIQUE INDEX SPARSE_TEST_IND_1 ON DBO.SPARSE_TEST (FIRST_NAME, LAST_NAME)
GO

For examples sake, we have put all address fields as sparse fields since an address or part of those attributes might not be known. And then you can see in the end that we have an ADDRESS_SET which is of data type XML and is defined as a column set for all the sparse columns in the table – per table, there can be only one column set. A column set is essentially an untyped XML representation that combines all the sparse columns in a table into a more structured output. You can read more on when to use column sets and when not to use them in the BOL online version here. There are quite a few major benefits of this – like using it for a content management system like MS Sharepoint or even for product catalogs where every column does not apply to each and every product. Having the column set in an XML format also means that the application can even select/update/insert this data set. One thing to caution for is that if you are using this column for updating the values, be specific in giving values to all the sparse columns else you will end up putting the rest of the sparse columns to be NULL.

Let’s populate it with some data now:

set nocount on
go
insert into dbo.sparse_test (first_name, last_name, addr_line_1, addr_line_2, city, state_name, country, zip_code)
values ('Jack', 'Black', '123 hill street', null, 'Seattle', 'WA', 'USA', NULL);
insert into dbo.sparse_test (first_name, last_name, addr_line_1, addr_line_2, city, state_name, country, zip_code)
values ('Steve', 'Smith', '2300 one point place', 'more address info.', 'Los Angeles', 'CA', 'USA', NULL);
insert into dbo.sparse_test (first_name, last_name, addr_line_1, addr_line_2, city, state_name, country, zip_code)
values ('Dennis', 'Baitis', '15 Main Street', null, 'Atlanta', 'GA', 'USA', '30339');

And now, let’s look at the data:

If we just do a “Select * from dbo.sparse_test”, we will not even get the sparse columns in the output – infact, we will get only the column_set column. Output is too big to post here – so, here is the output for the first record’s column set:

<ADDR_LINE_1>123 hill street</ADDR_LINE_1><CITY>Seattle</CITY><STATE_NAME>WA</STATE_NAME><COUNTRY>USA</COUNTRY>

As you can see, you only see those columns in here that had values provided for them. Sparse columns also work very well with filtered indexes. So, to sum up, the advantages are that:

a) when the sparse columns contain NULL values, they do not take up any space at all. We will do some benchmarks in the coming days to see how much savings that translates into.

b) column behavior remains the same as other non-sparse columns so application wise, there is no change.

c) column set is present which behaves like a normal column and can be very useful in scenarios where not all the attributes apply to all the items in the tables like a product catalog (though through a normalized design, it is less of a problem) – these are also very useful though for any content management system like MS Sharepoint since it will help in reducing the storage needs in places where all the attributes do not apply to all the items.  Same is true for any contact management system where a lot of information gets applied to a single individual.  Since a lot of that information applies to only a sub-set of the users, that wastes up a lot of space in current systems.

d) CDC (Change Data Capture) and transactional replication work with sparse columns (column sets does not work).

The limitations are also quite a few:

a) Such columns cannot be a part of the clustered index.

b) Data Compression and Merge replication do not work with these columns.

c) Even though BOL states that there is an overhead to retrieve the values from such columns (the not null values), we haven’t done any benchmark to give you any numbers in that regard…but would assume that
is true and hope that hit is pretty minimal. Believe the hit is because for the not null values, it takes 4 more bytes than the normal column. So, for an int, it will take 8 bytes when there is a value in the column. That is why one has to evaluate when “exactly” is this feature useful – read the BOL link in the start that we had given which mentions that use it only when the savings in terms of space are atleast 20-40%.

d) There is a restriction on the data types i.e. not all columns with every data type can be declared as SPARSE. Text, ntext, image, filestream, other spatial data types cannot be used.

So, bottom line is that before jumping on making use of this new feature, read and understand it’s usage scenarios as well as it’s limitations and then only use it in your applications. This is not for every application but in some types of applications, it can be really very advantageous. This will hopefully also help avoid the design pattern that is called EAV (Entity Attribute Value).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: