Systems Engineering and RDBMS

Archive for the ‘Data Model’ Category

Foreign Key Issue and resolutions

Posted by decipherinfosys on August 8, 2008

In one of the applications at the client site, the application had a design which allowed for bad data being entered into the system. The issue was fixed by fixing and putting a check in the application which writes to the database but at the database tier itself no such checks were present and as a result the data that was coming through the feeds also ended up corrupting the data in the end.

The situation was like this: There were two tables: COMPANY and COMPANY_SECURITY. Both had a surrogate key defined as an auto-incremental ID value. COMPANY table had an alternate key defined on CUSIP – in the Wall Street terms, a CUSIP is unique to every company so this was a NOT NULL UNIQUE column. The child table (COMPANY_SECURITY) had the foreign key established using the COMPANY_ID surrogate key. The alternate key of this table was SECURITY_CUSIP. A company whose stock is traded at the different exchanges like NASDAQ, NYSE etc. can have more than one security offering and the securities are unique. The COMPANY.CUSIP is 6 characters long and the COMPANY_SECURITY.SECURITY_CUSIP is 9 characters long with the first 6 digits being the same as the company cusip and the last 3 characters defining the security (2 defining it and the last being a checksum).

So, based on that understanding, say if the COMPANY.CUSIP for a company like Microsoft is 123456, then it’s securities (COMPANY_SECURITY.SECURITY_CUSIP) should be 123456xyz, 123456abc etc.. But there are no checks at the database level to make sure that the first 6 digits match between the security_cusip and the parent table’s cusip since the FK is established based on the surrogate key.

There are a couple of ways to fix this at the database tier:

1) Creation of a new object: Use a trigger to prevent the data entry into the child table in the event of bad data getting entered. This can create performance issues in the event of a highly transactional system and is a poor way of enforcing this kind of a constraint.

2) Change the table structure design: Change the Primary Key of the parent table to be on COMPANY.CUSIP i.e. this would be a scenario of making use of the natural key since CUSIP does have a business meaning – it never changes once it has been established unless there are M&A (Merger and Acquisition) activities in which case the older record is marked for deletion and a new one gets created to represent the merger/acquisition. And a lot of searches are done based on this column itself which is always of the same length. That way, the child table will also inherit the CUSIP column since we would be using it for the foreign key enforcement and then store only the three characters for the security_cusip instead of storing all 9. So, the child table will end up with CUSIP (6 characters) and SECURITY_CUSIP (3 characters) and that combination will make it unique. For the reads, there can be either of the two strategies:

a) Based on the concatenation of the values from the same table – do understand though that when operations are done on indexed columns, the index seek operations do not happen. One can use an indexed view though to do that and do the reads off of that.

b) Split the input parameter into two – one of 6 characters and the other of 3 and then by joining the parent and the child table, put the right parameter in the right where clause condition. This approach has the benefit of using the indexes since no operations are being done on the indexed columns.

3) Introduce new column: Same as #2 but keep the Primary key as such and introduce redundancy and put the CUSIP column in the child table as well and introduce the FK between that column and the parent’s CUSIP column. A FK can defined on a UNIQUE NOT NULL column as well – does not necessarily have to to be a Primary Key.

4) Add a new Check Constraint with a UDF: Keep the same schema and use a CHECK constraint to define the constraint across the two tables. Please do note that the ANSI SQL Standards do allow the user of sub-queries in check constraints but not all RDBMS follow it. What you can do though is that you can define a UDF (User Defined Function) that performs an existence check against the parent table and returns a 1 or a 0 and define the check constraint based on that. For example:

/*********************

Table Definitions

**********************/

create table company
(
company_id int identity(1,1) not null,
company_name nvarchar(20) not null,
ticker nvarchar(10) not null,
cusip nvarchar(6) not null
, constraint pk_company primary key (company_id));

create unique index company_ind_1 on dbo.company (cusip);

create table company_security
(
company_security_id int identity(1,1) not null,
company_id int not null,
security_cusip nvarchar(9) not null,
company_cusip as substring(security_cusip, 1, 6)
, constraint pk_company_security primary key (company_security_id)
, constraint fk_company foreign key (company_id) references dbo.company(company_id));

create unique index company_security_ind_1 on dbo.company_security (security_cusip);

/********************

Define the UDF

*********************/
CREATE FUNCTION dbo.udf_check_cusip
(@company_id int, @security_cusip nvarchar(9))
RETURNS bit
AS
BEGIN
IF EXISTS(SELECT 1
FROM dbo.company
WHERE company_id = @company_id and cusip = substring(@security_cusip, 1, 6))
RETURN 1
RETURN 0
END
/************************************************************

Now, add the CHECK constraint using the above UDF

*************************************************************/

alter table dbo.company_security with check add constraint ck_cusip check (dbo.udf_check_cusip (company_id, security_cusip) = 1);

Let’s test it out now:

declare @i int
insert into dbo.company (company_name, ticker, cusip) values (N’Microsoft’, N’MSFT’, N’123456′);
select @i = SCOPE_IDENTITY()
select @i
insert into dbo.company_security (company_id, security_cusip) values (@i, N’123456123′);

This will go through fine.

Now, let’s enter data that violates the check constraint:

insert into dbo.company_security (company_id, security_cusip) values (1, N’100006123′);

And we will get this error:

Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the CHECK constraint “ck_cusip”. The conflict occurred in database “AdventureWorks”, table “dbo.company_security”.
The statement has been terminated.

5) Change the table structure: Same as option #2 but in this case we do not change the PK. We keep the surrogate keys and in the child table, only use the three digit security_cusip. The unique index will then be the combination of the company_id and the three digits will make it unique. Kind of like having the fixed length field 6 digit CUSIP + 3 digit SECURITY_CUSIP as the Primary Key (i.e. a natural key selection like option #2 above). This would be fine for writes … when you do the reads – even for the ones which only need information from the company_security table (though such scenarios will be rare), you will need to join the company and the company_security table and present the data for security cusips by concatenating the company’s cusip and the security cusip. Same read scenarios as mentioned in #2 above apply.

So, as you can see from above, there are a lot of ways of addressing this issue and depending upon the state of the project and the impact to the application, you can chose your solution. Option #2 or 5 are the right design choices.

Note: This option was also tried by the client: Computed column. You cannot use a computed column to enforce the FK since computed columns cannot be used for defining FKs i.e. you cannot just introduce a computed column based on substring(security_cusip, 1, 6) and then try to establish a FK between it and the parent table’s CUSIP column.

Posted in Data Model, DB2 LUW, Oracle, SQL Server | Leave a Comment »

Designing Physical tables for a M:N recursive relationship within an Entity

Posted by decipherinfosys on April 25, 2008

This requirement is pretty common in some of the industries: Many times, the same entity can have a recursive M:N (Many-to-Many) within itself.  For example:

In the pharmaceutical industry, a drug can be composed of many drugs and vice-versa.  Likewise, in a manufacturing industry, a part can be composed of many parts and vice-versa.  The way to model this logical design into it’s corresponding physical implementation is by splitting the entity out into two physical tables with two 1-n (one-to-many) relationship between the main table and the child table.  This table will only be used to capture the recursive relationship and any other attributes that are specific to that relationship.

Example:

/*********************************************************************************
We will only consider 2 attributes for the sake of the explanation
**********************************************************************************/
CREATE TABLE dbo.DRUG_MASTER (DRUG_MASTER_ID INT NOT NULL IDENTITY, DRUG_NAME NVARCHAR(100), CONSTRAINT PK_DRUG_MASTER PRIMARY KEY (DRUG_MASTER_ID));
CREATE TABLE dbo.DRUG_RLTN (PARENT_DRUG_ID INT NOT NULL, CHILD_DRUG_ID INT NOT NULL, CONSTRAINT PK_DRUG_RLTN PRIMARY KEY (PARENT_DRUG_ID, CHILD_DRUG_ID));

And we will now have two Foreign Keys from the child to the parent table to help support the two 1:N relations that we have put into place:

ALTER TABLE dbo.DRUG_RLTN ADD CONSTRAINT FK_DRUG_RLTN_PARENT_TO_DRUG_MASTER FOREIGN KEY (PARENT_DRUG_ID) REFERENCES dbo.DRUG_MASTER (DRUG_MASTER_ID),
CONSTRAINT FK_DRUG_RLTN_CHILD_TO_DRUG_MASTER FOREIGN KEY (CHILD_DRUG_ID) REFERENCES dbo.DRUG_MASTER (DRUG_MASTER_ID)
GO

Let us insert some data now to show how this all ties up:

SET NOCOUNT ON
GO
SET IDENTITY_INSERT DRUG_MASTER ON
INSERT INTO dbo.DRUG_MASTER (DRUG_MASTER_ID, DRUG_NAME) VALUES (1, ‘DRUG A’);
INSERT INTO dbo.DRUG_MASTER (DRUG_MASTER_ID, DRUG_NAME) VALUES (2, ‘DRUG B’);
INSERT INTO dbo.DRUG_MASTER (DRUG_MASTER_ID, DRUG_NAME) VALUES (3, ‘DRUG C’);
INSERT INTO dbo.DRUG_MASTER (DRUG_MASTER_ID, DRUG_NAME) VALUES (4, ‘DRUG D’);
INSERT INTO dbo.DRUG_MASTER (DRUG_MASTER_ID, DRUG_NAME) VALUES (5, ‘DRUG E’);
SET IDENTITY_INSERT DRUG_MASTER OFF

INSERT INTO dbo.DRUG_RLTN (PARENT_DRUG_ID, CHILD_DRUG_ID) VALUES (1, 2);
INSERT INTO dbo.DRUG_RLTN (PARENT_DRUG_ID, CHILD_DRUG_ID) VALUES (1, 3);
INSERT INTO dbo.DRUG_RLTN (PARENT_DRUG_ID, CHILD_DRUG_ID) VALUES (2, 5);
INSERT INTO dbo.DRUG_RLTN (PARENT_DRUG_ID, CHILD_DRUG_ID) VALUES (5, 4);

As you can see from above, the M:N relationship is being captured in the relationship table DRUG_RLTN.

  • DrugA comprises of DrugB and DrugC
  • DrugB comprises of DrugE, and
  • DrugE comprises of DrugD

In the next post, we will consider how you go about traversing this hierarchy using SQL and will use this as an example.  We will also use the classic employee structure in an organization as an example to show how to traverse hierarchies in SQL Server, Oracle and DB2.

Posted in Data Model | 1 Comment »

Using GUID as the Primary Key

Posted by decipherinfosys on April 1, 2008

We have blogged before about how to go about choosing between a natural key vs a surrogate key when doing data modeling work.  One of the recent questions that was raised was about using a GUID as a primary key column.  The need arose from the requirement that the values needed to be unique throughout the enterprise and not just in a particular table which is what you get if you choose to go with the Identity scheme (SQL Server or DB2) or a Sequence (in Oracle or DB2).  One can come up with a number range design to implement this but that is fraught with danger and is very cumbersome to maintain.  So, the solution in those situations is using a GUID but do not use it as a primary key – instead create another column as a GUID and use it as an alternate key in the table.  Choosing GUID as a primary key column has a couple of dis-advantages:

  • The values are long and obscure. The values are random and cannot accept any patterns that may make them more meaningful to users.  The storage for these values is more.
  • There is no way to determine the sequence in which uniqueidentifier values were generated. They are not suited for existing applications that depend on serially incrementing key values.
  • At 16 bytes, the uniqueidentifier data type (in SQL Server) is relatively large compared to other data types such as 4-byte integers. This means indexes built using uniqueidentifier keys may be relatively slower than implementing the indexes using an int key since it has to traverse a larger number of pages for the joins as each page can fit lesser number of values.  This also means that the joins will be slower.

In such a scenario where enterprise wide uniqueness is needed, one solution would be to use INT (or if needed – BIGINT) data type for the primary key column and implement an alternate key column using the GUID to suffice the need for having a globally unique identifier value.  That way, you have the best of both worlds and will also be able to meet the requirements.

Posted in Data Model, DB2 LUW, Oracle, SQL Server | Leave a Comment »

Dimensional Modeling

Posted by decipherinfosys on March 31, 2008

Anyone who has designed and implemented a business intelligence system knows that one of the most important thing when doing dimensional modeling is the declaration of the grain of the fact table.  What that means is that the grain should be something that represents exactly what a record within the fact table is supposed to represent.  If this is not done properly, you will not be able to roll up a lot of information and your analysis will suffer due to the bad design.  Let’s take an example to illustrate this:

We will pick an example from the Medical industry since that was one of our most recent projects.  When working on implementing a BI solution for the medical practices and small hospitals, one of the items in the billing/coding departments was the line items in the bill for the services rendered by the practice.  These were the potential dimensions for such a line item:

  1. Provider (which particular doctor or a nurse practitioner who did the work),
  2. Location (since the practice can have multiple locations),
  3. Date (of examination),
  4. Patient,
  5. Diagnosis,
  6. Procedure,
  7. Billing Organization (the practice),
  8. Responsible Party (either parent/guardian or the patients themselves),
  9. Payer (though this can be further divided into Primary Payer and Secondary Payer)

One key thing to note is that the grain for this fact table for which we have listed the potential dimensions above is a line item on the bill that is received from the medical practice.  So, the grain always needs to be very atomic – the smaller and more atomic the measures are for the fact table, the more things you can infer from such a design since it will give rise to a lot of dimensions across which you can do your analysis.  Now suppose that instead of choosing the grain to be the line item on the bill, we choose the grain to be more coarse i.e. aggregations of the atomic data.  Such an example would be to choose a grain to be the “Total monetary amount charged for a given procedure based on a certain diagnosis done per month”.  If we choose such a grain, then we will end up with very less dimensions.  In the example above, we will potentially have only these dimensions:

  1. Provider (the doctor or the nurse practitioner who did the work),
  2. Procedure,
  3. Diagnosis,
  4. Month

We have taken the extreme case to make the point of choosing the grain judiciously.  As you can see from above, the number of dimensions have reduced in number and a lot of atomic information (and thus those dimensions) are lost.  It would have made no sense to include all of those 9 dimensions that we had listed before since even if we included them, we would have ended up with very little aggregation since the fact table’s grain is not atomic enough to be able to make sense for those dimensions that represent the atomic data.

Understand that aggregations shrink the dimensions since they have less dimensional detail.  As long as one designs the fact tables with a grain that is very atomic (cannot be further divided up into smaller section), it is always possible to roll up and present the aggregated information with a lot more control over how to slice and dice the data to present other interesting facts.  In the example above, one such measured fact would be the amount that was billed for that one particular procedural diagnosis.  One can then analyze this data by rolling it up by patient, by procedural diagnosis, by provider or by combinations like amount billed per month by procedure by provider by location.  As long as one has the atomic data, none of the information gets lost.

Posted in Business Intelligence, Data Model | 1 Comment »

BitMask Fields

Posted by decipherinfosys on January 4, 2008

This blog post is in response to one of the questions that we got from our reader. Here is the question:

“I’d like to hear your opinions on bitmask fields. I have found a lot of flames on using bitmask fields into RDBMS and most of them look a lot like the cons of EAV models, which I do agree (I just learned it the hard way, after using this design in a system which later became real hell), but what are the cons in bitmask fields when we just need an “open” set of boolean flags? Apart from indexing (which I believe would be impossible to be done) would it be better to add a new bit field for every new flag to your existing databases than using a bitmask field?”

This is a very good question and something that we have seen being used in many designs that we have encountered. There are two scenarios that we will point out in this regard where folks typically make use of this kind of a scheme.

Scenario # 1:

Using a bitmask field and thus de-normalizing and getting rid of an intersection table. Let’s first formulate this scenario using an example.

Suppose that we have a USER logical entity and an Access logical entity in our system and these have a M:N relationship between them i.e. a many-to-many relationship. A USER can have 1 or more than one access permissions and an access permission can belong to 1 or more than one users. The table structures given below are abbreviated just to provide an explanation for this blog post.

Let us first see how this will look like in a properly normalized scheme when we move from the logical model to the physical implementation model. When we do that, we will have three tables: USER_MASTER, ACCESS_MASTER and USER_ACCESS_XREF with the third table acting as the cross-reference table or the intermediate table. Here are the definitions and some sample data:

USE TEMPDB
GO
SET NOCOUNT ON
GO

CREATE TABLE USER_MASTER
(
USER_MASTER_ID INT IDENTITY NOT NULL,
FIRST_NAME NVARCHAR(30) NOT NULL,
MIDDLE_NAME NVARCHAR(30) NULL,
LAST_NAME NVARCHAR(30) NOT NULL,
CONSTRAINT PK_USER_MASTER PRIMARY KEY (USER_MASTER_ID)
)
GO

CREATE TABLE ACCESS_MASTER
(
ACCESS_MASTER_ID INT IDENTITY NOT NULL,
ACCESS_NAME NVARCHAR(500) NOT NULL,
CONSTRAINT PK_ACCESS_MASTER PRIMARY KEY (ACCESS_MASTER_ID)
)
GO

CREATE TABLE USER_ACCESS_XREF
(
USER_MASTER_ID INT,
ACCESS_MASTER_ID INT,
CONSTRAINT PK_USER_ACCESS_XREF PRIMARY KEY (USER_MASTER_ID, ACCESS_MASTER_ID)
)
GO

INSERT INTO USER_MASTER (FIRST_NAME, LAST_NAME) VALUES (‘JOE’, ‘SMITH’);
INSERT INTO USER_MASTER (FIRST_NAME, LAST_NAME) VALUES (‘DAN’, ‘ROGERS’);
INSERT INTO USER_MASTER (FIRST_NAME, LAST_NAME) VALUES (‘ROBERT’, ‘BAITIS’);

INSERT INTO ACCESS_MASTER (ACCESS_NAME) VALUES (‘connect’);
INSERT INTO ACCESS_MASTER (ACCESS_NAME) VALUES (‘read’);
INSERT INTO ACCESS_MASTER (ACCESS_NAME) VALUES (‘write’);
INSERT INTO ACCESS_MASTER (ACCESS_NAME) VALUES (‘alter’);
INSERT INTO ACCESS_MASTER (ACCESS_NAME) VALUES (‘view definition’);

INSERT INTO USER_ACCESS_XREF VALUES (1, 1);
INSERT INTO USER_ACCESS_XREF VALUES (2, 1);
INSERT INTO USER_ACCESS_XREF VALUES (2, 2);
INSERT INTO USER_ACCESS_XREF VALUES (3, 1);
INSERT INTO USER_ACCESS_XREF VALUES (3, 2);
INSERT INTO USER_ACCESS_XREF VALUES (3, 3);
INSERT INTO USER_ACCESS_XREF VALUES (3, 4);
INSERT INTO USER_ACCESS_XREF VALUES (3, 5);

Now, in order to see the users and their access rights, we can simply write up a SQL statement to do the joins between those 3 tables. Example:

SELECT UM.FIRST_NAME + ‘ ‘ + COALESCE(UM.MIDDLE_NAME, ”) + ‘ ‘ + UM.LAST_NAME AS USERNAME, AM.ACCESS_NAME
FROM dbo.USER_MASTER AS UM
INNER JOIN dbo.USER_ACCESS_XREF AS UAX
ON UM.USER_MASTER_ID = UAX.USER_MASTER_ID
INNER JOIN dbo.ACCESS_MASTER AS AM
ON UAX.ACCESS_MASTER_ID = AM.ACCESS_MASTER_ID
ORDER BY UM.USER_MASTER_ID
GO

USERNAME         ACCESS_NAME

---------------- ----------------

JOE  SMITH       connect

DAN  ROGERS      connect

DAN  ROGERS      read

ROBERT  BAITIS   connect

ROBERT  BAITIS   read

ROBERT  BAITIS   write

ROBERT  BAITIS   alter

ROBERT  BAITIS   view definition

When one wants to remove a permission for a given user, one can just do a delete from the cross-reference table and if one wants to add a permission, one can simply insert into this XREF table another record. If one wants to flatten the above structure out so that one gets only a single row per user, that pivoting can be done pretty easily. You can read more on pivoting on our blog site – search for pivot.

In a de-normalized scheme, we have seen people add a column to the ACCESS_MASTER table – something like ACCESS_BIT. And then, store the ACCESS_BIT_MASK column in the USER_MASTER table and get rid of the USER_ACCESS_XREF table all-together. Here is the code:

USE TEMPDB
GO
SET NOCOUNT ON
GO

CREATE TABLE ACCESS_MASTER
(
ACCESS_MASTER_ID INT IDENTITY NOT NULL,
ACCESS_NAME NVARCHAR(500) NOT NULL,
ACCESS_BIT INT NOT NULL,
CONSTRAINT PK_ACCESS_MASTER PRIMARY KEY (ACCESS_MASTER_ID)
)
GO

CREATE TABLE USER_MASTER
(
USER_MASTER_ID INT IDENTITY NOT NULL,
FIRST_NAME NVARCHAR(30) NOT NULL,
MIDDLE_NAME NVARCHAR(30) NULL,
LAST_NAME NVARCHAR(30) NOT NULL,
ACCESS_BIT_MASK INT NOT NULL,
CONSTRAINT PK_USER_MASTER PRIMARY KEY (USER_MASTER_ID)
)
GO

INSERT INTO ACCESS_MASTER (ACCESS_NAME, ACCESS_BIT) VALUES (‘connect’, 1);
INSERT INTO ACCESS_MASTER (ACCESS_NAME, ACCESS_BIT) VALUES (‘read’, 2);
INSERT INTO ACCESS_MASTER (ACCESS_NAME, ACCESS_BIT) VALUES (‘write’, 4);
INSERT INTO ACCESS_MASTER (ACCESS_NAME, ACCESS_BIT) VALUES (‘alter’, 8);
INSERT INTO ACCESS_MASTER (ACCESS_NAME, ACCESS_BIT) VALUES (‘view definition’, 16);

INSERT INTO USER_MASTER (FIRST_NAME, LAST_NAME, ACCESS_BIT_MASK) VALUES (‘JOE’, ‘SMITH’, 1);
INSERT INTO USER_MASTER (FIRST_NAME, LAST_NAME, ACCESS_BIT_MASK) VALUES (‘DAN’, ‘ROGERS’, 3);
INSERT INTO USER_MASTER (FIRST_NAME, LAST_NAME, ACCESS_BIT_MASK) VALUES (‘ROBERT’, ‘BAITIS’, 31);

So, what do those ACCESS_BIT_MASK values mean? Since Joe Smith had only connect privileges, he has a bit mask value of 1, since Dan Rogers had both connect and read access privileges, he has a bit mask value of 3 (1 + 2) and since Robert Baitis had all the privileges, he has a bit mask value of 31 ( 1 + 2 + 4 + 8 + 16).

Now, in order to find out the rights of the users one does not need to join anywhere and only needs to use the bitwise AND comparison. Example:

SELECT
UM.FIRST_NAME + ‘ ‘ + COALESCE(UM.MIDDLE_NAME, ”) + ‘ ‘ + UM.LAST_NAME AS USERNAME,
CASE WHEN UM.ACCESS_BIT_MASK & 1 <> 0 THEN ‘YES’ ELSE ‘NO’ END AS CONNECT_ACCESS,
CASE WHEN UM.ACCESS_BIT_MASK & 2 <> 0 THEN ‘YES’ ELSE ‘NO’ END AS READ_ACCESS,
CASE WHEN UM.ACCESS_BIT_MASK & 4 <> 0 THEN ‘YES’ ELSE ‘NO’ END AS WRITE_ACCESS,
CASE WHEN UM.ACCESS_BIT_MASK & 8 <> 0 THEN ‘YES’ ELSE ‘NO’ END AS ALTER_ACCESS,
CASE WHEN UM.ACCESS_BIT_MASK & 16 <> 0 THEN ‘YES’ ELSE ‘NO’ END AS VIEW_DEF_ACCESS
FROM dbo.USER_MASTER AS UM
ORDER BY UM.USER_MASTER_ID
GO

USERNAME             CONNECT_ACCESS READ_ACCESS WRITE_ACCESS ALTER_ACCESS VIEW_DEF_ACCESS

-------------------- -------------- ----------- ------------ ------------ ---------------

JOE  SMITH           YES            NO          NO           NO           NO

DAN  ROGERS          YES            YES         NO           NO           NO

ROBERT  BAITIS       YES            YES         YES          YES          YES

The argument made by proponents of this design is that those access rights will never change and their bit values will never change either. And doing this helps them getting rid of a potentially very large intersection table and also of doing joins at the times of the reads. One would just add or subtract the ACCESS_BIT value from the ACCESS_BIT_MASK as the access rights get added or removed respectively.

Well, here are the dis-advantages of this approach besides the fact that this is a de-normalized design:

a) Such a design does not allow for referential integrity. In order to enforce referential integrity in this case, one would need to find non-standard ways of enforcing the referential integrity – one such example is the usage of a trigger in order to enforce this. Another one is to make use of application code to enforce it instead of the database doing it. Both are highly un-desirable approaches both because of performance and maintenance.

b) Since there is no FK constraint, the DRI is out of the door i.e Declarative Referential Integrity i.e. ON DELETE CASCADE etc.

c) If there is a change in the value, the DDL change needs to be made.

d) Indexes cannot be used for searches: In order to see which users have say the Read Access, one will need to use the bitwise AND operator and that means that one would need to apply the condition to the WHERE clause and thus the index will not get used (unless using FBI in Oracle or computed columns in SQL Server).

e) If you have too many bit values, it will reduce the number of records that the table could hold and will require a design change. Example: The bit values go from 1, 2, 4, 8, 16, 64, …for record 25, you have a value of 16777216. Suppose that one had say 1000 records, the value will be too large and you will end up storing numbers in string data-type column and splitting (or dividing it with values) and casting/converting to integer/big integer data-types in order to do bitwise operation or would need to create more columns both of which are bad design choices.

There is a way to handle very large bitmasks (by making use of an auxiliary table that is a table of numbers) but we would not go into those details here.

Scenario #2:

When there are a lot of flag options for a given entity i.e. when there are a lot of boolean flag options for a given entity then also we have seen end users making use of such a scheme. Say, we have a table called WHSE_MASTER which stands for a Warehouse data set in a schema. Depending upon the configuration for a given implementation, one can then have several configuration options for each Warehouse. And each of these options are essentially just flag options i.e. toggle switches that indicate a Yes or a No. So, for the sake of simplicity, let us assume that we have SUPPORT_INBOUND_FLAG, SUPPORT_OUTBOUND_FLAG, SUPPORT_CROSS_DOCK_FLAG etc.

If one follows the properly designed scheme, then one would create a bit flag column for every option. The advantage of this is to have properly named columns which actually signify something to the end user rather than storing that meaning elsewhere in a look-up table.

In addition, if one needs to index these columns, it is pretty easy to do so – one can also index a single bitmask column that contains all these values however, one would then need to apply functions on it to get the filter condition – and as we know, application of functions on an indexed column negates the usage of the index (unless you are using FBI – function based indexes in Oracle or Computed Columns in the case of SQL Server).

Another problem is the same issue as point # (e) that was mentioned in Scenario #1 though having that many bit flag columns is unlikely for an entity. One would also need to maintain a look-up table with those bit values to be able to do addition and subtraction of values.

In such a scheme, we have also seen folks deciding to create a single column with a string data-type and using values like 100110 (or YNNYYN) in order to represent those flag options. The index issue, the another look-up table issue still remains and now there is another added complexity – one now needs to do substring operations in order to do the addition and removal of options (positional updates).

And another approach that we have seen people take is also to use the VARBINARY(N) data-type to store the bitmasks values like: “0x0E0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000”

and then doing operations on this to ascertain the different values.  It suffers from the same sets of issues as discussed before for other approaches.

Posted in Data Model, DB2 LUW, Oracle, SQL Server | Leave a Comment »

Some Data Modeling pattern books

Posted by decipherinfosys on December 9, 2007

A colleague of mine had recently asked about some good data modeling pattern books that are available in the market for different industry verticals. His company is building up business intelligence solutions for different verticals and hence there is a need to develop some generic baseline data models for those verticals and these books can help give an understanding of those domains. These are written by industry experts: Len Silverston, David Hay and Martin Fowler:

David Hay’s books on data model patterns:

http://www.amazon.com/exec/obidos/ASIN/0932633293

http://www.amazon.com/gp/product/0120887983/ref=s9_asin_image_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=1R6JW5YJWZQZ826S3T53&pf_rd_t=101&pf_rd_p=278240701&pf_rd_i=507846

And here are the books by Len on the Universal Data Models (this is an excellent read)…am giving the links to both the Volumes:

http://www.amazon.com/gp/product/0471380237/ref=s9_asin_image_3?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=1R6JW5YJWZQZ826S3T53&pf_rd_t=101&pf_rd_p=278240701&pf_rd_i=507846

http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471353485/ref=pd_bxgy_b_img_b

Martin Fowler Articles: http://martinfowler.com/articles.html

Look into Analysis Patterns and you can also look at some of his enterprise application development books which go into the generic patterns for data models.

Posted in Data Model | Leave a Comment »

OTLT (One True Lookup Table)

Posted by decipherinfosys on February 1, 2007

I have seen many organizations using a one true lookup table for look-ups. What OTLT means is that one has a table to do all the code and system related look-ups in the schema rather than having n number of look up tables. This is an OO design pushed into the database tier and poses many problems including :

a) Not being able to force foreign keys (so your data integrity is compromised right there),

b) Not being able to use the right data-types since everything has to be a string in order to accomodate all data-type values (this itself can give rise to bad data issues since a numeric(4,2) constraint cannot be enforced –> thus domain integrity is out of the door,

c) Because of (b), the length for the string column will be huge,

d) One has to then write complex SQL queries in order to retrieve the data and also take into account the type conversions that occur when you join this table with the other tables…implicit conversion is not allowed in all RDBMS and is not a good thing anyways.

When you end up with tons of look up data, this really becomes very cumbersome – though in smaller systems, this issue gets masked. One of the reasons people tend to shy away from a bunch of look up tables is because of maintenance but if you look at it, it is really not an issue and does give you a lot of benefits including the first and foremost, it preserves data integrity. If you need to have a screen to be able to modify data in those look up tables, you can create a view that covers all those look-up tables and then make that view updateable using SQL code. That way, the end users have one place to go in order to make the configuration lookup data changes and in the back-end, you are preserving data integrity and the above mentioned issues won’t arise either.

Here is an excellent article on OTLT by database guru: Joe Celko.

http://www.dbazine.com/ofinterest/oi-articles/celko22

Posted in Data Model | 2 Comments »

Surrogate Keys vs Natural Keys for Primary Key?

Posted by decipherinfosys on February 1, 2007

This topic probably is one of those that you cannot get any two database developers/DBAs to agree upon.  Everyone has their own opinion about this and it is also one of the most discussed topics over the web when it comes to data modeling.  Rather than taking any side :-), we are just listing out our experiences when it comes to chosing between a surrogate key vs the natural keys for the tables.

Surrogate Key:

Surrogate keys are keys that have no “business” meaning and are solely used to identify a record in the table.  Such keys are either database generated (example: Identity in SQL Server, Sequence in Oracle, Sequence/Identity in DB2 UDB etc.) or system generated values (like generated via a table in the schema).

Natural Key: 

Keys are natural if the attribute it represents is used for identification independently of the database schema.  What this basically means is that the keys are natural if people use them example: Invoice-Numbers, Tax-Ids, SSN etc.

Design considerations for choosing the Primary Key:

Primary Key should meet the following requirements:

  1. It should be not null, Unique and should apply to all rows.
  2. It should be minimal (i.e. less number of columns in the PK: ideally it should be 1, if using composite keys, then make sure that those are surrogates and using integer family data-types).
  3. It should be stable over a period of time (should not change i.e. update to the PK columns should not happen).

Keeping these in mind, here are the pros and cons of Surrogate vs. Natural keys:

Surrogate Key

I prefer surrogate keys to be DB controlled rather than being controlled via a next-up table in the schema since that is a more scalable approach.

Pros:

  1. Business Logic is not in the keys.
  2. Small 4-byte key (the surrogate key will most likely be an integer and SQL Server for example requires only 4 bytes to store it, if a bigint, then 8 bytes).
  3. Joins are very fast.
  4. No locking contentions because of unique constraint (this refers to the waits that get developed when two sessions are trying to insert the same unique business key) as the surrogates get generated by the DB and are cached – very scalable.

Cons:

  1. An additional index is needed.  In SQL Server, the PK constraint will always creates a unique index, in Oracle, if an index already exists, PK creation will use that index for uniqueness enforcement (not a con in Oracle).
  2. Cannot be used as a search key.
  3. If it is database controlled, for products that support multiple databases, different implementations are needed, example: identity in SS2k, before triggers and sequences in Oracle, identity/sequence in DB2 UDB.
  4. Always requires a join when browsing the child table(s).

Natural Key

 Pros:

  1. No additional Index.
  2. Can be used as a search key.

 Cons:

  1. If not chosen wisely (business meaning in the key(s)), then over a period of time additions may be required to the PK and modifications to the PK can occur.
  2. If using strings, joins are a bit slower as compared to the int data-type joins, storage is more as well.  Since storage is more, less data-values get stored per index page.  Also, reading strings is a two step process in some RDBMS: one to get the actual length of the string and second to actually perform the read operation to get the value.
  3. Locking contentions can arise if using application driven generation mechanism for the key.
  4. Can’t enter a record until value is known since the value has some meaning.

 Choosing Surrogate vs. Natural Keys:

There is no rule of thumb in this case. It has to be evaluated table by table:

  1. If we can identify an appropriate natural key that meets the three criteria for it to be a PK column, we should use it.  Look-up tables and configuration tables are typically ok.
  2. Data-Type for the PK: the smaller the better, choose an integer or a short-character data type.   It also ensures that the joins will be faster.  This becomes even more important if you are going to make the PK as a clustered index since non-clustered indexes are built off the clustered index.  RDBMS processes integer data values faster than the character data values because it converts characters to ASCII equivalent values before processing, which is an extra step.

Posted in Data Model | 19 Comments »

Name Value Pair Design (EAV)

Posted by decipherinfosys on January 29, 2007

This is one of the design anomalies in data modeling.  Infact, this even has a name for it: EAV (Entity Attribute Value).  Database Guru Joe Celko has written a lot about it in newsgroups as well as in his presentations.  The reason why I am blogging about this today is because I have seen this approach time and again and it is either done by novice DBAs/database developers or by application programmers who try to use their programming context in the relational world.  Here are the disadvantages of this approach :

1)  It is next to impossible to enforce business rule constraints like uniqueness, enforcing check constraints per attribute, adding defaults per attribute, enforcing PK/FK relationships based on attributes etc.  Since the attributes are being modeled as data elements, it is very error prone since we have now opened up our attribute design to the users.  All data integrity is destroyed (violation of 1NF itself).  Any typo becomes a new attribute.

2)  Different data-type attributes will be modeled using the string data-type for the value column and this will result into performance issues while doing comparisons since the index sorts will be an issue and not only that, an effective index design startegy cannot be formulated for this.  This also increases the index size un-necessarily.  In addition, it will require un-necessary cast operations like for the date/timestamp data-types, the numeric data-types with precision and a non-zero scale etc.

3)  It would lead to the usage of cursor based (and dynamic SQL) approach for reads since the number of attributes is virtually un-known.  In the case of complex reports or other read operations, this will be a nightmare…un-necessary CASE statements, un-necessary sub-queries with self references and un-necessary pivoting/un-pivoting of data would ensure that the reports are very slow…as your data grows, performance will get exponentially worse.

If you do have a situation where you are dealing with an entity that has thousands of attributes (I have seen requirements like these) and all of them are static i.e. once they are done and have data populated in them, they never change.  In those scenarios, in order to not exceed the rowlength (like SQL 2000 has a limitation on the rowsize being 8060), one can either chose to create a hard 1:1 relationship tables or chose this design to accomodate for such requirements.  Since the list of attributes is static and the data in them is static as well, a simple pivot view can be used to flatten it out.  But that is the only scenario where this approach can be used/considered.

If one really needs to model a system where the attributes can be generated by the end user (which itself is questionable but can happen in certain verticles – especially those where the end user needs to use the software for planning and forecasting), then a better approach will be to use a static list of attributes (for the different data-types) and have another table for their name look up so that storage is done in a relational manner but for the display piece, one can use the static list and can call an attribute whatever they want.  This is the most reasonable compromise for data integrity, performance and sufficing the need of the application.  We will be writing up a “How-To” article or a whitepaper on this taking an example from the real world and demostrating how you can solve this issue in a reasonable fashion using the above approach rather than using an EAV design.

Posted in Data Model | 4 Comments »