Systems Engineering and RDBMS

Archive for the ‘DB2 LUW’ Category

Another Self Join scenario

Posted by decipherinfosys on September 22, 2008

In one of our earlier posts, we had covered self joins. Yesterday, while helping a client with their data processing logic a similar situation arose which was resolved by making use of a self join. The scenario is not an uncommon one in the healthcare industry. This private practice had prescriptions for their patients stored in the database (SQL Server lingo) and had the start and end dates for those prescriptions. They are required to track this and also furnish reports to the insurance companies as well as to the auditors and one of the things that they look for is whether the prescriptions overlapped or was their a gap in the prescription for the same medication etc.. One of the reasons why they want to see such reports also is to track malpractice – an example: We were told that some doctors were caught under the malpractice insurance law when they had un-necessarily prescribed pain killers to the patient and the patient had infact taken those and sold them in order to make a profit…

So, how does all this fit into a self join? :-) Well, the information is stored in the database tables and data is power, isn’t it? Especially when used correctly. So, let’s create some dummy data first:

create table patient_info (pat_nbr int identity not null primary key, pat_first_name nvarchar(10) not null, pat_ssn nvarchar(9) not null)
create table patient_presc (pat_presc_nbr int identity not null, pat_nbr int not null, presc_id int not null, start_date_time datetime not null, end_date_time datetime not null,
constraint pk_pat_presc primary key (pat_presc_nbr),
constraint fk_to_pat_info foreign key (pat_nbr) references patient_info (pat_nbr))
create unique index patient_presc_ind_1 on patient_presc (pat_nbr, presc_id, start_date_time);

go
set nocount on
go
insert into patient_info (pat_first_name, pat_ssn) values (‘tom’, ‘123456789′);
insert into patient_info (pat_first_name, pat_ssn) values (‘harry’, ‘012345678′);
insert into patient_presc (pat_nbr, presc_id, start_date_time, end_date_time) values (1, 1, ‘03/01/2008′, ‘03/31/2008′);
insert into patient_presc (pat_nbr, presc_id, start_date_time, end_date_time) values (1, 1, ‘03/20/2008′, ‘04/15/2008′);
insert into patient_presc (pat_nbr, presc_id, start_date_time, end_date_time) values (1, 2, ‘03/22/2008′, ‘04/20/2008′);
insert into patient_presc (pat_nbr, presc_id, start_date_time, end_date_time) values (1, 1, ‘05/01/2008′, ‘05/30/2008′);
insert into patient_presc (pat_nbr, presc_id, start_date_time, end_date_time) values (1, 1, ‘07/01/2008′, ‘07/31/2008′);
insert into patient_presc (pat_nbr, presc_id, start_date_time, end_date_time) values (1, 1, ‘07/25/2008′, ‘08/25/2008′);

If you look into the data, you will see that there are two scenarios where there is an overlap of the prescriptions – for the same prescription (same medication), before the previous prescription’s date ended, a new prescription was issued. This is a cause for an audit which will then be followed up by the auditor with the doctors and nurses in the practice – there always is a reason code that is also stored in the tables as well as the doctor/nurse information but for the sake of this example, we have not shown that piece of information. We can get this information by doing a simple self join – here is the SQL:

select ‘Overlapped Prescriptions’,
pi.pat_first_name, pp1.end_date_time, min(pp2.start_date_time) as next_presc_start_date
from dbo.patient_info as pi
inner join dbo.patient_presc as pp1
on pi.pat_nbr = pp1.pat_nbr
inner join dbo.patient_presc as pp2
on pi.pat_nbr = pp2.pat_nbr
where pi.pat_ssn = ‘123456789′
and pp1.presc_id = pp2.presc_id
and pp1.start_date_time < pp2.start_date_time
and pp1.end_date_time < pp2.end_date_time
group by pi.pat_first_name, pp1.end_date_time
having datediff (dd, pp1.end_date_time, min(pp2.start_date_time)) < 1

                         pat_first_name end_date_time           next_presc_start_date
------------------------ -------------- ----------------------- -----------------------
Overlapped Prescriptions tom            2008-03-31 00:00:00.000 2008-03-20 00:00:00.000
Overlapped Prescriptions tom            2008-07-31 00:00:00.000 2008-07-25 00:00:00.000

Another thing that you would note is that there is 1 scenario where there is a gap of more than 30 days between the same type of prescription to the same patient. This could be patient negligence or could even be a valid scenario where the same disease (maybe a viral infection) occurred at different times leading to the same prescription being given again within 45 days but after a gap of 30 days. We can get this information by using the same SQL as above but changing the having criteria a bit:

select ‘Prescription Gap of more than 30 days and less than 45 days’,
pi.pat_first_name, pp1.end_date_time, min(pp2.start_date_time) as next_presc_start_date
from dbo.patient_info as pi
inner join dbo.patient_presc as pp1
on pi.pat_nbr = pp1.pat_nbr
inner join dbo.patient_presc as pp2
on pi.pat_nbr = pp2.pat_nbr
where pi.pat_ssn = ‘123456789′
and pp1.presc_id = pp2.presc_id
and pp1.start_date_time < pp2.start_date_time
and pp1.end_date_time < pp2.end_date_time
group by pi.pat_first_name, pp1.end_date_time
having datediff (dd, pp1.end_date_time, min(pp2.start_date_time)) > 30
and datediff (dd, pp1.end_date_time, min(pp2.start_date_time)) < 45

                                                            pat_first_name end_date_time           next_presc_start_date
----------------------------------------------------------- -------------- ----------------------- -----------------------
Prescription Gap of more than 30 days and less than 45 days tom            2008-05-30 00:00:00.000 2008-07-01 00:00:00.000

Posted in DB2 LUW, Oracle, SQL Server | Leave a Comment »

Back to the basics: NULL values and Aggregate functions

Posted by decipherinfosys on August 18, 2008

One common mistake for people new to SQL is remembering how the NULL values get treated when using aggregate functions.  Here is an example that helps illustrate the point:

declare @t table (col1 int)
insert into @t values (null)
insert into @t values (2)
insert into @t values (4)
select
AVG(col1) as Col_Avg,
COUNT(col1) as Col_Count,
COUNT(*) as Total_Count,
MAX(col1) as MAX_value,
MIN (col1) as MIN_value,
SUM(col1) as Total
from @t

Col_Avg     Col_Count   Total_Count MAX_value   MIN_value   Total
----------- ----------- ----------- ----------- ----------- -----------
3           2           3           4           2           6

Warning: Null value is eliminated by an aggregate or other SET operation.

As you can see from the Warning, the engine immediately tells you that the Null value was eliminated by an aggregate operation.  If you see the first column, you will see that the average is 3 and not 2 since it eliminated the row with the NULL value in it.  Likewise, when we did a COUNT(col1) vs a COUNT(*), there is a difference in the count – this again is because the NULL value has been eliminated by the aggregate function.

These are very important things to remember else you can end up writing code which will give out wrong results.  One of the ways to not fall into such issues is to address these issues right at the design time and if you have a numeric column, then define that as a not null column with a default value of 0.  That way, in the absence of the data, you can treat that as a 0 value and then the average would return the correct/intended data set – likewise for the count(col1) when the col1 is not null.

Posted in DB2 LUW, Oracle, SQL Server | Leave a Comment »

Back to the Basics: A Simple Join Issue

Posted by decipherinfosys on August 11, 2008

We have covered join basics before and one of the questions that all beginners face is that if they are joining TableA with TableB in an outer join condition BUT they need to join TableC with TableB in an inner join condition, then they lose the outer join effect since for records in TableB that do not match TableA, it will result into NULL records after the join and since NULL != NULL in SQL (not considering the ANSI_NULL setting in SQL Server for this post), then the inner join with TableC will get rid of those records. Let’s take this up with an example (using SQL Server Syntax though the same applies to Oracle of DB2 LUW as well):

create table TableA (col1 int primary key, col2 nvarchar(10));
create table TableB (col3 int primary key, col1 int, col4 nvarchar(10), constraint fk_b FOREIGN KEY (col1) references TableA(col1));
create table TableC (col5 int primary key, col3 int, col6 nvarchar(10), constraint fk_c FOREIGN KEY (col3) references TableB(col3));

set nocount on
go

insert into TableA values (1, ‘A-1 value’);
insert into TableA values (2, ‘A-2 value’);

insert into TableB values (1, 1, ‘B-1 value’);

insert into TableC values (1, 1, ‘C-1 value’);
insert into TableC values (2, 1, ‘C-2 value’);
go

Now, if we want to get all records in A but only those records in B where there is a match, the query is a simple left outer join between A and B:

select *
from dbo.TableA as A
Left Outer Join dbo.TableB as B
on A.col1 = B.col1

And the result will be:

col1        col2       col3        col1        col4
----------- ---------- ----------- ----------- ----------
1           A-1 value  1           1           B-1 value
2           A-2 value  NULL        NULL        NULL

And now, if we want to add an inner join condition between TableB and TableC to restrict getting only those records from TableB where there is a match in TableC, one would be tempted to do just this:

select *
from dbo.TableA as A
Left Outer Join dbo.TableB as B
on A.col1 = B.col1
Inner join dbo.TableC as C
on B.col3 = C.col3

But as you can see from the result:

col1        col2       col3        col1        col4       col5        col3        col6
----------- ---------- ----------- ----------- ---------- ----------- ----------- ----------
1           A-1 value  1           1           B-1 value  1           1           C-1 value
1           A-1 value  1           1           B-1 value  2           1           C-2 value

The second record from TableA is not present in the result anymore since the inner join negated that Null value. In order to get the desired result, one can make use of what are called as derived tables (SQL Server lingo) or inline views (Oracle lingo) and re-write the query as:

select *
from dbo.TableA as A
Left Outer Join (dbo.TableB as B
Inner join dbo.TableC as C
on B.col3 = C.col3)
on A.col1 = B.col1

or

select *
from dbo.TableA as A
Left Outer Join (select B.col1, B.col3, B.col4, C.col5, C.col6 from dbo.TableB as B
Inner join dbo.TableC as C
on B.col3 = C.col3) as X
on A.col1 = X.col1
and the result will be:

col1        col2       col1        col3        col4       col5        col6
----------- ---------- ----------- ----------- ---------- ----------- ----------
1           A-1 value  1           1           B-1 value  1           C-1 value
1           A-1 value  1           1           B-1 value  2           C-2 value
2           A-2 value  NULL        NULL        NULL       NULL        NULL

And in this particular case, one can even do this:

select *
from dbo.TableB as b
inner join dbo.tableC as c
on b.col3 = c.col3
right outer join dbo.tableA as a
on b.col1 = a.col1

Posted in DB2 LUW, Oracle, SQL Server | Leave a Comment »

Foreign Key Issue and resolutions

Posted by decipherinfosys on August 8, 2008

In one of the applications at the client site, the application had a design which allowed for bad data being entered into the system. The issue was fixed by fixing and putting a check in the application which writes to the database but at the database tier itself no such checks were present and as a result the data that was coming through the feeds also ended up corrupting the data in the end.

The situation was like this: There were two tables: COMPANY and COMPANY_SECURITY. Both had a surrogate key defined as an auto-incremental ID value. COMPANY table had an alternate key defined on CUSIP – in the Wall Street terms, a CUSIP is unique to every company so this was a NOT NULL UNIQUE column. The child table (COMPANY_SECURITY) had the foreign key established using the COMPANY_ID surrogate key. The alternate key of this table was SECURITY_CUSIP. A company whose stock is traded at the different exchanges like NASDAQ, NYSE etc. can have more than one security offering and the securities are unique. The COMPANY.CUSIP is 6 characters long and the COMPANY_SECURITY.SECURITY_CUSIP is 9 characters long with the first 6 digits being the same as the company cusip and the last 3 characters defining the security (2 defining it and the last being a checksum).

So, based on that understanding, say if the COMPANY.CUSIP for a company like Microsoft is 123456, then it’s securities (COMPANY_SECURITY.SECURITY_CUSIP) should be 123456xyz, 123456abc etc.. But there are no checks at the database level to make sure that the first 6 digits match between the security_cusip and the parent table’s cusip since the FK is established based on the surrogate key.

There are a couple of ways to fix this at the database tier:

1) Creation of a new object: Use a trigger to prevent the data entry into the child table in the event of bad data getting entered. This can create performance issues in the event of a highly transactional system and is a poor way of enforcing this kind of a constraint.

2) Change the table structure design: Change the Primary Key of the parent table to be on COMPANY.CUSIP i.e. this would be a scenario of making use of the natural key since CUSIP does have a business meaning – it never changes once it has been established unless there are M&A (Merger and Acquisition) activities in which case the older record is marked for deletion and a new one gets created to represent the merger/acquisition. And a lot of searches are done based on this column itself which is always of the same length. That way, the child table will also inherit the CUSIP column since we would be using it for the foreign key enforcement and then store only the three characters for the security_cusip instead of storing all 9. So, the child table will end up with CUSIP (6 characters) and SECURITY_CUSIP (3 characters) and that combination will make it unique. For the reads, there can be either of the two strategies:

a) Based on the concatenation of the values from the same table – do understand though that when operations are done on indexed columns, the index seek operations do not happen. One can use an indexed view though to do that and do the reads off of that.

b) Split the input parameter into two – one of 6 characters and the other of 3 and then by joining the parent and the child table, put the right parameter in the right where clause condition. This approach has the benefit of using the indexes since no operations are being done on the indexed columns.

3) Introduce new column: Same as #2 but keep the Primary key as such and introduce redundancy and put the CUSIP column in the child table as well and introduce the FK between that column and the parent’s CUSIP column. A FK can defined on a UNIQUE NOT NULL column as well – does not necessarily have to to be a Primary Key.

4) Add a new Check Constraint with a UDF: Keep the same schema and use a CHECK constraint to define the constraint across the two tables. Please do note that the ANSI SQL Standards do allow the user of sub-queries in check constraints but not all RDBMS follow it. What you can do though is that you can define a UDF (User Defined Function) that performs an existence check against the parent table and returns a 1 or a 0 and define the check constraint based on that. For example:

/*********************

Table Definitions

**********************/

create table company
(
company_id int identity(1,1) not null,
company_name nvarchar(20) not null,
ticker nvarchar(10) not null,
cusip nvarchar(6) not null
, constraint pk_company primary key (company_id));

create unique index company_ind_1 on dbo.company (cusip);

create table company_security
(
company_security_id int identity(1,1) not null,
company_id int not null,
security_cusip nvarchar(9) not null,
company_cusip as substring(security_cusip, 1, 6)
, constraint pk_company_security primary key (company_security_id)
, constraint fk_company foreign key (company_id) references dbo.company(company_id));

create unique index company_security_ind_1 on dbo.company_security (security_cusip);

/********************

Define the UDF

*********************/
CREATE FUNCTION dbo.udf_check_cusip
(@company_id int, @security_cusip nvarchar(9))
RETURNS bit
AS
BEGIN
IF EXISTS(SELECT 1
FROM dbo.company
WHERE company_id = @company_id and cusip = substring(@security_cusip, 1, 6))
RETURN 1
RETURN 0
END
/************************************************************

Now, add the CHECK constraint using the above UDF

*************************************************************/

alter table dbo.company_security with check add constraint ck_cusip check (dbo.udf_check_cusip (company_id, security_cusip) = 1);

Let’s test it out now:

declare @i int
insert into dbo.company (company_name, ticker, cusip) values (N’Microsoft’, N’MSFT’, N’123456′);
select @i = SCOPE_IDENTITY()
select @i
insert into dbo.company_security (company_id, security_cusip) values (@i, N’123456123′);

This will go through fine.

Now, let’s enter data that violates the check constraint:

insert into dbo.company_security (company_id, security_cusip) values (1, N’100006123′);

And we will get this error:

Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the CHECK constraint “ck_cusip”. The conflict occurred in database “AdventureWorks”, table “dbo.company_security”.
The statement has been terminated.

5) Change the table structure: Same as option #2 but in this case we do not change the PK. We keep the surrogate keys and in the child table, only use the three digit security_cusip. The unique index will then be the combination of the company_id and the three digits will make it unique. Kind of like having the fixed length field 6 digit CUSIP + 3 digit SECURITY_CUSIP as the Primary Key (i.e. a natural key selection like option #2 above). This would be fine for writes … when you do the reads – even for the ones which only need information from the company_security table (though such scenarios will be rare), you will need to join the company and the company_security table and present the data for security cusips by concatenating the company’s cusip and the security cusip. Same read scenarios as mentioned in #2 above apply.

So, as you can see from above, there are a lot of ways of addressing this issue and depending upon the state of the project and the impact to the application, you can chose your solution. Option #2 or 5 are the right design choices.

Note: This option was also tried by the client: Computed column. You cannot use a computed column to enforce the FK since computed columns cannot be used for defining FKs i.e. you cannot just introduce a computed column based on substring(security_cusip, 1, 6) and then try to establish a FK between it and the parent table’s CUSIP column.

Posted in DB2 LUW, Data Model, Oracle, SQL Server | Leave a Comment »

Self Joins

Posted by decipherinfosys on August 7, 2008

One of the recent questions that was asked by a reader for her particular problem needed to include a self join. So, what is a self join? A self join is a join that is done between the same table by using aliases. Such joins are useful for displaying data when there is a relationship between different columns of the data in the same table OR when you wish to retrieve data in a single row based on comparisons between different rows in the same table. They are pretty useful in certain scenarios. We will cover some of those scenarios in this post to give you an idea on where and how you can make use of that information.

A classic example that is always given whenever one talks about self joins is that of an employee table in which there is a recursive 1:N relationship between the employee and their manager. Here is an example:

set nocount on
go
DECLARE @EMP_MASTER TABLE
(
EMP_NBR INT NOT NULL PRIMARY KEY,
EMP_NAME NVARCHAR(20),
MGR_NBR INT NULL
);

INSERT INTO @EMP_MASTER VALUES (1, ‘DON’, 5);
INSERT INTO @EMP_MASTER VALUES (2, ‘HARI’, 3);
INSERT INTO @EMP_MASTER VALUES (3, ‘RAMESH’, 5);
INSERT INTO @EMP_MASTER VALUES (4, ‘JOE’, 5);
INSERT INTO @EMP_MASTER VALUES (5, ‘DENNIS’, NULL);
INSERT INTO @EMP_MASTER VALUES (6, ‘NIMISH’, 7);
INSERT INTO @EMP_MASTER VALUES (7, ‘JESSIE’, 4);
INSERT INTO @EMP_MASTER VALUES (8, ‘KEN’, 2);
INSERT INTO @EMP_MASTER VALUES (9, ‘AMBER’, 5);
INSERT INTO @EMP_MASTER VALUES (10, ‘JIM’, 5);

select empl.EMP_NAME as employee_name, mgr.EMP_NAME as manager_name
from @EMP_MASTER as empl
inner join @EMP_MASTER as mgr
on empl.MGR_NBR = mgr.EMP_NBR

employee_name        manager_name
-------------------- --------------------
DON                  DENNIS
HARI                 RAMESH
RAMESH               DENNIS
JOE                  DENNIS
NIMISH               JESSIE
JESSIE               JOE
KEN                  HARI
AMBER                DENNIS
JIM                  DENNIS

Another example is that of a product line where you might want to build a report to see which products have the same name but differ in their packaging or aisle location where they are placed in the warehouse or a shop. Example:

declare @product table (product_name nvarchar(30), aisle nvarchar(10));
insert into @product values (‘Starbucks-Coffee’, ‘A’);
insert into @product values (‘Starbucks-Coffee’, ‘B’);
insert into @product values (‘Earl Grey’, ‘A’);

select a.product_name, a.aisle as first_aisle, b.aisle as second_aisle
from @product as a
inner join @product as b
on a.product_name = b.product_name
where a.aisle < b.aisle

product_name                   first_aisle second_aisle
------------------------------ ----------- ------------
Starbucks-Coffee               A           B

The same logic can be applied in a multitude of situations – some of the situations that come to mind (besides the ones mentioned above are):

a) Schedule log – like a train station or an airline departure/arrival running log comparisons,

b) A Web server hit stats,

c) Geological data,

d) Stock values for NASDAQ, NYSE etc.

In all these, relative comparisons based on time are typically made when one does a self join.

Posted in DB2 LUW, Oracle, SQL Server | 1 Comment »

Ordering of the data

Posted by decipherinfosys on August 4, 2008

This might seem like an obvious thing to a majority of our readers but since this question has come up many times in e-mails from some of our readers as well as at client sites, we thought it needs to be mentioned in here.  At a client site, one of the end users of the system had put in a bug report that when configuring data for an application, the same data was sometimes getting repeated on different pages.  For the application in question, the query for the page was made up of a single configuration table which had close to 1000 rows and using a paging mechanism, 25 records were being displayed per page from that table.

The query that was being used for that page did not have an order by clause.  Also, data was updated as well as new data was inserted into this table as more stores were added to this retail application.  As a result of this, when paging was done and since the SQL did not have an order by clause, at times, the same data appeared across the pages.  This was a web based stateless application.  When paging was done, each page fired off the same SQL but showed different sets of the data (first 25 records on the first page, 26-50 on the next and so on).  Since there was no ORDER BY clause in the SQL, you are not guaranteed to get the records in “exactly” the same order each and every time.  The developer of the page had made an assumption that since the query is based off a single table, the data will be presented in the same order as that of the primary key index.  That is not always true.  As updates are made to the data records or as deletes happen, the internal rowids can change and one cannot rely on the data being returned in the order of the primary key index.  Also, if there is a filter condition that is used by the end user, depending upon the SQL that gets formed, the execution plan generated could be by-passing the PK index and returning all the data from a separate index itself (a covered index).

Moreover, if tomorrow, there is a need to add more tables in this SQL query, then also this assumption will fail.  So, the bottom line is that if you want the records to be returned in a particular order, always use an ORDER BY clause.  In the case of the above application since the application was using an ORM (Object Relational Mapper) to build up the SQL dynamically and was using a configuration table to specify the default order by condition (in case none was specified by the end user), fixing it was a simple data entry change by putting the primary key column in the list.

Posted in DB2 LUW, Oracle, SQL Server | 1 Comment »

Having vs Where Clause

Posted by decipherinfosys on July 29, 2008

Got another question from one of the readers asking about: “whether it is ok to substitute the WHERE clause with the HAVING clause since in the absence of the WHERE clause and a grouping condition, HAVING essentially acts as a WHERE clause onlY? What is the difference in using these two clauses?”

One thing to understand first is the basic function of the WHERE and the HAVING clauses – WHERE is used to apply filter conditions on the table columns and HAVING is used to apply the filters typically after the aggregations are done.  It is also important to understand the progression of the evaluation of the execution plan.  In the presence of both the WHERE and the HAVING clause, the WHERE condition will get evaluated prior to the HAVING condition.  Let’s take an example (using AdventureWorks sample user database in SQL Server 2005):

select Color, SUM(StandardCost) as SUM_STD_COST

from Production.Product

where Color IN (‘Blue’, ‘Black’)

group by Color

And one can also re-write this to use HAVING instead of the WHERE clause:

select Color, SUM(StandardCost) as SUM_STD_COST

from Production.Product

group by Color

having Color IN (‘Blue’, ‘Black’)

The second one evaluates the filter after the grouping and the first one applies the filter and then does the grouping and is more effective.  For this simple SQL statement, if you look at the execution plan, you will see that the optimizer is smart to change the second SQL’s execution plan to do a filter before the grouping since in the absence of the where clause,

From the first SQL:

————————————————————————-
|–Stream Aggregate(GROUP BY:([AdventureWorks].[Production].[Product].[Color]) DEFINE:([Expr1003]=SUM([AdventureWorks].[Production].[Product].[StandardCost])))
|–Sort(ORDER BY:([AdventureWorks].[Production].[Product].[Color] ASC))
|–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]), WHERE:([AdventureWorks].[Production].[Product].[Color]=N’Black’ OR [AdventureWorks].[Production].[Product].[Color]=N’Blue’))

And for the second SQL (after flushing the cache):

————————————————————————————-
|–Stream Aggregate(GROUP BY:([AdventureWorks].[Production].[Product].[Color]) DEFINE:([Expr1003]=SUM([AdventureWorks].[Production].[Product].[StandardCost])))
|–Sort(ORDER BY:([AdventureWorks].[Production].[Product].[Color] ASC))
|–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]), WHERE:([AdventureWorks].[Production].[Product].[Color]=N’Black’ OR [AdventureWorks].[Production].[Product].[Color]=N’Blue’))

So, the optimizer does the right thing in this case…however, in the case of complex queries, it would be an issue.  So, use the WHERE clause to filter off the records using indexed columns for good performance and use Having only for applying filters at the top of the aggregations.

Posted in DB2 LUW, Oracle, SQL Server | Leave a Comment »

More on Index design

Posted by decipherinfosys on July 14, 2008

We have blogged a couple of times over the factors that should be considered when designing indexes. In one of such posts, we had also covered indexes over multiple columns. In all of those posts, you will see one common thing that we try to emphasize – do not try to come up with a “rules of thumb“.  Here is a simple example (from one of the e-mails to a question to a reader);

Question: I am trying to see how to tune/optimize a query and I saw that an index on the where condition columns is missing.  Should I just go ahead and create an index and add all the columns that are used in the where clause into that index?

The answer is of course: NO.  First, we requested the reader to go ahead and read the post from above where we have discussed the column order in covered indexes and then also gave a simple example to help illustrate the point:

Say, you have this query:

select col1, col2, col3

from dbo.big_table

where col1 >= @x and col1 <= @y

and col3 = @z

For this, we might want to create a covered index (in this order): col3, col2, col1.  This is so that we can first of all process the data based on the equality operator on col3, then scan through and narrow down the data based on the >= and <= conditions on col1 and then get the value for col2 from the index itself rather than going through the table.  Now, as we have always stated, understanding your data and it’s characteristics is very very important.  Suppose that col3 is unique in this table!  In that case, we would just create a single index on col3 and be done with it.  Why?  It will qualify for a single record and thus we will just do the row ID lookup on the table for that one single record.

So, bottom line is that please spend some time to understand the data characteristics – understand how indexes work, understand how the query is accessing the data and the data distribution – density and selectivity of the column data values as well as how often that particular index is going to be used – if an index is being created to support a report that runs only once every 6 months, is it worth to have it in the schema all year round or should we just create it prior to generating that report?  Once you have that understanding, then only decide how you want to approach the design.

Posted in DB2 LUW, Oracle, SQL Server | Leave a Comment »

Getting the first and the last record together

Posted by decipherinfosys on June 27, 2008

While fixing a performance issue with a customer report yesterday, there was a unique requirement that came up. As the items in the warehouse move through the aisles, their datetime timestamp values get changed. In one of the sub-reports, the requirement was to display both the latest record as well as the oldest record in the system i.e. both the first and the last record based on the timestamp column. Also, another thing to add to this was that if the location for the item was not decided yet, then the those items needed to be ranked lower then those items which have a location assigned to them.

There are a couple of ways to resolve this kind of a scenario using SQL. We will present the solution in this post using T-SQL syntax though the same applies (with brief changes to Oracle and DB2 LUW as well – another thing to note also is that Oracle already has analytic functions: first_value() and last_value() that can be used with window functions to easily get these results).

Even though the actual query had a lot of table joins, for the sake of simplicity and to help explain the concept and protect the IP of the client, we will take up an example of a table variable:
set nocount on
go
declare @inventory table (item_id int not null, location_id int null, create_date_time datetime not null)
insert into @inventory values (1, null, cast(‘06/27/2007 12:00:00′ as datetime))
insert into @inventory values (2, 10, cast(‘06/27/2007 12:15:00′ as datetime))
insert into @inventory values (3, 20, cast(‘06/26/2007′ as datetime))
insert into @inventory values (4, 30, cast(‘06/21/2007′ as datetime))
insert into @inventory values (5, null, cast(‘06/27/2007 1:00:00′ as datetime))
insert into @inventory values (6, 50, cast(‘06/23/2007′ as datetime))
insert into @inventory values (7, 70, cast(‘06/24/2007′ as datetime))
insert into @inventory values (8, 80, cast(‘06/25/2007′ as datetime))
insert into @inventory values (9, 40, cast(‘06/27/2007 3:00:00′ as datetime))
insert into @inventory values (10, null, cast(‘06/27/2007 12:30:00′ as datetime))

select ‘***ALL Records***’, * from @inventory

                  item_id     location_id create_date_time
----------------- ----------- ----------- -----------------------
***ALL Records*** 1           NULL        2007-06-27 12:00:00.000
***ALL Records*** 2           10          2007-06-27 12:15:00.000
***ALL Records*** 3           20          2007-06-26 00:00:00.000
***ALL Records*** 4           30          2007-06-21 00:00:00.000
***ALL Records*** 5           NULL        2007-06-27 01:00:00.000
***ALL Records*** 6           50          2007-06-23 00:00:00.000
***ALL Records*** 7           70          2007-06-24 00:00:00.000
***ALL Records*** 8           80          2007-06-25 00:00:00.000
***ALL Records*** 9           40          2007-06-27 03:00:00.000
***ALL Records*** 10          NULL        2007-06-27 12:30:00.000

select top 1 ‘***LAST RECORD***’ as Last_Record, * from @inventory as I order by
case when location_id is null then 1 else 0 end ASC
,create_date_time desc

Last_Record       item_id     location_id create_date_time
----------------- ----------- ----------- -----------------------
***LAST RECORD*** 2           10          2007-06-27 12:15:00.000

select top 1 ‘***FIRST RECORD***’ as First_Record, * from @inventory as I order by
case when location_id is null then 1 else 0 end ASC
,create_date_time asc

First_Record       item_id     location_id create_date_time
------------------ ----------- ----------- -----------------------
***FIRST RECORD*** 4           30          2007-06-21 00:00:00.000

If you look at the output from above, you can see that the last record that is picked up by the logic is Item #2 even though Item #10 has the latest create_date_time timestamp value. The reason for this is that we had the requirement that if an item has not been assigned a location yet, that should rank lower. So, the above 2 SQL statements are just to show how we can get the first and the last records in a SQL statement. Now, in the report, we needed to get these two together and as you know, if one uses the UNION or UNION ALL clause between two statements, one can use only one ORDER BY clause which applies to the whole set. So, it is not as straight forward as just doing a UNION/UNION ALL operation between the two sets and applying an ORDER BY.

One of the solutions is to use derived tables:

select LR.* from
(select top 1 ‘***LAST RECORD***’ as Last_Record, * from @inventory as I order by
case when location_id is null then 1 else 0 end ASC
,create_date_time desc) as LR
UNION ALL
select FR.* from
(select top 1 ‘***FIRST RECORD***’ as First_Record, * from @inventory as I order by
case when location_id is null then 1 else 0 end ASC
,create_date_time asc) as FR

Last_Record        item_id     location_id create_date_time
------------------ ----------- ----------- -----------------------
***LAST RECORD***  2           10          2007-06-27 12:15:00.000
***FIRST RECORD*** 4           30          2007-06-21 00:00:00.000

Another solution is to use CTE (Common Table Expressions):

with results as
(select top 1 ‘***LAST RECORD***’ as Last_Record, * from @inventory as I order by
case when location_id is null then 1 else 0 end ASC
,create_date_time desc
UNION ALL
select top 1 ‘***FIRST RECORD***’ as First_Record, * from @inventory as I order by
case when location_id is null then 1 else 0 end ASC
,create_date_time asc
)
select * from results;

Last_Record        item_id     location_id create_date_time
------------------ ----------- ----------- -----------------------
***LAST RECORD***  2           10          2007-06-27 12:15:00.000
***FIRST RECORD*** 4           30          2007-06-21 00:00:00.000

Yet another solution can be the usage of the row_number() analytic function, for example:

select item_id, location_id, create_date_time
from
(
select
*
, row_number() over (order by case when location_id is null then 1 else 0 end ASC
,create_date_time desc) as RN_1
, row_number() over (order by case when location_id is null then 1 else 0 end ASC
,create_date_time asc) as RN_2
from @inventory) as IV
where RN_1 = 1 or RN_2 = 1

item_id     location_id create_date_time
----------- ----------- -----------------------
4           30          2007-06-21 00:00:00.000
2           10          2007-06-27 12:15:00.000

There are other solutions as well to this problem. One needs to test these out in one’s scenario and make sure that you have meaningful filter criterias and good execution plans. The next challenge in this report was that the project manager wanted these records listed side by side on the same record. That is pretty simple to do as well – using either pivoting or even simple join between derived tables with a match on the row_number() analytic function value. We leave that as an exercise to the reader and if you have questions, we will be more than happy to post the answer to that as well.

Posted in DB2 LUW, Oracle, SQL Server | 1 Comment »

Common Problem – getting the first in a Group

Posted by decipherinfosys on June 18, 2008

This is one of the very common problems in the SQL world and also one that has many different solutions available. The problem description is this:

You have a table with say this structure (using SQL Server Syntax):

set nocount on
go
declare @test table (store_nbr int, product_name nvarchar(20), product_date datetime, aisle_nbr int)
insert into @test values (1, ‘A’, ‘8-12-2004′, 6)
insert into @test values (1, ‘B’, ‘7-10-2005′, 2)
insert into @test values (1, ‘C’, ‘6-11-2006′, 3)
insert into @test values (1, ‘D’, ‘11-10-2007′, 4)
insert into @test values (2, ‘E’, ‘10-12-2004′, 2)
insert into @test values (2, ‘F’, ‘5-9-2005′, 4)

And what you want to find out is that record in each store which was introduced the last so essentially in each group (i.e. each store), you want to find out the product that was introduced most recently. There are a lot of ways to solve this problem and the solutions vary among RDBMS based on syntax only. One can use analytic functions to resolve this, use a derived table or use a co-related sub-query. Here is one way to solve this issue:

select * from @test A
where A.product_date = (select top 1 B.product_date
from @test as B
where A.store_nbr = B.store_nbr
order by product_date desc)

store_nbr   product_name         product_date            aisle_nbr
----------- -------------------- ----------------------- -----------
1           D                    2007-11-10 00:00:00.000 4
2           F                    2005-05-09 00:00:00.000 4

In this code, we are sorting based on the product date in a descending order in the co-related sub-query and getting the equality match based on the TOP 1 record. We could have also used the ROW_NUMBER() function as well, example:

select store_nbr, product_name, product_date, aisle_nbr
from
(select ROW_NUMBER() over (partition by store_nbr order by product_date desc) as rn, *
from @test A
) as IV
where IV.rn = 1

One can also choose to do a MAX operation in the co-related sub-query, example:

select * from @test as A
where A.product_date = (select MAX(B.product_date)
from @test as B
where A.store_nbr = B.store_nbr)

And in SQL Server, one can also use the CROSS APPLY function in SQL Server 2005.  We will look at the execution plans and the performance characteristics of these different options in an upcoming post.

Posted in DB2 LUW, Oracle, SQL Server | 1 Comment »