Systems Engineering and RDBMS

Using DISTINCT … just because

Posted by decipherinfosys on November 21, 2008

We have blogged about the DISTINCT clause in the past (common mis-conceptions about the DISTINCT clause) and what it does and how it gets mis-used at times. This is one of the common issues that we have seen in some of the queries written by junior folks or even at times by senior folks not well versed with SQL.  They add the DISTINCT clause to the Select list just to make sure that “in case” there are duplicates, it will all be taken care of by this wonderful clause.  Most of the time, the reason why they get duplicates is because of join conditions or because they did not write the query in different ways like for example: If you have a 3 table join and say tableA : tableB :: 1:N (1 to many relationship) and tableB:tableC::1:N (again a 1 to many relationship) and the 3 joins are together but one is only selecting the data from tableA and tableB and that set of columns provides unique set of data, however the developer also adds tableC in the join because there is a filter condition (where clause) on it.  That would result into duplicates when there are many records being returned from tableC for a record in tableB because of the 1:N relationship.  In that case, the developer ends up adding a DISTINCT clause in the select list to get the distinct set of data.  The developer instead could have chosen to just use an EXISTS clause like this:

select …<columns from tableA and tableB>…

from tableA

inner join table B

on tableA.col1 = tableB.col2

where …<criteria on tableA and tableB – if being used based on the business logic>…

and exists (select 1 from tableC where col3 = tableB.col4)

and this will negate the usage of the DISTINCT clause.  There are many other such scenarios as well.  The problem with using DISTINCT un-necessarily is that the code will then incur sorting costs un-necessarily since the DISTINCT clause requires that the resultset is sorted and the duplicates are eliminated from the result set.

We have seen even bigger issues when people start using distinct aggregates even when they are not needed or when there are better ways to write the query…by distinct aggregates, we mean, clauses like:

count(distinct invoice_nbr) or sum(distinct sales_qty) etc.

The problem exacerbates when there are a lot of mixing of such distinct aggregates with non-distinct aggregates, example:

select
count(distinct invoice_nbr),
sum(distinct sales_qty),
count(ship_date),
Avg(sales_price),

…etc.

Mixing of these distinct aggregates and non distinct aggregates in the same SQL statement’s select list leads to spooling and involves several re-reads of the intermediate results which is very expensive – even more expensive than computing those separately in separate queries.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: