Systems Engineering and RDBMS

Archive for December 22nd, 2007

Some common mis-conceptions about DISTINCT

Posted by decipherinfosys on December 22, 2007

We have seen this so many times at client sites that we decided to write up about it. One common mistake all the time is when people try to treat DISTINCT clause as a function. They try something like:

SELECT DISTINCT (COL1), COL2
FROM DISTINCT_TEST;

and expect that they will get only distinct values from COL1 and all the values from COL2. Let’s take a step back and think about it for a second. How is that even possible? The clause like DISTINCT operate on an entire row – the entire record and not only on a column. So, in the above query, even though it will not give you an error, it will give you all the results. Let’s follow that up with an example before we delve into this further:

CREATE TABLE DISTINCT_TEST (COL1 NVARCHAR(10), COL2 INT);
INSERT INTO DISTINCT_TEST VALUES (N’ABC’, 10);
INSERT INTO DISTINCT_TEST VALUES (N’ABC’, 20);
INSERT INTO DISTINCT_TEST VALUES (N’XYZ’, 30);
INSERT INTO DISTINCT_TEST VALUES (N’XYZ’, 40);
INSERT INTO DISTINCT_TEST VALUES (N’ABC’, 50);
GO

SELECT DISTINCT (COL1), COL2
FROM DISTINCT_TEST;

COL1 COL2
———- ———–
ABC 10
ABC 20
ABC 50
XYZ 30
XYZ 40

So, as you can see from above, DISTINCT is NOT a function, it is a clause that applies to the entire record. So:

SELECT DISTINCT (COL1), COL2
FROM DISTINCT_TEST;

SELECT DISTINCT (COL1), (COL2)
FROM DISTINCT_TEST;

SELECT COL1, COL2
FROM DISTINCT_TEST;

are all one and the same thing. We have even seen people try:

SELECT DISTINCT(COL1), MAX(COL2)
FROM DISTINCT_TEST;

thinking that they would be getting only one UNIQUE value from COL1 and for that value, get the MAX value from COL2. Since DISTINCT is not a function, the above SQL will give an error. You need to re-write the above SQL with a GROUP by CLAUSE and GROUP BY on COL1:

SELECT COL1, MAX(COL2) AS MAX_VAL
FROM DISTINCT_TEST
GROUP BY COL1;

COL1 MAX_VAL
———- ———–
ABC 50
XYZ 40

Another area of confusion that we have seen is when people try to order things by a column that is not included in the “SELECT DISTINCT …” list. Example:

SELECT DISTINCT COL1
FROM DISTINCT_TEST
ORDER BY COL2;

You will get the error:

Msg 145, Level 15, State 1, Line 1
ORDER BY items must appear in the select list if SELECT DISTINCT is specified.

The above SQL seems to be perfectly fine since even if we order by COL2, we should be getting 2 records: ABC and XYZ as the output. So, why does SQL Server force us to include the ordered columns in the SELECT list as well? If you think about the data for a second, the second column has all distinct values. Since we are ordering by that column and trying to get a distinct value from COL1 only, which one record would the optimizer pick for ABC? The one with a value of 10, 20 or 50 from COL2? What if there was say another column added to this table:

ALTER TABLE DISTINCT_TEST ADD COL3 NVARCHAR(30);

UPDATE DISTINCT_TEST SET COL3 = ‘TEST DATA 1′ WHERE COL1 = ‘ABC’ AND COL2 = 10;
UPDATE DISTINCT_TEST SET COL3 = ‘TEST DATA 2′ WHERE COL1 = ‘ABC’ AND COL2 = 20;
UPDATE DISTINCT_TEST SET COL3 = ‘TEST DATA 2′ WHERE COL1 = ‘ABC’ AND COL2 = 50;
UPDATE DISTINCT_TEST SET COL3 = ‘TEST DATA 3′ WHERE COL1 = ‘XYZ’ AND COL2 = 30;

So, the data now looks like:

COL1 COL2 COL3
———- ———– ——————————
ABC 10 TEST DATA 1
ABC 20 TEST DATA 2
XYZ 30 TEST DATA 3
XYZ 40 NULL
ABC 50 TEST DATA 2

And the query could be:

SELECT DISTINCT COL1, COL3
FROM DISTINCT_TEST
ORDER BY COL2;

In this case, which order the optimizer would have to pick to give a DISTINCT set of COL1 and COL2? There is no way for it to ascertain that and provide the record with COL2 = 20 first instead of the one that has COL2 = 50. So, the bottom-line is that the column that we want to order by should also be in the selection list so that we can determine the DISTINCT set of values. So, we will change the above SQL to be:

SELECT DISTINCT COL1, COL3
FROM DISTINCT_TEST
ORDER BY COL1;

COL1 COL3
———- ——————————
ABC TEST DATA 1
ABC TEST DATA 2
XYZ NULL
XYZ TEST DATA 3

Posted in DB2 LUW, Oracle, SQL Server | 3 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 80 other followers