Systems Engineering and RDBMS

A Grouping and Ordinal Ranking problem

Posted by decipherinfosys on April 15, 2009

Got a question from a reader yesterday:

Hi! I have an employee labor activity table and the table structure and some sample data for my problem is like this:

CREATE TABLE dbo.EMP_ACTIVITY
(
EMP_ACTIVITY_ID        INT                NOT NULL    IDENTITY(1,1)
,ACTV_DATE                DATETIME        NOT NULL
,ACTV_TYPE                NVARCHAR(10)    NOT NULL
,EMP_MASTER_ID            INT                NOT NULL
)
GO
SET NOCOUNT ON
GO

INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 10:00:00′, ‘JOB1′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 10:30:00′, ‘JOB1′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 11:30:00′, ‘JOB1′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 11:35:00′, ‘JOB2′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 12:30:00′, ‘JOB2′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 14:30:00′, ‘JOB3′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 15:30:00′, ‘JOB3′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 16:30:00′, ‘JOB1′, 10);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 10:00:00′, ‘JOB1′, 20);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 10:30:00′, ‘JOB1′, 20);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 11:30:00′, ‘JOB2′, 20);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 11:35:00′, ‘JOB3′, 20);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 12:30:00′, ‘JOB2′, 30);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 14:30:00′, ‘JOB3′, 30);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 15:30:00′, ‘JOB3′, 30);
INSERT INTO dbo.EMP_ACTIVITY (ACTV_DATE, ACTV_TYPE, EMP_MASTER_ID) VALUES (’04/13/2009 16:30:00′, ‘JOB1′, 30);

What I need to do in a SQL statement is that I need to sequence these records such that the sequence number should change on a change in the Activity Type and it should re-set to 1 again on a change in the employee ID value…the activity type could be the same as one of the previous activities done by the user.  So, using the data set from above as an example, what I need the output from above to be is like this:

actv_date actv_type emp_master_id Final_Activity_Seq
2009-04-13 10:00:00.000 JOB1 10 1
2009-04-13 10:30:00.000 JOB1 10 1
2009-04-13 11:30:00.000 JOB1 10 1
2009-04-13 11:35:00.000 JOB2 10 2
2009-04-13 12:30:00.000 JOB2 10 2
2009-04-13 14:30:00.000 JOB3 10 3
2009-04-13 15:30:00.000 JOB3 10 3
2009-04-13 16:30:00.000 JOB1 10 4
2009-04-13 10:00:00.000 JOB1 20 1
2009-04-13 10:30:00.000 JOB1 20 1
2009-04-13 11:30:00.000 JOB2 20 2
2009-04-13 11:35:00.000 JOB3 20 3
2009-04-13 12:30:00.000 JOB2 30 1
2009-04-13 14:30:00.000 JOB3 30 2
2009-04-13 15:30:00.000 JOB3 30 2
2009-04-13 16:30:00.000 JOB1 30 3

How can I go about it?

There are a couple of ways to do that … the key thing to understand is that first we need to mark each record at which the job changes and we need to ensure that it is done in the right order since a user can switch back to an activity called JOB1 sometime during the later part of the day.  So, if somehow we can mark the record at which the change occurs, that is how we can proceed then and provide a rank.  So, keeping that in mind, let’s use a CTE in SQL Server and simulate a lag() function since SQL Server does not have an inbuilt lag() function like Oracle does.  Later on, we will show how it can be done in Oracle.

with actv_change as (
select
EMP_ACTIVITY_ID
, ACTV_DATE
, ACTV_TYPE
, EMP_MASTER_ID
, ROW_NUMBER() over (partition by EMP_MASTER_ID ORDER by ACTV_DATE) as RN
from dbo.EMP_ACTIVITY
)
select
AC1.EMP_ACTIVITY_ID
, AC1.ACTV_DATE
, AC1.ACTV_TYPE
, AC1.EMP_MASTER_ID
, (CASE COALESCE(AC1.ACTV_TYPE, AC2.ACTV_TYPE)
WHEN AC2.ACTV_TYPE THEN 0
ELSE 1
END) as SEQ
from actv_change as AC1
left outer join actv_change as AC2
on AC1.EMP_MASTER_ID = AC2.EMP_MASTER_ID
AND AC1.RN = AC2.RN + 1

The output is like this:

EMP_ACTIVITY_ID    ACTV_DATE    ACTV_TYPE    EMP_MASTER_ID    SEQ
17    2009-04-13 10:00:00.000    JOB1    10    1
18    2009-04-13 10:30:00.000    JOB1    10    0
19    2009-04-13 11:30:00.000    JOB1    10    0
20    2009-04-13 11:35:00.000    JOB2    10    1
21    2009-04-13 12:30:00.000    JOB2    10    0
22    2009-04-13 14:30:00.000    JOB3    10    1
23    2009-04-13 15:30:00.000    JOB3    10    0
24    2009-04-13 16:30:00.000    JOB1    10    1
25    2009-04-13 10:00:00.000    JOB1    20    1
26    2009-04-13 10:30:00.000    JOB1    20    0
27    2009-04-13 11:30:00.000    JOB2    20    1
28    2009-04-13 11:35:00.000    JOB3    20    1
29    2009-04-13 12:30:00.000    JOB2    30    1
30    2009-04-13 14:30:00.000    JOB3    30    1
31    2009-04-13 15:30:00.000    JOB3    30    0
32    2009-04-13 16:30:00.000    JOB1    30    1

The value of 1 in the Sequence column represents when the change occurs.   We first got the row numbers by partitioning the data based on the employee ID value and ordering the data based on the activity date.  And then we used a left join with the CTE data set to lag the record and compute the 0 or 1 value for the Sequence.  So, now if we take a running total over the sequence column, we will be able to get the record set that the reader has asked.

It would have been nice if SQL Server allowed the usage of aggregate functions in the OVER clause and use the ORDER BY along with it.  Because if it did, we could have simply done:  SUM(SEQ) over (partition by emp_master_id order by actv_date) to get our running total.  However, this can easily be done bydoing this:

declare @temp table (actv_date datetime, actv_type nvarchar(10), emp_master_id int, seq int);

with actv_change as (
select
EMP_ACTIVITY_ID
, ACTV_DATE
, ACTV_TYPE
, EMP_MASTER_ID
, ROW_NUMBER() over (partition by EMP_MASTER_ID ORDER by ACTV_DATE) as RN
from dbo.EMP_ACTIVITY
)
insert into @temp
select
AC1.ACTV_DATE
, AC1.ACTV_TYPE
, AC1.EMP_MASTER_ID
, (CASE COALESCE(AC1.ACTV_TYPE, AC2.ACTV_TYPE)
WHEN AC2.ACTV_TYPE THEN 0
ELSE 1
END) as SEQ
from actv_change as AC1
left outer join actv_change as AC2
on AC1.EMP_MASTER_ID = AC2.EMP_MASTER_ID
AND AC1.RN = AC2.RN + 1

select
t1.actv_date
,t1.actv_type
,t1.emp_master_id
,t1.seq
,(select sum(t2.seq) from @temp as t2 where t2.emp_master_id = t1.emp_master_id and t2.actv_date <= t1.actv_date) as Final_Activity_SEQ
from @temp as t1

So, all we have done here is to get the data into a table variable and then get a running total using a sub-query.  The output is shown below:

actv_date actv_type emp_master_id seq Final_Activity_Seq
2009-04-13 10:00:00.000 JOB1 10 1 1
2009-04-13 10:30:00.000 JOB1 10 0 1
2009-04-13 11:30:00.000 JOB1 10 0 1
2009-04-13 11:35:00.000 JOB2 10 1 2
2009-04-13 12:30:00.000 JOB2 10 0 2
2009-04-13 14:30:00.000 JOB3 10 1 3
2009-04-13 15:30:00.000 JOB3 10 0 3
2009-04-13 16:30:00.000 JOB1 10 1 4
2009-04-13 10:00:00.000 JOB1 20 1 1
2009-04-13 10:30:00.000 JOB1 20 0 1
2009-04-13 11:30:00.000 JOB2 20 1 2
2009-04-13 11:35:00.000 JOB3 20 1 3
2009-04-13 12:30:00.000 JOB2 30 1 1
2009-04-13 14:30:00.000 JOB3 30 1 2
2009-04-13 15:30:00.000 JOB3 30 0 2
2009-04-13 16:30:00.000 JOB1 30 1 3

In the case of Oracle, this query simply becomes:

with actv_change as (
select
EMP_ACTIVITY_ID
, ACTV_DATE
, ACTV_TYPE
, EMP_MASTER_ID
, case when lag(ACTV_TYPE,1, ACTV_TYPE) over (partition by EMP_MASTER_ID order by ACTV_DATE) = ACTV_TYPE
then 0
else 1
end as RN
from EMP_ACTIVITY
)
select
AC1.ACTV_DATE
, AC1.ACTV_TYPE
, AC1.EMP_MASTER_ID
, sum(RN) over(partition by EMP_MASTER_ID order by ACTV_DATE) + 1 as  SEQ
from actv_change AC1

So, the solution is essentially the same – the only difference is that in the case of Oracle, we do not need to use the table variable and then get the running total…we can use the analytic function to get that output.

Resources:

  • CTE – BOL entry here.
  • Row_Number() analytic function – here.
  • Running total posts in SQL Server – there are a lot of posts on this topic – here is the google output.  The key thing to see is the performance differences between the difference approaches – here and here.
About these ads

2 Responses to “A Grouping and Ordinal Ranking problem”

  1. […] of Analytics…the question was from the same reader who had asked us the question on the employee labor activity tracking.  This time, the question was around the grouping and ranking of the receiving time and the […]

  2. […] before in order to resolve grouping and ranking problems in couple of previous blog posts – here and here. Today, we are going to cover another analytic function ratio_to_report to calculate the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 85 other followers

%d bloggers like this: