SQL: Finding Differences Between Rows in the Same Group or Partition

SQL Practical Cases: learning by example

Problem

You want to return the DEPTNO, ENAME, and SAL of each employee along with the difference in SAL between employees in the same department (i.e., having the same value for DEPTNO). The difference should be between each current employee and the employee hired immediately afterwards (you want to see if there is a correlation between seniority and salary on a “per department” basis). For each employee hired last in his department, return “N/A” for the difference. The result set should look like this:

	DEPTNO ENAME             SAL HIREDATE    DIFF
	------ ---------- ---------- ----------- ----------
	    10 CLARK            2450 09-JUN-1981      -2550
	    10 KING             5000 17-NOV-1981       3700
	    10 MILLER           1300 23-JAN-1982        N/A
	    20 SMITH             800 17-DEC-1980      -2175
	    20 JONES            2975 02-APR-1981        -25
	    20 FORD             3000 03-DEC-1981          0
	    20 SCOTT            3000 09-DEC-1982       1900
	    20 ADAMS            1100 12-JAN-1983        N/A
	    30 ALLEN            1600 20-FEB-1981        350
	    30 WARD             1250 22-FEB-1981      -1600
	    30 BLAKE            2850 01-MAY-1981       1350
	    30 TURNER           1500 08-SEP-1981        250
	    30 MARTIN           1250 28-SEP-1981        300
	    30 JAMES             950 03-DEC-1981        N/A

Solution

The is another example of where the window functions LEAD OVER and LAG OVER come in handy. You can easily access next and prior rows without additional joins. Alternative methods, such as subqueries or self-joins are possible but awkward.

What databases is applicable to? All Implementations

Use the window function LEAD OVER to access the “next” employee’s salary relative to the current row:

	1  with next_sal_tab (deptno,ename,sal,hiredate,next_sal)
	2  AS
	3  (select deptno, ename, sal, hiredate,
	4        lead(sal)over(partition by deptno
	5                          order by hiredate) as next_sal
	6   from emp )
	7
	8     select deptno, ename, sal, hiredate
	9  ,    coalesce(cast(sal-next_sal as char), 'N/A') as diff
	10    from next_sal_tab

 

In this case, for the sake of variety, we have used a common table expression rather than a subquery - both will work across most RDBMS’s these days, with the preference usually relating to readability.

Discussion

The first step is to use the LEAD OVER window function to find the “next” salary for each employee within her department. The employees hired last in each department will have a NULL value for NEXT_SAL:

	select deptno,ename,sal,hiredate,
	       lead(sal)over(partition by deptno order by hiredate) as next_sal
	  from emp

	DEPTNO ENAME             SAL HIREDATE      NEXT_SAL
	------ ---------- ---------- ----------- ----------
	    10 CLARK            2450 09-JUN-1981       5000
	    10 KING             5000 17-NOV-1981       1300
	    10 MILLER           1300 23-JAN-1982
	    20 SMITH             800 17-DEC-1980       2975
	    20 JONES            2975 02-APR-1981       3000
	    20 FORD             3000 03-DEC-1981       3000
	    20 SCOTT            3000 09-DEC-1982       1100
	    20 ADAMS            1100 12-JAN-1983
	    30 ALLEN            1600 20-FEB-1981       1250
	    30 WARD             1250 22-FEB-1981       2850
	    30 BLAKE            2850 01-MAY-1981       1500
	    30 TURNER           1500 08-SEP-1981       1250
	    30 MARTIN           1250 28-SEP-1981        950
	    30 JAMES             950 03-DEC-1981

 The next step is to take the difference between each employee’s salary and the salary of the employee hired immediately after her in the same department:

	select deptno,ename,sal,hiredate, sal-next_sal diff
	  from (
	select deptno,ename,sal,hiredate,
	       lead(sal)over(partition by deptno order by hiredate) next_sal
	  from emp
	       )

	DEPTNO ENAME             SAL HIREDATE          DIFF
	------ ---------- ---------- ----------- ----------
	    10 CLARK            2450 09-JUN-1981      -2550
	    10 KING             5000 17-NOV-1981       3700
	    10 MILLER           1300 23-JAN-1982
	    20 SMITH             800 17-DEC-1980      -2175
	    20 JONES            2975 02-APR-1981        -25
	    20 FORD             3000 03-DEC-1981          0
	    20 SCOTT            3000 09-DEC-1982       1900
	    20 ADAMS            1100 12-JAN-1983
	    30 ALLEN            1600 20-FEB-1981        350
	    30 WARD             1250 22-FEB-1981      -1600
	    30 BLAKE            2850 01-MAY-1981       1350
	    30 TURNER           1500 08-SEP-1981        250
	    30 MARTIN           1250 28-SEP-1981        300
	    30 JAMES             950 03-DEC-1981

The next step is to use the coalesce function to insert “N/A” when there is no next salary. To be able to return “N/A” you must cast the value of DIFF to a string:

	select deptno,ename,sal,hiredate,
	       nvl(to_char(sal-next_sal),'N/A') diff
	  from (
	select deptno,ename,sal,hiredate,
	       lead(sal)over(partition by deptno order by hiredate) next_sal
	  from emp
	       )

	DEPTNO ENAME             SAL HIREDATE    DIFF
	------ ---------- ---------- ----------- ---------------
	    10 CLARK            2450 09-JUN-1981 -2550
	    10 KING             5000 17-NOV-1981 3700
	    10 MILLER           1300 23-JAN-1982 N/A
	    20 SMITH             800 17-DEC-1980 -2175
	    20 JONES            2975 02-APR-1981 -25
	    20 FORD             3000 03-DEC-1981 0
	    20 SCOTT            3000 09-DEC-1982 1900
	    20 ADAMS            1100 12-JAN-1983 N/A
	    30 ALLEN            1600 20-FEB-1981 350
	    30 WARD             1250 22-FEB-1981 -1600
	    30 BLAKE            2850 01-MAY-1981 1350
	    30 TURNER           1500 08-SEP-1981 250
	    30 MARTIN           1250 28-SEP-1981 300
	    30 JAMES             950 03-DEC-1981 N/A

While the majority of the solutions provided in this book do not deal with “what if” scenarios (for the sake of readability and the author’s sanity), the scenario involving duplicates when using the LEAD OVER function in this manner must be discussed. In the simple sample data in table EMP, no employees have duplicate HIREDATEs, yet this is a very likely situation. Normally, I would not discuss a “what if” situation such as duplicates (since there aren’t any in table EMP), but the workaround involving LEAD may not be immediately obvious. Consider the following query, which returns the difference in SAL between the employees in DEPTNO 10 (the difference is performed in the order in which they were hired):

	select deptno,ename,sal,hiredate,
	       lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
	  from (
	select deptno,ename,sal,hiredate,
	       lead(sal)over(partition by deptno
	                         order by hiredate) next_sal
	  from emp
	 where deptno=10 and empno > 10
	       )

	DEPTNO ENAME    SAL HIREDATE    DIFF
	------ ------ ----- ----------- ----------
	    10 CLARK   2450 09-JUN-1981      -2550
	    10 KING    5000 17-NOV-1981       3700
	    10 MILLER  1300 23-JAN-1982        N/A

This solution is correct considering the data in table EMP but, if there were duplicate rows, the solution would fail. Consider the example below, showing four more employees hired on the same day as KING:

	insert into emp (empno,ename,deptno,sal,hiredate)
	values (1,'ant',10,1000,to_date('17-NOV-1981'))

	insert into emp (empno,ename,deptno,sal,hiredate)
	values (2,'joe',10,1500,to_date('17-NOV-1981'))

	insert into emp (empno,ename,deptno,sal,hiredate)
	values (3,'jim',10,1600,to_date('17-NOV-1981'))

	insert into emp (empno,ename,deptno,sal,hiredate)
	values (4,'jon',10,1700,to_date('17-NOV-1981'))

	select deptno,ename,sal,hiredate,
	       lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
	  from (
	select deptno,ename,sal,hiredate,
	       lead(sal)over(partition by deptno
	                         order by hiredate) next_sal
	  from emp
	 where deptno=10
	       )

	DEPTNO ENAME    SAL HIREDATE    DIFF
	------ ------ ----- ----------- ----------
	    10 CLARK   2450 09-JUN-1981       1450
	    10 ant     1000 17-NOV-1981       -500
	    10 joe     1500 17-NOV-1981      -3500
	    10 KING    5000 17-NOV-1981       3400
	    10 jim     1600 17-NOV-1981       -100
	    10 jon     1700 17-NOV-1981        400
	    10 MILLER  1300 23-JAN-1982        N/A

You’ll notice that with the exception of employee JON, all employees hired on the same date (November 17) evaluate their salary against another employee hired on the same date! This is incorrect. All employees hired on November 17 should have the difference of salary computed against MILLER’s salary, not another employee hired on November 17. Take, for example, employee ANT. The value for DIFF for ANT is–500 because ANT’s SAL is compared with JOE’s SAL and is 500 less than JOE’s SAL, hence the value of–500. The correct value for DIFF for employee ANT should be–300 because ANT makes 300 less than MILLER, who is the next employee hired by HIREDATE. The reason the solution seems to not work is due to the default behavior of Oracle’s LEAD OVER function. By default, LEAD OVER only looks ahead one row. So, for employee ANT, the next SAL based on HIREDATE is JOE’s SAL, because LEAD OVER simply looks one row ahead and doesn’t skip duplicates. Fortunately, Oracle planned for such a situation and allows you to pass an additional parameter to LEAD OVER to determine how far ahead it should look. In the example above, the solution is simply a matter of counting: find the distance from each employee hired on November 17 to January 23 (MILLER’s HIREDATE). The solution below shows how to accomplish this:

	select deptno,ename,sal,hiredate,
	       lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
	  from (
	select deptno,ename,sal,hiredate,
	       lead(sal,cnt-rn+1)over(partition by deptno
	                         order by hiredate) next_sal
	  from (
	select deptno,ename,sal,hiredate,
	       count(*)over(partition by deptno,hiredate) cnt,
	       row_number()over(partition by deptno,hiredate order by sal) rn
	  from emp
	 where deptno=10
	       )
	       )

	DEPTNO ENAME     SAL HIREDATE    DIFF
	------ ------  ----- ----------- ----------
	    10 CLARK    2450 09-JUN-1981       1450
	    10 ant      1000 17-NOV-1981       -300
	    10 joe      1500 17-NOV-1981        200
	    10 jim      1600 17-NOV-1981        300
	    10 jon      1700 17-NOV-1981        400
	    10 KING     5000 17-NOV-1981       3700
	    10 MILLER   1300 23-JAN-1982        N/A

Now the solution is correct. As you can see, all the employees hired on November 17 now have their salaries compared with MILLER’s salary. Inspecting the results, employee ANT now has a value of–300 for DIFF, which is what we were hoping for. If it isn’t immediately obvious, the expression passed to LEAD OVER; CNT-RN+1 is simply the distance from each employee hired on November 17 to MILLER. Consider the inline view below, which shows the values for CNT and RN:

	select deptno,ename,sal,hiredate,
	       count(*)over(partition by deptno,hiredate) cnt,
	       row_number()over(partition by deptno,hiredate order by sal) rn
	  from emp
	 where deptno=10

	DEPTNO ENAME    SAL HIREDATE           CNT         RN
	------ ------ ----- ----------- ---------- ----------
	    10 CLARK   2450 09-JUN-1981          1          1
	    10 ant     1000 17-NOV-1981          5          1
	    10 joe     1500 17-NOV-1981          5          2
	    10 jim     1600 17-NOV-1981          5          3
	    10 jon     1700 17-NOV-1981          5          4
	    10 KING    5000 17-NOV-1981          5          5
	    10 MILLER  1300 23-JAN-1982          1          1

The value for CNT represents, for each employee with a duplicate HIREDATE, how many duplicates there are in total for their HIREDATE. The value for RN represents a ranking for the employees in DEPTNO 10. The rank is partitioned by DEPTNO and HIREDATE so only employees with a HIREDATE that another employee has will have a value greater than one. The ranking is sorted by SAL (this is arbitrary; SAL is convenient, but we could have just as easily chosen EMPNO). Now that you know how many total duplicates there are and you have a ranking of each duplicate, the distance to MILLER is simply the total number of duplicates minus the current rank plus one (CNT-RN+1). The results of the distance calculation and its effect on LEAD OVER are shown below:

	select deptno,ename,sal,hiredate,
	       lead(sal)over(partition by deptno
	                         order by hiredate) incorrect,
	       cnt-rn+1 distance,
	       lead(sal,cnt-rn+1)over(partition by deptno
	                         order by hiredate) correct
	  from (
	select deptno,ename,sal,hiredate,
	       count(*)over(partition by deptno,hiredate) cnt,
	       row_number()over(partition by deptno,hiredate
	                            order by sal) rn
	  from emp
	 where deptno=10
	       )

	DEPTNO ENAME    SAL HIREDATE    INCORRECT    DISTANCE    CORRECT
	------ ------ ----- ----------- ---------- ---------- ----------
	    10 CLARK   2450 09-JUN-1981       1000          1       1000
	    10 ant     1000 17-NOV-1981       1500          5       1300
	    10 joe     1500 17-NOV-1981       1600          4       1300
	    10 jim     1600 17-NOV-1981       1700          3       1300
	    10 jon     1700 17-NOV-1981       5000          2       1300
	    10 KING    5000 17-NOV-1981       1300          1       1300
	    10 MILLER  1300 23-JAN-1982                     1

Now you can clearly see the effect that you have when you pass the correct distance to LEAD OVER. The rows for INCORRECT represent the values returned by LEAD OVER using a default distance of one. The rows for CORRECT represent the values returned by LEAD OVER using the proper distance for each employee with a duplicate HIREDATE to MILLER. At this point, all that is left is to find the difference between CORRECT and SAL for each row, which has already been shown.

 

Вас заинтересует / Intresting for you:

A Brief History of SQL
A Brief History of SQL 5401 views Александров Попков Wed, 17 Oct 2018, 15:04:29
SQL: how to determine which ro...
SQL: how to determine which ro... 1399 views Денис Tue, 06 Jul 2021, 18:47:53
How to Retrieving a Subset of ...
How to Retrieving a Subset of ... 1700 views Денис Wed, 14 Jul 2021, 04:45:37
Referencing an Aliased Column ...
Referencing an Aliased Column ... 6683 views Денис Wed, 14 Jul 2021, 12:59:03
Comments (0)
There are no comments posted here yet
Leave your comments
Posting as Guest
×
Suggested Locations