Problem
You want to return the DEPTNO, ENAME, and SAL of each employee along with the difference in SAL between employees in the same department (i.e., having the same value for DEPTNO). The difference should be between each current employee and the employee hired immediately afterwards (you want to see if there is a correlation between seniority and salary on a “per department” basis). For each employee hired last in his department, return “N/A” for the difference. The result set should look like this:
DEPTNO ENAME SAL HIREDATE DIFF
------ ---------- ---------- ----------- ----------
10 CLARK 2450 09-JUN-1981 -2550
10 KING 5000 17-NOV-1981 3700
10 MILLER 1300 23-JAN-1982 N/A
20 SMITH 800 17-DEC-1980 -2175
20 JONES 2975 02-APR-1981 -25
20 FORD 3000 03-DEC-1981 0
20 SCOTT 3000 09-DEC-1982 1900
20 ADAMS 1100 12-JAN-1983 N/A
30 ALLEN 1600 20-FEB-1981 350
30 WARD 1250 22-FEB-1981 -1600
30 BLAKE 2850 01-MAY-1981 1350
30 TURNER 1500 08-SEP-1981 250
30 MARTIN 1250 28-SEP-1981 300
30 JAMES 950 03-DEC-1981 N/A
Solution
The is another example of where the window functions LEAD OVER and LAG OVER come in handy. You can easily access next and prior rows without additional joins. Alternative methods, such as subqueries or self-joins are possible but awkward.
What databases is applicable to? All Implementations
Use the window function LEAD OVER to access the “next” employee’s salary relative to the current row:
1 with next_sal_tab (deptno,ename,sal,hiredate,next_sal)
2 AS
3 (select deptno, ename, sal, hiredate,
4 lead(sal)over(partition by deptno
5 order by hiredate) as next_sal
6 from emp )
7
8 select deptno, ename, sal, hiredate
9 , coalesce(cast(sal-next_sal as char), 'N/A') as diff
10 from next_sal_tab
In this case, for the sake of variety, we have used a common table expression rather than a subquery - both will work across most RDBMS’s these days, with the preference usually relating to readability.
Discussion
The first step is to use the LEAD OVER window function to find the “next” salary for each employee within her department. The employees hired last in each department will have a NULL value for NEXT_SAL:
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno order by hiredate) as next_sal
from emp
DEPTNO ENAME SAL HIREDATE NEXT_SAL
------ ---------- ---------- ----------- ----------
10 CLARK 2450 09-JUN-1981 5000
10 KING 5000 17-NOV-1981 1300
10 MILLER 1300 23-JAN-1982
20 SMITH 800 17-DEC-1980 2975
20 JONES 2975 02-APR-1981 3000
20 FORD 3000 03-DEC-1981 3000
20 SCOTT 3000 09-DEC-1982 1100
20 ADAMS 1100 12-JAN-1983
30 ALLEN 1600 20-FEB-1981 1250
30 WARD 1250 22-FEB-1981 2850
30 BLAKE 2850 01-MAY-1981 1500
30 TURNER 1500 08-SEP-1981 1250
30 MARTIN 1250 28-SEP-1981 950
30 JAMES 950 03-DEC-1981
The next step is to take the difference between each employee’s salary and the salary of the employee hired immediately after her in the same department:
select deptno,ename,sal,hiredate, sal-next_sal diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno order by hiredate) next_sal
from emp
)
DEPTNO ENAME SAL HIREDATE DIFF
------ ---------- ---------- ----------- ----------
10 CLARK 2450 09-JUN-1981 -2550
10 KING 5000 17-NOV-1981 3700
10 MILLER 1300 23-JAN-1982
20 SMITH 800 17-DEC-1980 -2175
20 JONES 2975 02-APR-1981 -25
20 FORD 3000 03-DEC-1981 0
20 SCOTT 3000 09-DEC-1982 1900
20 ADAMS 1100 12-JAN-1983
30 ALLEN 1600 20-FEB-1981 350
30 WARD 1250 22-FEB-1981 -1600
30 BLAKE 2850 01-MAY-1981 1350
30 TURNER 1500 08-SEP-1981 250
30 MARTIN 1250 28-SEP-1981 300
30 JAMES 950 03-DEC-1981
The next step is to use the coalesce function to insert “N/A” when there is no next salary. To be able to return “N/A” you must cast the value of DIFF to a string:
select deptno,ename,sal,hiredate,
nvl(to_char(sal-next_sal),'N/A') diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno order by hiredate) next_sal
from emp
)
DEPTNO ENAME SAL HIREDATE DIFF
------ ---------- ---------- ----------- ---------------
10 CLARK 2450 09-JUN-1981 -2550
10 KING 5000 17-NOV-1981 3700
10 MILLER 1300 23-JAN-1982 N/A
20 SMITH 800 17-DEC-1980 -2175
20 JONES 2975 02-APR-1981 -25
20 FORD 3000 03-DEC-1981 0
20 SCOTT 3000 09-DEC-1982 1900
20 ADAMS 1100 12-JAN-1983 N/A
30 ALLEN 1600 20-FEB-1981 350
30 WARD 1250 22-FEB-1981 -1600
30 BLAKE 2850 01-MAY-1981 1350
30 TURNER 1500 08-SEP-1981 250
30 MARTIN 1250 28-SEP-1981 300
30 JAMES 950 03-DEC-1981 N/A
While the majority of the solutions provided in this book do not deal with “what if” scenarios (for the sake of readability and the author’s sanity), the scenario involving duplicates when using the LEAD OVER function in this manner must be discussed. In the simple sample data in table EMP, no employees have duplicate HIREDATEs, yet this is a very likely situation. Normally, I would not discuss a “what if” situation such as duplicates (since there aren’t any in table EMP), but the workaround involving LEAD may not be immediately obvious. Consider the following query, which returns the difference in SAL between the employees in DEPTNO 10 (the difference is performed in the order in which they were hired):
select deptno,ename,sal,hiredate,
lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno
order by hiredate) next_sal
from emp
where deptno=10 and empno > 10
)
DEPTNO ENAME SAL HIREDATE DIFF
------ ------ ----- ----------- ----------
10 CLARK 2450 09-JUN-1981 -2550
10 KING 5000 17-NOV-1981 3700
10 MILLER 1300 23-JAN-1982 N/A
This solution is correct considering the data in table EMP but, if there were duplicate rows, the solution would fail. Consider the example below, showing four more employees hired on the same day as KING:
insert into emp (empno,ename,deptno,sal,hiredate)
values (1,'ant',10,1000,to_date('17-NOV-1981'))
insert into emp (empno,ename,deptno,sal,hiredate)
values (2,'joe',10,1500,to_date('17-NOV-1981'))
insert into emp (empno,ename,deptno,sal,hiredate)
values (3,'jim',10,1600,to_date('17-NOV-1981'))
insert into emp (empno,ename,deptno,sal,hiredate)
values (4,'jon',10,1700,to_date('17-NOV-1981'))
select deptno,ename,sal,hiredate,
lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
from (
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno
order by hiredate) next_sal
from emp
where deptno=10
)
DEPTNO ENAME SAL HIREDATE DIFF
------ ------ ----- ----------- ----------
10 CLARK 2450 09-JUN-1981 1450
10 ant 1000 17-NOV-1981 -500
10 joe 1500 17-NOV-1981 -3500
10 KING 5000 17-NOV-1981 3400
10 jim 1600 17-NOV-1981 -100
10 jon 1700 17-NOV-1981 400
10 MILLER 1300 23-JAN-1982 N/A
You’ll notice that with the exception of employee JON, all employees hired on the same date (November 17) evaluate their salary against another employee hired on the same date! This is incorrect. All employees hired on November 17 should have the difference of salary computed against MILLER’s salary, not another employee hired on November 17. Take, for example, employee ANT. The value for DIFF for ANT is–500 because ANT’s SAL is compared with JOE’s SAL and is 500 less than JOE’s SAL, hence the value of–500. The correct value for DIFF for employee ANT should be–300 because ANT makes 300 less than MILLER, who is the next employee hired by HIREDATE. The reason the solution seems to not work is due to the default behavior of Oracle’s LEAD OVER function. By default, LEAD OVER only looks ahead one row. So, for employee ANT, the next SAL based on HIREDATE is JOE’s SAL, because LEAD OVER simply looks one row ahead and doesn’t skip duplicates. Fortunately, Oracle planned for such a situation and allows you to pass an additional parameter to LEAD OVER to determine how far ahead it should look. In the example above, the solution is simply a matter of counting: find the distance from each employee hired on November 17 to January 23 (MILLER’s HIREDATE). The solution below shows how to accomplish this:
select deptno,ename,sal,hiredate,
lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff
from (
select deptno,ename,sal,hiredate,
lead(sal,cnt-rn+1)over(partition by deptno
order by hiredate) next_sal
from (
select deptno,ename,sal,hiredate,
count(*)over(partition by deptno,hiredate) cnt,
row_number()over(partition by deptno,hiredate order by sal) rn
from emp
where deptno=10
)
)
DEPTNO ENAME SAL HIREDATE DIFF
------ ------ ----- ----------- ----------
10 CLARK 2450 09-JUN-1981 1450
10 ant 1000 17-NOV-1981 -300
10 joe 1500 17-NOV-1981 200
10 jim 1600 17-NOV-1981 300
10 jon 1700 17-NOV-1981 400
10 KING 5000 17-NOV-1981 3700
10 MILLER 1300 23-JAN-1982 N/A
Now the solution is correct. As you can see, all the employees hired on November 17 now have their salaries compared with MILLER’s salary. Inspecting the results, employee ANT now has a value of–300 for DIFF, which is what we were hoping for. If it isn’t immediately obvious, the expression passed to LEAD OVER; CNT-RN+1 is simply the distance from each employee hired on November 17 to MILLER. Consider the inline view below, which shows the values for CNT and RN:
select deptno,ename,sal,hiredate,
count(*)over(partition by deptno,hiredate) cnt,
row_number()over(partition by deptno,hiredate order by sal) rn
from emp
where deptno=10
DEPTNO ENAME SAL HIREDATE CNT RN
------ ------ ----- ----------- ---------- ----------
10 CLARK 2450 09-JUN-1981 1 1
10 ant 1000 17-NOV-1981 5 1
10 joe 1500 17-NOV-1981 5 2
10 jim 1600 17-NOV-1981 5 3
10 jon 1700 17-NOV-1981 5 4
10 KING 5000 17-NOV-1981 5 5
10 MILLER 1300 23-JAN-1982 1 1
The value for CNT represents, for each employee with a duplicate HIREDATE, how many duplicates there are in total for their HIREDATE. The value for RN represents a ranking for the employees in DEPTNO 10. The rank is partitioned by DEPTNO and HIREDATE so only employees with a HIREDATE that another employee has will have a value greater than one. The ranking is sorted by SAL (this is arbitrary; SAL is convenient, but we could have just as easily chosen EMPNO). Now that you know how many total duplicates there are and you have a ranking of each duplicate, the distance to MILLER is simply the total number of duplicates minus the current rank plus one (CNT-RN+1). The results of the distance calculation and its effect on LEAD OVER are shown below:
select deptno,ename,sal,hiredate,
lead(sal)over(partition by deptno
order by hiredate) incorrect,
cnt-rn+1 distance,
lead(sal,cnt-rn+1)over(partition by deptno
order by hiredate) correct
from (
select deptno,ename,sal,hiredate,
count(*)over(partition by deptno,hiredate) cnt,
row_number()over(partition by deptno,hiredate
order by sal) rn
from emp
where deptno=10
)
DEPTNO ENAME SAL HIREDATE INCORRECT DISTANCE CORRECT
------ ------ ----- ----------- ---------- ---------- ----------
10 CLARK 2450 09-JUN-1981 1000 1 1000
10 ant 1000 17-NOV-1981 1500 5 1300
10 joe 1500 17-NOV-1981 1600 4 1300
10 jim 1600 17-NOV-1981 1700 3 1300
10 jon 1700 17-NOV-1981 5000 2 1300
10 KING 5000 17-NOV-1981 1300 1 1300
10 MILLER 1300 23-JAN-1982 1
Now you can clearly see the effect that you have when you pass the correct distance to LEAD OVER. The rows for INCORRECT represent the values returned by LEAD OVER using a default distance of one. The rows for CORRECT represent the values returned by LEAD OVER using the proper distance for each employee with a duplicate HIREDATE to MILLER. At this point, all that is left is to find the difference between CORRECT and SAL for each row, which has already been shown.