SQL: how to determine which rows represent a range of consecutive values?

Problem

You want to determine which rows represent a range of consecutive projects. Consider the following result set from view V, which contains data about a project and its start and end dates:

``````	select *
from V

PROJ_ID PROJ_START  PROJ_END
------- ----------- -----------
1 01-JAN-2005 02-JAN-2005
2 02-JAN-2005 03-JAN-2005
3 03-JAN-2005 04-JAN-2005
4 04-JAN-2005 05-JAN-2005
5 06-JAN-2005 07-JAN-2005
6 16-JAN-2005 17-JAN-2005
7 17-JAN-2005 18-JAN-2005
8 18-JAN-2005 19-JAN-2005
9 19-JAN-2005 20-JAN-2005
10 21-JAN-2005 22-JAN-2005
11 26-JAN-2005 27-JAN-2005
12 27-JAN-2005 28-JAN-2005
13 28-JAN-2005 29-JAN-2005
14 29-JAN-2005 30-JAN-2005``````

Excluding the first row, each row’s PROJ_START should equal the PROJ_END of the row before it (“before” is defined as PROJ_ID–1 for the current row). Examining the first five rows from view V, PROJ_IDs 1 through 3 are part of the same “group” as each PROJ_END equals the PROJ_START of the row after it. Because you want to find the range of dates for consecutive projects, you would like to return all rows where the current PROJ_END equals the next row’s PROJ_START. If the first five rows comprised the entire result set, you would like to return only the first three rows. The final result set (using all 14 rows from view V) should be:

``````	PROJ_ID PROJ_START  PROJ_END
------- ----------- -----------
1  01-JAN-2005 02-JAN-2005
2  02-JAN-2005 03-JAN-2005
3  03-JAN-2005 04-JAN-2005
6  16-JAN-2005 17-JAN-2005
7  17-JAN-2005 18-JAN-2005
8  18-JAN-2005 19-JAN-2005
11  26-JAN-2005 27-JAN-2005
12  27-JAN-2005 28-JAN-2005
13  28-JAN-2005 29-JAN-2005``````

The rows with PROJ_IDs 4,5,9,10, and 14 are excluded from this result set because the PROJ_END of each of these rows does not match the PROJ_START of the row following it.

Solution

This solution takes best advantage of the window function LEAD OVER to look at the “next” row’s BEGIN_DATE, thus avoiding the need to self join, which was necessary before window functions were widely introduced:

``````	1 select proj_id, proj_start, proj_end
2   from (
3 select proj_id, proj_start, proj_end,
5   from V
6        ) alias
7 where next_proj_start = proj_end``````

Discussion

Applies to: `DB2`, `MySQL`, `PostgreSQL`, `SQL Server` and `Oracle`

Although it is possible to develop a solution using a self-join, the window function LEAD OVER is perfect for this type of problem, and more intuitive. The function LEAD OVER allows you to examine other rows without performing a self join (though the function must impose order on the result set to do so). Consider the results of the inline view (lines 3–5) for IDs 1 and 4:

``````	select *
from (
select proj_id, proj_start, proj_end,
from v
)
where proj_id in ( 1, 4 )

PROJ_ID PROJ_START  PROJ_END    NEXT_PROJ_START
------- ----------- ----------- ---------------
1 01-JAN-2005 02-JAN-2005 02-JAN-2005
4 04-JAN-2005 05-JAN-2005 06-JAN-2005``````

Examining the above snippet of code and its result set, it is particularly easy to see why PROJ_ID 4 is excluded from the final result set of the complete solution. It’s excluded because its PROJ_END date of 05-JAN-2005 does not match the “next” project’s start date of 06-JAN-2005.

The function LEAD OVER is extremely handy when it comes to problems such as this one, particularly when examining partial results. When working with window functions, keep in mind that they are evaluated after the FROM and WHERE clauses, so the LEAD OVER function in the preceding query must be embedded within an inline view. Otherwise the LEAD OVER function is applied to the result set after the WHERE clause has filtered out all rows except for PROJ_ID’s 1 and 4.

Now, depending on how you view the data, you may very well want to include PROJ_ID 4 in the final result set. Consider the first five rows from view V:

``````	select *
from V
where proj_id <= 5

PROJ_ID PROJ_START  PROJ_END
------- ----------- -----------
1 01-JAN-2005 02-JAN-2005
2 02-JAN-2005 03-JAN-2005
3 03-JAN-2005 04-JAN-2005
4 04-JAN-2005 05-JAN-2005
5 06-JAN-2005 07-JAN-2005``````

If your requirement is such that PROJ_ID 4 is in fact contiguous (because PROJ_ START for PROJ_ID 4 matches PROJ_END for PROJ_ID 3), and that only PROJ_ ID 5 should be discarded, the proposed solution for this recipe is incorrect (!), or at the very least, incomplete:

``````	select proj_id, proj_start, proj_end
from (
select proj_id, proj_start, proj_end,
from V
where proj_id <= 5
)
where proj_end = next_start

PROJ_ID PROJ_START  PROJ_END
------- ----------- -----------
1 01-JAN-2005 02-JAN-2005
2 02-JAN-2005 03-JAN-2005
3 03-JAN-2005 04-JAN-2005``````

If you believe PROJ_ID 4 should be included, simply add LAG OVER to the query and use an additional filter in the `WHERE` clause:

``````	select proj_id, proj_start, proj_end
from (
select proj_id, proj_start, proj_end,
lag(proj_end)over(order by proj_id) last_end
from V
where proj_id <= 5
)
where proj_end = next_start
or proj_start = last_end

PROJ_ID PROJ_START  PROJ_END
------- ----------- -----------
1 01-JAN-2005 02-JAN-2005
2 02-JAN-2005 03-JAN-2005
3 03-JAN-2005 04-JAN-2005
4 04-JAN-2005 05-JAN-2005``````

Now PROJ_ID 4 is included in the final result set, and only the evil PROJ_ID 5 is excluded. Please consider your exact requirements when applying these recipes to your code.

Вас заинтересует / Intresting for you:

5040 views Александров Попков Wed, 17 Oct 2018, 15:04:29
6316 views Денис Wed, 14 Jul 2021, 12:59:03
1577 views Денис Wed, 14 Jul 2021, 04:45:37
1778 views Денис Sat, 10 Jul 2021, 18:02:52