Some time ago Philipp Salvisberg has posted several thoughts about joining and merging temporal intervals. Recently I was looking for some examples of using the new MATCH_RECOGNIZE clause introduced in 12c and found that these queries can be significantly simplified using the new clause.
I would urge you must read the original articles to understand the problem and solution:
Joining Temporal Intervals
Merging Temporal Intervals with Gaps
Joining Temporal Intervals Part 2
Briefly summarized: we have temporal versions of EMP, DEPT and JOB tables named EMPV, DEPTV and JOBV with validity columns VALID_FROM and VALID_TO. We want to join data with respect of temporal validity, possible gaps and also merge the records with identical data extending their validity periods. The script to create and populate the tables can be found here.
Here is the original query:
WITH
joined AS (
-- gap-aware temporal join
-- produces result_cols to calculate new_group in the subsequent query
SELECT e.empno,
g.valid_from,
LEAST(
e.valid_to,
d.valid_to,
j.valid_to,
NVL(m.valid_to, e.valid_to),
LEAD(g.valid_from - 1, 1, e.valid_to) OVER(
PARTITION BY e.empno ORDER BY g.valid_from
)
) AS valid_to,
(
e.ename
|| ',' || j.job
|| ',' || e.mgr
|| ',' || m.ename
|| ',' || TO_CHAR(e.hiredate,'YYYY-MM-DD')
|| ',' || e.sal
|| ',' || e.comm
|| ',' || e.deptno
|| ',' || d.dname
) AS result_cols,
e.ename,
j.job,
e.mgr,
m.ename AS mgr_ename,
e.hiredate,
e.sal,
e.comm,
e.deptno,
d.dname
FROM empv e
INNER JOIN (SELECT valid_from FROM empv
UNION
SELECT valid_from FROM deptv
UNION
SELECT valid_from FROM jobv
UNION
SELECT valid_to + 1 FROM empv
WHERE valid_to != DATE '9999-12-31'
UNION
SELECT valid_to + 1 FROM deptv
WHERE valid_to != DATE '9999-12-31'
UNION
SELECT valid_to + 1 FROM jobv
WHERE valid_to != DATE '9999-12-31') g
ON g.valid_from BETWEEN e.valid_from AND e.valid_to
INNER JOIN deptv d
ON d.deptno = e.deptno
AND g.valid_from BETWEEN d.valid_from AND d.valid_to
INNER JOIN jobv j
ON j.jobno = e.jobno
AND g.valid_from BETWEEN j.valid_from AND j.valid_to
LEFT JOIN empv m
ON m.empno = e.mgr
AND g.valid_from BETWEEN m.valid_from AND m.valid_to
),
calc_various AS (
-- produces columns has_gap, new_group
SELECT empno,
valid_from,
valid_to,
result_cols,
ename,
job,
mgr,
mgr_ename,
hiredate,
sal,
comm,
deptno,
dname,
CASE
WHEN LAG(valid_to, 1, valid_from - 1) OVER(
PARTITION BY empno ORDER BY valid_from
) = valid_from - 1 THEN
0
ELSE
1
END AS has_gap,
CASE
WHEN LAG(result_cols, 1, result_cols) OVER (
PARTITION BY empno ORDER BY valid_from
) = result_cols THEN
0
ELSE
1
END AS new_group
FROM joined
),
calc_group AS (
-- produces column group_no
SELECT empno,
valid_from,
valid_to,
ename,
job,
mgr,
mgr_ename,
hiredate,
sal,
comm,
deptno,
dname,
SUM(has_gap + new_group) OVER(
PARTITION BY empno ORDER BY valid_from
) AS group_no
FROM calc_various
),
merged AS (
-- produces the final merged result
-- grouping by group_no ensures that gaps are honored
SELECT empno,
MIN(valid_from) AS valid_from,
MAX(valid_to) AS valid_to,
ename,
job,
mgr,
mgr_ename,
hiredate,
sal,
comm,
deptno,
dname
FROM calc_group
GROUP BY empno,
group_no,
ename,
job,
mgr,
mgr_ename,
hiredate,
sal,
comm,
deptno,
dname
ORDER BY empno,
valid_from
)
-- main
select * FROM merged WHERE empno = 7788;
Let’s see what is going on. In the named query joined we are joining all possible time intervals. Next, we are using “start of the group” method. First, we identify records where we have a gap or another set of values (HAS_GAP and NEW_GROUP) – named query calc_various. Then we calculate the unique group number to be able to group records upon it – GROUP_NO in the named query calc_group. Finally we can group merging records and extending our time intervals – named query merged.
Looks complicated enough, isn’t it? Let’s try to simplify it using new feature – the MATCH_RECOGNIZE clause.
WITH joined AS (
-- gap-aware temporal join
-- produces result_cols to calculate new_group in the subsequent query
SELECT e.empno,
g.valid_from,
LEAST(
e.valid_to,
d.valid_to,
j.valid_to,
NVL(m.valid_to, e.valid_to),
LEAD(g.valid_from - 1, 1, e.valid_to) OVER(
PARTITION BY e.empno ORDER BY g.valid_from
)
) AS valid_to,
e.ename,
j.job,
e.mgr,
m.ename AS mgr_ename,
e.hiredate,
e.sal,
e.comm,
e.deptno,
d.dname
FROM empv e
INNER JOIN (SELECT valid_from FROM empv
UNION
SELECT valid_from FROM deptv
UNION
SELECT valid_from FROM jobv
UNION
SELECT valid_to + 1 FROM empv
WHERE valid_to != DATE '9999-12-31'
UNION
SELECT valid_to + 1 FROM deptv
WHERE valid_to != DATE '9999-12-31'
UNION
SELECT valid_to + 1 FROM jobv
WHERE valid_to != DATE '9999-12-31') g
ON g.valid_from BETWEEN e.valid_from AND e.valid_to
INNER JOIN deptv d
ON d.deptno = e.deptno AND g.valid_from BETWEEN d.valid_from AND d.valid_to
INNER JOIN jobv j
ON j.jobno = e.jobno AND g.valid_from BETWEEN j.valid_from AND j.valid_to
LEFT JOIN empv m
ON m.empno = e.mgr AND g.valid_from BETWEEN m.valid_from AND m.valid_to
)
SELECT empno
, valid_from
, valid_to
, ename
, job
, mgr
, mgr_ename
, hiredate
, sal
, comm
, deptno
, dname
FROM joined
MATCH_RECOGNIZE (
PARTITION BY empno
, ename
, job
, mgr
, mgr_ename
, hiredate
, sal
, comm
, deptno
, dname
ORDER BY valid_from
MEASURES FIRST(valid_from) valid_from, LAST(valid_to) valid_to
ONE ROW PER MATCH
PATTERN ( strt nxt* )
DEFINE nxt as valid_from = prev(valid_to) + 1
)
WHERE empno = 7788;
What we are doing here? We want to find a pattern where time intervals for the same values of content columns are connected and merge them.
- Lines 61-70: we partition our data by all content columns. Thus we are searching for a pattern only within the same values
- Line 71 : goes without saying, the data must be ordered by VALID_FROM within a partition
- Lines 74,75: next we define our pattern and pattern variables. Let’s look at the pattern. The pattern is expressed using regular expression syntax over logical pattern variables. We start with the variable strt, which is a special one since we don’t define a condition for it. This means it is always true and therefore the first row of a partition will always match. Next we have a variable nxt which might have 0 to N repetitions. The condition for this variable is that its VALID_FROM must be “connected” to the VALID_TO of the previous row (one day later). If this is the case, those further rows will belong to the same match. If no, then we have a gap and those rows become a start of the next pattern match.
- Lines 72,73: note that we are saying ONE ROW PER MATCH (which is default). Hence, we get only one row even if matching several connected time intervals. We can then “extend” the time interval of this result row using the VALID_FROM of the first matched row and VALID_TO of the last matched row: FIRST(valid_from) and LAST(valid_to)
Quite simple, isn’t it? Note how close is this code to plain English explanation.
