Some time ago Philipp Salvisberg has posted several thoughts about joining and merging temporal intervals. Recently I was looking for some examples of using the new MATCH_RECOGNIZE clause introduced in 12c and found that these queries can be significantly simplified using the new clause.
I would urge you must read the original articles to understand the problem and solution:
Joining Temporal Intervals
Merging Temporal Intervals with Gaps
Joining Temporal Intervals Part 2
Briefly summarized: we have temporal versions of EMP, DEPT and JOB tables named EMPV, DEPTV and JOBV with validity columns VALID_FROM and VALID_TO. We want to join data with respect of temporal validity, possible gaps and also merge the records with identical data extending their validity periods. The script to create and populate the tables can be found here.
Here is the original query:
WITH joined AS ( -- gap-aware temporal join -- produces result_cols to calculate new_group in the subsequent query SELECT e.empno, g.valid_from, LEAST( e.valid_to, d.valid_to, j.valid_to, NVL(m.valid_to, e.valid_to), LEAD(g.valid_from - 1, 1, e.valid_to) OVER( PARTITION BY e.empno ORDER BY g.valid_from ) ) AS valid_to, ( e.ename || ',' || j.job || ',' || e.mgr || ',' || m.ename || ',' || TO_CHAR(e.hiredate,'YYYY-MM-DD') || ',' || e.sal || ',' || e.comm || ',' || e.deptno || ',' || d.dname ) AS result_cols, e.ename, j.job, e.mgr, m.ename AS mgr_ename, e.hiredate, e.sal, e.comm, e.deptno, d.dname FROM empv e INNER JOIN (SELECT valid_from FROM empv UNION SELECT valid_from FROM deptv UNION SELECT valid_from FROM jobv UNION SELECT valid_to + 1 FROM empv WHERE valid_to != DATE '9999-12-31' UNION SELECT valid_to + 1 FROM deptv WHERE valid_to != DATE '9999-12-31' UNION SELECT valid_to + 1 FROM jobv WHERE valid_to != DATE '9999-12-31') g ON g.valid_from BETWEEN e.valid_from AND e.valid_to INNER JOIN deptv d ON d.deptno = e.deptno AND g.valid_from BETWEEN d.valid_from AND d.valid_to INNER JOIN jobv j ON j.jobno = e.jobno AND g.valid_from BETWEEN j.valid_from AND j.valid_to LEFT JOIN empv m ON m.empno = e.mgr AND g.valid_from BETWEEN m.valid_from AND m.valid_to ), calc_various AS ( -- produces columns has_gap, new_group SELECT empno, valid_from, valid_to, result_cols, ename, job, mgr, mgr_ename, hiredate, sal, comm, deptno, dname, CASE WHEN LAG(valid_to, 1, valid_from - 1) OVER( PARTITION BY empno ORDER BY valid_from ) = valid_from - 1 THEN 0 ELSE 1 END AS has_gap, CASE WHEN LAG(result_cols, 1, result_cols) OVER ( PARTITION BY empno ORDER BY valid_from ) = result_cols THEN 0 ELSE 1 END AS new_group FROM joined ), calc_group AS ( -- produces column group_no SELECT empno, valid_from, valid_to, ename, job, mgr, mgr_ename, hiredate, sal, comm, deptno, dname, SUM(has_gap + new_group) OVER( PARTITION BY empno ORDER BY valid_from ) AS group_no FROM calc_various ), merged AS ( -- produces the final merged result -- grouping by group_no ensures that gaps are honored SELECT empno, MIN(valid_from) AS valid_from, MAX(valid_to) AS valid_to, ename, job, mgr, mgr_ename, hiredate, sal, comm, deptno, dname FROM calc_group GROUP BY empno, group_no, ename, job, mgr, mgr_ename, hiredate, sal, comm, deptno, dname ORDER BY empno, valid_from ) -- main select * FROM merged WHERE empno = 7788;
Let’s see what is going on. In the named query joined we are joining all possible time intervals. Next, we are using “start of the group” method. First, we identify records where we have a gap or another set of values (HAS_GAP and NEW_GROUP) – named query calc_various. Then we calculate the unique group number to be able to group records upon it – GROUP_NO in the named query calc_group. Finally we can group merging records and extending our time intervals – named query merged.
Looks complicated enough, isn’t it? Let’s try to simplify it using new feature – the MATCH_RECOGNIZE clause.
WITH joined AS ( -- gap-aware temporal join -- produces result_cols to calculate new_group in the subsequent query SELECT e.empno, g.valid_from, LEAST( e.valid_to, d.valid_to, j.valid_to, NVL(m.valid_to, e.valid_to), LEAD(g.valid_from - 1, 1, e.valid_to) OVER( PARTITION BY e.empno ORDER BY g.valid_from ) ) AS valid_to, e.ename, j.job, e.mgr, m.ename AS mgr_ename, e.hiredate, e.sal, e.comm, e.deptno, d.dname FROM empv e INNER JOIN (SELECT valid_from FROM empv UNION SELECT valid_from FROM deptv UNION SELECT valid_from FROM jobv UNION SELECT valid_to + 1 FROM empv WHERE valid_to != DATE '9999-12-31' UNION SELECT valid_to + 1 FROM deptv WHERE valid_to != DATE '9999-12-31' UNION SELECT valid_to + 1 FROM jobv WHERE valid_to != DATE '9999-12-31') g ON g.valid_from BETWEEN e.valid_from AND e.valid_to INNER JOIN deptv d ON d.deptno = e.deptno AND g.valid_from BETWEEN d.valid_from AND d.valid_to INNER JOIN jobv j ON j.jobno = e.jobno AND g.valid_from BETWEEN j.valid_from AND j.valid_to LEFT JOIN empv m ON m.empno = e.mgr AND g.valid_from BETWEEN m.valid_from AND m.valid_to ) SELECT empno , valid_from , valid_to , ename , job , mgr , mgr_ename , hiredate , sal , comm , deptno , dname FROM joined MATCH_RECOGNIZE ( PARTITION BY empno , ename , job , mgr , mgr_ename , hiredate , sal , comm , deptno , dname ORDER BY valid_from MEASURES FIRST(valid_from) valid_from, LAST(valid_to) valid_to ONE ROW PER MATCH PATTERN ( strt nxt* ) DEFINE nxt as valid_from = prev(valid_to) + 1 ) WHERE empno = 7788;
What we are doing here? We want to find a pattern where time intervals for the same values of content columns are connected and merge them.
- Lines 61-70: we partition our data by all content columns. Thus we are searching for a pattern only within the same values
- Line 71 : goes without saying, the data must be ordered by VALID_FROM within a partition
- Lines 74,75: next we define our pattern and pattern variables. Let’s look at the pattern. The pattern is expressed using regular expression syntax over logical pattern variables. We start with the variable strt, which is a special one since we don’t define a condition for it. This means it is always true and therefore the first row of a partition will always match. Next we have a variable nxt which might have 0 to N repetitions. The condition for this variable is that its VALID_FROM must be “connected” to the VALID_TO of the previous row (one day later). If this is the case, those further rows will belong to the same match. If no, then we have a gap and those rows become a start of the next pattern match.
- Lines 72,73: note that we are saying ONE ROW PER MATCH (which is default). Hence, we get only one row even if matching several connected time intervals. We can then “extend” the time interval of this result row using the VALID_FROM of the first matched row and VALID_TO of the last matched row: FIRST(valid_from) and LAST(valid_to)
Quite simple, isn’t it? Note how close is this code to plain English explanation.