Merging temporal intervals using MATCH_RECOGNIZE

Some time ago Philipp Salvisberg has posted several thoughts about joining and merging temporal intervals. Recently I was looking for some examples of using the new MATCH_RECOGNIZE clause introduced in 12c and found that these queries can be significantly simplified using the new clause.

I would urge you must read the original articles to understand the problem and solution:

Joining Temporal Intervals
Merging Temporal Intervals with Gaps
Joining Temporal Intervals Part 2

Briefly summarized: we have temporal versions of EMP, DEPT and JOB tables named EMPV, DEPTV and JOBV with validity columns VALID_FROM and VALID_TO. We want to join data with respect of temporal validity, possible gaps and also merge the records with identical data extending their validity periods. The script to create and populate the tables can be found here.

Here is the original query:

WITH
   joined AS (
      -- gap-aware temporal join
      -- produces result_cols to calculate new_group in the subsequent query
      SELECT e.empno,
             g.valid_from,
             LEAST(
                e.valid_to,
                d.valid_to,
                j.valid_to,
                NVL(m.valid_to, e.valid_to),
                LEAD(g.valid_from - 1, 1, e.valid_to) OVER(
                   PARTITION BY e.empno ORDER BY g.valid_from
                )
             ) AS valid_to,
             (
                e.ename
                || ',' || j.job
                || ',' || e.mgr
                || ',' || m.ename
                || ',' || TO_CHAR(e.hiredate,'YYYY-MM-DD')
                || ',' || e.sal
                || ',' || e.comm
                || ',' || e.deptno
                || ',' || d.dname
             ) AS result_cols,
             e.ename,
             j.job,
             e.mgr,
             m.ename AS mgr_ename,
             e.hiredate,
             e.sal,
             e.comm,
             e.deptno,
             d.dname            
        FROM empv e
       INNER JOIN (SELECT valid_from FROM empv
                   UNION
                   SELECT valid_from FROM deptv
                   UNION
                   SELECT valid_from FROM jobv
                   UNION
                   SELECT valid_to + 1 FROM empv
                    WHERE valid_to != DATE '9999-12-31'
                   UNION
                   SELECT valid_to + 1 FROM deptv
                    WHERE valid_to != DATE '9999-12-31'
                   UNION
                   SELECT valid_to + 1 FROM jobv
                    WHERE valid_to != DATE '9999-12-31') g
          ON g.valid_from BETWEEN e.valid_from AND e.valid_to
       INNER JOIN deptv d
          ON d.deptno = e.deptno
             AND g.valid_from BETWEEN d.valid_from AND d.valid_to
       INNER JOIN jobv j
          ON j.jobno = e.jobno
             AND g.valid_from BETWEEN j.valid_from AND j.valid_to
        LEFT JOIN empv m
          ON m.empno = e.mgr
             AND g.valid_from BETWEEN m.valid_from AND m.valid_to
   ),
   calc_various AS (
      -- produces columns has_gap, new_group
      SELECT empno,
             valid_from,
             valid_to,
             result_cols,
             ename,
             job,
             mgr,
             mgr_ename,
             hiredate,
             sal,
             comm,
             deptno,
             dname,
             CASE
                WHEN LAG(valid_to, 1, valid_from - 1) OVER(
                        PARTITION BY empno ORDER BY valid_from
                     ) = valid_from - 1 THEN
                   0
                ELSE
                   1
             END AS has_gap,
             CASE
                WHEN LAG(result_cols, 1, result_cols) OVER (
                        PARTITION BY empno ORDER BY valid_from
                     ) = result_cols THEN
                   0
                ELSE
                   1
             END AS new_group
        FROM joined
   ),
   calc_group AS (
      -- produces column group_no
      SELECT empno,
             valid_from,
             valid_to,
             ename,
             job,
             mgr,
             mgr_ename,
             hiredate,
             sal,
             comm,
             deptno,
             dname,
              SUM(has_gap + new_group) OVER(
                PARTITION BY empno ORDER BY valid_from
             ) AS group_no
        FROM calc_various
   ),
   merged AS (
      -- produces the final merged result
      -- grouping by group_no ensures that gaps are honored
      SELECT empno,
             MIN(valid_from) AS valid_from,
             MAX(valid_to) AS valid_to,
             ename,
             job,
             mgr,
             mgr_ename,
             hiredate,
             sal,
             comm,
             deptno,
             dname
        FROM calc_group
       GROUP BY empno,
                group_no,
                ename,
                job,
                mgr,
                mgr_ename,
                hiredate,
                sal,
                comm,
                deptno,
                dname
       ORDER BY empno,
                valid_from
   )  
-- main
select * FROM merged WHERE empno = 7788;

Let’s see what is going on. In the named query joined we are joining all possible time intervals. Next, we are using “start of the group” method. First, we identify records where we have a gap or another set of values (HAS_GAP and NEW_GROUP) – named query calc_various. Then we calculate the unique group number to be able to group records upon it – GROUP_NO in the named query calc_group. Finally we can group merging records and extending our time intervals – named query merged.

Looks complicated enough, isn’t it? Let’s try to simplify it using new feature – the MATCH_RECOGNIZE clause.

WITH  joined AS (
      -- gap-aware temporal join
      -- produces result_cols to calculate new_group in the subsequent query
      SELECT e.empno,
             g.valid_from,
             LEAST(
                e.valid_to,
                d.valid_to,
                j.valid_to,
                NVL(m.valid_to, e.valid_to),
                LEAD(g.valid_from - 1, 1, e.valid_to) OVER(
                   PARTITION BY e.empno ORDER BY g.valid_from
                )
             ) AS valid_to,
             e.ename,
             j.job,
             e.mgr,
             m.ename AS mgr_ename,
             e.hiredate,
             e.sal,
             e.comm,
             e.deptno,
             d.dname            
        FROM empv e
       INNER JOIN (SELECT valid_from FROM empv
                   UNION
                   SELECT valid_from FROM deptv
                   UNION
                   SELECT valid_from FROM jobv
                   UNION
                   SELECT valid_to + 1 FROM empv
                    WHERE valid_to != DATE '9999-12-31'
                   UNION
                   SELECT valid_to + 1 FROM deptv
                    WHERE valid_to != DATE '9999-12-31'
                   UNION
                   SELECT valid_to + 1 FROM jobv
                    WHERE valid_to != DATE '9999-12-31') g
          ON g.valid_from BETWEEN e.valid_from AND e.valid_to
       INNER JOIN deptv d
          ON d.deptno = e.deptno AND g.valid_from BETWEEN d.valid_from AND d.valid_to
       INNER JOIN jobv j
          ON j.jobno = e.jobno AND g.valid_from BETWEEN j.valid_from AND j.valid_to
        LEFT JOIN empv m
          ON m.empno = e.mgr AND g.valid_from BETWEEN m.valid_from AND m.valid_to
   )
SELECT empno
,      valid_from
,      valid_to
,      ename
,      job
,      mgr
,      mgr_ename
,      hiredate
,      sal
,      comm
,      deptno
,      dname
FROM joined
      MATCH_RECOGNIZE (
        PARTITION BY empno
                   , ename
                   , job
                   , mgr
                   , mgr_ename
                   , hiredate
                   , sal
                   , comm
                   , deptno
                   , dname            
        ORDER BY valid_from
        MEASURES FIRST(valid_from) valid_from, LAST(valid_to) valid_to
        ONE ROW PER MATCH
        PATTERN ( strt nxt* )
        DEFINE nxt as valid_from = prev(valid_to) + 1
        )
WHERE empno = 7788; 

What we are doing here? We want to find a pattern where time intervals for the same values of content columns are connected and merge them.

  • Lines 61-70: we partition our data by all content columns. Thus we are searching for a pattern only within the same values
  • Line 71 : goes without saying, the data must be ordered by VALID_FROM within a partition
  • Lines 74,75: next we define our pattern and pattern variables. Let’s look at the pattern. The pattern is expressed using regular expression syntax over logical pattern variables. We start with the variable strt, which is a special one since we don’t define a condition for it. This means it is always true and therefore the first row of a partition will always match. Next we have a variable nxt which might have 0 to N repetitions. The condition for this variable is that its VALID_FROM must be “connected” to the VALID_TO of the previous row (one day later). If this is the case, those further rows will belong to the same match. If no, then we have a gap and those rows become a start of the next pattern match.
  • Lines 72,73: note that we are saying ONE ROW PER MATCH (which is default). Hence, we get only one row even if matching several connected time intervals. We can then “extend” the time interval of this result row using the VALID_FROM of the first matched row and VALID_TO of the last matched row: FIRST(valid_from) and LAST(valid_to)

Quite simple, isn’t it? Note how close is this code to plain English explanation.

Leave a Reply

Your email address will not be published. Required fields are marked *