Debugging SCD2

This post is again about the Slowly Changing Dimensions Type 2, but focusing on another problem. Once you have a need to validate the versioning mechanism, how you can do this? Or, in other words, having several versions of the same data (identified by the natural key), how to check what fields have been changed from version to version? Working with systems like Siebel CRM, which have some tables with 500+ columns, this possibility was really useful.
Of course you can write some PL/SQL code and iterate through the columns to compare their values. But I’m a friend of “pure SQL” solutions – let’s see how this can be done.

My idea is to prepare the data set with a row per “pair of rows to compare”, column name and value to compare and then to group on first two, while HAVING COUNT(DISTINCT value) will show us the changes. Sounds difficult? Let’s do it step by step.

We start with the table, which was created and loaded in my previous post about SCD2:

SQL> select * from t_target;

   DWH_KEY VALID_FROM VALID_TO  C ETL BUS_K FIRST_NAME SECOND_NAM LAST_NAME  HIRE_DATE FIRE_DATE     SALARY
---------- ---------- --------- - --- ----- ---------- ---------- ---------- --------- --------- ----------
         1 01-DEC-16  02-DEC-16 N UPD 123   Roger                 Federer    01-JAN-10               900000
         2 01-DEC-16  02-DEC-16 N UPD 456   Rafael                Nadal      01-MAY-09               720000
         3 01-DEC-16  01-DEC-16 N UPD 789   Serena                Williams   01-JUN-08               650000
         5 02-DEC-16  03-DEC-16 N UPD 789   Serena     Jameka     Williams   01-JUN-08               650000
         6 02-DEC-16  02-DEC-16 N UPD 345   Venus                 Williams   01-NOV-16               500000
        10 03-DEC-16  31-DEC-99 Y INS 345   Venus                 Williams   01-NOV-16 01-DEC-16     500000
        11 03-DEC-16  31-DEC-99 Y INS 123   Roger                 Federer    01-JAN-10               920000
        12 04-DEC-16  31-DEC-99 Y INS 789   Serena     Jameka     Williams   01-JUN-08               700000

 8 rows selected

We should compare the three highlighted rows for the business key 789 an report the field name where the change took place and both the new and old values. In general, I would like to build a query which will be able to compare more than one pair of rows at a time. Therefore, the first thing we do, is identifying those pairs:

select a.dwh_key id_a
,      lead(a.dwh_key) over (order by a.valid_from) id_b
from   t_target a 
where  a.bus_key = 789 

      ID_A       ID_B
---------- ----------
         3          5
         5         12
        12

We are using the surrogate key DWH_KEY to identify the rows. And we must examine the changes within two pairs of rows: 3 <-> 5 and 5 <-> 12. The third row in the result is not of interest and well be filtered out in the next step.

Now, each pair consists of two rows. We duplicate the previous result set by joining it to the row source generator with exactly two rows (this small subquery FROM DUAL with CONNECT BY), providing the dwh_key identifying each part (highlighted line 9) and generating a unique number to identify each pair (line 10)

with p as ( 
  select a.dwh_key id_a
  ,      lead(a.dwh_key) over (order by a.valid_from) id_b
  from   t_target a 
  where  a.bus_key = 789  )
,  pairs as (
  select id_a
  ,      id_b 
  ,      case when lvl = 1 then id_a else id_b end id
  ,      dense_rank() over (order by id_a, id_b) rn 
  from   p
  ,     (select level lvl from dual connect by level<=2) row_source
  where id_b is not null )
select * from pairs;

      ID_A       ID_B         ID         RN
---------- ---------- ---------- ----------
         3          5          5          1
         3          5          3          1
         5         12         12          2
         5         12          5          2

Now we are ready to collect column names and values. But how? We need to transpose the columns to be the rows of the result set – classical task for UNPIVOT operator. Don’t forget to use INCLUDE NULLS, because we are interested in those rows too.

SQL> with prep as ( 
  select a.dwh_key id_a
  ,      lead(a.dwh_key) over (order by a.valid_from) id_b
  from   t_target a 
  where  a.bus_key = 789  )
,  pairs as (
  select id_a
  ,      id_b 
  ,      case when lvl = 1 then id_a else id_b end id
  ,      dense_rank() over (order by id_a, id_b) rn 
  from   prep
  ,     (select level lvl from dual connect by level<=2) row_source
  where id_b is not null )
select * 
from   ( select p.rn, a.DWH_KEY, a.FIRST_NAME, a.SECOND_NAMES, a.LAST_NAME
         from   t_target a, pairs p
         where  a.dwh_key = p.id )
unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME )  );

        RN    DWH_KEY COLUMN_NAME                    COLUMN_VALUE                 
---------- ---------- ------------------------------ ------------------------------
         1          3 FIRST_NAME                     Serena                        
         1          3 SECOND_NAMES                                                 
         1          3 LAST_NAME                      Williams                      
         1          5 FIRST_NAME                     Serena                        
         1          5 SECOND_NAMES                   Jameka                        
         1          5 LAST_NAME                      Williams                      
         2          5 FIRST_NAME                     Serena                        
         2          5 SECOND_NAMES                   Jameka                        
         2          5 LAST_NAME                      Williams                      
         2         12 FIRST_NAME                     Serena                        
         2         12 SECOND_NAMES                   Jameka                        
         2         12 LAST_NAME                      Williams                      

 12 rows selected

Looks promising. Now we can try to group by RN and COLUMN_NAME, and see the changes. But we need some more steps… First you cannot combine UNPIVOT and GROUP BY in one query, so we need to nest the query once more (lines 14-20). Another point to keep in mind is a special handling of NULL values. MIN, MAX and COUNT(distinct … ) are ignoring NULLs. But there are simple workarounds there: we can use KEEP clause with MIN and MAX (lines 22,23) and we can COUNT(distinct dump(column_value)) to count NULLS as well (line 26) (see more at Nulls and Aggregate Functions):

SQL> with prep as ( 
  select a.dwh_key id_a
  ,      lead(a.dwh_key) over (order by a.valid_from) id_b
  from   t_target a 
  where  a.bus_key = 789  )
,  pairs as (
  select id_a
  ,      id_b 
  ,      case when lvl = 1 then id_a else id_b end id
  ,      dense_rank() over (order by id_a, id_b) rn 
  from   prep
  ,     (select level lvl from dual connect by level<=2) row_source
  where id_b is not null )
,  unpivot_query as (  
  select * 
  from   ( select p.rn, a.DWH_KEY, a.FIRST_NAME, a.SECOND_NAMES, a.LAST_NAME
           from   t_target a, pairs p
           where  a.dwh_key = p.id )
  unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME )  )
  )
select rn
,      min (column_value) keep (dense_rank first order by dwh_key) old_value
,      max (column_value) keep (dense_rank last order by dwh_key) new_value
from   unpivot_query
group by rn, column_name
having count(distinct dump(column_value)) > 1
order by rn;

        RN OLD_VALUE                      NEW_VALUE                    
---------- ------------------------------ ------------------------------
         1                                Jameka

Ready? Not just yet. We’ve compared only three VARCHAR2 columns, but there are also some date and number fields in the table. Is it a problem? Yes, it is! Trying to UNPIVOT columns of different datatypes
results in ORA-01790:

...
unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME, SALARY )  )
...
ORA-01790: expression must have same datatype as corresponding expression

But it is not a kind of problem, we cannot solve! Just do three UNPIVOT subqueries each dealing with its own datatype family (VARCHAR2, DATE, NUMBER), convert the results to strings and then combine them with UNION ALL:

SQL> with prep as ( 
  select a.dwh_key id_a
  ,      lead(a.dwh_key) over (order by a.valid_from) id_b
  from   t_target a 
  where  a.bus_key = 789  )
,  pairs as (
  select id_a
  ,      id_b 
  ,      case when lvl = 1 then id_a else id_b end id
  ,      dense_rank() over (order by id_a, id_b) rn 
  from   prep
  ,     (select level lvl from dual connect by level<=2) row_source
  where id_b is not null )
,  unpivot_query as (  
  -- CHAR fields
  select * from (
  select * 
  from   ( select p.rn, a.DWH_KEY, a.FIRST_NAME, a.SECOND_NAMES, a.LAST_NAME
           from   t_target a, pairs p
           where  a.dwh_key = p.id )
  unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME )  )
  )
  union all
  -- DATE fields
  select rn, DWH_KEY, column_name, to_char(column_value,'dd.mm.yyyy hh24:mi:ss')  
  from (select * 
        from   ( select p.rn, a.DWH_KEY, a.HIRE_DATE, FIRE_DATE
                 from   t_target a, pairs p
                 where  a.dwh_key = p.id )
        unpivot include nulls (column_value for column_name in (HIRE_DATE, FIRE_DATE )  )
        )
  union all
  --NUMBER fields
  select rn, DWH_KEY, column_name, to_char(column_value) 
  from (select * 
        from   ( select p.rn, a.DWH_KEY, a.SALARY
                 from   t_target a, pairs p
                 where  a.dwh_key = p.id )
        unpivot include nulls (column_value for column_name in (SALARY )  )
       )
  )
select rn
,      listagg(dwh_key,'-') within group (order by dwh_key) dwh_keys
,      column_name 
,      min (column_value) keep (dense_rank first order by dwh_key) old_value
,      max (column_value) keep (dense_rank last order by dwh_key) new_value
from   unpivot_query
group by rn, column_name
having count(distinct dump(column_value)) > 1
order by rn;

        RN DWH_KEYS   COLUMN_NAME                    OLD_VALUE                      NEW_VALUE                    
---------- ---------- ------------------------------ ------------------------------ ------------------------------
         1 3-5        SECOND_NAMES                                                  Jameka                        
         2 5-12       SALARY                         650000                         700000

I added listagg as a pair of DWH_KEYs to show which rows were compared.
Okay, the resulting query is more complicated as initially expected. It was a challenging SQL-exercise and can be useful for ad hoc queries. If I had to implement it as a generic solution for regular use, I would rather tend to implement it using PL/SQL, getting columns ad their data types out of the data dictionary.

SQLORA

Just some more Oracle stuff

Debugging SCD2

Related Posts

Leave a Reply Cancel reply