This post is again about the Slowly Changing Dimensions Type 2, but focusing on another problem. Once you have a need to validate the versioning mechanism, how you can do this? Or, in other words, having several versions of the same data (identified by the natural key), how to check what fields have been changed from version to version? Working with systems like Siebel CRM, which have some tables with 500+ columns, this possibility was really useful.
Of course you can write some PL/SQL code and iterate through the columns to compare their values. But I’m a friend of “pure SQL” solutions – let’s see how this can be done.
My idea is to prepare the data set with a row per “pair of rows to compare”, column name and value to compare and then to group on first two, while HAVING COUNT(DISTINCT value) will show us the changes. Sounds difficult? Let’s do it step by step.
We start with the table, which was created and loaded in my previous post about SCD2:
SQL> select * from t_target; DWH_KEY VALID_FROM VALID_TO C ETL BUS_K FIRST_NAME SECOND_NAM LAST_NAME HIRE_DATE FIRE_DATE SALARY ---------- ---------- --------- - --- ----- ---------- ---------- ---------- --------- --------- ---------- 1 01-DEC-16 02-DEC-16 N UPD 123 Roger Federer 01-JAN-10 900000 2 01-DEC-16 02-DEC-16 N UPD 456 Rafael Nadal 01-MAY-09 720000 3 01-DEC-16 01-DEC-16 N UPD 789 Serena Williams 01-JUN-08 650000 5 02-DEC-16 03-DEC-16 N UPD 789 Serena Jameka Williams 01-JUN-08 650000 6 02-DEC-16 02-DEC-16 N UPD 345 Venus Williams 01-NOV-16 500000 10 03-DEC-16 31-DEC-99 Y INS 345 Venus Williams 01-NOV-16 01-DEC-16 500000 11 03-DEC-16 31-DEC-99 Y INS 123 Roger Federer 01-JAN-10 920000 12 04-DEC-16 31-DEC-99 Y INS 789 Serena Jameka Williams 01-JUN-08 700000 8 rows selected
We should compare the three highlighted rows for the business key 789 an report the field name where the change took place and both the new and old values. In general, I would like to build a query which will be able to compare more than one pair of rows at a time. Therefore, the first thing we do, is identifying those pairs:
select a.dwh_key id_a , lead(a.dwh_key) over (order by a.valid_from) id_b from t_target a where a.bus_key = 789 ID_A ID_B ---------- ---------- 3 5 5 12 12
We are using the surrogate key DWH_KEY to identify the rows. And we must examine the changes within two pairs of rows: 3 <-> 5 and 5 <-> 12. The third row in the result is not of interest and well be filtered out in the next step.
Now, each pair consists of two rows. We duplicate the previous result set by joining it to the row source generator with exactly two rows (this small subquery FROM DUAL with CONNECT BY), providing the dwh_key identifying each part (highlighted line 9) and generating a unique number to identify each pair (line 10)
with p as ( select a.dwh_key id_a , lead(a.dwh_key) over (order by a.valid_from) id_b from t_target a where a.bus_key = 789 ) , pairs as ( select id_a , id_b , case when lvl = 1 then id_a else id_b end id , dense_rank() over (order by id_a, id_b) rn from p , (select level lvl from dual connect by level<=2) row_source where id_b is not null ) select * from pairs; ID_A ID_B ID RN ---------- ---------- ---------- ---------- 3 5 5 1 3 5 3 1 5 12 12 2 5 12 5 2
Now we are ready to collect column names and values. But how? We need to transpose the columns to be the rows of the result set – classical task for UNPIVOT operator. Don’t forget to use INCLUDE NULLS, because we are interested in those rows too.
SQL> with prep as ( select a.dwh_key id_a , lead(a.dwh_key) over (order by a.valid_from) id_b from t_target a where a.bus_key = 789 ) , pairs as ( select id_a , id_b , case when lvl = 1 then id_a else id_b end id , dense_rank() over (order by id_a, id_b) rn from prep , (select level lvl from dual connect by level<=2) row_source where id_b is not null ) select * from ( select p.rn, a.DWH_KEY, a.FIRST_NAME, a.SECOND_NAMES, a.LAST_NAME from t_target a, pairs p where a.dwh_key = p.id ) unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME ) ); RN DWH_KEY COLUMN_NAME COLUMN_VALUE ---------- ---------- ------------------------------ ------------------------------ 1 3 FIRST_NAME Serena 1 3 SECOND_NAMES 1 3 LAST_NAME Williams 1 5 FIRST_NAME Serena 1 5 SECOND_NAMES Jameka 1 5 LAST_NAME Williams 2 5 FIRST_NAME Serena 2 5 SECOND_NAMES Jameka 2 5 LAST_NAME Williams 2 12 FIRST_NAME Serena 2 12 SECOND_NAMES Jameka 2 12 LAST_NAME Williams 12 rows selected
Looks promising. Now we can try to group by RN and COLUMN_NAME, and see the changes. But we need some more steps… First you cannot combine UNPIVOT and GROUP BY in one query, so we need to nest the query once more (lines 14-20). Another point to keep in mind is a special handling of NULL values. MIN, MAX and COUNT(distinct … ) are ignoring NULLs. But there are simple workarounds there: we can use KEEP clause with MIN and MAX (lines 22,23) and we can COUNT(distinct dump(column_value)) to count NULLS as well (line 26) (see more at Nulls and Aggregate Functions):
SQL> with prep as ( select a.dwh_key id_a , lead(a.dwh_key) over (order by a.valid_from) id_b from t_target a where a.bus_key = 789 ) , pairs as ( select id_a , id_b , case when lvl = 1 then id_a else id_b end id , dense_rank() over (order by id_a, id_b) rn from prep , (select level lvl from dual connect by level<=2) row_source where id_b is not null ) , unpivot_query as ( select * from ( select p.rn, a.DWH_KEY, a.FIRST_NAME, a.SECOND_NAMES, a.LAST_NAME from t_target a, pairs p where a.dwh_key = p.id ) unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME ) ) ) select rn , min (column_value) keep (dense_rank first order by dwh_key) old_value , max (column_value) keep (dense_rank last order by dwh_key) new_value from unpivot_query group by rn, column_name having count(distinct dump(column_value)) > 1 order by rn; RN OLD_VALUE NEW_VALUE ---------- ------------------------------ ------------------------------ 1 Jameka
Ready? Not just yet. We’ve compared only three VARCHAR2 columns, but there are also some date and number fields in the table. Is it a problem? Yes, it is! Trying to UNPIVOT columns of different datatypes
results in ORA-01790:
... unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME, SALARY ) ) ... ORA-01790: expression must have same datatype as corresponding expression
But it is not a kind of problem, we cannot solve! Just do three UNPIVOT subqueries each dealing with its own datatype family (VARCHAR2, DATE, NUMBER), convert the results to strings and then combine them with UNION ALL:
SQL> with prep as ( select a.dwh_key id_a , lead(a.dwh_key) over (order by a.valid_from) id_b from t_target a where a.bus_key = 789 ) , pairs as ( select id_a , id_b , case when lvl = 1 then id_a else id_b end id , dense_rank() over (order by id_a, id_b) rn from prep , (select level lvl from dual connect by level<=2) row_source where id_b is not null ) , unpivot_query as ( -- CHAR fields select * from ( select * from ( select p.rn, a.DWH_KEY, a.FIRST_NAME, a.SECOND_NAMES, a.LAST_NAME from t_target a, pairs p where a.dwh_key = p.id ) unpivot include nulls (column_value for column_name in (FIRST_NAME, SECOND_NAMES, LAST_NAME ) ) ) union all -- DATE fields select rn, DWH_KEY, column_name, to_char(column_value,'dd.mm.yyyy hh24:mi:ss') from (select * from ( select p.rn, a.DWH_KEY, a.HIRE_DATE, FIRE_DATE from t_target a, pairs p where a.dwh_key = p.id ) unpivot include nulls (column_value for column_name in (HIRE_DATE, FIRE_DATE ) ) ) union all --NUMBER fields select rn, DWH_KEY, column_name, to_char(column_value) from (select * from ( select p.rn, a.DWH_KEY, a.SALARY from t_target a, pairs p where a.dwh_key = p.id ) unpivot include nulls (column_value for column_name in (SALARY ) ) ) ) select rn , listagg(dwh_key,'-') within group (order by dwh_key) dwh_keys , column_name , min (column_value) keep (dense_rank first order by dwh_key) old_value , max (column_value) keep (dense_rank last order by dwh_key) new_value from unpivot_query group by rn, column_name having count(distinct dump(column_value)) > 1 order by rn; RN DWH_KEYS COLUMN_NAME OLD_VALUE NEW_VALUE ---------- ---------- ------------------------------ ------------------------------ ------------------------------ 1 3-5 SECOND_NAMES Jameka 2 5-12 SALARY 650000 700000
I added listagg as a pair of DWH_KEYs to show which rows were compared.
Okay, the resulting query is more complicated as initially expected. It was a challenging SQL-exercise and can be useful for ad hoc queries. If I had to implement it as a generic solution for regular use, I would rather tend to implement it using PL/SQL, getting columns ad their data types out of the data dictionary.