Building Hash Keys using SQL Macros in Oracle 20c

In the next post about SQL macros in Oracle 20c we look at how they could be useful for building hash keys. If you are familiar with Data Warehousing and Data Vault modelling approach, you will probably know why it can be a good idea to build hash keys or hash diffs. Anyway, we will not discuss whether or not you should use them, but rather how you can do this in Oracle in a consistent and performant way.

There are several ways to generate a hash key in Oracle. The fastest one is by using a SQL function STANDARD_HASH which has been introduced in Oracle 12.1. For more details see the blog post of my colleague Dani Schnider.

On the other hand it is very important to establish a set or rules to follow while building a hash key, especially in case of multi-column keys or even hashing all columns of a data row to produce a hash diff. For example you have to cast all data types to string using the same format mask, concatenate all strings using the same delimiter an so on.

A proven idea is to encapsulate these rules in a PL/SQL procedure, that should be used by everyone in your project to build hash keys. In the past, it was not so easy to make such a procedure generic though. Since Oracle 18c we could try to do it in a flexible way using polymorphic table functions (PTF). But there is still a problem with this approach.

STANDARD_HASH is only available in SQL. You cannot call it directly in PL/SQL. This is still true in 20c. The only way is doing a SELECT FROM DUAL. But doing SELECT out of PTF will cause the infamous context switch!

So we need to encapsulate the rule set in the PL/SQL procedure, while remaining in a SQL engine. It’s where SQL macros introduced in 20c can help, I guess.

Let’s take an EMP table as an example and look at what we want to achieve and what the problem is. We’re using a function STANDARD_HASH to build a MD5-hash of a primary key and of the whole row, which means all fields converted to string and concatenated with a defined delimiter. Note that I formatted the values so that you see only a narrower part or it.

SQL> SELECT STANDARD_HASH(TO_CHAR(empno),'MD5') as hash_key
,      STANDARD_HASH(TO_CHAR(empno)||'#'||ename||'#'||job||'#'||
                     TO_CHAR(mgr)||'#'||TO_CHAR(hiredate,'YYYYMMDD')
                     ||'#'||TO_CHAR(sal)||'#'||TO_CHAR(comm)
                     ||'#'||TO_CHAR(deptno), 'MD5') as hash_diff
,      e.*
FROM   emp e
FETCH FIRST 3 ROWS ONLY

HASH_KEY   HASH_DIFF       EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- ---------- ---------- --------- ---------- --------- ---------- ---------- ----------
0D7F9017FB 42CB6932B4       7369 SMITH      CLERK           7902 17-DEC-80        800                    20
7A2B33C672 AA63299F72       7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
E1E1F667CE 27332DD16B       7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30

3 rows selected.

Listing 1: Manually build a hash key and a hash diff

It is obvious that if you have a concatenated key or in case of hash diffs, the expressions to be hashed become so complex that they can hardly be handled manually. Imagine what it looks like with tables of few hundreds of columns! You need a generator (Data Warehouse Automation tool, like BiGENiUS for example) to generate this code. If you don’t have one, you can still generate only a hardest part of such SQL statement – with a SQL macro!

Implementation with SQL macro

Now what kind of macro should we try: scalar or table? At first sight, we have to return scalar values – hash_key and hash_diff – so why not use scalar macro? But note that we than have to pass all the columns as parameter to build a hash diff. Again, think of a table with few hundred columns 😉 Such macro will be table specific, that means you just need another macro with different parameters for another table. And last but not least, to be able to apply string conversion rules, we have to get the data type information for every column, likely querying the data dictionary.

Table macros, on the other hand, don’t have those disadvantages and can be made really generic.

Table is the default type for SQL macros, so you don’t have to specify the type in the function annotation. Table macros can only appear in a FROM clause. So what we want to implement, is a SQM function which can be queried as follows and will pass all source columns through adding two new columns HASH_KEY and HASH_DIFF:

SELECT e.hash_key
,      e.hash_diff
,      e.*
FROM   add_hash_columns (emp) e;
   

Listing 2: The desired way to use a SQL macro table function

The key difference is that table macros can accept table parameters. You may know what table parameters are, if you are already familiar with Polymorphic Table Functions (PTF) which have been introduced in Oracle 18c. I have also blogged a lot about PTF. A PTF must have exactly one parameter of type TABLE. For use in DESCRIBE method this will be implicitly converted in record type defined in DBMS_TF package: DBMS_TF.TABLE_T. With SQL macros you cannot define arguments of type TABLE but instead you can directly use DBMS_TF.TABLE_T. Passing a table name to a macro function leads to implicit conversion to that record type.

Let’s start with a macro to build a hash diff only. For now, all we need is just one table parameter of the type DBMS_TF.TALE_T. The DBMS_TF.TABLE_T record contains all information we need: all columns with their data types as long as the table name. How exactly do we get at it? Please note that the documentation of DBMS_TF.TABLE_T is not correct as of now. You can refer the package specification in the database itself, where you’ll find the table_name as an atribute:

  TYPE TABLE_T          IS RECORD
  (
    column                TABLE_COLUMNS_T,     /* Column information */
    schema_name           dbms_quoted_id,  /* the schema name OF ptf */
    package_name          dbms_quoted_id, /* the package name OF ptf */
    ptf_name              dbms_quoted_id,    /* the ptf name invoked */
    table_schema_name     dbms_quoted_id,       /* schema name table */
    table_name            dbms_quoted_id               /* table name */
  );

Listing 3: the definition or the record DBMS_TF.TABLE_T

The field column is of type TABLE_COLUMNS_T which is a table of the type COLUMN_T, which is turn defined as:

TYPE COLUMN_T          IS RECORD
  (
    description            COLUMN_METADATA_T, /* Column metadata */
    pass_through           BOOLEAN,      /* Pass-through column? */
    for_read               BOOLEAN  /* Column data will be read? */
  );

Listing 4: record type COLUMN_T

Next we look at the the field description and what its datatype COLUMN_METADATA_T means (Listing 5). Here we find the information we need to cast actual data types a strings.

TYPE COLUMN_METADATA_T IS RECORD
  (
    type               PLS_INTEGER DEFAULT TYPE_VARCHAR2,
    max_len            PLS_integer DEFAULT -1,
    name               VARCHAR2(32767),
    name_len           PLS_INTEGER,
    /* following two attributes are used for numerical data */
    precision          PLS_INTEGER,
    scale              PLS_INTEGER,
    /* following three attributes are used for character data */
    charsetid          PLS_INTEGER,
    charsetform        PLS_INTEGER,
    collation          PLS_INTEGER,
    /* following attributes may be used in future */
    schema_name        DBMS_QUOTED_ID,
    schema_name_len    PLS_INTEGER,
    type_name          DBMS_QUOTED_ID,
    type_name_len      PLS_INTEGER
  );

Listing 5: record type COLUMN_METADATA_T

Now we know all the ingredients and can implement the function (Listing 6). Let inspect the code:

  • line 9 – we’re building hash diff of the whole row, so we iterate through all columns
  • line 10 – the name of the column is in t.column(i).description.name
  • lines 11, 13, 15 – t.column(i).description.type that is the way we get the data type of particular column that we compare to the constants defined in DBMS_TF
  • lines 12, 14, 16, 18 – continue to build a string to be hashed bearing in mind the rules for converting different data types, using delimiter and so on
  • line 21 – build and return the expression for then STANDARD_HASH call along with selecting all other columns from the table (we have the table name in t.table_name)

UPDATE 22.03.2020: We don’t actually have to concatenate the table name as in line 21 at the end of the Listing 6. See the update at the end of this post

CREATE OR REPLACE FUNCTION add_hash_columns(t DBMS_TF.TABLE_T) 
RETURN VARCHAR2 SQL_MACRO(TABLE)
AS
 v_hdiff clob ;
 v_str   varchar2(200);
 v_delimiter varchar2(9):= '||''#''||';
 v_name dbms_id;
BEGIN
  FOR I IN 1..t.column.count LOOP
    v_name := t.column(i).description.name;
    IF t.column(i).description.type = dbms_tf.type_varchar2 THEN  
      v_str := v_name;
    ELSIF t.column(i).description.type = dbms_tf.type_number THEN
      v_str := 'to_char('||v_name||')';
    ELSIF t.column(i).description.type = dbms_tf.type_date THEN
      v_str := 'to_char('||v_name||',''YYYYMMDD'')';
    END IF;
    v_hdiff := v_hdiff || v_delimiter || v_str;
  END LOOP;
  v_hdiff := LTRIM(v_hdiff,'|''#');
  RETURN 'SELECT  STANDARD_HASH('||v_hdiff||',''MD5'') hash_diff, '||
         ' e.* FROM '||t.table_name ||' e';
END;
/

Listing 6: Implementation for hash diff

That’s it! And it works! Even better, in that case the EXPAND_SQL_TEXT procedure is showing us the real SQL statement that runs:

SQL> SELECT e.*
FROM   add_hash_columns (emp) e
FETCH FIRST 3 ROWS ONLY

HASH_DIFF       EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- ---------- --------- ---------- --------- ---------- ---------- ----------
42CB6932B4       7369 SMITH      CLERK           7902 17-DEC-80        800                    20
AA63299F72       7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
27332DD16B       7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30

3 rows selected.

DECLARE
  l_clob CLOB;
BEGIN
  DBMS_UTILITY.expand_sql_text (
    input_sql_text  => q'!SELECT e.* FROM   add_hash_columns (emp) e!',
    output_sql_text => l_clob  );
  DBMS_OUTPUT.put_line(l_clob);
END;
/

SELECT "A1"."HASH_DIFF" "HASH_DIFF","A1"."EMPNO" "EMPNO","A1"."ENAME" "ENAME","A1"."JOB" "JOB","A1"."MGR" "MGR","A1"."HIREDATE" "HIREDATE","A1"."SAL" "SAL","A1"."COMM" "COMM","A1"."DEPTNO" "DEPTNO" 
FROM  (
SELECT "A2"."HASH_DIFF" "HASH_DIFF","A2"."EMPNO" "EMPNO","A2"."ENAME" "ENAME","A2"."JOB" "JOB","A2"."MGR" "MGR","A2"."HIREDATE" "HIREDATE","A2"."SAL" "SAL","A2"."COMM" "COMM","A2"."DEPTNO" "DEPTNO" 
FROM  (
SELECT STANDARD_HASH(TO_CHAR("A3"."EMPNO")||'#'||"A3"."ENAME"||'#'||"A3"."JOB"||
'#'||TO_CHAR("A3"."MGR")||'#'||TO_CHAR("A3"."HIREDATE",'YYYYMMDD')||
'#'||TO_CHAR("A3"."SAL")||'#'||TO_CHAR("A3"."COMM")||
'#'||TO_CHAR("A3"."DEPTNO"),'MD5') "HASH_DIFF",
"A3"."EMPNO" "EMPNO","A3"."ENAME" "ENAME","A3"."JOB" "JOB","A3"."MGR" "MGR","A3"."HIREDATE" "HIREDATE","A3"."SAL" "SAL","A3"."COMM" "COMM","A3"."DEPTNO" "DEPTNO" 
FROM "ONFTEST"."EMP" "A3") "A2") "A1"


Listing 7: Run the SQL with macro

Passing column lists

We started with building a hash diff only and I said it would be easier. Why so? Because we need ALL columns and could simply iterate through them. In case of a hash key we are building a hash out of the business key. Often it is a primary key of the table but not always. The safest way would be to explicitly name the column(s) to be hashed. How do we pass the column list?

With a PTF we can use an operator COLUMNS for that. Fortunately, we can also use it with SQL macros! Unfortunately, it is not mentioned in the documentation. Simply pass a list of columns inside this operator. The database will implicitly convert it to the type DBMS_TF.COLUMNS_T, which is a table of DBMS_QUOTED_ID.

Now we are ready to extend the implementation (Listing 8). As we iterate through the columns, just check if the current column is member of the columns array (line 21) and build an additional string to hash for the key.

CREATE OR REPLACE FUNCTION add_hash_columns(t DBMS_TF.TABLE_T
                                          , key_cols DBMS_TF.COLUMNS_T) 
RETURN VARCHAR2 SQL_MACRO(TABLE)
AS
 v_hdiff clob ;
 v_hkey  clob ;
 v_str   varchar2(200);
 v_delimiter varchar2(9):= '||''#''||';
 v_name dbms_id;
BEGIN
  FOR I IN 1..t.column.count LOOP
    v_name := t.column(i).description.name;
    IF t.column(i).description.type = dbms_tf.type_varchar2 THEN  
      v_str := v_name;
    ELSIF t.column(i).description.type = dbms_tf.type_number THEN
      v_str := 'to_char('||v_name||')';
    ELSIF t.column(i).description.type = dbms_tf.type_date THEN
      v_str := 'to_char('||v_name||',''YYYYMMDD'')';
    END IF;
    v_hdiff := v_hdiff || v_delimiter || v_str;
    IF v_name MEMBER OF key_cols THEN
      v_hkey := v_hkey || v_delimiter || v_str;
    END IF; 
  END LOOP;
  v_hdiff := LTRIM(v_hdiff,'|''#');
  v_hkey := LTRIM(v_hkey,'|''#');
  RETURN 'SELECT STANDARD_HASH('||v_hkey||',''MD5'') hash_key, '||
         '       STANDARD_HASH('||v_hdiff||',''MD5'') hash_diff, '||
         ' e.* FROM '||t.table_name ||' e';
END;
/

SQL> SELECT e.*
FROM   add_hash_columns (emp, COLUMNS(empno)) e
FETCH FIRST 3 ROWS ONLY

HASH_KEY   HASH_DIFF       EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- ---------- ---------- --------- ---------- --------- ---------- ---------- ----------
0D7F9017FB 42CB6932B4       7369 SMITH      CLERK           7902 17-DEC-80        800                    20
7A2B33C672 AA63299F72       7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
E1E1F667CE 27332DD16B       7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30

3 rows selected.

Listing 8: the final implementation

Conclusion

I think, this “marriage” of PTF and SQL macros is a great thing indeed. Look at the Listing 8. We were able to produce the same output as desired in the Listing 1. We were able to hide the complex logic and provide one generic function to use across the whole project. And we have not compromised the performance in any way! The real SQL statement running is the actually the same as in the Listing 1. So everything happens in the SQL engine with no context switches at all. This use case alone is enough for me to become a fan of SQL macros. But that is not all, stay tuned!

UPDATE

We don’t actually have to concatenate the table name as shown in the RETURN operator of the function in Listing 6 and 8. The parameter substitution will just work with table parameters as well. We can just use the parameter name in the return string and the actual table name (or even a subquery if you want) will be substituted and incorporated in the target SQL statement:

CREATE OR REPLACE FUNCTION add_hash_columns(t DBMS_TF.TABLE_T
                                          , key_cols DBMS_TF.COLUMNS_T) 
RETURN VARCHAR2 SQL_MACRO(TABLE)
AS
 v_hdiff clob ;
 v_hkey  clob ;
 v_str   varchar2(200);
 v_delimiter varchar2(9):= '||''#''||';
 v_name dbms_id;
BEGIN
  FOR I IN 1..t.column.count LOOP
    v_name := t.column(i).description.name;
    IF t.column(i).description.type = dbms_tf.type_varchar2 THEN  
      v_str := v_name;
    ELSIF t.column(i).description.type = dbms_tf.type_number THEN
      v_str := 'to_char('||v_name||')';
    ELSIF t.column(i).description.type = dbms_tf.type_date THEN
      v_str := 'to_char('||v_name||',''YYYYMMDD'')';
    END IF;
    v_hdiff := v_hdiff || v_delimiter || v_str;
    IF v_name MEMBER OF key_cols THEN
      v_hkey := v_hkey || v_delimiter || v_str;
    END IF; 
  END LOOP;
  v_hdiff := LTRIM(v_hdiff,'|''#');
  v_hkey := LTRIM(v_hkey,'|''#');
  RETURN 'SELECT STANDARD_HASH('||v_hkey||',''MD5'') hash_key, '||
         '       STANDARD_HASH('||v_hdiff||',''MD5'') hash_diff, '||
         ' t.* FROM t';
END;
/
SQL> SELECT e.*
FROM   add_hash_columns (emp, COLUMNS(empno)) e
FETCH FIRST 3 ROWS ONLY

HASH_KEY   HASH_DIFF       EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
---------- ---------- ---------- ---------- --------- ---------- --------- ---------- ---------- ----------
0D7F9017FB 42CB6932B4       7369 SMITH      CLERK           7902 17-DEC-80        800                    20
7A2B33C672 AA63299F72       7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
E1E1F667CE 27332DD16B       7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30

3 rows selected.

Related Posts

5 thoughts on “Building Hash Keys using SQL Macros in Oracle 20c

  1. Stew Ashton

    Thanks for this series on SQL macros! Very important new feature and trickier than I thought.

    You picked a good example to show everything that can be done with a table macro, however I would consider in real life a simpler solution:

    select e.*, standard_hash(json_object(*)) hash from emp e;

    (I don’t see the point of a “hash” key if we already have a primary key.)

    The problem with creating your own string is that you have to explicitly convert every datatype and explicitly choose the NLS parameters. The nice thing about JSON is that NLS parameters are ignored: the JSON generation functions will produce the same output in every database in the world and there is nothing any DBA or developer can do to change that.

    Thanks again for these posts…

    Best regards, Stew

    I suspect this solution will run just as fast too,

    Reply
    1. admin Post author

      Thank you for you comment and for the nice idea to use json_object, Stew!
      I will give it a try!
      Exactly, the problem with datatype conversion and NLS parameters is there, and that’s why you just need ONE function to handle it instead of every developer doing it her/his way.

      Best Regards, Andrej

      Reply
  2. Stew Ashton

    I am rereading all your SQL macro posts, and I just realized that STANDARD_HASH doesn’t require you to convert data types to strings! To confirm that there is no implicit conversion to strings, here is an example with a number:

    SQL> create table t as select 4.2  n from dual;
    
    Table T created.
    
    SQL> alter session set nls_numeric_characters='.,';
    
    Session altered.
    
    SQL> select to_char(n) converted_number, standard_hash(n) from t;
    
    CONVERTED_NUMBER                         STANDARD_HASH(N)                        
    ---------------------------------------- ----------------------------------------
    4.2                                      552DD49F0244C49CCCE6054C11A1147E2F6FD44A
    
    SQL> alter session set nls_numeric_characters=', ';
    
    Session altered.
    
    SQL> select to_char(n) converted_number, standard_hash(n) from t;
    
    CONVERTED_NUMBER                         STANDARD_HASH(N)                        
    ---------------------------------------- ----------------------------------------
    4,2                                      552DD49F0244C49CCCE6054C11A1147E2F6FD44A
    

    The hash is identical even though the implicit conversion returns different strings.

    Best regards, Stew

    Reply
    1. admin Post author

      Hello Stew,
      yes, you can use STANDARD_HASH without first converting to string.
      But it is a kind of established practice in this particular application field (comparing rows via hash diff) to serialize a whole row as a string first.
      It is not only the need to hash every column separately as you mentioned in your second comment, but also interoperability between different platforms, technologies, programming languages and so on.
      If I don’t convert, I would have issues with internal representation of NUMBER, FLOAT, DOUBLE etc. Today I don’t know what will happen with my data tomorrow. Maybe I have to offload it with hash values altogether to compare elsewhere or maybe I will receive data to load with already generated hash diff values, the rules and results must be the same everywhere. That’s why – strings!
      I just checked it in Amazon Redshift and the MD5 for 4.2 was just different whereas for the string ‘4.2’ of course the same.
      That said, STANDARD_HASH in Oracle has two major problems:
      1) not available in PL/SQL – can be addressed with SQL macro
      2) can’t handle CLOB – can also be handled in a macro, hashing smaller chunks, concatenating, hashing again, but then the same rules across different platforms will be difficult

      Best regards, Andrej

      Reply
  3. Stew Ashton

    Sorry, I didn’t mention that this requires hashing every column separately, then perhaps concatenating the results before hashing them again. This probably requires more, even much more, CPU than your solution.

    There is also the issue of rows containing so much data that concatenating all the values would result in a string too long for a VARCHAR2…

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *