Fix Optimizer Estimate Issues from Implicit Conversions #JoelKallmanDay

This is not just another post about why correct data types matter. Most of you know that using the wrong data type in WHERE or JOIN conditions can trigger implicit conversions, prevent an index access path, and cause performance problems. But what if you can’t change the SQL statements, and you don’t want to redesign your data model? This post is about possible solutions in case of a wrong cardinality estimation due to an implicit data type conversion.

Problem description

In Data Vault–based Data Warehouses, surrogate keys are often generated with MD5 hashes and stored as RAW(16) in Oracle. This design is efficient and follows best practices: it uses less space than VARCHAR2(32), avoids character set conversion, and aligns with the output of the STANDARD_HASH function used to generate the keys. As Dani Schnider shows on his blog, joins on RAW(16) are typically faster than on VARCHAR2(32).

Many teams keep the hash keys as surrogate keys in the downstream star schema to avoid remapping or regenerating keys. Whether it makes sense and what are the pros and cons over using numeric keys – this can be a topic for a separate discussion. I just want to mention, if you are already doing this or are planning to, make sure your BI tools can handle RAW/hex properly. Otherwise, here is what can happen.

Listing 1 shows the structure of the table ORDERS_HK. It has more than 900 thousand rows. The column status_hk has just two distinct values representing “processed” and “canceled”. And because business is going particularly well, the status “canceled” occurs in only 0.1% of cases. Because of this skew, a histogram has been created on this column.

create table orders_hk
( prod_hk        raw(16)      not null,
  cust_hk        raw(16)      not null,
  time_id        date         not null,
  channel_hk     raw(16)      not null,
  promo_hk       raw(16)      not null,
  status_hk      raw(16)      not null,
  quantity_sold  number(3)    not null,
  amount_sold    number(10,2) not null
);

SQL> select num_rows 
from user_tables 
where table_name = 'ORDERS_HK'

  NUM_ROWS
----------
    918843
1 row selected.

SQL> select column_name, num_distinct, histogram 
from sys.dba_tab_col_statistics s
where s.table_name = 'ORDERS_HK'
and   s.column_name = 'STATUS_HK'

COLUMN_NAM NUM_DISTINCT HISTOGRAM      
---------- ------------ ---------------
STATUS_HK             2 FREQUENCY      
1 row selected.

SQL> select status_hk, count(*)
from   orders_hk 
group by status_hk

STATUS_HK                          COUNT(*)
-------------------------------- ----------
9D090A4005926E9E2A32E05CBD8C2FC6     917925
38881E0A24039DC2621E1D6F86CB71F7        918

2 rows selected.

Listing 1: Setup

I will just use a very simple query to show the misbehavior. The BI tool (I don’t want to name it but I suppose there can be more than just one having such problem) will let you pick readable values like “canceled” to be used for filtering, but actually uses their keys in the query they send to the database. And here it is crucial how they do it. Listing 2 shows two alternatives: the first one is just passing the hexadecimal values as string and the second one converts the string to the RAW datatype using the function HEXTORAW().

select *
from   orders_hk 
where  status_hk = '38881E0A24039DC2621E1D6F86CB71F7'; 

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |   459K|    43M|  3721   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   459K|    43M|  3721   (1)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(RAWTOHEX("STATUS_HK")='38881E0A24039DC2621E1D6F86CB71F7')

select *
from   orders_hk 
where  status_hk = hextoraw('38881E0A24039DC2621E1D6F86CB71F7'); 

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |   331 | 33100 |  3715   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   331 | 33100 |  3715   (1)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("STATUS_HK"=HEXTORAW('38881E0A24039DC2621E1D6F86CB71F7'))

Listing 2: Passing the filter as VARCHAR2 vs. RAW

You can spot the difference, right? The first one is causing an implicit conversion in the database using the function RAWTOHEX() applied to the column value. You can see it in line 15 showing the filter condition. You may wonder why Oracle doesn’t instead apply HEXTORAW() to the literal. Well, Oracle has documented a datatype precedence, and implicit conversion follows it. According to this, character data types have higher precedence than RAW. So in that case the RAW-column will be converted to match the data type of the passed VARCHAR2-variable and not vice versa. This prevents the use of the histogram. And this leads to a huge mistake in optimizer cardinality estimation: it estimates the half of the row count in the table since the column has only two distinct values (line 9) – an overestimation by a factor of 500! As you can see, with the second approach using explicit conversion we get more accurate estimation, the histogram is used.

Since I had no indexes, the execution plans are the same after all: full table scan. So, who cares, you may ask? The real-world queries in a data warehouse are often much more complex as this “one table-one filter” example was. Just assume this is part of a complex query involving 20–30 joins. The join order generated by the optimizer will be crucial for such query. And this overestimation (500 times!) will probably help it to come up with the wrong order. That’s the problem and we need a solution. Moreover, we need a solution that doesn’t require to change existing code, since it comes from the BI tool itself.

Solution Using Statistics Extension

First, we can try to create extended statistics for this table. It is possible to create them for column groups but also for expressions which fits our use case. After creating them using the function dbms_stats.create_extended_stats and regathering table stats, we’ll see the new hidden virtual column and the histogram created on it (Listing 3). Our problem-query is now producing a nearly perfect estimation of 918 rows!

SQL> select sys.dbms_stats.create_extended_stats(user,'ORDERS_HK','(RAWTOHEX(STATUS_HK))')
from dual

SYS.DBMS_STATS.CREATE_EXTENDED_STATS(USER,'ORDERS_HK','(RAWTOHEX(STATUS_HK))')  
--------------------------------------------------------------------------------
SYS_STUJKKESX5P05C1HFG41MNPS8I                                                  
1 row selected.

begin
  dbms_stats.gather_table_stats(user, 'ORDERS_HK');
end;
/

SQL> select column_name, num_distinct, histogram 
from sys.dba_tab_col_statistics s
where s.table_name = 'ORDERS_HK'

COLUMN_NAME                    NUM_DISTINCT HISTOGRAM      
------------------------------ ------------ ---------------
PROD_HK                                  72 NONE           
CUST_HK                                7059 NONE           
TIME_ID                                1459 NONE           
CHANNEL_HK                                4 NONE           
PROMO_HK                                  4 NONE           
STATUS_HK                                 2 FREQUENCY      
QUANTITY_SOLD                             1 NONE           
AMOUNT_SOLD                            3586 NONE           
SYS_STUJKKESX5P05C1HFG41MNPS8I            2 FREQUENCY      

9 rows selected.

select *
from   orders_hk 
where  status_hk = '38881E0A24039DC2621E1D6F86CB71F7'; 

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |   918 |   119K|  3720   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   918 |   119K|  3720   (1)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(RAWTOHEX("STATUS_HK")='38881E0A24039DC2621E1D6F86CB71F7')

Listing 3: Creating extended statistics to fix the problem

Looks good but in the real world you may hit an unexpected problem getting ORA-20008: Number of extensions in table <…> already reaches the upper limit (20). This limit of 20 can be higher for wider tables as Jonathan Lewis writes. And if the preference AUTO_STAT_EXTENSIONS is set to true in your case, then optimizer could have already added a lot of extended statistics to your table in the background. In that case you first need to identify automatically added column groups that you don’t really need and then drop them before you can add this new expression as extended statistics.

Solution Using Virtual Column

The second approach is to define a virtual column doing explicit datatype conversion. User-created virtual columns don’t count toward the mentioned limit (at least if you have a bug fix 35156634 – should be in 19.16). Listing 4 illustrates this: after creating the new column STATUS_HKV and re-gathering statistics the optimizer’s cardinality estimate is now accurate. Notice how the filter condition in the line 45 shows that the new virtual column is used even though I haven’t referenced it in the query!

ALTER TABLE ORDERS_HK ADD (STATUS_HKV  GENERATED ALWAYS AS 
	       (RAWTOHEX("STATUS_HK")) VIRTUAL ); 

Table altered.	       

begin
  dbms_stats.gather_table_stats(user, 'ORDERS_HK'
   , method_opt => 'FOR COLUMNS STATUS_HKV SIZE 2'
    , options => 'GATHER AUTO');
end;
/

SQL> select column_name, num_distinct, histogram 
from sys.dba_tab_col_statistics s
where s.table_name = 'ORDERS_HK'

COLUMN_NAME                    NUM_DISTINCT HISTOGRAM      
------------------------------ ------------ ---------------
PROD_HK                                  72 NONE           
CUST_HK                                7059 NONE           
TIME_ID                                1459 NONE           
CHANNEL_HK                                4 NONE           
PROMO_HK                                  4 NONE           
STATUS_HK                                 2 FREQUENCY      
QUANTITY_SOLD                             1 NONE           
AMOUNT_SOLD                            3586 NONE           
STATUS_HKV                                2 FREQUENCY      

9 rows selected.

select *
from   orders_hk 
where  status_hk = '38881E0A24039DC2621E1D6F86CB71F7'; 

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |   918 |   119K|  3720   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   918 |   119K|  3720   (1)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("ORDERS_HK"."STATUS_HKV"='38881E0A24039DC2621E1D6F86CB71F7')

Listing 4: Define a virtual column doing explicit conversion to VARCHAR2

One argument against the above-mentioned approach may be that you don’t want to introduce schema changes that can potentially break some existing code around this table, for example ETL processes. Properly written code should not break when a new column is added, but to be safe… In this case you can define the new virtual column as an invisible column. Listing 5 verifies that in this case, the original problem is solved as well.

ALTER TABLE ORDERS_HK DROP COLUMN STATUS_HKV;
Table altered.
ALTER TABLE ORDERS_HK ADD (STATUS_HKV INVISIBLE GENERATED ALWAYS AS 
	       (RAWTOHEX("STATUS_HK")) VIRTUAL ); 
Table altered.	   
	       
begin
  dbms_stats.gather_table_stats(user, 'ORDERS_HK'
   , method_opt => 'FOR COLUMNS STATUS_HKV SIZE 2'
    , options => 'GATHER AUTO');
end;
/

SQL> select c.column_name, c.hidden_column, c.user_generated, c.num_distinct, c.histogram 
from  sys.dba_tab_cols c 
where c.table_name = 'ORDERS_HK'

COLUMN_NAME          HIDDEN_COLUMN USER_GENERATED NUM_DISTINCT HISTOGRAM      
-------------------- ------------- -------------- ------------ ---------------
PROD_HK              NO            YES                      72 NONE           
CUST_HK              NO            YES                    7059 NONE           
TIME_ID              NO            YES                    1459 NONE           
CHANNEL_HK           NO            YES                       4 NONE           
PROMO_HK             NO            YES                       4 NONE           
STATUS_HK            NO            YES                       2 FREQUENCY      
QUANTITY_SOLD        NO            YES                       1 NONE           
AMOUNT_SOLD          NO            YES                    3586 NONE           
STATUS_HKV           YES           YES                       2 FREQUENCY      

9 rows selected.

select *
from   orders_hk 
where  status_hk = '38881E0A24039DC2621E1D6F86CB71F7'; 

-------------------------------------------------------------------------------
| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |           |   918 |   119K|  3720   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   918 |   119K|  3720   (1)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("ORDERS_HK"."STATUS_HKV"='38881E0A24039DC2621E1D6F86CB71F7')

Listing 5: Create a virtual column as invisible

Conclusion

Implicit data type conversions can silently cripple query performance in Oracle by preventing the optimizer from using histograms. When comparing RAW(16) columns (such as MD5 hash keys) with string literals, Oracle’s data type precedence rules force conversion of the column, not the literal — breaking cardinality estimates and execution plans.

If you can’t change the SQL coming from BI tools, two safe options exist:

Extended statistics on RAWTOHEX(column) to guide the optimizer.
Virtual columns (visible or invisible) that expose the column as VARCHAR2, so statistics and histograms can be applied.

A third option is to create a function-based index on RAWTOHEX(status_hk). This can help if you specifically need index access, but it introduces more overhead: it creates a hidden virtual column (which counts toward the 20 extended statistics/hidden column limit), consumes extra space for the index itself, and increases maintenance cost. For most cases where only cardinality estimation needs to be corrected, extended statistics or user-defined virtual columns remain the cleaner and more flexible solutions.

SQLORA

Just some more Oracle stuff

Fix Optimizer Estimate Issues from Implicit Conversions #JoelKallmanDay

Problem description

Solution Using Statistics Extension

Solution Using Virtual Column

Conclusion

Leave a Reply Cancel reply