{"id":2446,"date":"2025-10-14T22:17:15","date_gmt":"2025-10-14T20:17:15","guid":{"rendered":"https:\/\/blog.sqlora.com\/en\/?p=2446"},"modified":"2025-10-14T22:17:16","modified_gmt":"2025-10-14T20:17:16","slug":"fix-optimizer-estimate-issues-from-implicit-conversions-joelkallmanday","status":"publish","type":"post","link":"https:\/\/blog.sqlora.com\/en\/fix-optimizer-estimate-issues-from-implicit-conversions-joelkallmanday\/","title":{"rendered":"Fix Optimizer Estimate Issues from Implicit Conversions #JoelKallmanDay"},"content":{"rendered":"\n<p>This is not just another post about why correct data types matter. Most of you know that using the wrong data type in WHERE or JOIN conditions can trigger implicit conversions, prevent an index access path, and cause performance problems. But what if you can\u2019t change the SQL statements, and you don\u2019t want to redesign your data model? This post is about possible solutions in case of a wrong cardinality estimation due to an implicit data type conversion.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Problem description<\/h2>\n\n\n\n<p>In Data Vault\u2013based Data Warehouses, surrogate keys are often generated with MD5 hashes and stored as RAW(16) in Oracle. This design is efficient and follows best practices: it uses less space than VARCHAR2(32), avoids character set conversion, and aligns with the output of the STANDARD_HASH function used to generate the keys. As Dani Schnider shows on his blog, <a href=\"https:\/\/danischnider.wordpress.com\/2018\/12\/11\/data-types-of-join-columns\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">joins on RAW(16) are typically faster than on VARCHAR2(32)<\/a>.<\/p>\n\n\n\n<p>Many teams keep the hash keys as surrogate keys in the downstream star schema to avoid remapping or regenerating keys. Whether it makes sense and what are the pros and cons over using numeric keys &#8211; this can be a topic for a separate discussion. I just want to mention, if you are already doing this or are planning to, make sure your BI tools can handle RAW\/hex properly. Otherwise, here is what can happen.<\/p>\n\n\n\n<p>Listing 1 shows the structure of the table ORDERS_HK. It has more than 900 thousand rows. The column <strong>status_hk<\/strong> has just two distinct values representing &#8220;processed&#8221; and &#8220;canceled&#8221;. And because business is going particularly well, the status \u201ccanceled\u201d occurs in only 0.1% of cases. Because of this skew, a histogram has been created on this column.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\ncreate table orders_hk\n( prod_hk        raw(16)      not null,\n  cust_hk        raw(16)      not null,\n  time_id        date         not null,\n  channel_hk     raw(16)      not null,\n  promo_hk       raw(16)      not null,\n  status_hk      raw(16)      not null,\n  quantity_sold  number(3)    not null,\n  amount_sold    number(10,2) not null\n);\n\nSQL&gt; select num_rows \nfrom user_tables \nwhere table_name = &#039;ORDERS_HK&#039;\n\n  NUM_ROWS\n----------\n    918843\n1 row selected.\n\nSQL&gt; select column_name, num_distinct, histogram \nfrom sys.dba_tab_col_statistics s\nwhere s.table_name = &#039;ORDERS_HK&#039;\nand   s.column_name = &#039;STATUS_HK&#039;\n\nCOLUMN_NAM NUM_DISTINCT HISTOGRAM      \n---------- ------------ ---------------\nSTATUS_HK             2 FREQUENCY      \n1 row selected.\n\nSQL&gt; select status_hk, count(*)\nfrom   orders_hk \ngroup by status_hk\n\nSTATUS_HK                          COUNT(*)\n-------------------------------- ----------\n9D090A4005926E9E2A32E05CBD8C2FC6     917925\n38881E0A24039DC2621E1D6F86CB71F7        918\n\n2 rows selected.\n<\/pre><\/div>\n\n\n<p><strong><em>Listing 1: Setup<\/em><\/strong><\/p>\n\n\n\n<p>I will just use a very simple query to show the misbehavior. The BI tool (I don&#8217;t want to name it but I suppose there can be more than just one having such problem) will let you pick readable values like &#8220;canceled&#8221; to be used for filtering, but actually uses their keys in the query they send to the database. And here it is crucial how they do it. Listing 2 shows two alternatives: the first one is just passing the hexadecimal values as string and the second one converts the string to the RAW datatype using the function <strong>HEXTORAW()<\/strong>.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; highlight: [9,15,25]; title: ; notranslate\" title=\"\">\nselect *\nfrom   orders_hk \nwhere  status_hk = &#039;38881E0A24039DC2621E1D6F86CB71F7&#039;; \n\n-------------------------------------------------------------------------------\n| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |\n-------------------------------------------------------------------------------\n|   0 | SELECT STATEMENT  |           |   459K|    43M|  3721   (1)| 00:00:01 |\n|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   459K|    43M|  3721   (1)| 00:00:01 |\n-------------------------------------------------------------------------------\n \nPredicate Information (identified by operation id):\n---------------------------------------------------\n \n   1 - filter(RAWTOHEX(&quot;STATUS_HK&quot;)=&#039;38881E0A24039DC2621E1D6F86CB71F7&#039;)\n\nselect *\nfrom   orders_hk \nwhere  status_hk = hextoraw(&#039;38881E0A24039DC2621E1D6F86CB71F7&#039;); \n\n-------------------------------------------------------------------------------\n| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |\n-------------------------------------------------------------------------------\n|   0 | SELECT STATEMENT  |           |   331 | 33100 |  3715   (1)| 00:00:01 |\n|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   331 | 33100 |  3715   (1)| 00:00:01 |\n-------------------------------------------------------------------------------\n \nPredicate Information (identified by operation id):\n---------------------------------------------------\n \n   1 - filter(&quot;STATUS_HK&quot;=HEXTORAW(&#039;38881E0A24039DC2621E1D6F86CB71F7&#039;))\n<\/pre><\/div>\n\n\n<p><strong><em>Listing 2: Passing the filter as VARCHAR2 vs. RAW<\/em><\/strong><\/p>\n\n\n\n<p>You can spot the difference, right? The first one is causing an implicit conversion in the database using the function <strong>RAWTOHEX()<\/strong> applied to the column value. You can see it in line 15 showing the filter condition. You may wonder why Oracle doesn\u2019t instead apply <strong>HEXTORAW()<\/strong> to the literal. Well, Oracle has <a href=\"https:\/\/docs.oracle.com\/en\/database\/oracle\/oracle-database\/19\/sqlrf\/Data-Type-Comparison-Rules.html#GUID-82C9B166-5850-48F5-B3ED-BB82B3631407\" target=\"_blank\" rel=\"noreferrer noopener\">documented a datatype precedence<\/a>, and implicit conversion follows it. According to this, character data types have higher precedence than RAW. So in that case the RAW-column will be converted to match the data type of the passed VARCHAR2-variable and not vice versa. This prevents the use of the histogram. And this leads to a huge mistake in optimizer cardinality estimation: it estimates the half of the row count in the table since the column has only two distinct values (line 9) &#8211; an overestimation by a factor of 500! As you can see, with the second approach using explicit conversion we get more accurate estimation, the histogram is used.<\/p>\n\n\n\n<p>Since I had no indexes, the execution plans are the same after all: full table scan. So, who cares, you may ask? The real-world queries in a data warehouse are often much more complex as this &#8220;one table-one filter&#8221; example was. Just assume this is part of a complex query involving 20\u201330 joins. The join order generated by the optimizer will be crucial for such query. And this overestimation (500 times!) will probably help it to come up with the wrong order. That&#8217;s the problem and we need a solution. Moreover, we need a solution that doesn&#8217;t require to change existing code, since it comes from the BI tool itself.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Solution Using Statistics Extension<\/h2>\n\n\n\n<p>First, we can try to create extended statistics for this table. It is possible to create them for column groups but also for expressions which fits our use case. After creating them using the function <strong>dbms_stats.create_extended_stats<\/strong> and regathering table stats, we&#8217;ll see the new hidden virtual column and the histogram created on it (Listing 3). Our problem-query is now producing a nearly perfect estimation of 918 rows!<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\nSQL&gt; select sys.dbms_stats.create_extended_stats(user,&#039;ORDERS_HK&#039;,&#039;(RAWTOHEX(STATUS_HK))&#039;)\nfrom dual\n\nSYS.DBMS_STATS.CREATE_EXTENDED_STATS(USER,&#039;ORDERS_HK&#039;,&#039;(RAWTOHEX(STATUS_HK))&#039;)  \n--------------------------------------------------------------------------------\nSYS_STUJKKESX5P05C1HFG41MNPS8I                                                  \n1 row selected.\n\nbegin\n  dbms_stats.gather_table_stats(user, &#039;ORDERS_HK&#039;);\nend;\n\/\n\nSQL&gt; select column_name, num_distinct, histogram \nfrom sys.dba_tab_col_statistics s\nwhere s.table_name = &#039;ORDERS_HK&#039;\n\nCOLUMN_NAME                    NUM_DISTINCT HISTOGRAM      \n------------------------------ ------------ ---------------\nPROD_HK                                  72 NONE           \nCUST_HK                                7059 NONE           \nTIME_ID                                1459 NONE           \nCHANNEL_HK                                4 NONE           \nPROMO_HK                                  4 NONE           \nSTATUS_HK                                 2 FREQUENCY      \nQUANTITY_SOLD                             1 NONE           \nAMOUNT_SOLD                            3586 NONE           \nSYS_STUJKKESX5P05C1HFG41MNPS8I            2 FREQUENCY      \n\n9 rows selected.\n\nselect *\nfrom   orders_hk \nwhere  status_hk = &#039;38881E0A24039DC2621E1D6F86CB71F7&#039;; \n\n-------------------------------------------------------------------------------\n| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |\n-------------------------------------------------------------------------------\n|   0 | SELECT STATEMENT  |           |   918 |   119K|  3720   (1)| 00:00:01 |\n|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   918 |   119K|  3720   (1)| 00:00:01 |\n-------------------------------------------------------------------------------\n \nPredicate Information (identified by operation id):\n---------------------------------------------------\n \n   1 - filter(RAWTOHEX(&quot;STATUS_HK&quot;)=&#039;38881E0A24039DC2621E1D6F86CB71F7&#039;)\n<\/pre><\/div>\n\n\n<p><strong><em>Listing 3: Creating extended statistics to fix the problem<\/em><\/strong><\/p>\n\n\n\n<p>Looks good but in the real world you may hit an unexpected problem getting <strong><em>ORA-20008: Number of extensions in table &lt;&#8230;&gt; already reaches the upper limit (20)<\/em><\/strong>. This limit of 20 can be higher for wider tables as <a href=\"https:\/\/jonathanlewis.wordpress.com\/2016\/12\/07\/extended-stats-4\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Jonathan Lewis writes<\/a>. And if the preference AUTO_STAT_EXTENSIONS is set to true in your case, then optimizer could have already added a lot of extended statistics to your table in the background. In that case you first need to identify automatically added column groups that you don&#8217;t really need and then drop them before you can add this new expression as extended statistics.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Solution Using Virtual Column<\/h2>\n\n\n\n<p>The second approach is to define a virtual column doing explicit datatype conversion. User-created virtual columns don&#8217;t count toward the mentioned limit (at least if you have a bug fix <strong>35156634<\/strong> &#8211; should be in 19.16). Listing 4 illustrates this: after creating the new column <strong>STATUS_HKV <\/strong>and re-gathering statistics the optimizer\u2019s cardinality estimate is now accurate. Notice how the filter condition in the line 45 shows that the new virtual column is used even though I haven&#8217;t referenced it in the query!<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; highlight: [45]; title: ; notranslate\" title=\"\">\nALTER TABLE ORDERS_HK ADD (STATUS_HKV  GENERATED ALWAYS AS \n\t       (RAWTOHEX(&quot;STATUS_HK&quot;)) VIRTUAL ); \n\nTable altered.\t       \n\nbegin\n  dbms_stats.gather_table_stats(user, &#039;ORDERS_HK&#039;\n   , method_opt =&gt; &#039;FOR COLUMNS STATUS_HKV SIZE 2&#039;\n    , options =&gt; &#039;GATHER AUTO&#039;);\nend;\n\/\n\nSQL&gt; select column_name, num_distinct, histogram \nfrom sys.dba_tab_col_statistics s\nwhere s.table_name = &#039;ORDERS_HK&#039;\n\nCOLUMN_NAME                    NUM_DISTINCT HISTOGRAM      \n------------------------------ ------------ ---------------\nPROD_HK                                  72 NONE           \nCUST_HK                                7059 NONE           \nTIME_ID                                1459 NONE           \nCHANNEL_HK                                4 NONE           \nPROMO_HK                                  4 NONE           \nSTATUS_HK                                 2 FREQUENCY      \nQUANTITY_SOLD                             1 NONE           \nAMOUNT_SOLD                            3586 NONE           \nSTATUS_HKV                                2 FREQUENCY      \n\n9 rows selected.\n\nselect *\nfrom   orders_hk \nwhere  status_hk = &#039;38881E0A24039DC2621E1D6F86CB71F7&#039;; \n\n-------------------------------------------------------------------------------\n| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |\n-------------------------------------------------------------------------------\n|   0 | SELECT STATEMENT  |           |   918 |   119K|  3720   (1)| 00:00:01 |\n|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   918 |   119K|  3720   (1)| 00:00:01 |\n-------------------------------------------------------------------------------\n \nPredicate Information (identified by operation id):\n---------------------------------------------------\n \n   1 - filter(&quot;ORDERS_HK&quot;.&quot;STATUS_HKV&quot;=&#039;38881E0A24039DC2621E1D6F86CB71F7&#039;)\t \n<\/pre><\/div>\n\n\n<p><strong><em>Listing 4: Define a virtual column doing explicit conversion to VARCHAR2<\/em><\/strong><\/p>\n\n\n\n<p>One argument against the above-mentioned approach may be that you don&#8217;t want to introduce schema changes that can potentially break some existing code around this table, for example ETL processes. Properly written code should not break when a new column is added, but to be safe\u2026 In this case you can define the new virtual column as an invisible column. Listing 5 verifies that in this case, the original problem is solved as well.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\nALTER TABLE ORDERS_HK DROP COLUMN STATUS_HKV;\nTable altered.\nALTER TABLE ORDERS_HK ADD (STATUS_HKV INVISIBLE GENERATED ALWAYS AS \n\t       (RAWTOHEX(&quot;STATUS_HK&quot;)) VIRTUAL ); \nTable altered.\t   \n\t       \nbegin\n  dbms_stats.gather_table_stats(user, &#039;ORDERS_HK&#039;\n   , method_opt =&gt; &#039;FOR COLUMNS STATUS_HKV SIZE 2&#039;\n    , options =&gt; &#039;GATHER AUTO&#039;);\nend;\n\/\n\nSQL&gt; select c.column_name, c.hidden_column, c.user_generated, c.num_distinct, c.histogram \nfrom  sys.dba_tab_cols c \nwhere c.table_name = &#039;ORDERS_HK&#039;\n\nCOLUMN_NAME          HIDDEN_COLUMN USER_GENERATED NUM_DISTINCT HISTOGRAM      \n-------------------- ------------- -------------- ------------ ---------------\nPROD_HK              NO            YES                      72 NONE           \nCUST_HK              NO            YES                    7059 NONE           \nTIME_ID              NO            YES                    1459 NONE           \nCHANNEL_HK           NO            YES                       4 NONE           \nPROMO_HK             NO            YES                       4 NONE           \nSTATUS_HK            NO            YES                       2 FREQUENCY      \nQUANTITY_SOLD        NO            YES                       1 NONE           \nAMOUNT_SOLD          NO            YES                    3586 NONE           \nSTATUS_HKV           YES           YES                       2 FREQUENCY      \n\n9 rows selected.\n\nselect *\nfrom   orders_hk \nwhere  status_hk = &#039;38881E0A24039DC2621E1D6F86CB71F7&#039;; \n\n-------------------------------------------------------------------------------\n| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |\n-------------------------------------------------------------------------------\n|   0 | SELECT STATEMENT  |           |   918 |   119K|  3720   (1)| 00:00:01 |\n|*  1 |  TABLE ACCESS FULL| ORDERS_HK |   918 |   119K|  3720   (1)| 00:00:01 |\n-------------------------------------------------------------------------------\n \nPredicate Information (identified by operation id):\n---------------------------------------------------\n \n   1 - filter(&quot;ORDERS_HK&quot;.&quot;STATUS_HKV&quot;=&#039;38881E0A24039DC2621E1D6F86CB71F7&#039;)\n<\/pre><\/div>\n\n\n<p><strong><em>Listing 5: Create a virtual column as invisible<\/em><\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Implicit data type conversions can silently cripple query performance in Oracle by preventing the optimizer from using histograms. When comparing RAW(16) columns (such as MD5 hash keys) with string literals, Oracle\u2019s data type precedence rules force conversion of the column, not the literal \u2014 breaking cardinality estimates and execution plans.<\/p>\n\n\n\n<p>If you can\u2019t change the SQL coming from BI tools, two safe options exist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Extended statistics on <strong>RAWTOHEX(column)<\/strong> to guide the optimizer.<\/li>\n\n\n\n<li>Virtual columns (visible or invisible) that expose the column as VARCHAR2, so statistics and histograms can be applied.<\/li>\n<\/ul>\n\n\n\n<p>A third option is to create a function-based index on <strong>RAWTOHEX(status_hk)<\/strong>. This can help if you specifically need index access, but it introduces more overhead: it creates a hidden virtual column (which counts toward the 20 extended statistics\/hidden column limit), consumes extra space for the index itself, and increases maintenance cost. For most cases where only cardinality estimation needs to be corrected, extended statistics or user-defined virtual columns remain the cleaner and more flexible solutions.<\/p>\n\n\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2925154690547867\" crossorigin=\"anonymous\"><\/script><ins class=\"adsbygoogle\" style=\"display:block; text-align:center;\" data-ad-client=\"ca-pub-2925154690547867\" \ndata-ad-slot=\"2727403138\" \ndata-ad-layout=\"in-article\"\ndata-ad-format=\"fluid\"><\/ins>\n<script> \n(adsbygoogle = window.adsbygoogle || []).push({}); \n<\/script>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is not just another post about why correct data types matter. Most of you know that using the wrong data type in WHERE or JOIN conditions can trigger implicit conversions, prevent an index access path, and cause performance problems. But what if you can\u2019t change the SQL statements, and you don\u2019t want to redesign [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2589,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10,50,2],"tags":[157,51,80],"class_list":["post-2446","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cbo","category-data-warehouse","category-oracle","tag-performance","tag-sql","tag-statistics"],"_links":{"self":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts\/2446","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/comments?post=2446"}],"version-history":[{"count":46,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts\/2446\/revisions"}],"predecessor-version":[{"id":2602,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts\/2446\/revisions\/2602"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/media\/2589"}],"wp:attachment":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/media?parent=2446"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/categories?post=2446"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/tags?post=2446"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}