{"id":2184,"date":"2024-05-14T00:14:47","date_gmt":"2024-05-13T22:14:47","guid":{"rendered":"https:\/\/blog.sqlora.com\/en\/?p=2184"},"modified":"2024-05-14T09:13:17","modified_gmt":"2024-05-14T07:13:17","slug":"object-statistics-in-your-data-pipelines-part-2","status":"publish","type":"post","link":"https:\/\/blog.sqlora.com\/en\/object-statistics-in-your-data-pipelines-part-2\/","title":{"rendered":"Object Statistics in Your Data Pipelines (Part 2)"},"content":{"rendered":"\n<p>For me, building efficient data pipelines means thinking about performance from day one. And that means a conscious and thoughtful approach to optimizer statistics. It is not just this Autotask, which always runs in the database in the evenings or at weekends and is the responsibility of the DBAs. Every data engineer knows the data to be processed better, knows when and how it is loaded and processed. They should also be better able to recognize when and whether the statistics are critical at certain process steps and therefore when and how the statistics should be collected in the best possible way. That&#8217;s their job too! This is my point of view, which I would like to explain and emphasize with the posts in this series<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>In the <a href=\"https:\/\/blog.sqlora.com\/en\/object-statistics-in-your-data-pipelines\/\" target=\"_blank\" rel=\"noreferrer noopener\">first post<\/a> I&#8217;ve shown how the statistics collection autotask can occasionally process intermediate data states while your data pipeline is still running. This will potentially lead to problems and you don&#8217;t want this to happen. The example in this post deals with the case where you need up-to-date statistics immediately after loading, otherwise there is a risk of falling into an &#8220;out-of-range&#8221; cardinality misestimate.<\/p>\n\n\n\n<p>Let&#8217;s assume we are loading the data once a month into a newly created monthly partition of the table SALES and using the following query afterwards for reporting . The execution plan and statistics for one of the past months look like this:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\nSQL&gt; select s.cust_id, s.time_id, sum (s.amount_sold)\n  2  from    sales s\n  3     join sh.customers c on c.cust_id = s.cust_id\n  4  where s.time_id = date &#039;2024-03-01&#039;\n  5  group by s.cust_id, s.time_id   ;\n\n--------------------------------------------------------------------------------------\n| Id  | Operation                | Name      | Starts | E-Rows | A-Rows |   A-Time   |\n--------------------------------------------------------------------------------------\n|   0 | SELECT STATEMENT         |           |      1 |        |   1394 |00:00:00.12 |\n|   1 |  HASH GROUP BY           |           |      1 |    990 |   1394 |00:00:00.12 |\n|*  2 |   HASH JOIN              |           |      1 |    204K|    204K|00:00:00.08 |\n|   3 |    TABLE ACCESS FULL     | CUSTOMERS |      1 |  55500 |  55500 |00:00:00.05 |\n|   4 |    PARTITION RANGE SINGLE|           |      1 |    204K|    204K|00:00:00.03 |\n|*  5 |     TABLE ACCESS FULL    | SALES     |      1 |    204K|    204K|00:00:00.03 |\n--------------------------------------------------------------------------------------\n\nSQL&gt; WITH    FUNCTION raw_to_date(p_val in raw) return date  as\n  2          date_val date;\n  3          begin\n  4              dbms_stats.convert_raw_value(p_val,date_val);\n  5              return date_val;\n  6          end;\n  7  select p.partition_name, p.num_rows\n  8  ,     c.column_name, c.num_distinct\n  9  ,     raw_to_date(c.low_value) low_value\n 10  ,     raw_to_date(c.high_value) high_value\n 11  from  user_part_col_statistics c \n 12        join user_tab_partitions p on p.table_name = c.table_name \n 13                            and p.partition_name= c.partition_name\n 14  where p.table_name = &#039;SALES&#039; \n 15  and   json_value(p.high_value_json, &#039;$.high_value&#039; returning date) = date&#039;2024-04-01&#039;\n 16  and   c.column_name = &#039;TIME_ID&#039;\n 17  ;\n\nPARTITION_   NUM_ROWS COLUMN_NAM NUM_DISTINCT LOW_VALUE  HIGH_VALUE\n---------- ---------- ---------- ------------ ---------- ----------\nSYS_P3760      204370 TIME_ID               1 2024-03-01 2024-03-01\n\n<\/pre><\/div>\n\n\n<p><strong><em>Listing 1: good execution plan in presence of statistics<\/em><\/strong><\/p>\n\n\n\n<p>As you can see, the optimizer expects 204K rows to be returned form the fact table <strong>SALES<\/strong> after filtering and this estimation is very good &#8211; compare estimated (A-Rows) and actual (A-Rows) rows and both are 204K. That is not really surprising, taking into account the existing partition level statistics (NUM_ROWS=204K) and column statistics (NUM_DISTNCT=1 and Low-\/High-Value corresponds to the filter). Therefore it decides to use a hash join between <strong>SALES<\/strong> and <strong>CUSTOMERS<\/strong> and build a hash table on the smaller result set expected to come from the table <strong>CUSTOMERS<\/strong> (55K). Everything fine so far.<\/p>\n\n\n\n<p>Now let&#8217;s simulate loading of new months data. I will just copy the data of March&#8217;s partition for that:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\nSQL&gt; insert into  sales (TIME_ID,\n  2  QUANTITY_SOLD,\n  3  PROMO_ID,\n  4  PROD_ID,\n  5  CUST_ID,\n  6  CHANNEL_ID,\n  7  AMOUNT_SOLD)\n  8  select date &#039;2024-05-01&#039; TIME_ID\n  9  ,QUANTITY_SOLD\n 10  ,PROMO_ID\n 11  ,PROD_ID\n 12  ,CUST_ID\n 13  ,CHANNEL_ID\n 14  ,AMOUNT_SOLD\n 15  from sales partition for (date &#039;2020-03-01&#039;) ;\n\n204.370 rows inserted.\n<\/pre><\/div>\n\n\n<p><strong><em>Listing 2: simulate new months data load<\/em><\/strong><\/p>\n\n\n\n<p>And then look at the same query but for this new month (Listing 3). Partition-level stats are missing for the new partition. The global table stats tell us that the high value for column is 2024-04-01, so our filter predicate is &#8220;out-of-range&#8221;. The optimizer has to prorate the density calculation an so it comes up with the estimation of just 622 rows! The order of the tables in the join has changed. It doesn&#8217;t make a huge difference for such small tables, but in real life, a misestimate of 30+ times can lead to really significant performance losses!<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\nselect  s.cust_id, s.time_id, sum (s.amount_sold) \nfrom    sales s    join sh.customers c on  c.cust_id = s.cust_id \nwhere   s.time_id = date &#039;2024-05-01&#039; \ngroup by s.cust_id, s.time_id\n \n--------------------------------------------------------------------------------------\n| Id  | Operation                | Name      | Starts | E-Rows | A-Rows |   A-Time   |\n--------------------------------------------------------------------------------------\n|   0 | SELECT STATEMENT         |           |      1 |        |   1394 |00:00:00.30 |\n|   1 |  HASH GROUP BY           |           |      1 |    133 |   1394 |00:00:00.30 |\n|*  2 |   HASH JOIN              |           |      1 |    622 |    204K|00:00:00.13 |\n|   3 |    PARTITION RANGE SINGLE|           |      1 |    622 |    204K|00:00:00.02 |\n|*  4 |     TABLE ACCESS FULL    | SALES     |      1 |    622 |    204K|00:00:00.02 |\n|   5 |    TABLE ACCESS FULL     | CUSTOMERS |      1 |  55500 |  55500 |00:00:01.98 |\n--------------------------------------------------------------------------------------\n\nSQL&gt; WITH    FUNCTION raw_to_date(p_val in raw) return date  as\n  2          date_val date;\n  3          begin\n  4              dbms_stats.convert_raw_value(p_val,date_val);\n  5              return date_val;\n  6          end;\n  7  select p.partition_name, p.num_rows, p.last_analyzed\n  8  ,     c.column_name, c.num_distinct, c.last_analyzed\n  9  ,     raw_to_date(c.low_value) low_value\n 10  ,     raw_to_date(c.high_value) high_value\n 11  from  user_part_col_statistics c \n 12        join user_tab_partitions p on p.table_name = c.table_name \n 13                            and p.partition_name= c.partition_name\n 14  where p.table_name = &#039;SALES&#039; \n 15  and   json_value(p.high_value_json, &#039;$.high_value&#039; returning date) = date&#039;2024-06-01&#039;\n 16  and   c.column_name = &#039;TIME_ID&#039;\n 17  ;\n\nPARTITION_   NUM_ROWS LAST_ANALY COLUMN_NAM NUM_DISTINCT LAST_ANALY LOW_VALUE  HIGH_VALUE\n---------- ---------- ---------- ---------- ------------ ---------- ---------- ----------\nSYS_P3952                        TIME_ID       \n\n-- Global table stats\nSQL&gt; WITH    FUNCTION raw_to_date(p_val in raw) return date  as\n  2          date_val date;\n  3          begin\n  4              dbms_stats.convert_raw_value(p_val,date_val);\n  5              return date_val;\n  6          end;\n  7  select  c.column_name, c.num_distinct, c.last_analyzed\n  8  ,     raw_to_date(c.low_value) low_value\n  9  ,     raw_to_date(c.high_value) high_value\n 10  from   user_tab_cols c \n 11  where  c.table_name = &#039;SALES_I&#039; \n 12  and   c.column_name = &#039;TIME_ID&#039;;\n\nCOLUMN_NAM NUM_DISTINCT LAST_ANALY LOW_VALUE  HIGH_VALUE\n---------- ------------ ---------- ---------- ----------\nTIME_ID            1458 2024-05-12 2019-01-01 2024-04-01                                          \n\n<\/pre><\/div>\n\n\n<p><strong><em>Listing 3: poor execution plan because of missing stats<\/em><\/strong><\/p>\n\n\n\n<p>An experienced Oracle data engineer will use direct path load (INSERT with a hint APPEND) into a newly created partition and thus benefit from <a href=\"https:\/\/blog.sqlora.com\/en\/tag\/osg\/\" target=\"_blank\" rel=\"noreferrer noopener\">online statistics gathering<\/a>, you may say.  Of course, they can do it, but then they have to be aware that the stats are also collected, what limits and restriction this feature has, how to make sure, it always works on so on. In other words, they should think about gathering stats in their ETL and make a decision to rely on this feature if applicable.<\/p>\n\n\n\n<p>You may also note that, with optimizer adaptive features enabled, this sub-optimal execution plan will not remain permanent. Seeing this huge cardinality misestimate on first execution, the optimizer will probably create a SQL plan directive to use dynamic sampling for the next execution. This statement will be re-optimized and the new execution plan based on better estimates will be generated. <\/p>\n\n\n\n<p>This is also true, but I have two points on this: 1) I have often seen that DBA&#8217;s disable the adaptive features; 2) this still means that you must have had that first sub-optimal execution! If it&#8217;s your query that&#8217;s affected and your data pipeline didn&#8217;t finish in the night, or if it&#8217;s you getting calls from angry users whose report took 30 minutes instead of 10 seconds, how do you look at it?<\/p>\n\n\n\n<p>My opinion: better make sure the optimizer has enough information in time to prevent sub-optimal executions &#8211; think about statistics as part of your data pipelines. This doesn&#8217;t mean just collecting them &#8211; it might be better to lock, to copy or even to set statistics in your particular case, but that&#8217;s another topic. It&#8217;s important to have a plan!<\/p>\n\n\n<script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2925154690547867\" crossorigin=\"anonymous\"><\/script><ins class=\"adsbygoogle\" style=\"display:block; text-align:center;\" data-ad-client=\"ca-pub-2925154690547867\" \ndata-ad-slot=\"2727403138\" \ndata-ad-layout=\"in-article\"\ndata-ad-format=\"fluid\"><\/ins>\n<script> \n(adsbygoogle = window.adsbygoogle || []).push({}); \n<\/script>\n\n\n<div class=\"crp-list-container\"><h3 class=\"crp-list-title\">Related Posts<\/h3><ul class=\"crp-list\"><li class=\"crp-list-item crp-list-item-image-none\"><div class=\"crp-list-item-title\"><a href=\"https:\/\/blog.sqlora.com\/en\/object-statistics-in-your-data-pipelines\/\" target=\"_blank\">Object Statistics in Your Data Pipelines<\/a><\/div><\/li><li class=\"crp-list-item crp-list-item-image-none\"><div class=\"crp-list-item-title\"><a href=\"https:\/\/blog.sqlora.com\/en\/online-statistics-gathering-for-etl-part-1\/\" target=\"_blank\">Online Statistics Gathering for ETL &#8211; Part 1<\/a><\/div><\/li><li class=\"crp-list-item crp-list-item-image-none\"><div class=\"crp-list-item-title\"><a href=\"https:\/\/blog.sqlora.com\/en\/online-statistics-gathering-for-etl-part-2\/\" target=\"_blank\">Online Statistics Gathering for ETL &#8211; Part 2<\/a><\/div><\/li><li class=\"crp-list-item crp-list-item-image-none\"><div class=\"crp-list-item-title\"><a href=\"https:\/\/blog.sqlora.com\/en\/online-statistics-gathering-for-etl-part-3\/\" target=\"_blank\">Online Statistics Gathering for ETL \u2013 Part 3<\/a><\/div><\/li><\/ul><\/div>","protected":false},"excerpt":{"rendered":"<p>For me, building efficient data pipelines means thinking about performance from day one. And that means a conscious and thoughtful approach to optimizer statistics. It is not just this Autotask, which always runs in the database in the evenings or at weekends and is the responsibility of the DBAs. Every data engineer knows the data [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2210,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10,71,2],"tags":[145,98,80],"class_list":["post-2184","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cbo","category-etl","category-oracle","tag-cbo","tag-etl","tag-statistics"],"_links":{"self":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts\/2184","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/comments?post=2184"}],"version-history":[{"count":24,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts\/2184\/revisions"}],"predecessor-version":[{"id":2212,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/posts\/2184\/revisions\/2212"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/media\/2210"}],"wp:attachment":[{"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/media?parent=2184"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/categories?post=2184"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.sqlora.com\/en\/wp-json\/wp\/v2\/tags?post=2184"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}