Object Statistics in Your Data Pipelines (Part 2)

For me, building efficient data pipelines means thinking about performance from day one. And that means a conscious and thoughtful approach to optimizer statistics. It is not just this Autotask, which always runs in the database in the evenings or at weekends and is the responsibility of the DBAs. Every data engineer knows the data to be processed better, knows when and how it is loaded and processed. They should also be better able to recognize when and whether the statistics are critical at certain process steps and therefore when and how the statistics should be collected in the best possible way. That’s their job too! This is my point of view, which I would like to explain and emphasize with the posts in this series

In the first post I’ve shown how the statistics collection autotask can occasionally process intermediate data states while your data pipeline is still running. This will potentially lead to problems and you don’t want this to happen. The example in this post deals with the case where you need up-to-date statistics immediately after loading, otherwise there is a risk of falling into an “out-of-range” cardinality misestimate.

Let’s assume we are loading the data once a month into a newly created monthly partition of the table SALES and using the following query afterwards for reporting . The execution plan and statistics for one of the past months look like this:

SQL> select s.cust_id, s.time_id, sum (s.amount_sold)
  2  from    sales s
  3     join sh.customers c on c.cust_id = s.cust_id
  4  where s.time_id = date '2024-03-01'
  5  group by s.cust_id, s.time_id   ;

--------------------------------------------------------------------------------------
| Id  | Operation                | Name      | Starts | E-Rows | A-Rows |   A-Time   |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |           |      1 |        |   1394 |00:00:00.12 |
|   1 |  HASH GROUP BY           |           |      1 |    990 |   1394 |00:00:00.12 |
|*  2 |   HASH JOIN              |           |      1 |    204K|    204K|00:00:00.08 |
|   3 |    TABLE ACCESS FULL     | CUSTOMERS |      1 |  55500 |  55500 |00:00:00.05 |
|   4 |    PARTITION RANGE SINGLE|           |      1 |    204K|    204K|00:00:00.03 |
|*  5 |     TABLE ACCESS FULL    | SALES     |      1 |    204K|    204K|00:00:00.03 |
--------------------------------------------------------------------------------------

SQL> WITH    FUNCTION raw_to_date(p_val in raw) return date  as
  2          date_val date;
  3          begin
  4              dbms_stats.convert_raw_value(p_val,date_val);
  5              return date_val;
  6          end;
  7  select p.partition_name, p.num_rows
  8  ,     c.column_name, c.num_distinct
  9  ,     raw_to_date(c.low_value) low_value
 10  ,     raw_to_date(c.high_value) high_value
 11  from  user_part_col_statistics c 
 12        join user_tab_partitions p on p.table_name = c.table_name 
 13                            and p.partition_name= c.partition_name
 14  where p.table_name = 'SALES' 
 15  and   json_value(p.high_value_json, '$.high_value' returning date) = date'2024-04-01'
 16  and   c.column_name = 'TIME_ID'
 17  ;

PARTITION_   NUM_ROWS COLUMN_NAM NUM_DISTINCT LOW_VALUE  HIGH_VALUE
---------- ---------- ---------- ------------ ---------- ----------
SYS_P3760      204370 TIME_ID               1 2024-03-01 2024-03-01

Listing 1: good execution plan in presence of statistics

As you can see, the optimizer expects 204K rows to be returned form the fact table SALES after filtering and this estimation is very good – compare estimated (A-Rows) and actual (A-Rows) rows and both are 204K. That is not really surprising, taking into account the existing partition level statistics (NUM_ROWS=204K) and column statistics (NUM_DISTNCT=1 and Low-/High-Value corresponds to the filter). Therefore it decides to use a hash join between SALES and CUSTOMERS and build a hash table on the smaller result set expected to come from the table CUSTOMERS (55K). Everything fine so far.

Now let’s simulate loading of new months data. I will just copy the data of March’s partition for that:

SQL> insert into  sales (TIME_ID,
  2  QUANTITY_SOLD,
  3  PROMO_ID,
  4  PROD_ID,
  5  CUST_ID,
  6  CHANNEL_ID,
  7  AMOUNT_SOLD)
  8  select date '2024-05-01' TIME_ID
  9  ,QUANTITY_SOLD
 10  ,PROMO_ID
 11  ,PROD_ID
 12  ,CUST_ID
 13  ,CHANNEL_ID
 14  ,AMOUNT_SOLD
 15  from sales partition for (date '2020-03-01') ;

204.370 rows inserted.

Listing 2: simulate new months data load

And then look at the same query but for this new month (Listing 3). Partition-level stats are missing for the new partition. The global table stats tell us that the high value for column is 2024-04-01, so our filter predicate is “out-of-range”. The optimizer has to prorate the density calculation an so it comes up with the estimation of just 622 rows! The order of the tables in the join has changed. It doesn’t make a huge difference for such small tables, but in real life, a misestimate of 30+ times can lead to really significant performance losses!

select  s.cust_id, s.time_id, sum (s.amount_sold) 
from    sales s    join sh.customers c on  c.cust_id = s.cust_id 
where   s.time_id = date '2024-05-01' 
group by s.cust_id, s.time_id
 
--------------------------------------------------------------------------------------
| Id  | Operation                | Name      | Starts | E-Rows | A-Rows |   A-Time   |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |           |      1 |        |   1394 |00:00:00.30 |
|   1 |  HASH GROUP BY           |           |      1 |    133 |   1394 |00:00:00.30 |
|*  2 |   HASH JOIN              |           |      1 |    622 |    204K|00:00:00.13 |
|   3 |    PARTITION RANGE SINGLE|           |      1 |    622 |    204K|00:00:00.02 |
|*  4 |     TABLE ACCESS FULL    | SALES     |      1 |    622 |    204K|00:00:00.02 |
|   5 |    TABLE ACCESS FULL     | CUSTOMERS |      1 |  55500 |  55500 |00:00:01.98 |
--------------------------------------------------------------------------------------

SQL> WITH    FUNCTION raw_to_date(p_val in raw) return date  as
  2          date_val date;
  3          begin
  4              dbms_stats.convert_raw_value(p_val,date_val);
  5              return date_val;
  6          end;
  7  select p.partition_name, p.num_rows, p.last_analyzed
  8  ,     c.column_name, c.num_distinct, c.last_analyzed
  9  ,     raw_to_date(c.low_value) low_value
 10  ,     raw_to_date(c.high_value) high_value
 11  from  user_part_col_statistics c 
 12        join user_tab_partitions p on p.table_name = c.table_name 
 13                            and p.partition_name= c.partition_name
 14  where p.table_name = 'SALES' 
 15  and   json_value(p.high_value_json, '$.high_value' returning date) = date'2024-06-01'
 16  and   c.column_name = 'TIME_ID'
 17  ;

PARTITION_   NUM_ROWS LAST_ANALY COLUMN_NAM NUM_DISTINCT LAST_ANALY LOW_VALUE  HIGH_VALUE
---------- ---------- ---------- ---------- ------------ ---------- ---------- ----------
SYS_P3952                        TIME_ID       

-- Global table stats
SQL> WITH    FUNCTION raw_to_date(p_val in raw) return date  as
  2          date_val date;
  3          begin
  4              dbms_stats.convert_raw_value(p_val,date_val);
  5              return date_val;
  6          end;
  7  select  c.column_name, c.num_distinct, c.last_analyzed
  8  ,     raw_to_date(c.low_value) low_value
  9  ,     raw_to_date(c.high_value) high_value
 10  from   user_tab_cols c 
 11  where  c.table_name = 'SALES_I' 
 12  and   c.column_name = 'TIME_ID';

COLUMN_NAM NUM_DISTINCT LAST_ANALY LOW_VALUE  HIGH_VALUE
---------- ------------ ---------- ---------- ----------
TIME_ID            1458 2024-05-12 2019-01-01 2024-04-01                                          

Listing 3: poor execution plan because of missing stats

An experienced Oracle data engineer will use direct path load (INSERT with a hint APPEND) into a newly created partition and thus benefit from online statistics gathering, you may say. Of course, they can do it, but then they have to be aware that the stats are also collected, what limits and restriction this feature has, how to make sure, it always works on so on. In other words, they should think about gathering stats in their ETL and make a decision to rely on this feature if applicable.

You may also note that, with optimizer adaptive features enabled, this sub-optimal execution plan will not remain permanent. Seeing this huge cardinality misestimate on first execution, the optimizer will probably create a SQL plan directive to use dynamic sampling for the next execution. This statement will be re-optimized and the new execution plan based on better estimates will be generated.

This is also true, but I have two points on this: 1) I have often seen that DBA’s disable the adaptive features; 2) this still means that you must have had that first sub-optimal execution! If it’s your query that’s affected and your data pipeline didn’t finish in the night, or if it’s you getting calls from angry users whose report took 30 minutes instead of 10 seconds, how do you look at it?

My opinion: better make sure the optimizer has enough information in time to prevent sub-optimal executions – think about statistics as part of your data pipelines. This doesn’t mean just collecting them – it might be better to lock, to copy or even to set statistics in your particular case, but that’s another topic. It’s important to have a plan!

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *