For me, building efficient data pipelines means thinking about performance from day one. And that means a conscious and thoughtful approach to optimizer statistics. It is not just this Autotask, which always runs in the database in the evenings or at weekends and is the responsibility of the DBAs. Every data engineer knows the data to be processed better, knows when and how it is loaded and processed. They should also be better able to recognize when and whether the statistics are critical at certain process steps and therefore when and how the statistics should be collected in the best possible way. That’s their job too! This is my point of view, which I would like to explain and emphasize with the posts in this series
In the first post I’ve shown how the statistics collection autotask can occasionally process intermediate data states while your data pipeline is still running. This will potentially lead to problems and you don’t want this to happen. The example in this post deals with the case where you need up-to-date statistics immediately after loading, otherwise there is a risk of falling into an “out-of-range” cardinality misestimate.
Let’s assume we are loading the data once a month into a newly created monthly partition of the table SALES and using the following query afterwards for reporting . The execution plan and statistics for one of the past months look like this:
SQL> select s.cust_id, s.time_id, sum (s.amount_sold)
2 from sales s
3 join sh.customers c on c.cust_id = s.cust_id
4 where s.time_id = date '2024-03-01'
5 group by s.cust_id, s.time_id ;
--------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1394 |00:00:00.12 |
| 1 | HASH GROUP BY | | 1 | 990 | 1394 |00:00:00.12 |
|* 2 | HASH JOIN | | 1 | 204K| 204K|00:00:00.08 |
| 3 | TABLE ACCESS FULL | CUSTOMERS | 1 | 55500 | 55500 |00:00:00.05 |
| 4 | PARTITION RANGE SINGLE| | 1 | 204K| 204K|00:00:00.03 |
|* 5 | TABLE ACCESS FULL | SALES | 1 | 204K| 204K|00:00:00.03 |
--------------------------------------------------------------------------------------
SQL> WITH FUNCTION raw_to_date(p_val in raw) return date as
2 date_val date;
3 begin
4 dbms_stats.convert_raw_value(p_val,date_val);
5 return date_val;
6 end;
7 select p.partition_name, p.num_rows
8 , c.column_name, c.num_distinct
9 , raw_to_date(c.low_value) low_value
10 , raw_to_date(c.high_value) high_value
11 from user_part_col_statistics c
12 join user_tab_partitions p on p.table_name = c.table_name
13 and p.partition_name= c.partition_name
14 where p.table_name = 'SALES'
15 and json_value(p.high_value_json, '$.high_value' returning date) = date'2024-04-01'
16 and c.column_name = 'TIME_ID'
17 ;
PARTITION_ NUM_ROWS COLUMN_NAM NUM_DISTINCT LOW_VALUE HIGH_VALUE
---------- ---------- ---------- ------------ ---------- ----------
SYS_P3760 204370 TIME_ID 1 2024-03-01 2024-03-01
Listing 1: good execution plan in presence of statistics
As you can see, the optimizer expects 204K rows to be returned form the fact table SALES after filtering and this estimation is very good – compare estimated (A-Rows) and actual (A-Rows) rows and both are 204K. That is not really surprising, taking into account the existing partition level statistics (NUM_ROWS=204K) and column statistics (NUM_DISTNCT=1 and Low-/High-Value corresponds to the filter). Therefore it decides to use a hash join between SALES and CUSTOMERS and build a hash table on the smaller result set expected to come from the table CUSTOMERS (55K). Everything fine so far.
Now let’s simulate loading of new months data. I will just copy the data of March’s partition for that:
SQL> insert into sales (TIME_ID,
2 QUANTITY_SOLD,
3 PROMO_ID,
4 PROD_ID,
5 CUST_ID,
6 CHANNEL_ID,
7 AMOUNT_SOLD)
8 select date '2024-05-01' TIME_ID
9 ,QUANTITY_SOLD
10 ,PROMO_ID
11 ,PROD_ID
12 ,CUST_ID
13 ,CHANNEL_ID
14 ,AMOUNT_SOLD
15 from sales partition for (date '2020-03-01') ;
204.370 rows inserted.
Listing 2: simulate new months data load
And then look at the same query but for this new month (Listing 3). Partition-level stats are missing for the new partition. The global table stats tell us that the high value for column is 2024-04-01, so our filter predicate is “out-of-range”. The optimizer has to prorate the density calculation an so it comes up with the estimation of just 622 rows! The order of the tables in the join has changed. It doesn’t make a huge difference for such small tables, but in real life, a misestimate of 30+ times can lead to really significant performance losses!
select s.cust_id, s.time_id, sum (s.amount_sold)
from sales s join sh.customers c on c.cust_id = s.cust_id
where s.time_id = date '2024-05-01'
group by s.cust_id, s.time_id
--------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1394 |00:00:00.30 |
| 1 | HASH GROUP BY | | 1 | 133 | 1394 |00:00:00.30 |
|* 2 | HASH JOIN | | 1 | 622 | 204K|00:00:00.13 |
| 3 | PARTITION RANGE SINGLE| | 1 | 622 | 204K|00:00:00.02 |
|* 4 | TABLE ACCESS FULL | SALES | 1 | 622 | 204K|00:00:00.02 |
| 5 | TABLE ACCESS FULL | CUSTOMERS | 1 | 55500 | 55500 |00:00:01.98 |
--------------------------------------------------------------------------------------
SQL> WITH FUNCTION raw_to_date(p_val in raw) return date as
2 date_val date;
3 begin
4 dbms_stats.convert_raw_value(p_val,date_val);
5 return date_val;
6 end;
7 select p.partition_name, p.num_rows, p.last_analyzed
8 , c.column_name, c.num_distinct, c.last_analyzed
9 , raw_to_date(c.low_value) low_value
10 , raw_to_date(c.high_value) high_value
11 from user_part_col_statistics c
12 join user_tab_partitions p on p.table_name = c.table_name
13 and p.partition_name= c.partition_name
14 where p.table_name = 'SALES'
15 and json_value(p.high_value_json, '$.high_value' returning date) = date'2024-06-01'
16 and c.column_name = 'TIME_ID'
17 ;
PARTITION_ NUM_ROWS LAST_ANALY COLUMN_NAM NUM_DISTINCT LAST_ANALY LOW_VALUE HIGH_VALUE
---------- ---------- ---------- ---------- ------------ ---------- ---------- ----------
SYS_P3952 TIME_ID
-- Global table stats
SQL> WITH FUNCTION raw_to_date(p_val in raw) return date as
2 date_val date;
3 begin
4 dbms_stats.convert_raw_value(p_val,date_val);
5 return date_val;
6 end;
7 select c.column_name, c.num_distinct, c.last_analyzed
8 , raw_to_date(c.low_value) low_value
9 , raw_to_date(c.high_value) high_value
10 from user_tab_cols c
11 where c.table_name = 'SALES_I'
12 and c.column_name = 'TIME_ID';
COLUMN_NAM NUM_DISTINCT LAST_ANALY LOW_VALUE HIGH_VALUE
---------- ------------ ---------- ---------- ----------
TIME_ID 1458 2024-05-12 2019-01-01 2024-04-01
Listing 3: poor execution plan because of missing stats
An experienced Oracle data engineer will use direct path load (INSERT with a hint APPEND) into a newly created partition and thus benefit from online statistics gathering, you may say. Of course, they can do it, but then they have to be aware that the stats are also collected, what limits and restriction this feature has, how to make sure, it always works on so on. In other words, they should think about gathering stats in their ETL and make a decision to rely on this feature if applicable.
You may also note that, with optimizer adaptive features enabled, this sub-optimal execution plan will not remain permanent. Seeing this huge cardinality misestimate on first execution, the optimizer will probably create a SQL plan directive to use dynamic sampling for the next execution. This statement will be re-optimized and the new execution plan based on better estimates will be generated.
This is also true, but I have two points on this: 1) I have often seen that DBA’s disable the adaptive features; 2) this still means that you must have had that first sub-optimal execution! If it’s your query that’s affected and your data pipeline didn’t finish in the night, or if it’s you getting calls from angry users whose report took 30 minutes instead of 10 seconds, how do you look at it?
My opinion: better make sure the optimizer has enough information in time to prevent sub-optimal executions – think about statistics as part of your data pipelines. This doesn’t mean just collecting them – it might be better to lock, to copy or even to set statistics in your particular case, but that’s another topic. It’s important to have a plan!