Object Statistics in Your Data Pipelines

As my electrical engineering lecturer at university used to joke, there are only two faults in electricity: 1) No contact where it is needed. 2) There is contact where it is not needed. You can often think the same way when looking for the causes of a poor execution plan: no stats were gathered when the optimizer needed them for a good execution plan or someone gathered stats when it was inappropriate.

As many performance puzzles begin: one day a report had become extremely slow. And as you might guess, the reason for this was an execution plan that had changed. The SQL statement for this report was a join of several larger tables and a table with reference data containing a couple of thousands or rows. And the change was that the optimizer has chosen to do a Cartesian join of this table with one of the larger tables. How on earth did it come up with this idea? Well, the main reasoning factor for these optimizer decisions are the cardinality estimations which are in turn based on objects statistics.

I will try to illustrate the problem using well-known Oracle’s sample schema Sales History instead of showing complex real-life SQL queries.

SQL> select s.cust_id, s.time_id, sum (s.amount_sold)
  2  from    sh.sales s
  3     join sh.customers c on c.cust_id = s.cust_id
  4     join sh.times t on t.time_id = s.time_id
  5  where t.calendar_year = 2020 
  6  group by s.cust_id, s.time_id   ;

36.870 rows selected. 

Elapsed: 00:00:00.299

SQL> select * from dbms_xplan.display_cursor(format=>'iostats last');
 
------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |  36870 |00:00:00.15 |    3021 |   2816 |
|   1 |  HASH GROUP BY              |           |      1 |    229K|  36870 |00:00:00.15 |    3021 |   2816 |
|*  2 |   HASH JOIN                 |           |      1 |    229K|    247K|00:00:00.11 |    3021 |   2816 |
|   3 |    TABLE ACCESS FULL        | CUSTOMERS |      1 |  55500 |  55500 |00:00:00.07 |    1543 |   1536 |
|*  4 |    HASH JOIN                |           |      1 |    229K|    247K|00:00:00.04 |    1478 |   1280 |
|*  5 |     TABLE ACCESS FULL       | TIMES     |      1 |    365 |    366 |00:00:00.01 |      60 |      0 |
|   6 |     PARTITION RANGE SUBQUERY|           |      1 |    918K|    247K|00:00:00.02 |    1418 |   1280 |
|   7 |      TABLE ACCESS FULL      | SALES     |      4 |    918K|    247K|00:00:00.02 |    1358 |   1280 |
------------------------------------------------------------------------------------------------------------

Let this be a query that was running fast enough and for a long time without any problems. Until one day, without any change or deployment and also without any significant increase in data, the runtime had become absolutely unacceptable.

The execution plan looks like this:

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |           |      1 |        |  36870 |00:00:27.23 |    3032 |  58027 |  55200 |
|   1 |  HASH GROUP BY             |           |      1 |     33 |  36870 |00:00:27.23 |    3032 |  58027 |  55200 |
|*  2 |   HASH JOIN                |           |      1 |     33 |    247K|00:00:26.97 |    3032 |  58027 |  55200 |
|   3 |    MERGE JOIN CARTESIAN    |           |      1 |    365 |     20M|00:00:08.62 |    1603 |   1536 |      0 |
|   4 |     TABLE ACCESS FULL      | CUSTOMERS |      1 |      1 |  55500 |00:00:00.17 |    1543 |   1536 |      0 |
|   5 |     BUFFER SORT            |           |  55500 |    365 |     20M|00:00:04.41 |      60 |      0 |      0 |
|*  6 |      TABLE ACCESS FULL     | TIMES     |      1 |    365 |    366 |00:00:00.01 |      60 |      0 |      0 |
|   7 |    PARTITION RANGE SUBQUERY|           |      1 |    918K|    247K|00:00:00.02 |    1418 |   1280 |      0 |
|   8 |     TABLE ACCESS FULL      | SALES     |      4 |    918K|    247K|00:00:00.03 |    1358 |   1280 |      0 |
--------------------------------------------------------------------------------------------------------------------

You can see a fatal cardinality misestimate for the table customers. The optimizer assumes that table to have just one row (look at E-rows, whereas it actually has 55500 rows according to A-rows) and has therefore decided to use the Cartesian product. And so it produces 20 million rows as an intermediate result set which then has to be joined with the fact table. No wonder it took 27 seconds compared to 150 milliseconds.

What else we should know about this problematic table is that it is completely reloaded every day at 10 pm (TRUNCATE/INSERT)

A closer look at the statistics and data in the table have only led to further confusion.

SQL>  select num_rows, to_char(last_analyzed, 'YYYY-MM-DD HH24:MI:SS')
  2   from user_tables 
  3   where table_name = 'CUSTOMERS';

  NUM_ROWS TO_CHAR(LAST_ANALYZ
---------- -------------------
         0 2024-05-06 22:05:23

SQL>  select count(*) 
  2   from  customers;

  COUNT(*)
----------
     55500

The new data has been correctly loaded and the statistics for the table are also in place, however, they show that the table is empty. The timing of the last statistics collection also matched the timing of the last load.

The assumption was that something had gone wrong in the loading procedure and we started to investigate the code. However, we found that the code did not care about the statistics at all. Neither explicit DBMS_STATS call, nor online statistics gathering (INSERT was without the hint APPEND) were in place. And yet the stats have been gathered around that time. It was then clear that we had to look in the other direction!

Oracle has a number of Auto-Tasks, one of them is about collection of optimizer stats. The best practice is not to disable them. There is a good chance that the default scheduler windows for those tasks have not been changed in your system as well. The default window for this task WEEKNIGHT_WINDOW starts at 10 pm.

SQL> select client_name, status from dba_autotask_client where client_name='auto optimizer stats collection';

CLIENT_NAME                                                      STATUS  
---------------------------------------------------------------- --------
auto optimizer stats collection                                  ENABLED 

1 row selected. 

Elapsed: 00:00:00.048
SQL> 
SQL> select window_name, repeat_interval from dba_scheduler_windows;

WINDOW_NAME          REPEAT_INTERVAL                                                       
-------------------- ----------------------------------------------------------------------
MONDAY_WINDOW        freq=daily;byday=MON;byhour=22;byminute=0; bysecond=0                 
TUESDAY_WINDOW       freq=daily;byday=TUE;byhour=22;byminute=0; bysecond=0                 
WEDNESDAY_WINDOW     freq=daily;byday=WED;byhour=22;byminute=0; bysecond=0                 
THURSDAY_WINDOW      freq=daily;byday=THU;byhour=22;byminute=0; bysecond=0                 
FRIDAY_WINDOW        freq=daily;byday=FRI;byhour=22;byminute=0; bysecond=0                 
SATURDAY_WINDOW      freq=daily;byday=SAT;byhour=6;byminute=0; bysecond=0                  
SUNDAY_WINDOW        freq=daily;byday=SUN;byhour=6;byminute=0; bysecond=0                  
WEEKNIGHT_WINDOW     freq=daily;byday=MON,TUE,WED,THU,FRI;byhour=22;byminute=0; bysecond=0 
WEEKEND_WINDOW       freq=daily;byday=SAT;byhour=0;byminute=0;bysecond=0                   

9 rows selected. 

Elapsed: 00:00:00.032

At the end it was simply bad luck: the statistics gathering auto task has chosen to process our table in question and has gathered statistics on it exactly after TRUNCATE but before INSERT was committed.

So what is my point here? There can be different solutions for this case:

  • explicitly collect statistics after loading
  • ensure online statistics gathering can kick in
  • consider locking statistics on that table if the change rate is not high
  • maybe don’t truncate the table, consider incremental load
  • don’t run automatic statistics gathering task and ETL processes simultaneously
  • and so on …

But my point is not about the solution for this particular case. As data engineers we should always think about optimizer statistics, consider them every time you are loading, moving or transforming your data. Know how it works and know your options. Even if you don’t do something, it must be a conscious and well-justified decision!

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *