Online Statistics Gathering: Update 2024

Online Statistics Gathering has been introduced in Oracle 12c. I think it is a very important building block to establish a consistent approach in dealing with optimizer statistics in your ETL jobs. I have already written some posts about it a while ago (first, second and third). When I was preparing a talk about optimizer statistics for data engineers recently, I took a closer look at it again. Oracle has now continuously improved the feature over several versions and has lifted many restrictions. Let’s see what I mean.

TL;DR

There are a couple of changes in the behavior and/or documentation since the introduction of the feature (Oracle 19c to 23ai):

  • Index statistics are gathered too. This is only valid for unique indexes in 19c, but also for non-unique in 23ai. There are no stats for bitmap indexes though.
  • Omitting columns without default values is no restriction anymore.
  • If using MERGE and writing into an empty segment using direct path writes, online statistics will be gathered too. But this behavior is not documented.
  • If loading into a partitioned table with its incremental statistics preference set to TRUE but not using an extended syntax to explicitly specify a partition, still no statistics are gathered. This was a known and documented restriction, but it has now disappeared from the documentation

Recap

The idea behind this feature is that if you are loading data using direct path write into an empty segment, optimizer statistics will be gathered on the fly as a part of your loading SQL statement. What statistics exactly will be gathered, what statements are eligible, some limitations – this is what has changed

Index Statistics

Initially, index statistics were not gathered at all. That was last mentioned in the documentation of Oracle 18c:

While gathering online statistics, the database does not gather index statistics or create histograms.

Optimizer Statistics Concepts 18c

Looking at the same paragraph in the 19c docs, only histograms are mentioned

So, what does it mean? Are index statistics gathered as well in 19c? Let’s test it.

SQL> CREATE TABLE t_with_ind (id number, s varchar2(20));

Table T_WITH_IND created.

SQL> CREATE UNIQUE INDEX i1 on t_with_ind(id);

INDEX I1 created.

SQL> CREATE INDEX i2 on t_with_ind(s);

Index I2 created.

SQL> SELECT table_name, last_analyzed, num_rows
  2  FROM   user_tables 
  3  WHERE  table_name IN ('T_WITH_IND');

TABLE_NAME LAST_ANA   NUM_ROWS
---------- -------- ----------
T_WITH_IND                    


SQL> SELECT table_name, index_name, last_analyzed, num_rows,  blevel, leaf_blocks, distinct_keys, clustering_factor
  2  FROM   user_indexes 
  3  WHERE  table_name IN ('T_WITH_IND');

TABLE_NAME INDEX_NAME LAST_ANA   NUM_ROWS     BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- -------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1         17/08/24          0          0           0             0                 0
T_WITH_IND I2         17/08/24          0          0           0             0                 0

 
SQL> INSERT /*+ append */ INTO   T_WITH_IND (ID,S)
  2  SELECT level AS id, LPAD(level,20,'0') 
  3  FROM   dual
  4  CONNECT BY level <= 10000;

10.000 rows inserted.

SQL> commit;

Commit complete.

SQL> SELECT table_name, last_analyzed, num_rows
  2  FROM   user_tables 
  3  WHERE  table_name IN ('T_WITH_IND');

TABLE_NAME LAST_ANA   NUM_ROWS
---------- -------- ----------
T_WITH_IND 17/08/24      10000

SQL> SELECT table_name, index_name, last_analyzed, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
  2  FROM   user_indexes i
  3  WHERE  table_name IN ('T_WITH_IND');

TABLE_NAME INDEX_NAME LAST_ANA   NUM_ROWS     BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- -------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1         17/08/24      10000          1          22         10000              4127
T_WITH_IND I2         17/08/24          0          0           0             0                 0

Listing 1: Only unique indexes in 19c

And indeed, we can see that statistics for the first (unique) index were gathered but not for the non-unique one. I haven’t found a bug for this behavior but I have found a post by Donatello Settembrino, describing exactly this observation. As Christian Antognini commented, in Oracle 20c (which was available only as a preview) the statistics for non-unique indexes are gathered too. I will skip 20c and 21c altogether, but we can check this in Oracle 23ai:

SQL> CREATE TABLE t_with_ind (id number, s varchar2(20))
Table created.
SQL> CREATE INDEX i1 on t_with_ind(id)
Index created.
SQL> CREATE INDEX i2 on t_with_ind(s)
Index created.
SQL> CREATE INDEX i3 on t_with_ind(id,s)
Index created.
SQL> CREATE BITMAP INDEX bi1 on t_with_ind(id) invisible
Index created.
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_WITH_IND')

TABLE_NAME LAST_ANALYZED           NUM_ROWS
---------- --------------------- ----------
T_WITH_IND                                 
1 row selected.
SQL> SELECT table_name, index_name, last_analyzed, num_rows,  blevel, leaf_blocks, distinct_keys, clustering_factor
FROM   user_indexes 
WHERE  table_name IN ('T_WITH_IND')

TABLE_NAME INDEX_NAME LAST_ANALYZED           NUM_ROWS     BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- --------------------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1         18.08.2024 09:39:31            0          0           0             0                 0
T_WITH_IND I2         18.08.2024 09:39:31            0          0           0             0                 0
T_WITH_IND I3         18.08.2024 09:39:31            0          0           0             0                 0
T_WITH_IND BI1        18.08.2024 09:39:31            0          0           0             0                 0

4 rows selected.

SQL> INSERT /*+ append */  INTO   T_WITH_IND (ID,S)
SELECT level AS id , LPAD(level,20,'0') 
FROM   dual
CONNECT BY level <= 100000
100000 rows created.
SQL> commit
Commit complete.

SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_WITH_IND')

TABLE_NAME LAST_ANALYZED           NUM_ROWS
---------- --------------------- ----------
T_WITH_IND 18.08.2024 09:39:39       100000
1 row selected.
SQL> SELECT table_name, index_name, last_analyzed, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
FROM   user_indexes i
WHERE  table_name IN ('T_WITH_IND')

TABLE_NAME INDEX_NAME LAST_ANALYZED           NUM_ROWS     BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- --------------------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1         18.08.2024 09:39:39       100000          1         231        100000             41508
T_WITH_IND I2         18.08.2024 09:39:39       100000          2         448        100000             41508
T_WITH_IND I3         18.08.2024 09:39:39       100000          2         543        100000             41508
T_WITH_IND BI1        18.08.2024 09:39:31            0          0           0             0                 0

Listing 2: Unique and non-unique B-tree indexes got statistics but not a bitmap index

As we can see, statistics for B-Tree indexes are collected online both for unique and non-unique indexes, but we still get no statistics for bitmaps indexes. Having seen that, I still believe the best practice for bulk loading large amounts of data would be to disable/drop the indexes first, then load the data and rebuild/recreate the indexes afterwards.

Not all columns specified

Up to Oracle 18c we could find the following restriction:

It is loaded using an INSERT INTO ... SELECT, and neither of the following conditions is true: all columns of the target table are specified, or a subset of the target columns are specified and the unspecified columns have default values.

Put differently, the database only gathers statistics automatically for bulk loads when either all columns of the target table are specified, or a subset of the target columns are specified and the unspecified columns have default values. 

If you omit the column col_no_def with no default value in the INSERT SQL, online statistics gathering does not take place in Oracle 12.1 to 18c as shown in Listing 3.

SQL> CREATE TABLE t_not_all_columns (id number, col_no_def varchar2(20))
Table created.
 
SQL> INSERT /*+ append */ INTO   t_not_all_columns (ID)
SELECT level AS id 
FROM   dual
CONNECT BY level <= 10000
10000 rows created.
 
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_NOT_ALL_COLUMNS')
 
TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_NOT_ALL_COLUMNS                                      
1 row selected.

Listing 3: The behavior up to 18c

Is the restriction now gone? Yes, the same code works from 19c:

SQL> CREATE TABLE t_not_all_columns (id number, col_no_def varchar2(20));

Table T_NOT_ALL_COLUMNS created.

SQL> INSERT /*+ append */ INTO   t_not_all_columns (ID)
  2  SELECT level AS id 
  3  FROM   dual
  4  CONNECT BY level <= 10000;

10.000 rows inserted.

SQL> commit;

Commit complete.

SQL> SELECT table_name, last_analyzed, num_rows
  2  FROM   user_tables 
  3  WHERE  table_name IN ('T_NOT_ALL_COLUMNS');

TABLE_NAME                                 LAST_ANA   NUM_ROWS
------------------------------------------ -------- ----------
T_NOT_ALL_COLUMNS                          18/08/24      10000

Listing 4: omitting columns with no default values is no problem from 19c

Online Statistics also using MERGE

MERGE statement was never mentioned in the documentation as eligible for online statistics gathering. In fact, if you are loading data into an empty segment and specify then APPEND hint, it is doing the same as pure INSERT, right? Listing 5 shows the check in Oracle 19c. ORA-12838 indicates that direct path write has happened, but, nevertheless, we got no statistics in this case.

SQL> CREATE TABLE t_merge (id number, s varchar2(20));

Table T_MERGE created.

SQL> MERGE /*+ append */
  2  INTO   t_merge t
  3  USING (
  4         SELECT level AS id , lpad (level, 20, '0') s
  5         FROM   dual
  6         CONNECT BY level <= 10000) q
  7  ON   (q.id = t.id)
  8  WHEN NOT MATCHED THEN INSERT VALUES (q.id, q.s);

10.000 rows merged.

SQL> SELECT count(*) FROM t_merge;

SQL Error: ORA-12838: cannot read/modify an object after modifying it in parallel
12838. 00000 -  "cannot read/modify an object after modifying it in parallel"

SQL> COMMIT;

Commit complete.

SQL> SELECT table_name, last_analyzed, num_rows
  2  FROM   user_tables 
  3  WHERE  table_name IN ('T_MERGE');

TABLE_NAME                    LAST_ANA   NUM_ROWS
------------------------------ -------- ----------
T_MERGE                                                                                                                                             

Listing 5: MERGE still not gathering stats in 19c

Let’s check it in Oracle 23ai! First, this is new to Oracle 23ai, we got no error trying to read segments immediately after they were written via direct path. No ORA-12838 – nice! But the statics were gathered online. I’ve changed the check query a bit to show the column NOTES from user_tab_col_statistics. The text STATS_ON_LOAD indicates that Online Statistics Gathering has kicked in.

SQL> CREATE TABLE t_merge (id number, s varchar2(20));

Table T_MERGE created.

SQL> MERGE /*+ append */
  2  INTO   t_merge t
  3  USING (
  4         SELECT level AS id , lpad (level, 20, '0') s
  5         FROM   dual
  6         CONNECT BY level <= 10000) q
  7  ON   (q.id = t.id)
  8  WHEN NOT MATCHED THEN INSERT VALUES (q.id, q.s);

10.000 rows merged.

SQL> SELECT count(*) FROM t_merge;

  COUNT(*)
----------
     10000

SQL> COMMIT;

Commit complete.

SQL> SELECT table_name, column_name, last_analyzed, notes
  2  FROM   user_tab_col_statistics
  3  WHERE  table_name IN ('T_MERGE');

TABLE_NAME           COLUMN_NAM LAST_ANA NOTES                                                                                              
-------------------- ---------- -------- ---------------------
T_MERGE              ID         18/08/24 STATS_ON_LOAD                                                                                      
T_MERGE              S          18/08/24 STATS_ON_LOAD 

Listing 6: MERGE in Oracle 23ai – statistics are gathered

Well, good to know that it works now, but I would also like to see the documentation updated at this point. Otherwise, if we will want to benefit from it, we’ll be using undocumented behavior, right?

Partitioning

Bulk loading data into partitioned tables, there are some point to consider, what statistics will be gathered depending on syntax you are using and whether incremental statistics preference is set for the table. I wrote about it in my post in 2018 and I will repeat the matrix here again. The interesting point is that if you set the incremental preference to TRUE and don’t use extended syntax (i.e. explicitly specifying the target partition) then you’ll get no statistics at all.

Incremental Preference TRUEIncremental Preference FALSE
Partition name specifiedpartition-level stats + synopsispartition-level stats
No partition name specifiedNO STATS AT ALLglobal stats

That is explicitly mentioned under restrictions in the documentation of Oracle 18c:

It is partitioned, INCREMENTAL is set to true, and partition-extended syntax is not used.For example, assume that you execute DBMS_STATS.SET_TABLE_PREFS(null, 'sales', incremental', 'true'). In this case, the database does not gather statistics for INSERT INTO sales SELECT, even when sales is empty. However, the database does gather statistics automatically for INSERT INTO sales PARTITION (sales_q4_2000) SELECT.

And again, from 19c onwards the restriction is not on the list anymore. What does it mean? What statistics will be gathered online in this case? I have tested it in Oracle 23ai:

SQL> CREATE TABLE t_part (pkey number, id number, s varchar2(20))
  2  partition by range (pkey)
  3  (partition P0 values less than (0),
  4  partition P1 values less than (1),
  5  partition P2 values less than (2),
  6  partition P3 values less than (3));

Table T_PART created.

SQL> exec dbms_stats.set_table_prefs(null,'T_PART','INCREMENTAL','TRUE')

PL/SQL procedure successfully completed.

SQL> INSERT /*+ append */ INTO   t_part --PARTITION (P2)
  2  SELECT 1 pkey, level AS id , lpad (level, 20, '0') s
  3  FROM   dual
  4  CONNECT BY level <= 10000;

10.000 rows inserted.

SQL> COMMIT;

Commit complete.

SQL> select table_name, partition_name, object_type, last_analyzed, num_rows
  2  from sys.user_tab_statistics cs
  3  where table_name IN ('T_PART');

TABLE_NAME           PARTI OBJECT_TYPE     LAST_ANA   NUM_ROWS
-------------------- ----- --------------- -------- ----------
T_PART                     TABLE                              
T_PART               P0    PARTITION                          
T_PART               P1    PARTITION                          
T_PART               P2    PARTITION                          
T_PART               P3    PARTITION                          

Listing 7: Oracle 23ai – still no stats for incremental=true and no extended syntax used

As we can see, no statistics are collected online in case I don’t specify the partition name (or the key using PARTITION FOR) explicitly. Obviously, we have a bug in the documentation or a bug in the implementation if it is really supposed to work now.

Conclusion

I think it’s great that Oracle continues to improve this valuable feature over several major releases.

Although I don’t think you should always rely on the online gathering of index statistics during bulk loads, it could be useful in some load scenarios, mainly for one-off loads. For loading large amount of data on a regular basis it can be better to set the indexes unusable or drop them and then recreate/rebuild after loading data.

I think it is very important that the restriction about incomplete column lists is lifted, because it could lead to hard-to-find bugs. It is a best practice to always provide an explicit column list doing an INSERT, both in INSERT as well as in the SELECT part of it. So, if you were to extend the table but not yet the INSERT statements, everything would still work but the statistics would no longer be collected online. This has now been resolved.

I was pleasantly surprised that MERGE with APPEND-Hint will also gather statistics online if inserting into an empty segment. It is somehow consistent and what one would expect, I think. Nevertheless, you can only benefit from it once – only if the segment is still empty. After that no online statistics gathering will take place. On the other hand, if you yourself ensure that the target table is emptied each time beforehand, than you don’t really need MERGE since WHEN MATCHED THEN UPDATE doesn’t make sense on empty table and simple INSERT will do the work.

To be honest, i think that lifting of the restriction concerning loading data into partitions with incremental property set to TRUE was just removing it from the docs. But it seems still to be there.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *