Online Statistics Gathering has been introduced in Oracle 12c. I think it is a very important building block to establish a consistent approach in dealing with optimizer statistics in your ETL jobs. I have already written some posts about it a while ago (first, second and third). When I was preparing a talk about optimizer statistics for data engineers recently, I took a closer look at it again. Oracle has now continuously improved the feature over several versions and has lifted many restrictions. Let’s see what I mean.
TL;DR
There are a couple of changes in the behavior and/or documentation since the introduction of the feature (Oracle 19c to 23ai):
- Index statistics are gathered too. This is only valid for unique indexes in 19c, but also for non-unique in 23ai. There are no stats for bitmap indexes though.
- Omitting columns without default values is no restriction anymore.
- If using MERGE and writing into an empty segment using direct path writes, online statistics will be gathered too. But this behavior is not documented.
- If loading into a partitioned table with its incremental statistics preference set to TRUE but not using an extended syntax to explicitly specify a partition, still no statistics are gathered. This was a known and documented restriction, but it has now disappeared from the documentation
Recap
The idea behind this feature is that if you are loading data using direct path write into an empty segment, optimizer statistics will be gathered on the fly as a part of your loading SQL statement. What statistics exactly will be gathered, what statements are eligible, some limitations – this is what has changed
Index Statistics
Initially, index statistics were not gathered at all. That was last mentioned in the documentation of Oracle 18c:
While gathering online statistics, the database does not gather index statistics or create histograms.
Optimizer Statistics Concepts 18c
Looking at the same paragraph in the 19c docs, only histograms are mentioned
So, what does it mean? Are index statistics gathered as well in 19c? Let’s test it.
SQL> CREATE TABLE t_with_ind (id number, s varchar2(20));
Table T_WITH_IND created.
SQL> CREATE UNIQUE INDEX i1 on t_with_ind(id);
INDEX I1 created.
SQL> CREATE INDEX i2 on t_with_ind(s);
Index I2 created.
SQL> SELECT table_name, last_analyzed, num_rows
2 FROM user_tables
3 WHERE table_name IN ('T_WITH_IND');
TABLE_NAME LAST_ANA NUM_ROWS
---------- -------- ----------
T_WITH_IND
SQL> SELECT table_name, index_name, last_analyzed, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
2 FROM user_indexes
3 WHERE table_name IN ('T_WITH_IND');
TABLE_NAME INDEX_NAME LAST_ANA NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- -------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1 17/08/24 0 0 0 0 0
T_WITH_IND I2 17/08/24 0 0 0 0 0
SQL> INSERT /*+ append */ INTO T_WITH_IND (ID,S)
2 SELECT level AS id, LPAD(level,20,'0')
3 FROM dual
4 CONNECT BY level <= 10000;
10.000 rows inserted.
SQL> commit;
Commit complete.
SQL> SELECT table_name, last_analyzed, num_rows
2 FROM user_tables
3 WHERE table_name IN ('T_WITH_IND');
TABLE_NAME LAST_ANA NUM_ROWS
---------- -------- ----------
T_WITH_IND 17/08/24 10000
SQL> SELECT table_name, index_name, last_analyzed, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
2 FROM user_indexes i
3 WHERE table_name IN ('T_WITH_IND');
TABLE_NAME INDEX_NAME LAST_ANA NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- -------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1 17/08/24 10000 1 22 10000 4127
T_WITH_IND I2 17/08/24 0 0 0 0 0
Listing 1: Only unique indexes in 19c
And indeed, we can see that statistics for the first (unique) index were gathered but not for the non-unique one. I haven’t found a bug for this behavior but I have found a post by Donatello Settembrino, describing exactly this observation. As Christian Antognini commented, in Oracle 20c (which was available only as a preview) the statistics for non-unique indexes are gathered too. I will skip 20c and 21c altogether, but we can check this in Oracle 23ai:
SQL> CREATE TABLE t_with_ind (id number, s varchar2(20))
Table created.
SQL> CREATE INDEX i1 on t_with_ind(id)
Index created.
SQL> CREATE INDEX i2 on t_with_ind(s)
Index created.
SQL> CREATE INDEX i3 on t_with_ind(id,s)
Index created.
SQL> CREATE BITMAP INDEX bi1 on t_with_ind(id) invisible
Index created.
SQL> SELECT table_name, last_analyzed, num_rows
FROM user_tables
WHERE table_name IN ('T_WITH_IND')
TABLE_NAME LAST_ANALYZED NUM_ROWS
---------- --------------------- ----------
T_WITH_IND
1 row selected.
SQL> SELECT table_name, index_name, last_analyzed, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
FROM user_indexes
WHERE table_name IN ('T_WITH_IND')
TABLE_NAME INDEX_NAME LAST_ANALYZED NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- --------------------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1 18.08.2024 09:39:31 0 0 0 0 0
T_WITH_IND I2 18.08.2024 09:39:31 0 0 0 0 0
T_WITH_IND I3 18.08.2024 09:39:31 0 0 0 0 0
T_WITH_IND BI1 18.08.2024 09:39:31 0 0 0 0 0
4 rows selected.
SQL> INSERT /*+ append */ INTO T_WITH_IND (ID,S)
SELECT level AS id , LPAD(level,20,'0')
FROM dual
CONNECT BY level <= 100000
100000 rows created.
SQL> commit
Commit complete.
SQL> SELECT table_name, last_analyzed, num_rows
FROM user_tables
WHERE table_name IN ('T_WITH_IND')
TABLE_NAME LAST_ANALYZED NUM_ROWS
---------- --------------------- ----------
T_WITH_IND 18.08.2024 09:39:39 100000
1 row selected.
SQL> SELECT table_name, index_name, last_analyzed, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
FROM user_indexes i
WHERE table_name IN ('T_WITH_IND')
TABLE_NAME INDEX_NAME LAST_ANALYZED NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
---------- ---------- --------------------- ---------- ---------- ----------- ------------- -----------------
T_WITH_IND I1 18.08.2024 09:39:39 100000 1 231 100000 41508
T_WITH_IND I2 18.08.2024 09:39:39 100000 2 448 100000 41508
T_WITH_IND I3 18.08.2024 09:39:39 100000 2 543 100000 41508
T_WITH_IND BI1 18.08.2024 09:39:31 0 0 0 0 0
Listing 2: Unique and non-unique B-tree indexes got statistics but not a bitmap index
As we can see, statistics for B-Tree indexes are collected online both for unique and non-unique indexes, but we still get no statistics for bitmaps indexes. Having seen that, I still believe the best practice for bulk loading large amounts of data would be to disable/drop the indexes first, then load the data and rebuild/recreate the indexes afterwards.
Not all columns specified
Up to Oracle 18c we could find the following restriction:
It is loaded using an
INSERT INTO ... SELECT
, and neither of the following conditions is true: all columns of the target table are specified, or a subset of the target columns are specified and the unspecified columns have default values.Put differently, the database only gathers statistics automatically for bulk loads when either all columns of the target table are specified, or a subset of the target columns are specified and the unspecified columns have default values.
If you omit the column col_no_def with no default value in the INSERT SQL, online statistics gathering does not take place in Oracle 12.1 to 18c as shown in Listing 3.
SQL> CREATE TABLE t_not_all_columns (id number, col_no_def varchar2(20))
Table created.
SQL> INSERT /*+ append */ INTO t_not_all_columns (ID)
SELECT level AS id
FROM dual
CONNECT BY level <= 10000
10000 rows created.
SQL> SELECT table_name, last_analyzed, num_rows
FROM user_tables
WHERE table_name IN ('T_NOT_ALL_COLUMNS')
TABLE_NAME LAST_ANALYZED NUM_ROWS
------------------------------ ------------- ----------
T_NOT_ALL_COLUMNS
1 row selected.
Listing 3: The behavior up to 18c
Is the restriction now gone? Yes, the same code works from 19c:
SQL> CREATE TABLE t_not_all_columns (id number, col_no_def varchar2(20));
Table T_NOT_ALL_COLUMNS created.
SQL> INSERT /*+ append */ INTO t_not_all_columns (ID)
2 SELECT level AS id
3 FROM dual
4 CONNECT BY level <= 10000;
10.000 rows inserted.
SQL> commit;
Commit complete.
SQL> SELECT table_name, last_analyzed, num_rows
2 FROM user_tables
3 WHERE table_name IN ('T_NOT_ALL_COLUMNS');
TABLE_NAME LAST_ANA NUM_ROWS
------------------------------------------ -------- ----------
T_NOT_ALL_COLUMNS 18/08/24 10000
Listing 4: omitting columns with no default values is no problem from 19c
Online Statistics also using MERGE
MERGE statement was never mentioned in the documentation as eligible for online statistics gathering. In fact, if you are loading data into an empty segment and specify then APPEND hint, it is doing the same as pure INSERT, right? Listing 5 shows the check in Oracle 19c. ORA-12838 indicates that direct path write has happened, but, nevertheless, we got no statistics in this case.
SQL> CREATE TABLE t_merge (id number, s varchar2(20));
Table T_MERGE created.
SQL> MERGE /*+ append */
2 INTO t_merge t
3 USING (
4 SELECT level AS id , lpad (level, 20, '0') s
5 FROM dual
6 CONNECT BY level <= 10000) q
7 ON (q.id = t.id)
8 WHEN NOT MATCHED THEN INSERT VALUES (q.id, q.s);
10.000 rows merged.
SQL> SELECT count(*) FROM t_merge;
SQL Error: ORA-12838: cannot read/modify an object after modifying it in parallel
12838. 00000 - "cannot read/modify an object after modifying it in parallel"
SQL> COMMIT;
Commit complete.
SQL> SELECT table_name, last_analyzed, num_rows
2 FROM user_tables
3 WHERE table_name IN ('T_MERGE');
TABLE_NAME LAST_ANA NUM_ROWS
------------------------------ -------- ----------
T_MERGE
Listing 5: MERGE still not gathering stats in 19c
Let’s check it in Oracle 23ai! First, this is new to Oracle 23ai, we got no error trying to read segments immediately after they were written via direct path. No ORA-12838 – nice! But the statics were gathered online. I’ve changed the check query a bit to show the column NOTES from user_tab_col_statistics. The text STATS_ON_LOAD indicates that Online Statistics Gathering has kicked in.
SQL> CREATE TABLE t_merge (id number, s varchar2(20));
Table T_MERGE created.
SQL> MERGE /*+ append */
2 INTO t_merge t
3 USING (
4 SELECT level AS id , lpad (level, 20, '0') s
5 FROM dual
6 CONNECT BY level <= 10000) q
7 ON (q.id = t.id)
8 WHEN NOT MATCHED THEN INSERT VALUES (q.id, q.s);
10.000 rows merged.
SQL> SELECT count(*) FROM t_merge;
COUNT(*)
----------
10000
SQL> COMMIT;
Commit complete.
SQL> SELECT table_name, column_name, last_analyzed, notes
2 FROM user_tab_col_statistics
3 WHERE table_name IN ('T_MERGE');
TABLE_NAME COLUMN_NAM LAST_ANA NOTES
-------------------- ---------- -------- ---------------------
T_MERGE ID 18/08/24 STATS_ON_LOAD
T_MERGE S 18/08/24 STATS_ON_LOAD
Listing 6: MERGE in Oracle 23ai – statistics are gathered
Well, good to know that it works now, but I would also like to see the documentation updated at this point. Otherwise, if we will want to benefit from it, we’ll be using undocumented behavior, right?
Partitioning
Bulk loading data into partitioned tables, there are some point to consider, what statistics will be gathered depending on syntax you are using and whether incremental statistics preference is set for the table. I wrote about it in my post in 2018 and I will repeat the matrix here again. The interesting point is that if you set the incremental preference to TRUE and don’t use extended syntax (i.e. explicitly specifying the target partition) then you’ll get no statistics at all.
Incremental Preference TRUE | Incremental Preference FALSE | |
---|---|---|
Partition name specified | partition-level stats + synopsis | partition-level stats |
No partition name specified | NO STATS AT ALL | global stats |
That is explicitly mentioned under restrictions in the documentation of Oracle 18c:
It is partitioned,
INCREMENTAL
is set totrue
, and partition-extended syntax is not used.For example, assume that you executeDBMS_STATS.SET_TABLE_PREFS(null, 'sales', incremental', 'true')
. In this case, the database does not gather statistics forINSERT INTO sales SELECT
, even whensales
is empty. However, the database does gather statistics automatically forINSERT INTO sales PARTITION
(sales_q4_2000) SELECT
.
And again, from 19c onwards the restriction is not on the list anymore. What does it mean? What statistics will be gathered online in this case? I have tested it in Oracle 23ai:
SQL> CREATE TABLE t_part (pkey number, id number, s varchar2(20))
2 partition by range (pkey)
3 (partition P0 values less than (0),
4 partition P1 values less than (1),
5 partition P2 values less than (2),
6 partition P3 values less than (3));
Table T_PART created.
SQL> exec dbms_stats.set_table_prefs(null,'T_PART','INCREMENTAL','TRUE')
PL/SQL procedure successfully completed.
SQL> INSERT /*+ append */ INTO t_part --PARTITION (P2)
2 SELECT 1 pkey, level AS id , lpad (level, 20, '0') s
3 FROM dual
4 CONNECT BY level <= 10000;
10.000 rows inserted.
SQL> COMMIT;
Commit complete.
SQL> select table_name, partition_name, object_type, last_analyzed, num_rows
2 from sys.user_tab_statistics cs
3 where table_name IN ('T_PART');
TABLE_NAME PARTI OBJECT_TYPE LAST_ANA NUM_ROWS
-------------------- ----- --------------- -------- ----------
T_PART TABLE
T_PART P0 PARTITION
T_PART P1 PARTITION
T_PART P2 PARTITION
T_PART P3 PARTITION
Listing 7: Oracle 23ai – still no stats for incremental=true and no extended syntax used
As we can see, no statistics are collected online in case I don’t specify the partition name (or the key using PARTITION FOR) explicitly. Obviously, we have a bug in the documentation or a bug in the implementation if it is really supposed to work now.
Conclusion
I think it’s great that Oracle continues to improve this valuable feature over several major releases.
Although I don’t think you should always rely on the online gathering of index statistics during bulk loads, it could be useful in some load scenarios, mainly for one-off loads. For loading large amount of data on a regular basis it can be better to set the indexes unusable or drop them and then recreate/rebuild after loading data.
I think it is very important that the restriction about incomplete column lists is lifted, because it could lead to hard-to-find bugs. It is a best practice to always provide an explicit column list doing an INSERT, both in INSERT as well as in the SELECT part of it. So, if you were to extend the table but not yet the INSERT statements, everything would still work but the statistics would no longer be collected online. This has now been resolved.
I was pleasantly surprised that MERGE with APPEND-Hint will also gather statistics online if inserting into an empty segment. It is somehow consistent and what one would expect, I think. Nevertheless, you can only benefit from it once – only if the segment is still empty. After that no online statistics gathering will take place. On the other hand, if you yourself ensure that the target table is emptied each time beforehand, than you don’t really need MERGE since WHEN MATCHED THEN UPDATE doesn’t make sense on empty table and simple INSERT will do the work.
To be honest, i think that lifting of the restriction concerning loading data into partitions with incremental property set to TRUE was just removing it from the docs. But it seems still to be there.