Online Statistics Gathering has been introduced in 12c and is a very handy feature for ETL and batch jobs developers. However the devil is in the detail. There are some points to remember. Let’s take a closer look.
Introduction
If you are not familiar with the concept of this feature yet, here are some useful links:
But I’ll try to give a short introduction too. The official name of the feature is Online Statistics Gathering for Bulk Loads. So when exactly would the table statistics be automatically gathered and when not?
It’s quite clear with Create Table As Select (CTAS): they would (if no restrictions apply). The next type of supported bulk load scenario is INSERT INTO ... SELECT
into an empty table using a direct path insert. That means, your INSERT should be used with a hint APPEND or should be done in parallel. Don’t forget to enable parallel DML at the session level or with a hint enable_parallel_dml – the parallel hint alone is not enough.
Parameter, Hints and Execution Plans
There is an underscore parameter _optimizer_gather_stats_on_load which is TRUE by default. Should you want to disable the feature, this parameter can be set to FALSE. There are also two optimizer hints: GATHER_OPTIMIZER_STATISTICS
and NO_GATHER_OPTIMIZER_STATISTICS
. Don’t misunderstand them! You cannot tell the optimizer to gather (piggyback) statistics online if the main condition (bulk insert into an empty segment) is not met. These hints are only useful in conjunction with the global setting via _optimizer_gather_stats_on_load: if set to TRUE, you can disable the feature for individual statements via NO_GATHER_OPTIMIZER_STATISTICS
and vice versa, if globally set to FALSE, you can selectively enable piggybacking using GATHER_OPTIMIZER_STATISTICS
hint.
How do you know whether online statistics will be gathered? Doing explain plan you will see an extra step OPTIMIZER STATISTICS GATHERING
.
------------------------------------------------- | Id | Operation | Name | ------------------------------------------------- | 0 | INSERT STATEMENT | | | 1 | LOAD AS SELECT | T2 | | 2 | OPTIMIZER STATISTICS GATHERING | | | 3 | TABLE ACCESS FULL | T | -------------------------------------------------
But in fact you cannot know. This step only means the online statistics collection will be considered at runtime, because the optimizer doesn’t check whether the segment is really empty at parse time.
What about MERGE?
This question comes in mind, because in fact the INSERT part of the MERGE will also be done using direct path write if we were using an APPEND hint or doing it in parallel. As an ETL scenario it is not uncommon to load only records that not already exist in the target. And since you don’t need to provide both WHEN MATCHED and WHEN NOT MATCHED (since 10g), one could construct a MERGE doing only an insert which, from my point of view, will be semantically the same as an INSERT INTO … AS SELECT with a NOT IN or NOT EXISTS subquery. The latter will gather statistics online. But what can we expect from MERGE? MERGE is not mentioned in the documentation, but in my opinion it’s worth checking.
SQL> CREATE TABLE t_merge (id number, s varchar2(20)) Table created. SQL> --- MERGE SQL> MERGE /*+ append */ INTO t_merge t USING ( SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 10000) q ON (q.id = t.id) WHEN NOT MATCHED THEN INSERT VALUES (q.id, q.s) 10000 rows merged. -- This error message means we were doing a bulk insert (direct path write) into T_MERGE SQL> SELECT count(*) FROM t_merge >> SELECT count(*) FROM t_merge * Error at line 1 ORA-12838: cannot read/modify an object after modifying it in parallel SQL> COMMIT Commit complete. SQL> -- but we don't have any statistics SQL> SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name IN ('T_MERGE') TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_MERGE 1 row selected.
Listing 1 – No Online Statistics Gathering with MERGE
Unfortunately no. Despite of doing a direct path write into an empty table, MERGE doesn’t piggyback optimizer statistics. Of course it’s not a big problem, you should just be aware of it. In one of my recent projects I was wondering, why don’t my dimensions have any statistics after initial load? These dimensions were quite stable: once initially loaded they rarely got few new records. So I decided not to gather optimizer statistics after each load, but rely on standard Oracle statistics gathering job instead. And for initial load the Online Statistics Gathering will do the job, right? NO! Because the load process was implemented using MERGE. So I had two options: to re-implement the mappings using an INSERT or to explicitly gather optimizer statistics after initial load (or still after each load). Opted for the latter.
What does „empty“ exactly mean?
An interesting question, quite relevant for designing ETL processes. If we plan to implement some intermediate tables and rely on online statistics gathering, how to ensure this works each time we repeat the load job? Let’s test it.
SQL> CREATE TABLE t_delete (id NUMBER, s VARCHAR2(20)) Table created. SQL> CREATE TABLE t_truncate_reuse (id NUMBER, s VARCHAR2(20)) Table created. SQL> CREATE TABLE t_truncate_drop (id NUMBER, s VARCHAR2(20)) Table created. SQL> CREATE TABLE t_rollback (id NUMBER, s VARCHAR2(20)) Table created. SQL> INSERT /*+ append */ INTO t_delete SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 10000 10000 rows created. SQL> INSERT /*+ append */ INTO t_truncate_drop SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 10000 10000 rows created. SQL> INSERT /*+ append */ INTO t_truncate_reuse SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 10000 10000 rows created. SQL> COMMIT Commit complete. SQL> INSERT /*+ append */ INTO t_rollback SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 10000 10000 rows created. SQL> -- Statistics are published without COMMIT SQL> SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name = 'T_ROLLBACK' TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_ROLLBACK 14.04.18 10000 1 row selected. SQL> ROLLBACK Rollback complete. SQL> -- but after ROLLBACK they are gone SQL> SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name IN ('T_DELETE', 'T_ROLLBACK', 'T_TRUNCATE_DROP', 'T_TRUNCATE_REUSE') TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_TRUNCATE_REUSE 14.04.18 10000 T_TRUNCATE_DROP 14.04.18 10000 T_ROLLBACK T_DELETE 14.04.18 10000 4 rows selected. SQL> -- now delete rows and truncate tables SQL> DELETE FROM t_delete 10000 rows deleted. SQL> TRUNCATE TABLE t_truncate_drop DROP STORAGE Table truncated. SQL> TRUNCATE TABLE t_truncate_reuse REUSE STORAGE Table truncated. SQL> -- INSERT more data and check if statistics had been updated SQL> INSERT /*+ append */ INTO t_delete SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 20000 20000 rows created. SQL> INSERT /*+ append */ INTO t_truncate_drop SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 20000 20000 rows created. SQL> INSERT /*+ append */ INTO t_truncate_reuse SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 20000 20000 rows created. SQL> INSERT /*+ append */ INTO t_rollback SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 20000 20000 rows created. SQL> COMMIT Commit complete. SQL> -- do we have the new statistics? SQL> SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name in ('T_DELETE', 'T_ROLLBACK', 'T_TRUNCATE_DROP', 'T_TRUNCATE_REUSE') TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_ROLLBACK T_DELETE 14.04.18 10000 T_TRUNCATE_REUSE 14.04.18 10000 T_TRUNCATE_DROP 14.04.18 20000 4 rows selected.
Listing 2 – when will the segment be considered empty?
This example provides some quite interesting findings:
- After deleting all rows the segment is not considered empty. The next
INSERT /*+ append*/
will not gather statistics. - A rollback of a direct path INSERT leaves the segment in state that is not considered empty. The next INSERT /*+ append*/ will not gather statistics.
- After truncating the table with DROP STORAGE option (this is default) the next INSERT /*+ append*/ will gather statistics. But again, using TRUNCATE TABLE … REUSE STORAGE prevents online statistics gathering afterwards.
- INSERT /*+ append*/ publishes the optimizer statistics already before committing. Another sessions can see them. The ROLLBACK take them away though.
The only way we can rely on online statistics gathering in a repeatable process without dropping the table is to use TRUNCATE TABLE without an option or with explicit DROP STORAGE option before inserting new data.
The fact that the optimizer statistics are immediately available after INSERT statement is indeed documented. But the placement of this sentence at the end of 10.3.3.2 Global Statistics During Inserts into Empty Partitioned Tables is IMHO wrong or at least understated.
Cursor invalidation
Initially I wanted to investigate the behavior concerning cursor invalidation, but I didn’t managed to get a consistent test results. If you are gathering object statistics via DBMS_STATS you have control over cursor invalidation using the parameter NO_INVALIDATE. With Online Statistics Gathering you have no control. With the following script I believed to prove that the cursor was invalidated immediately after the INSERT which gathered stats, even without committing. But running this some more times I saw that it was not invalidated every time. On the other hand, if I turned the INSERT to be a conventional load, the cursor was invalidated too, sometimes not.
SQL> set linesize 300 pages 10000 SQL> column table_name format a30 SQL> DROP table t_rollback purge Table dropped. SQL> CREATE TABLE t_rollback (id NUMBER, s VARCHAR2(20)) Table created. SQL> -- we will check if this cursor will be invalidated SQL> SELECT /* TEST_CURSOR_INVALIDATION */* FROM t_rollback no rows selected. SQL> COLUMN sql_id new_value sql_id SQL> -- find the cursor in the shared pool SQL> SELECT sql_id, LOADED_VERSIONS,EXECUTIONS,INVALIDATIONS FROM v$sql s WHERE sql_text LIKE '%TEST_CURSOR_INVALIDATION%' AND UPPER(sql_text) NOT LIKE '%V$SQL%' SQL_ID LOADED_VERSIONS EXECUTIONS INVALIDATIONS ------------- --------------- ---------- ------------- 027kac8gfsud8 1 1 0 1 row selected. SQL> -- Insert data but not commit SQL> INSERT /*+ append */ INTO t_rollback SELECT level AS id , lpad (level, 20, '0') s FROM dual CONNECT BY level <= 10000 10000 rows created. SQL> -- the statistics are already published SQL> SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name = 'T_ROLLBACK' TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_ROLLBACK 14.04.18 10000 1 row selected. SQL> -- and the Cursor has been invalidated SQL> SELECT sql_id, LOADED_VERSIONS,EXECUTIONS,INVALIDATIONS FROM v$sql s WHERE sql_id = '027kac8gfsud8' SQL_ID LOADED_VERSIONS EXECUTIONS INVALIDATIONS ------------- --------------- ---------- ------------- 027kac8gfsud8 1 1 1 1 row selected. SQL> ROLLBACK Rollback complete. SQL> -- statistics are gone after rollback SQL> SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name = 'T_ROLLBACK' TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_ROLLBACK 1 row selected.
Listing 3 – Cursor invalidation, the results are not consistent though!
At this point I had strong doubt, whether it was still worth further investigation. I still believe the cursor will be invalidated, but is it a problem within an ETL process? In my opinion not really. The online statistics only kick in if loading into an empty segment.
Following cases are possible:
- With CTAS there can be no cursors using THIS new segment to invalidate;
- With direct path INSERT there could be the stats in place which were gathered on empty segment, so it is highly preferable to invalidate all cursors after loading data;
- With direct path INSERT there are „old“ stats in place which were gathered before truncate. Well, in that case your cursors have to be re-optimized again, but the parse time is generally not a problem in data warehouse environments. If you are loading data which can significantly differ from load to load, than again, the invalidation will be desirable. If the data is pretty the same and you are really having the problem with cursor invalidation, you could prevent re-gathering already representative statistics by locking them or using the hint NO_GATHER_OPTIMIZER_STATS.
Some pitfalls and restrictions
There are some restrictions documented, but should you have doubts, keep in mind to re-read this section (and MOS Notes) for your version. There are better explanations in 12.2. Some restrictions were lifted, some of them where lifted even in 12.1 but this seems not to be reflected in the documentation.
What kind of statistics?
You get only table and basic column statistics. No histograms. No index statistics will be gathered. Recently I could observe a nasty side effect. There was an index on a big table which occasionally was empty at the time the standard statistics maintenance job was active. After that both table and index statistics were reflecting that fact (number or distinct keys = 0). And while table stats were re-gathered online, the index stats were not. And guess what? The optimizer has chosen to use index statistics for its cardinality estimates resulting in a nested loop. So you should cater for index statistics yourself.
Columns
Such a simple thing as not specifying all columns in INSERT prevents online statistics gathering:
SQL> CREATE TABLE t_not_all_columns (id number, s varchar2(20)) Table created. SQL> INSERT /*+ append */ INTO t_not_all_columns (ID) SELECT level AS id FROM dual CONNECT BY level <= 10000 10000 rows created. SQL> SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name IN ('T_NOT_ALL_COLUMNS') TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_NOT_ALL_COLUMNS 1 row selected.
Listing 4 – not all columns specified – no statistics gathering
This was not mentioned at all in the 12.1 documentation. But 12.2 has a very good explanation. If all missing columns have default values online statistics gathering will kick in. Now I asked myself, if this works, why should virtual columns be a problem, as 12.1 documentation says? Tested it and indeed, it works with virtual columns even in 12.1 (Listing 5)
> SELECT * FROM v$version BANNER CON_ID -------------------------------------------------------------------------------- ---------- Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production 0 PL/SQL Release 12.1.0.2.0 - Production 0 CORE 12.1.0.2.0 Production 0 TNS for Linux: Version 12.1.0.2.0 - Production 0 NLSRTL Version 12.1.0.2.0 - Production 0 > ---------------- > --not all columns > DROP table t_not_all_columns purge table T_NOT_ALL_COLUMNS dropped. > CREATE TABLE t_not_all_columns ( id number , s varchar2(20)) table T_NOT_ALL_COLUMNS created. > INSERT /*+ append */ INTO t_not_all_columns (ID) SELECT level AS id FROM dual CONNECT BY level <= 10000 10,000 rows inserted. > SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name IN ('T_NOT_ALL_COLUMNS') TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_NOT_ALL_COLUMNS > ----------- > --default > DROP table t_not_all_columns_default purge table T_NOT_ALL_COLUMNS_DEFAULT dropped. > CREATE TABLE t_not_all_columns_default ( id number , s varchar2(20) default 'ABC') table T_NOT_ALL_COLUMNS_DEFAULT created. > INSERT /*+ append */ INTO t_not_all_columns_default (ID) SELECT level AS id FROM dual CONNECT BY level <= 10000 10,000 rows inserted. > SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name IN ('T_NOT_ALL_COLUMNS_DEFAULT') TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_NOT_ALL_COLUMNS_DEFAULT 29-MAY-18 10000 > ---------- > --virtual > DROP table t_not_all_columns_virtual purge table T_NOT_ALL_COLUMNS_VIRTUAL dropped. > CREATE TABLE t_not_all_columns_virtual ( id number , s varchar2(20) generated always as ('ABC') virtual) table T_NOT_ALL_COLUMNS_VIRTUAL created. > INSERT /*+ append */ INTO t_not_all_columns_virtual (ID) SELECT level AS id FROM dual CONNECT BY level <= 10000 10,000 rows inserted. > SELECT table_name, last_analyzed, num_rows FROM user_tables WHERE table_name IN ('T_NOT_ALL_COLUMNS_VIRTUAL') TABLE_NAME LAST_ANALYZED NUM_ROWS ------------------------------ ------------- ---------- T_NOT_ALL_COLUMNS_VIRTUAL 29-MAY-18 10000
Listing 5 – columns with default values and virtual columns
I should say, I don’t really understand, why not specifying a column which has no default value should be a problem. After all, if I gather statistics afterwards and some columns are completely empty, I’ll get statistics on them reflecting exactly that fact.
Extended Stats
There is a bug 18425876 – if you have created extended stats before filling the table, this prevents online statistics gathering. I could reproduce this bug only in 12.1.0.1 though. MOS states it was fixed in 12.2, but it seems it have been backported to 12.1.0.2 too.
SQL> SELECT * FROM v$version; BANNER CON_ID -------------------------------------------------------------------------------- ---------- Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production 0 ... SQL> SQL> SQL> DROP table t_ext_stats purge; Table dropped. SQL> SQL> CREATE TABLE t_ext_stats 2 ( id number 3 , s varchar2(20)); Table created. SQL> SQL> variable x varchar2(50) SQL> SQL> exec :x:= DBMS_STATS.CREATE_EXTENDED_STATS(user, 'T_EXT_STATS','(ID,S)') PL/SQL procedure successfully completed. SQL> SQL> INSERT /*+ append */ INTO t_ext_stats (ID, S) 2 SELECT level AS id, 'ABC' 3 FROM dual 4 CONNECT BY level <= 10000; 10000 rows created. SQL> SQL> SQL> SELECT table_name, last_analyzed, num_rows 2 FROM user_tables 3 WHERE table_name IN ('T_EXT_STATS'); TABLE_NAME LAST_ANAL NUM_ROWS ------------------------------ --------- ---------- T_EXT_STATS
Listing 6 – extended statistics prevent online statistics gathering
In the second part we’ll take a look at partitioning.