Online Statistics Gathering for ETL – Part 1

Online Statistics Gathering has been introduced in 12c and is a very handy feature for ETL and batch jobs developers. However the devil is in the detail. There are some points to remember. Let’s take a closer look.

Introduction

If you are not familiar with the concept of this feature yet, here are some useful links:

But I’ll try to give a short introduction too. The official name of the feature is Online Statistics Gathering for Bulk Loads. So when exactly would the table statistics be automatically gathered and when not?

It’s quite clear with Create Table As Select (CTAS): they would (if no restrictions apply). The next type of supported bulk load scenario is INSERT INTO ... SELECT into an empty table using a direct path insert. That means, your INSERT should be used with a hint APPEND or should be done in parallel. Don’t forget to enable parallel DML at the session level or with a hint enable_parallel_dml – the parallel hint alone is not enough.

Parameter, Hints and Execution Plans

There is an underscore parameter _optimizer_gather_stats_on_load which is TRUE by default. Should you want to disable the feature, this parameter can be set to FALSE. There are also two optimizer hints: GATHER_OPTIMIZER_STATISTICS and NO_GATHER_OPTIMIZER_STATISTICS. Don’t misunderstand them! You cannot tell the optimizer to gather (piggyback) statistics online if the main condition (bulk insert into an empty segment) is not met. These hints are only useful in conjunction with the global setting via _optimizer_gather_stats_on_load: if set to TRUE, you can disable the feature for individual statements via NO_GATHER_OPTIMIZER_STATISTICS and vice versa, if globally set to FALSE, you can selectively enable piggybacking using GATHER_OPTIMIZER_STATISTICS hint.

How do you know whether online statistics will be gathered? Doing explain plan you will see an extra step OPTIMIZER STATISTICS GATHERING.

-------------------------------------------------
| Id  | Operation                        | Name |
-------------------------------------------------
|   0 | INSERT STATEMENT                 |      |
|   1 |  LOAD AS SELECT                  | T2   |
|   2 |   OPTIMIZER STATISTICS GATHERING |      |
|   3 |    TABLE ACCESS FULL             | T    |
-------------------------------------------------

But in fact you cannot know. This step only means the online statistics collection will be considered at runtime, because the optimizer doesn’t check whether the segment is really empty at parse time.

What about MERGE?

This question comes in mind, because in fact the INSERT part of the MERGE will also be done using direct path write if we were using an APPEND hint or doing it in parallel. As an ETL scenario it is not uncommon to load only records that not already exist in the target. And since you don’t need to provide both WHEN MATCHED and WHEN NOT MATCHED (since 10g), one could construct a MERGE doing only an insert which, from my point of view, will be semantically the same as an INSERT INTO … AS SELECT with a NOT IN or NOT EXISTS subquery. The latter will gather statistics online. But what can we expect from MERGE? MERGE is not mentioned in the documentation, but in my opinion it’s worth checking.


SQL> CREATE TABLE t_merge (id number, s varchar2(20))
Table created.

SQL> --- MERGE
SQL> MERGE /*+ append */ 
INTO   t_merge t
USING (
       SELECT level AS id , lpad (level, 20, '0') s
       FROM   dual
       CONNECT BY level <= 10000) q
ON   (q.id = t.id)
WHEN NOT MATCHED THEN INSERT VALUES (q.id, q.s)
10000 rows merged.

-- This error message means we were doing a bulk insert (direct path write) into T_MERGE
SQL> SELECT count(*) FROM t_merge
>> SELECT count(*) FROM t_merge
                     *
Error at line 1
ORA-12838: cannot read/modify an object after modifying it in parallel 

SQL> COMMIT
Commit complete.

SQL> -- but we don't have any statistics
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_MERGE')

TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_MERGE                                                
1 row selected.

Listing 1 – No Online Statistics Gathering with MERGE

Unfortunately no. Despite of doing a direct path write into an empty table, MERGE doesn’t piggyback optimizer statistics. Of course it’s not a big problem, you should just be aware of it. In one of my recent projects I was wondering, why don’t my dimensions have any statistics after initial load? These dimensions were quite stable: once initially loaded they rarely got few new records. So I decided not to gather optimizer statistics after each load, but rely on standard Oracle statistics gathering job instead. And for initial load the Online Statistics Gathering will do the job, right? NO! Because the load process was implemented using MERGE. So I had two options: to re-implement the mappings using an INSERT or to explicitly gather optimizer statistics after initial load (or still after each load). Opted for the latter.

What does „empty“ exactly mean?

An interesting question, quite relevant for designing ETL processes. If we plan to implement some intermediate tables and rely on online statistics gathering, how to ensure this works each time we repeat the load job? Let’s test it.

SQL> CREATE TABLE t_delete (id NUMBER, s VARCHAR2(20))
Table created.
SQL> CREATE TABLE t_truncate_reuse (id NUMBER, s VARCHAR2(20))
Table created.
SQL> CREATE TABLE t_truncate_drop (id NUMBER, s VARCHAR2(20))
Table created.
SQL> CREATE TABLE t_rollback (id NUMBER, s VARCHAR2(20))
Table created.
SQL> INSERT /*+ append */ INTO   t_delete
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 10000
10000 rows created.

SQL> INSERT /*+ append */ INTO   t_truncate_drop 
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 10000
10000 rows created.

SQL> INSERT /*+ append */ INTO   t_truncate_reuse
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 10000
10000 rows created.
SQL> COMMIT
Commit complete.

SQL> INSERT /*+ append */ INTO   t_rollback
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 10000
10000 rows created.

SQL> -- Statistics are published without COMMIT
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name = 'T_ROLLBACK'

TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_ROLLBACK                     14.04.18           10000
1 row selected.

SQL> ROLLBACK
Rollback complete.

SQL> -- but after ROLLBACK they are gone
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_DELETE', 'T_ROLLBACK', 
                      'T_TRUNCATE_DROP', 'T_TRUNCATE_REUSE')

TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_TRUNCATE_REUSE               14.04.18           10000
T_TRUNCATE_DROP                14.04.18           10000
T_ROLLBACK                                             
T_DELETE                       14.04.18           10000

4 rows selected.

SQL> -- now delete rows and truncate tables
SQL> DELETE FROM t_delete
10000 rows deleted.
SQL> TRUNCATE TABLE t_truncate_drop DROP STORAGE
Table truncated.
SQL> TRUNCATE TABLE t_truncate_reuse REUSE STORAGE
Table truncated.

SQL> -- INSERT more data and check if statistics had been updated
SQL> INSERT /*+ append */ INTO   t_delete
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 20000
20000 rows created.

SQL> INSERT /*+ append */ INTO   t_truncate_drop 
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 20000
20000 rows created.

SQL> INSERT /*+ append */ INTO   t_truncate_reuse
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 20000
20000 rows created.

SQL> INSERT /*+ append */ INTO   t_rollback
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 20000
20000 rows created.

SQL> COMMIT
Commit complete.

SQL> -- do we have the new statistics?
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name in ('T_DELETE', 'T_ROLLBACK', 
                      'T_TRUNCATE_DROP', 'T_TRUNCATE_REUSE')

TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_ROLLBACK                                             
T_DELETE                       14.04.18           10000
T_TRUNCATE_REUSE               14.04.18           10000
T_TRUNCATE_DROP                14.04.18           20000

4 rows selected.

Listing 2 – when will the segment be considered empty?

This example provides some quite interesting findings:

  • After deleting all rows the segment is not considered empty. The next INSERT /*+ append*/ will not gather statistics.
  • A rollback of a direct path INSERT leaves the segment in state that is not considered empty. The next INSERT /*+ append*/ will not gather statistics.
  • After truncating the table with DROP STORAGE option (this is default) the next INSERT /*+ append*/ will gather statistics. But again, using TRUNCATE TABLE … REUSE STORAGE prevents online statistics gathering afterwards.
  • INSERT /*+ append*/ publishes the optimizer statistics already before committing. Another sessions can see them. The ROLLBACK take them away though.

The only way we can rely on online statistics gathering in a repeatable process without dropping the table is to use TRUNCATE TABLE without an option or with explicit DROP STORAGE option before inserting new data.

The fact that the optimizer statistics are immediately available after INSERT statement is indeed documented. But the placement of this sentence at the end of 10.3.3.2 Global Statistics During Inserts into Empty Partitioned Tables is IMHO wrong or at least understated.

Cursor invalidation
Initially I wanted to investigate the behavior concerning cursor invalidation, but I didn’t managed to get a consistent test results. If you are gathering object statistics via DBMS_STATS you have control over cursor invalidation using the parameter NO_INVALIDATE. With Online Statistics Gathering you have no control. With the following script I believed to prove that the cursor was invalidated immediately after the INSERT which gathered stats, even without committing. But running this some more times I saw that it was not invalidated every time. On the other hand, if I turned the INSERT to be a conventional load, the cursor was invalidated too, sometimes not.


SQL> set linesize 300 pages 10000
SQL> column table_name format a30
SQL> DROP table t_rollback purge
Table dropped.
SQL> CREATE TABLE t_rollback (id NUMBER, s VARCHAR2(20))
Table created.

SQL> -- we will check if this cursor will be invalidated
SQL> SELECT /* TEST_CURSOR_INVALIDATION */* FROM t_rollback
no rows selected.

SQL> COLUMN sql_id new_value sql_id

SQL> -- find the cursor in the shared pool
SQL> SELECT sql_id, LOADED_VERSIONS,EXECUTIONS,INVALIDATIONS
FROM   v$sql s
WHERE  sql_text LIKE '%TEST_CURSOR_INVALIDATION%' 
AND    UPPER(sql_text) NOT LIKE '%V$SQL%'

SQL_ID        LOADED_VERSIONS EXECUTIONS INVALIDATIONS
------------- --------------- ---------- -------------
027kac8gfsud8               1          1             0
1 row selected.

SQL> -- Insert data but not commit
SQL> INSERT /*+ append */ INTO   t_rollback
SELECT level AS id , lpad (level, 20, '0') s
FROM   dual
CONNECT BY level <= 10000
10000 rows created.

SQL> -- the statistics are already published
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name = 'T_ROLLBACK'

TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_ROLLBACK                     14.04.18           10000
1 row selected.

SQL> -- and the Cursor has been invalidated
SQL> SELECT sql_id, LOADED_VERSIONS,EXECUTIONS,INVALIDATIONS
FROM   v$sql s
WHERE  sql_id = '027kac8gfsud8'

SQL_ID        LOADED_VERSIONS EXECUTIONS INVALIDATIONS
------------- --------------- ---------- -------------
027kac8gfsud8               1          1             1
1 row selected.

SQL> ROLLBACK
Rollback complete.

SQL> -- statistics are gone after rollback
SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name = 'T_ROLLBACK'

TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_ROLLBACK                                             
1 row selected.

Listing 3 – Cursor invalidation, the results are not consistent though!

At this point I had strong doubt, whether it was still worth further investigation. I still believe the cursor will be invalidated, but is it a problem within an ETL process? In my opinion not really. The online statistics only kick in if loading into an empty segment.

Following cases are possible:

  • With CTAS there can be no cursors using THIS new segment to invalidate;
  • With direct path INSERT there could be the stats in place which were gathered on empty segment, so it is highly preferable to invalidate all cursors after loading data;
  • With direct path INSERT there are „old“ stats in place which were gathered before truncate. Well, in that case your cursors have to be re-optimized again, but the parse time is generally not a problem in data warehouse environments. If you are loading data which can significantly differ from load to load, than again, the invalidation will be desirable. If the data is pretty the same and you are really having the problem with cursor invalidation, you could prevent re-gathering already representative statistics by locking them or using the hint NO_GATHER_OPTIMIZER_STATS.

Some pitfalls and restrictions

There are some restrictions documented, but should you have doubts, keep in mind to re-read this section (and MOS Notes) for your version. There are better explanations in 12.2. Some restrictions were lifted, some of them where lifted even in 12.1 but this seems not to be reflected in the documentation.

What kind of statistics?

You get only table and basic column statistics. No histograms. No index statistics will be gathered. Recently I could observe a nasty side effect. There was an index on a big table which occasionally was empty at the time the standard statistics maintenance job was active. After that both table and index statistics were reflecting that fact (number or distinct keys = 0). And while table stats were re-gathered online, the index stats were not. And guess what? The optimizer has chosen to use index statistics for its cardinality estimates resulting in a nested loop. So you should cater for index statistics yourself.

Columns

Such a simple thing as not specifying all columns in INSERT prevents online statistics gathering:


SQL> CREATE TABLE t_not_all_columns (id number, s varchar2(20))
Table created.

SQL> INSERT /*+ append */ INTO   t_not_all_columns (ID)
SELECT level AS id 
FROM   dual
CONNECT BY level <= 10000
10000 rows created.

SQL> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_NOT_ALL_COLUMNS')

TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_NOT_ALL_COLUMNS                                      
1 row selected.

Listing 4 – not all columns specified – no statistics gathering

This was not mentioned at all in the 12.1 documentation. But 12.2 has a very good explanation. If all missing columns have default values online statistics gathering will kick in. Now I asked myself, if this works, why should virtual columns be a problem, as 12.1 documentation says? Tested it and indeed, it works with virtual columns even in 12.1 (Listing 5)


> SELECT * FROM v$version
BANNER                                                                               CON_ID
-------------------------------------------------------------------------------- ----------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production              0
PL/SQL Release 12.1.0.2.0 - Production                                                    0
CORE	12.1.0.2.0	Production                                                                  0
TNS for Linux: Version 12.1.0.2.0 - Production                                            0
NLSRTL Version 12.1.0.2.0 - Production                                                    0

> ----------------
> --not all columns
> DROP table t_not_all_columns purge

table T_NOT_ALL_COLUMNS dropped.

> CREATE TABLE t_not_all_columns 
( id number
, s varchar2(20))

table T_NOT_ALL_COLUMNS created.

> INSERT /*+ append */ INTO   t_not_all_columns (ID)
SELECT level AS id 
FROM   dual
CONNECT BY level <= 10000
10,000 rows inserted.

> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_NOT_ALL_COLUMNS')
TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_NOT_ALL_COLUMNS                                      

> -----------
> --default
> DROP table t_not_all_columns_default purge

table T_NOT_ALL_COLUMNS_DEFAULT dropped.

> CREATE TABLE t_not_all_columns_default 
( id number
, s varchar2(20) default 'ABC')

table T_NOT_ALL_COLUMNS_DEFAULT created.

> INSERT /*+ append */ INTO   t_not_all_columns_default (ID)
SELECT level AS id 
FROM   dual
CONNECT BY level <= 10000
10,000 rows inserted.

> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_NOT_ALL_COLUMNS_DEFAULT')
TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_NOT_ALL_COLUMNS_DEFAULT      29-MAY-18          10000

> ----------
> --virtual
> DROP table t_not_all_columns_virtual purge

table T_NOT_ALL_COLUMNS_VIRTUAL dropped.

> CREATE TABLE t_not_all_columns_virtual 
( id number
, s varchar2(20) generated always as ('ABC') virtual)

table T_NOT_ALL_COLUMNS_VIRTUAL created.

> INSERT /*+ append */ INTO   t_not_all_columns_virtual (ID)
SELECT level AS id 
FROM   dual
CONNECT BY level <= 10000
10,000 rows inserted.

> SELECT table_name, last_analyzed, num_rows
FROM   user_tables 
WHERE  table_name IN ('T_NOT_ALL_COLUMNS_VIRTUAL')
TABLE_NAME                     LAST_ANALYZED   NUM_ROWS
------------------------------ ------------- ----------
T_NOT_ALL_COLUMNS_VIRTUAL      29-MAY-18          10000
 

Listing 5 – columns with default values and virtual columns

I should say, I don’t really understand, why not specifying a column which has no default value should be a problem. After all, if I gather statistics afterwards and some columns are completely empty, I’ll get statistics on them reflecting exactly that fact.

Extended Stats

There is a bug 18425876 – if you have created extended stats before filling the table, this prevents online statistics gathering. I could reproduce this bug only in 12.1.0.1 though. MOS states it was fixed in 12.2, but it seems it have been backported to 12.1.0.2 too.

SQL> SELECT * FROM v$version;

BANNER                                                                               CON_ID                                                                                                                                                                                                                 
-------------------------------------------------------------------------------- ----------                                                                                                                                                                                                                 
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production              0                                                                                                                                                                                                                 
...
SQL> 
SQL> 
SQL> DROP table t_ext_stats purge;

Table dropped.

SQL> 
SQL> CREATE TABLE t_ext_stats
  2  ( id number
  3  , s varchar2(20));

Table created.

SQL> 
SQL> variable x varchar2(50)
SQL> 
SQL> exec :x:= DBMS_STATS.CREATE_EXTENDED_STATS(user, 'T_EXT_STATS','(ID,S)')

PL/SQL procedure successfully completed.

SQL> 
SQL> INSERT /*+ append */ INTO	 t_ext_stats (ID, S)
  2  SELECT level AS id, 'ABC'
  3  FROM   dual
  4  CONNECT BY level <= 10000;

10000 rows created.

SQL> 
SQL> 
SQL> SELECT table_name, last_analyzed, num_rows
  2  FROM   user_tables
  3  WHERE  table_name IN ('T_EXT_STATS');

TABLE_NAME                     LAST_ANAL   NUM_ROWS                                                                                                                                                                                                                                                         
------------------------------ --------- ----------                                                                                                                                                                                                                                                         
T_EXT_STATS                                                                                                                                                                                                                                                                                                  

Listing 6 – extended statistics prevent online statistics gathering

In the second part we’ll take a look at partitioning.

Related Posts

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert