Actually, the CBO usually does a good job optimizing “NOT IN” subqueries. Unless something goes wrong. Recently it took a long time on a 10.2.0.5 database to search for the reason. At last looking in the CBO trace files (event 10053) has given the right hint.
It was an “INSERT AS SELECT” DML-Statement, that suddenly took hours instead of a couple of minutes.
Let’s show it using following example:
CREATE TABLE t1 ( c1 NUMBER , c2 NUMBER , c3 NUMBER , other_infos VARCHAR2(100) ); CREATE TABLE t2 (c1 NUMBER, c2 NUMBER, c3 NUMBER); INSERT INTO t1 (c1, c2, c3, other_infos) SELECT c1, c2, c3, min(other_infos) FROM ( SELECT TRUNC(SYS.DBMS_RANDOM.VALUE(1,25000)) c1 , TRUNC(SYS.DBMS_RANDOM.VALUE(1,1000)) c2 , TRUNC(SYS.DBMS_RANDOM.VALUE(1,10)) c3 , TRUNC(SYS.DBMS_RANDOM.VALUE(0,100)) other_infos FROM DUAL CONNECT BY LEVEL <= 6000000 ) GROUP BY c1, c2, c3; INSERT INTO t2 (c1, c2, c3) SELECT TRUNC(SYS.DBMS_RANDOM.VALUE(1,25000)) , TRUNC(SYS.DBMS_RANDOM.VALUE(1,1000)) , TRUNC(SYS.DBMS_RANDOM.VALUE(1,10)) FROM DUAL CONNECT BY LEVEL <= 500000; BEGIN DBMS_STATS.GATHER_TABLE_STATS('SQLORA','T1'); DBMS_STATS.GATHER_TABLE_STATS('SQLORA','T2'); END; /
Listing 1
Please note that the INSERT-Statement on line 9 ensures the combination of C1, C2 und C3 to be unique. The table structure, data volume and data distribution are close to the reality now. But our example is not yet complete. I’ve kept back teh primary key to show how the query in question behaves without it. Our qeury have to select all records from T1 which are not in T2. This will be done with the help of NOT IN Subquery.
SQL> EXPLAIN PLAN FOR 2 SELECT t1.* 3 FROM t1 4 WHERE (t1.c1, t1.c2, t1.c3) 5 NOT IN (SELECT t2.c1, t2.c2, t2.c3 6 FROM t2); Explained. SQL> SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY(null, null, 'BASIC')); PLAN_TABLE_OUTPUT ------------------------------------------------------------------- Plan hash value: 895956251 ----------------------------------- | Id | Operation | Name | ----------------------------------- | 0 | SELECT STATEMENT | | | 1 | FILTER | | | 2 | TABLE ACCESS FULL| T1 | | 3 | TABLE ACCESS FULL| T2 | ----------------------------------- 10 rows selected.
Listing 2
There is no subquery unnesting possible yet. The FILTER operation, which is used by the Optimizer, acts as Nested Loops. That is, the subquery will be executed once for every row in the T1. Some kind of optimization takes place, so that the subquery runs once for unique value of join criteria, that is in turn the combination of C1, C2 und C3. That comes to nothing as we have ourselves made this combination of C1, C2, C3 unique.
In order for subquery unnesting in case of “NOT IN” to work, one have to ensure that there are no Nulls on both sides. Then the optimizer is able to fulfill an anti-join. This was true for the versions prior to 11g. Since 11g there are so called Null-Aware Anti-Joins there and there is no need to take care of nulls. This kind of Anti-Joins is well described here, but we don’t consider them.
We create a primary key on T1 to ensure that there are no Nulls in C1, C2 and C3. For T2 we include the corresponding condition in the Where-Clause of the query.
ALTER TABLE t1 ADD CONSTRAINT t1_pk PRIMARY KEY (c1, c2, c3); SQL> EXPLAIN PLAN FOR 2 SELECT t1.* 3 FROM t1 4 WHERE (t1.c1, t1.c2, t1.c3) 5 NOT IN (SELECT t2.c1, t2.c2, t2.c3 6 FROM t2 7 WHERE t2.c1 IS NOT NULL 8 AND t2.c2 IS NOT NULL 9 AND t2.c3 IS NOT NULL); Explained. SQL> SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY(null, null, 'BASIC')); PLAN_TABLE_OUTPUT ------------------------------------------------------------------- Plan hash value: 629543484 ------------------------------------- | Id | Operation | Name | ------------------------------------- | 0 | SELECT STATEMENT | | | 1 | HASH JOIN RIGHT ANTI| | | 2 | TABLE ACCESS FULL | T2 | | 3 | TABLE ACCESS FULL | T1 | ------------------------------------- 10 rows selected.
Listing 3
Our example is complete now. That was the state before performance problems began.
Loss of Performance
One of these days there was FILTER-Operation as in Listing 2 instead of Hash-Anti-Join in the execution plan. The query didn’t comeback after hours and had to be cancelled manually. I didn’t search for the root cause at the beginning, because it was planned to migrate the database to the new version in few weeks and because the problem should be fixed quickly. I was able to modify the source code if needed. So it was an easy task, wasn’t it?
My first idea was to use stored outlines, because the execution plan on the development and test database was still the old (good) one. But without success. The production database was ignoring them.
So then we need to change the source code? The prerequisites for an Anti-Join are still there. So if we provide a hint for the CBO, it should follow it?
SELECT t1.* FROM t1 WHERE (t1.c1, t1.c2, t1.c3) NOT IN (SELECT /*+ HASH_AJ */ t2.c1, t2.c2, t2.c3 FROM t2 WHERE t2.c1 IS NOT NULL AND t2.c2 IS NOT NULL AND t2.c3 IS NOT NULL);
Listing 4
Was not the case. On the test database I could also use MERGE_AJ or NL_AJ as Hints and the CBO was following them changing the execution plans respectively. Only the production database kept showing FILTER in the execution plan, no matter what I did!
To understand what was going on, we can let the CBO itself explain it – activating the trace event 10053.
SQL> SELECT VALUE FROM V$PARAMETER WHERE UPPER(NAME)='USER_DUMP_DEST'; VALUE ---------------------------------------------------------------------- /u00/app/oracle/admin/LLBE10/udump SQL> SQL> ALTER SESSION SET TRACEFILE_IDENTIFIER='MY_NOT_IN_TRACE'; Session altered. SQL> SQL> ALTER SESSION SET EVENTS '10053 trace name context forever, level 2'; Session altered. SQL> SQL> EXPLAIN PLAN FOR 2 SELECT t1.* 3 FROM t1 4 WHERE (t1.c1, t1.c2, t1.c3) 5 NOT IN (SELECT t2.c1, t2.c2, t2.c3 6 FROM t2 7 WHERE t2.c1 IS NOT NULL 8 AND t2.c2 IS NOT NULL 9 AND t2.c3 IS NOT NULL); Explained. SQL> SQL> ALTER SESSION SET EVENTS '10053 trace name context off'; Session altered.
Listing 5
We then search for the trace file with our identifier “MY_NOT_IN_TRACE” in the file name in the directory USER_DUMP_DEST (see line 5). I did it in the test and in production databases and then compared these two files.
***************************** Cost-Based Subquery Unnesting ***************************** SU: No subqueries to consider in query block SEL$2 (#2). SU: Considering subquery unnesting in query block SEL$1 (#1) SU: Performing unnesting that does not require costing. SU: Considering subquery unnest on SEL$1 (#1). SU: Checking validity of unnesting subquery SEL$2 (#2) SU: Passed validity checks. SU: Transform an NOT IN subquery to an anti-join.
Listing 6: The test system – everything fine
***************************** Cost-Based Subquery Unnesting ***************************** SU: No subqueries to consider in query block SEL$2 (#2). SU: Considering subquery unnesting in query block SEL$1 (#1) SU: Performing unnesting that does not require costing. SU: Considering subquery unnest on SEL$1 (#1). SU: Checking validity of unnesting subquery SEL$2 (#2) SU: SU bypassed: No correlation to immediate outer subquery. SU: Validity checks failed. *******************************
Listing 7: productive database – subquery unnesting not taking place
You can see that CBO could rewrite the NOT IN subquery into an anti-join in the first case (test) and in the second case (production) has considered this impossible. Why? I again checked the table structure, etc., the primary key on T1 (enabled). All the same … Well at least that was clear why the hints and the outlines were ignored: the CBO considered the transformation in which they could make a difference at all impossible.
It sounds like a bug after all. It would also perhaps be the right time to contact Oracle Support… I decided against it in my project situation because the database should be migrated in a few weeks to 11.2. and the workaround has to be implemented asap. The CBO is not doing its job? Then we’ll do it ourselves! We wanted that it transforms our NOT IN subquery into a hash anti-join. Then we write the hash anti-join ourselves:
SQL> EXPLAIN PLAN FOR 2 SELECT t1.* 3 FROM t1 LEFT JOIN t2 ON (t1.c1 = t2.c1 AND t1.c2 = t2.c2 AND t1.c3 = t2.c3) 4 WHERE t2.c1 IS NULL; Explained. SQL> SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY(null, null, 'BASIC')); PLAN_TABLE_OUTPUT --------------------------------------------------------------------- Plan hash value: 196153611 --------------------------------------- | Id | Operation | Name | --------------------------------------- | 0 | SELECT STATEMENT | | | 1 | FILTER | | | 2 | HASH JOIN RIGHT OUTER| | | 3 | TABLE ACCESS FULL | T2 | | 4 | TABLE ACCESS FULL | T1 | ---------------------------------------
Listing 8
The operation FILTER here should not be confused with that from Listing 2. It is to filter the results from outer join, which has already taken place, to make it an anti-join. Listing 9 shows that the overhead is quite acceptable.
SQL> INSERT INTO t3 2 SELECT t1.* 3 FROM t1 4 WHERE (t1.c1, t1.c2, t1.c3) 5 NOT IN (SELECT t2.c1, t2.c2, t2.c3 6 FROM t2 7 WHERE t2.c1 IS NOT NULL 8 AND t2.c2 IS NOT NULL 9 AND t2.c3 IS NOT NULL); 5907679 rows created. SQL> SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(null, null, 'ALLSTATS LAST')); SQL_ID 13fhw1c4kbkcc, child number 0 ------------------------------------- INSERT INTO t3 SELECT t1.* FROM t1 WHERE (t1.c1, t1.c2, t1.c3) NOT IN (SELECT t2.c1, t2.c2, t2.c3 FROM t2 WHERE t2.c1 IS NOT NULL AND t2.c2 IS NOT NULL AND t2.c3 IS NOT NULL) Plan hash value: 629543484 --------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem | --------------------------------------------------------------------------------------------------------------------------- |* 1 | HASH JOIN RIGHT ANTI| | 1 | 5925K| 5907K|00:01:20.13 | 17154 | 17135 | 14M| 2366K| 20M (0)| |* 2 | TABLE ACCESS FULL | T2 | 1 | 498K| 500K|00:00:00.50 | 1182 | 1174 | | | | | 3 | TABLE ACCESS FULL | T1 | 1 | 5925K| 5920K|00:00:23.13 | 15972 | 15961 | | | | --------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("T1"."C1"="T2"."C1" AND "T1"."C2"="T2"."C2" AND "T1"."C3"="T2"."C3") 2 - filter(("T2"."C1" IS NOT NULL AND "T2"."C2" IS NOT NULL AND "T2"."C3" IS NOT NULL)) 22 rows selected. SQL> truncate table t3 reuse storage; Table truncated. SQL> SQL> INSERT INTO t3 2 SELECT t1.* 3 FROM t1 LEFT JOIN t2 ON (t1.c1 = t2.c1 AND t1.c2=t2.c2 AND t1.c3=t2.c3) 4 WHERE t2.c1 IS NULL; 5907679 rows created. SQL> SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(null, null, 'ALLSTATS LAST')); SQL_ID a0m9d3nr8d75d, child number 0 ------------------------------------- INSERT INTO t3 SELECT t1.* FROM t1 LEFT JOIN t2 ON (t1.c1 = t2.c1 AND t1.c2=t2.c2 AND t1.c3=t2.c3) WHERE t2.c1 IS NULL Plan hash value: 196153611 ----------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------------------- |* 1 | FILTER | | 1 | | 5907K|00:02:01.28 | 17154 | 17135 | | | | |* 2 | HASH JOIN RIGHT OUTER| | 1 | 1 | 5920K|00:01:35.56 | 17154 | 17135 | 14M| 2366K| 20M (0)| | 3 | TABLE ACCESS FULL | T2 | 1 | 498K| 500K|00:00:00.50 | 1182 | 1174 | | | | | 4 | TABLE ACCESS FULL | T1 | 1 | 5925K| 5920K|00:00:23.30 | 15972 | 15961 | | | | ----------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("T2"."C1" IS NULL) 2 - access("T1"."C3"="T2"."C3" AND "T1"."C2"="T2"."C2" AND "T1"."C1"="T2"."C1") 22 rows selected.
Listing 9