NOT IN vs. NOT EXISTS: The Negation Battle #JoelKallmanDay

A year ago I wrote about the pitfalls of NULL values in NOT IN subqueries. This time I wanted to keep promises I made back then, dig a little deeper and also look at NOT EXISTS as another type of subquery used to filter out records based on negative matches. Is there any semantic difference and which is faster? Unfortunately, there are many myths and misconceptions around the usage of (NOT) IN and (NOT) EXISTS subqueries.

TL;DR

If a subquery may return NULL’s then using it with NOT IN is not semantically equivalent to NOT EXISTS. NULL’s in the NOT IN – subquery’s result will lead to empty result of the main query. If you don’t want this behavior:
- define NOT NULL constraints on corresponding column(s) if suitable
- use explicit IS NOT NULL conditions in your subquery
- use NVL() in the SELECT list
- rewrite the subquery as NOT EXISTS
As a rule, Oracle performs a query transformation called Subquery Unnesting making it possible to perform one of the ANTI JOIN operations: Nested Loops, Hash or Merge Anti join
There are Null-Aware version of these three join types making their use with NOT IN subqueries possible if NULL’s are expected
Even without Subquery Unnesting Oracle transforms NOT IN into NOT EXISTS using LNNVL() if NULL’s are expected

At least once a week I come across the next “complete list of tips for better SQL performance” thread on X or Linkedin. And almost always one of the bullet points is something like “always use EXISTS instead of IN” or similar. From my point of view, they are not very useful.

SQL is a fourth generation language. We describe what result we want, not how to achieve this result. If two syntactic alternatives are semantically interchangeable, then it shouldn’t matter which one you use. If there are significant performance differences, then it is due to the specific SQL engine. Ideally, there should be a query transformation that rewrites the query and ensures equally good performance. If this is not (yet) the case in a specific SQL engine, then recommendations such as “Always use this instead of that” should not be generalized, but with specific reference to the engine and version, and also clearly explain why, so that such statements can be easily verified again later.

This blog post relates to NOT IN and NOT EXISTS subqueries in the Oracle Database. For the examples in this post I’ll use Oracle’s sample schema Sales History (SH), specifically the table CUSTOMERS. I’ve created a non-partitioned copy of this table to make the execution plans more readable.

Are NOT IN and NOT EXISTS interchangeable?

In the CUSTOMERS table, all customers are classified into income groups. Let’s try to list all income groups that occur among customers from Cologne but are not represented among customers from Hamburg? We can use both NOT IN and NOT EXISTS for this task:

SELECT  DISTINCT c.cust_income_level
FROM    customers c
WHERE   c.cust_city = 'Koeln'
AND     c.cust_income_level NOT IN (SELECT c2.cust_income_level
                                    FROM   customers c2
                                    WHERE  c2.cust_city = 'Hamburg');

CUST_INCOME_LEVEL             
------------------------------
K: 250,000 - 299,999          
1 row selected.

SELECT  DISTINCT c.cust_income_level
FROM    customers c
WHERE   c.cust_city = 'Koeln'
AND     NOT EXISTS (SELECT 1
                    FROM   customers c2
                    WHERE  c2.cust_city = 'Hamburg'
                    AND    c2.cust_income_level=c.cust_income_level);

CUST_INCOME_LEVEL             
------------------------------
K: 250,000 - 299,999          
1 row selected.

Listing 1: We can use both NOT IN and NOT EXISTS to answer the question

At first sight, it looks good, they produce the same result. But you may remember from the previous post, that with NOT IN subquery you should pay attention to possible NULL values. Let’s temporary update records for income level group A in Hamburg to NULL (the column is nullable, so it is allowed)

SQL> UPDATE  customers c
SET     c.cust_income_level = NULL
WHERE   c.cust_city = 'Hamburg'
AND     c.cust_income_level LIKE 'A:%'

2 rows updated.

SQL> SELECT  DISTINCT c.cust_income_level
FROM    customers c
WHERE   c.cust_city = 'Koeln'
AND     c.cust_income_level NOT IN (SELECT c2.cust_income_level
                                    FROM   customers c2
                                    WHERE  c2.cust_city = 'Hamburg')

no rows selected.

SQL> SELECT  DISTINCT c.cust_income_level
FROM    customers c
WHERE   c.cust_city = 'Koeln'
AND     NOT EXISTS (SELECT 1
                    FROM   customers c2
                    WHERE  c2.cust_city = 'Hamburg'
                    AND    c2.cust_income_level=c.cust_income_level)

CUST_INCOME_LEVEL             
------------------------------
K: 250,000 - 299,999          
A: Below 30,000               

2 rows selected.

SQL> ROLLBACK
Rollback complete.

Listing 2: two queries are not semantically equal if NULL values are present

As expected, the presence of the NULL values in the result of the NOT IN subquery leads to empty result of the main query. But with NOT EXISTS we even get one more row for the income group A now. So if NULL values are to expect, the two queries are definitely semantically different and not interchangeable.

Why NULL’s are treated differently with NOT EXISTS? Of course, they are not. The comparison with NULL is now in the WHERE clause of the subquery. Comparing NULL with some value without using NVL() means the WHERE condition is not satisfied and thus, the subquery returns no rows. Which, in turn, leads to satisfying the NOT EXISTS condition in the main query!

By the way, it really doesn’t matter what you are selecting in EXISTS/NOT EXISTS subquery. This will not be evaluated. You can select 1, 5, NULL or even * – no difference. You can even SELECT 1/0 and you’ll get no “division by zero” error.

How are they executed?

A common misconception often leads discussions about the performance of sub-queries in the wrong direction: the idea that the way queries are written determines how they are executed. That is, for every row in the table of the main query the whole uncorrelated NOT IN subquery will be executed. And the same for NOT EXISTS but this time this will be a correlated subquery and for every execution it will have an additional filter coming from the main query. Wrong!

In some rare situations the execution can be more like this, but in general Oracle performs a query transformation called Subquery Unnesting to unnest the subquery putting the referenced tables in the FROM clause of the main query, thus making correlated or uncorrelated subqueries to be a join – a special kind of join: a Semi-Join for IN and EXISTS and an Anti-Join for NOT IN and NOT EXISTS. After that, the optimizer can choose from all available join methods: nested loop, hash or sort-merge join (more precisely, the semi or anti-join variants of them).

Let’s first show how it works without subquery unnesting. To see this, I have to disable the unnesting using the hint NO_UNNEST:

SQL> SELECT  DISTINCT c.cust_income_level
FROM    customers c
WHERE   c.cust_city = 'Koeln'
AND     NOT EXISTS (SELECT /*+ no_unnest */ 1
                    FROM   customers c2
                    WHERE  c2.cust_city = 'Hamburg'
                    AND    c2.cust_income_level=c.cust_income_level)

CUST_INCOME_LEVEL             
------------------------------
K: 250,000 - 299,999          
1 row selected.

---------------------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |   A-Time   |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |      1 |        |      1 |00:00:00.02 |
|   1 |  HASH UNIQUE        |           |      1 |     12 |      1 |00:00:00.02 |
|*  2 |   FILTER            |           |      1 |        |     10 |00:00:00.02 |
|*  3 |    TABLE ACCESS FULL| CUSTOMERS |      1 |    474 |    532 |00:00:00.01 |
|*  4 |    TABLE ACCESS FULL| CUSTOMERS |     12 |      5 |     11 |00:00:00.02 |
---------------------------------------------------------------------------------
                                                                                 
Predicate Information (identified by operation id):                              
---------------------------------------------------                              
                                                                                 
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "CUSTOMERS" 
              "C2" WHERE "C2"."CUST_CITY"='Hamburg' AND "C2"."CUST_INCOME_LEVEL"=:B1))                                                        
   3 - filter("C"."CUST_CITY"='Koeln')                                           
   4 - filter(("C2"."CUST_CITY"='Hamburg' AND "C2"."CUST_INCOME_LEVEL"=:B1)) 

SQL> SELECT  COUNT(DISTINCT c.cust_income_level)
FROM    customers c
WHERE   c.cust_city = 'Koeln'

COUNT(DISTINCTC.CUST_INCOME_LEVEL)
----------------------------------
                                12
1 row selected.

Listing 3: NOT EXISTS subquery without unnesting

The FILTER-operation works much like nested loops. For each of 532 rows from CUSTOMER filtered in step 3 of the execution plan, the table CUSTOMERS will be queried again in step 4 using the filter “C2”.”CUST_CITY”=’Hamburg’ AND “C2”.”CUST_INCOME_LEVEL”=:B1. But is it really so? No! We can see a kind of optimization here. There are 532 customers in Cologne (A-Rows of step 3) and they are classified into 12 different groups (see the additional query at the end). With that, the step 4 is only being started 12 times, once per unique value of the join column. You can see it in the column “Starts” for step 4.

And what’s with NOT IN? The execution plan looks very much the same, except for one small detail: the filter in the steps 4 and 2 contains LNNVL(“C2”.”CUST_INCOME_LEVEL”<>:B1).

SELECT  DISTINCT c.cust_income_level     
FROM    customers c     
WHERE   c.cust_city = 'Koeln'     
AND     c.cust_income_level NOT IN (SELECT /*+ no_unnest */ c2.cust_income_level   
                                    FROM   customers c2  
                                    WHERE  c2.cust_city = 'Hamburg') 

---------------------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |   A-Time   |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |      1 |        |      1 |00:00:00.01 |
|   1 |  HASH UNIQUE        |           |      1 |     12 |      1 |00:00:00.01 |
|*  2 |   FILTER            |           |      1 |        |     10 |00:00:00.01 |
|*  3 |    TABLE ACCESS FULL| CUSTOMERS |      1 |    474 |    532 |00:00:00.01 |
|*  4 |    TABLE ACCESS FULL| CUSTOMERS |     12 |      5 |     11 |00:00:00.01 |
---------------------------------------------------------------------------------
                                                                                 
Predicate Information (identified by operation id):                              
---------------------------------------------------                              
                                                                                 
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "CUSTOMERS" 
              "C2" WHERE "C2"."CUST_CITY"='Hamburg' AND 
              LNNVL("C2"."CUST_INCOME_LEVEL"<>:B1)))                                                       
   3 - filter("C"."CUST_CITY"='Koeln')                                           
   4 - filter(("C2"."CUST_CITY"='Hamburg' AND                                    
              LNNVL("C2"."CUST_INCOME_LEVEL"<>:B1)))        

-- This query will be semantically the same
SELECT  DISTINCT c.cust_income_level
FROM    customers c
WHERE   c.cust_city = 'Koeln'
AND     NOT EXISTS (SELECT /*+ no_unnest */ 1
                    FROM   customers c2
                    WHERE  c2.cust_city = 'Hamburg'
                    AND    LNNVL(c2.cust_income_level<>c.cust_income_level));

Listing 4: NOT IN subquery is rewritten as NOT EXISTS without unnesting

This function LNNVL (documented since 11g) is used to take care of possible NULL’s and rewrite the NOT IN into NOT EXISTS subquery. Again, event without subquery unnesting, a NOT IN was rewritten as NOT EXISTS. From my point of view, it doesn’t make any sense to do it yourself, unless you’re not happy with NULL handling using NOT IN.

Now let’s look at the execution plans in case subquery unnesting takes place, staring with NOT IN. In Oracle versions prior to 11g subquery unnesting has only worked if NULL’s were excluded on both sides, like shown in Listing 5. Both queries are restricting optimizer features to the old 10g version. The first query shows the same plan with FILTER-operation like we have already seen as we disabled the unnesting with the hint NO_UNNEST. The second query has two additional IS NOT NULL conditions on lines 23 and 27 (of course, NOT NULL constraints defined for these columns would also work), which enable subquery unnesting and thus the much faster and scalable HASH JOIN instead of FILTER.

SQL> SELECT /*+ optimizer_features_enable('10.2.0.4') */ 
           DISTINCT c.cust_income_level
    FROM    customers c
    WHERE   c.cust_city = 'Koeln'
    AND     c.cust_income_level NOT IN (SELECT c2.cust_income_level
                                        FROM   customers c2
                                        WHERE  c2.cust_city = 'Hamburg')

--------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |      1 |        |      1 |
|   1 |  HASH UNIQUE        |           |      1 |     12 |      1 |
|*  2 |   FILTER            |           |      1 |        |     10 |
|*  3 |    TABLE ACCESS FULL| CUSTOMERS |      1 |    474 |    532 |
|*  4 |    TABLE ACCESS FULL| CUSTOMERS |     12 |      5 |     11 |
--------------------------------------------------------------------

SQL> SELECT /*+ optimizer_features_enable('10.2.0.4') */ 
           DISTINCT c.cust_income_level
    FROM    customers c
    WHERE   c.cust_city = 'Koeln'
    AND     c.cust_income_level IS NOT NULL
    AND     c.cust_income_level NOT IN (SELECT c2.cust_income_level
                                        FROM   customers c2
                                        WHERE  c2.cust_city = 'Hamburg'
                                        AND    c2.cust_income_level IS NOT NULL);

----------------------------------------------------------------------
| Id  | Operation             | Name      | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT      |           |      1 |        |      1 |
|   1 |  HASH UNIQUE          |           |      1 |      1 |      1 |
|*  2 |   HASH JOIN RIGHT ANTI|           |      1 |      1 |     10 |
|*  3 |    TABLE ACCESS FULL  | CUSTOMERS |      1 |     60 |     44 |
|*  4 |    TABLE ACCESS FULL  | CUSTOMERS |      1 |    474 |    532 |
----------------------------------------------------------------------

Listing 5: no subquery unnesting before 11g if NULL’s are expected

I remember those times around 2009 when the customer was quite astonished seeing the query’s runtime became orders of magnitude faster just by adding those NOT NULL conditions. But that has long been a thing of the past. Since Oracle 11g the database was able to do subquery unnesting and perform a HASH ANTI JOIN even if NULL’s are to expect. This operation is called Null-Aware Anti Join, see Listing 6.

SQL> SELECT  DISTINCT c.cust_income_level
    FROM    customers c
    WHERE   c.cust_city = 'Koeln'
    AND     c.cust_income_level NOT IN (SELECT c2.cust_income_level
                                        FROM   customers c2
                                        WHERE  c2.cust_city = 'Hamburg')

CUST_INCOME_LEVEL             
------------------------------
K: 250,000 - 299,999          
1 row selected.

-------------------------------------------------------------------------
| Id  | Operation                | Name      | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |           |      1 |        |      1 |
|   1 |  HASH UNIQUE             |           |      1 |      5 |      1 |
|*  2 |   HASH JOIN RIGHT ANTI NA|           |      1 |      5 |     10 |
|*  3 |    TABLE ACCESS FULL     | CUSTOMERS |      1 |     60 |     44 |
|*  4 |    TABLE ACCESS FULL     | CUSTOMERS |      1 |    474 |    532 |
-------------------------------------------------------------------------

SQL> SELECT  DISTINCT c.cust_income_level
    FROM    customers c
    WHERE   c.cust_city = 'Koeln'
    AND     c.cust_income_level NOT IN (SELECT c2.cust_income_level
                                        FROM   customers c2
                                        WHERE  c2.cust_city = 'Hamburg'
                                        AND    c2.cust_income_level IS NOT NULL)

CUST_INCOME_LEVEL             
------------------------------
K: 250,000 - 299,999          
1 row selected.

--------------------------------------------------------------------------
| Id  | Operation                 | Name      | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |      1 |        |      1 |
|   1 |  HASH UNIQUE              |           |      1 |      5 |      1 |
|*  2 |   HASH JOIN RIGHT ANTI SNA|           |      1 |      5 |     10 |
|*  3 |    TABLE ACCESS FULL      | CUSTOMERS |      1 |     60 |     44 |
|*  4 |    TABLE ACCESS FULL      | CUSTOMERS |      1 |    474 |    532 |
--------------------------------------------------------------------------

Listing 6: Null-aware anti joins from 11g

You can see it as HASH JOIN ANTI NA or HASH JOIN RIGHT ANTI NA in the execution plans, depending on what table the optimizer chooses to drive the join. If NULL is only to expect on the one side, you will see a Single-Null-Aware-Join as HASH JOIN ANTI SNA or HASH JOIN RIGHT ANTI SNA. Both NA and SNA Null-Aware options also exist for Nested Loop and Merge Anti Joins.

Does “null-aware” mean the NOT IN query will behave differently if there are NULL’s? No! You still get empty result if the subquery returns some NULL’s. It just means Oracle is able to unnest the subquery and perform one of the null-aware anti join operations.

Conclusion

Whenever you are in doubt as to which option you should use, it would be nice to have a fixed neuronal connection in your brain: NOT IN – NOT EXISTS – NULL. First, check if NULL’s are possible. If you are working with constrained NOT NULL columns, you’re fine. Otherwise you will get empty result if the NOT IN subquery selects any NULL’s. This is technically correct, but rarely the expected result. You can add explicit IS NOT NULL predicates to the NOT IN subquery or some NVL() in the SELECT list or rewrite it to NOT EXISTS. NOT EXISTS with equality predicate will treat NULL’s as “not equal”.

From a performance point of view, there should be no reason to switch between these options.

Null Values? NOT IN my Subquery! #JoelKallmanDay

SQLORA

Just some more Oracle stuff

NOT IN vs. NOT EXISTS: The Negation Battle #JoelKallmanDay

TL;DR

Are NOT IN and NOT EXISTS interchangeable?

How are they executed?

Conclusion

Related Posts

Leave a Reply Cancel reply