A few days ago Sven Weller has published an excellent post about writing conditional logic in SQL and it reminded me of an example I used in one of my presentations. It is also about conditional logic in SQL but maybe some less obvious use cases in context of function calls and sorting.
Assume we have a table ORDER_ITEMS.
SQL> SELECT * FROM order_items; ORDER_ID LINEITEM_ID STATUS SHIP_DT --------- ----------- -------- --------- 1 1 shipped 01-FEB-16 1 2 shipped 01-JAN-16 1 3 open 1 4 returned 01-DEC-15
Listing 1
The slightly artificial requirement is to report items „as is“, without grouping, along with the number of shipped and returned items and the first (oldest) shipping date of an item in „shipped“ state. Per order, of course. So the result should be as follows:
ORDER_ID LINEITEM_ID STATUS SHIP_DT SHIP_CNT RET_CNT FIRST_SHI -------- ----------- -------- --------- ---------- -------- --------- 1 2 shipped 01-JAN-16 2 1 01-JAN-16 1 1 shipped 01-FEB-16 2 1 01-JAN-16 1 4 returned 01-DEC-15 2 1 01-JAN-16 1 3 open 2 1 01-JAN-16
Listing 2
Well, since we don’t allowed to group but need to count, the first idea is to use analytic functions. But we need a kind of conditional count or two of them respectively: one for shipped and one for returned items. That is, we have to count rows only if the state has the particular value and skip all other rows. How we can do this?
Since the table we query can be huge, I assume we should avoid nesting and multiple scans of it. One of the possible solutions – and there are always a few – would be to use the described „conditional logic“ when calling analytic functions as shown in Listing 3.
What we are counting is the expression returned by the CASE-operator. We should return any NOT NULL column if the condition is met or NULL otherwise. Why NULL? Because of a special handling of NULL values by the COUNT aggregate function. COUNT(expr) returns the number of rows where expr is not null. See also Nulls and aggregate functions for more information. You can achieve the same effect using SUM and summarizing 1 and 0 values, if you find it to be more intuitive.
SQL> SELECT oi.order_id, oi.lineitem_id, oi.status, oi.ship_dt 2 , COUNT(CASE WHEN status='shipped' 3 THEN lineitem_id 4 ELSE NULL 5 END) 6 OVER (PARTITION BY order_id ) ship_cnt 7 , SUM(CASE WHEN status='shipped' 8 THEN 1 9 ELSE 0 10 END) 11 OVER (PARTITION BY order_id ) ship_cnt2 12 FROM order_items oi; ORDER_ID LINEITEM_ID STATUS SHIP_DT SHIP_CNT SHIP_CNT2 ---------- ----------- -------- --------- ---------- ---------- 1 1 shipped 01-FEB-16 2 2 1 4 returned 01-DEC-15 2 2 1 3 open 2 2 1 2 shipped 01-JAN-16 2 2
Listing 3
Well, this was the first part of the question. How to find the first shipping date of the items in shipped state? Let us first see how to do it wrong (Listing 4). 😉
Simply using FIRST_VALUE analytical function doesn’t return the correct date: we have found the oldest date, but the lineitem it belongs to is not in shipped state. Somehow we have to take the state into account. How about using the same approach as with counting rows (lines 6-11)? This returns null – why? What happens? We are ordering by SHIP_DT, the oldest date is ’01-DEC-15′, so we found the row in question, that is the line_item 4. Now what should the function FIRST_VALUE return from this row? We have said, the CASE expression should be returned and this CASE expression just evaluates to NULL, since the state of the row is „returned“, not „shipped“.
SQL> SELECT oi.order_id, oi.lineitem_id, oi.status, oi.ship_dt 2 , FIRST_VALUE(ship_dt) 3 OVER (PARTITION BY order_id 4 ORDER BY ship_dt) first_shipment_wrong 5 6 , FIRST_VALUE(CASE WHEN status='shipped' 7 THEN ship_dt 8 ELSE NULL 9 END) 10 OVER (PARTITION BY order_id 11 ORDER BY ship_dt) also_wrong 12 FROM order_items oi 13 ORDER BY order_id, lineitem_id; ORDER_ID LINEITEM_ID STATUS SHIP_DT FIRST_SHI ALSO_WRON ---------- ----------- -------- --------- --------- --------- 1 1 shipped 01-FEB-16 01-DEC-15 1 2 shipped 01-JAN-16 01-DEC-15 1 3 open 01-DEC-15 1 4 returned 01-DEC-15 01-DEC-15
Listing 4
We can correct this at least in two ways. First, there is an option to find non-null values: IGNORE NULLS clause, which has been introduced in 11.2. But be aware, with FIRST_VALUE we will also get wrong results because of Oracle’s default windowing clause "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW"
(Listing 5 lines 2-6). We simply don’t allow the function to look ahead and find non-null values. Provide the correct windowing clause to fix this (lines 8-15). Another way is to use conditional expression while ordering. Doing so we only consider shipping dates with correct state (lines 17-23).
SQL> SELECT oi.order_id, oi.lineitem_id, oi.status, oi.ship_dt 2 , FIRST_VALUE(CASE WHEN status='shipped' 3 THEN ship_dt 4 ELSE NULL 5 END ) IGNORE NULLS 6 OVER (PARTITION BY order_id ORDER BY ship_dt) still_wrong 7 8 , FIRST_VALUE(CASE WHEN status='shipped' 9 THEN ship_dt 10 ELSE NULL 11 END ) IGNORE NULLS 12 OVER (PARTITION BY order_id 13 ORDER BY ship_dt 14 ROWS BETWEEN UNBOUNDED PRECEDING 15 AND UNBOUNDED FOLLOWING) now_correct 16 17 , FIRST_VALUE(ship_dt) 18 OVER (PARTITION BY order_id 19 ORDER BY CASE WHEN status='shipped' 20 THEN ship_dt 21 ELSE NULL 22 END 23 NULLS LAST) also_correct 24 FROM order_items oi 25 ORDER BY order_id, lineitem_id; ORDER_ID LINEITEM_ID STATUS SHIP_DT STILL_WRO NOW_CORRE ALSO_CORR ---------- ----------- -------- --------- --------- --------- --------- 1 1 shipped 01-FEB-16 01-JAN-16 01-JAN-16 01-JAN-16 1 2 shipped 01-JAN-16 01-JAN-16 01-JAN-16 01-JAN-16 1 3 open 01-JAN-16 01-JAN-16 01-JAN-16 1 4 returned 01-DEC-15 01-JAN-16 01-JAN-16
Listing 5
Putting it all together, we get the required result set (Listing 6). Putting the logic next to the function calls instead of the WHERE clause allows us to combine calculations in one query which otherwise would be spread across multiple queries with their own WHERE clauses.
SQL> SELECT oi.order_id, oi.lineitem_id, oi.status, oi.ship_dt 2 , COUNT(CASE WHEN status='shipped' 3 THEN lineitem_id 4 ELSE NULL 5 END) 6 OVER (PARTITION BY order_id ) ship_cnt 7 8 , COUNT(CASE WHEN status='returned' 9 THEN lineitem_id 10 ELSE NULL 11 END) 12 OVER (PARTITION BY order_id ) ret_cnt 13 14 , FIRST_VALUE(ship_dt) 15 OVER (PARTITION BY order_id 16 ORDER BY CASE WHEN status='shipped' 17 THEN ship_dt 18 ELSE NULL 19 END 20 NULLS LAST) first_shipment 21 FROM o12test.order_items oi 22 ; ORDER_ID LINEITEM_ID STATUS SHIP_DT SHIP_CNT RET_CNT FIRST_SHI -------- ----------- -------- --------- ---------- ---------- --------- 1 2 shipped 01-JAN-16 2 1 01-JAN-16 1 1 shipped 01-FEB-16 2 1 01-JAN-16 1 4 returned 01-DEC-15 2 1 01-JAN-16 1 3 open 2 1 01-JAN-16
Listing 6
Well, I admit, SQL statements of this kind are not so easy to understand and are difficult to maintain. I don’t want to judge whether they should be used. Now and then the performance gain can clearly outweigh the maintenance headaches, in other situation you can have no performance gain at all. As always, it depends. And I think one should always critically question the requirements, whether the complex things are really need to be done. I just wanted to show the approach as it clearly has to do with conditional logic in SQL. If you nevertheless end up with such SQL, remember it should be extensively commented and documented from day one. Believe me, next day you will not grasp any more what exactly your SQL should do.