When Compression Expands: The Hidden Pitfalls of HCC

Hybrid Columnar Compression (HCC) in Oracle Exadata is a prime example of how physical infrastructure can significantly enhance the efficiency of data management and analysis—when used correctly. This is an area that also affects developers and data engineers: what it is, how it works, and what to expect. If HCC is used in your project, then everyone on the team should be aware of it, including the specifics of what, where, and how it is implemented. Otherwise, there may be a few surprises, as in the example I encountered recently: instead of achieving the expected storage savings and improved performance, there was a huge increase in space consumption, along with negative performance effects. Read on to find out what happened.

There are a couple of good sources to start with learning about Hybrid Columnar Compression. Just to name a few:

I assume you already have a basic understanding of how HCC works. To follow, you should at least understand that multiple consecutive data blocks make up a Compression Unit (CU). Multiple rows are stored compressed in a CU and the storage is by column (column-major) in contrast to otherwise typical row-major storage in Oracle database. All columns belonging to one row are always completely contained in a single CU.

Let’s first start with creating an uncompressed table using sample data set.

SQL> create table hcc_test_raw nocompress as
select lpad(rownum,10, '0') id
,      lpad(dbms_random.value,1000, 'x') text
from    dual connect by level <= 100000

Table created.

SQL> select  t.num_rows
,       t.avg_row_len
,       t.blocks
,       round(SUM(bytes)/1024/1024) AS actual_size_mb
,       sum(s.extents)
from    user_tables t
join    user_segments s on s.segment_name = t.table_name
where   table_name = 'HCC_TEST_RAW'
group by t.num_rows
,        t.avg_row_len
,        t.compress_for
,        t.blocks

  NUM_ROWS AVG_ROW_LEN     BLOCKS  ACTUAL_SIZE_MB SUM(S.EXTENTS)
---------- ----------- ----------  -------------- --------------
    100000        1012      14436             116             29
1 row selected.

Listing 1: Introducing the test dataset and an uncompressed table for comparison

Now let’s take the same data set but create the table using COMPRESS FOR QUERY HIGH this time (Listing 2).

SQL> create table hcc_test compress for query high as
select lpad(rownum,10, '0') id
,      lpad(dbms_random.value,1000, 'x') text
from    dual connect by level <= 100000

Table created.

SQL> select  t.num_rows
,       t.avg_row_len
,       t.blocks
,       t.compress_for
,       round(SUM(bytes)/1024/1024) AS actual_size_mb
,       sum(s.extents)
from user_tables t
join    user_segments s on s.segment_name = t.table_name
where table_name = 'HCC_TEST'
group by t.num_rows
,       t.avg_row_len
,       t.compress_for
,       t.blocks

  NUM_ROWS AVG_ROW_LEN     BLOCKS COMPRESS_FOR ACTUAL_SIZE_MB SUM(S.EXTENTS)
---------- ----------- ---------- ------------ -------------- --------------
    100000        1012        344 QUERY HIGH                4              1
1 row selected.

Listing 2: The same data set in a table which is compressed FOR QUERY HIGH

Very good space saving from 29 to just one extent, right? Using a script provided by Jonathan Lewis we can show that we have 84 Compression Units (CU) mostly consisting of four consecutive blocks and most of them containing 1200+ rows. Take into account that dbms_rowid.rowid_block_number reports the header block of the CU in case of HCC and the gap to the next BLOCK_NO are the blocks belonging to the same CU.

break on report
compute sum of count(*) on report
compute count of block_no on report

select
   dbms_rowid.rowid_relative_fno(rowid) file_no,
   dbms_rowid.rowid_block_number(rowid) block_no,
count(*)
from hcc_test
group by
   dbms_rowid.rowid_relative_fno(rowid),
   dbms_rowid.rowid_block_number(rowid)
order by 1,2;

   FILE_NO   BLOCK_NO   COUNT(*)
---------- ---------- ----------
       120       1034       1035
       120       1037       1210
       120       1041       1209
       120       1045       1208
       120       1049       1208
       120       1053       1207

   ...

       120       1360       1208
       120       1364        793
           ----------           
count              84           
sum                       100000

Listing 3: Distribution of the rows to the Compression Units

HCC using conventional INSERT

When HCC was introduced in Oracle 11.2 and Exadata V2, we learned that rows only get compressed when written via direct path writes. Later, however, there was a change: starting with Oracle Database 12.2 also conventional INSERT SELECT without APPEND hint will use HCC. This sounds great if the use case is to add some data later to a segment which is compressed using HCC, like shown in Listing 4. Using the function get_compression_type from the package DBMS_COMPRESSION provided by Oracle we can prove that all rows in the table are compressed using FOR QUERY HIGH setting even though we have not used direct path insert.

SQL> insert into hcc_test (id, text)    
select lpad(rownum,10, '0') id
,      lpad(dbms_random.value,1000, 'x') text
from    dual connect by level <= 10000
10000 rows created.
SQL> commit
Commit complete.
SQL> select  count(*),
        CASE DBMS_COMPRESSION.get_compression_type (user, 'HCC_TEST', rowid)
            WHEN 1  THEN 'COMP_NOCOMPRESS'
            WHEN 2  THEN 'COMP_FOR_OLTP'
            WHEN 4  THEN 'COMP_FOR_QUERY_HIGH'
            WHEN 8  THEN 'COMP_FOR_QUERY_LOW'
            WHEN 16 THEN 'COMP_FOR_ARCHIVE_HIGH'
            WHEN 32 THEN 'COMP_FOR_ARCHIVE_LOW'
            WHEN 64 THEN 'COMP_BLOCK'
       END AS compression_type
  from hcc_test 
  group by CASE DBMS_COMPRESSION.get_compression_type (user, 'HCC_TEST', rowid)
            WHEN 1  THEN 'COMP_NOCOMPRESS'
            WHEN 2  THEN 'COMP_FOR_OLTP'
            WHEN 4  THEN 'COMP_FOR_QUERY_HIGH'
            WHEN 8  THEN 'COMP_FOR_QUERY_LOW'
            WHEN 16 THEN 'COMP_FOR_ARCHIVE_HIGH'
            WHEN 32 THEN 'COMP_FOR_ARCHIVE_LOW'
            WHEN 64 THEN 'COMP_BLOCK'
       END

  COUNT(*) COMPRESSION_TYPE     
---------- ---------------------
    110000 COMP_FOR_QUERY_HIGH  
1 row selected.

Listing 4: Adding more data via conventional INSERT: also new data is compressed

This looks fine so far, but what if our use case involves not just adding new data, but completely replacing the table’s data? And for some reason we would not use TRUNCATE (DROP STORAGE) but a simple DELETE for that? Would HCC reuse the existing empty blocks? Let’s test.

-- Recreate the table from scratch first as in Listing 2
SQL> delete from hcc_test
100000 rows deleted.
SQL> insert into hcc_test (id, text)    
select lpad(rownum,10, '0') id
,      lpad(dbms_random.value,1000, 'x') text
from    dual connect by level <= 100000
100000 rows created.
SQL> commit
Commit complete.
SQL> begin
  dbms_stats.gather_table_stats(user,'HCC_TEST');
end;
 PL/SQL procedure successfully completed.

SQL> select  t.num_rows
,       t.avg_row_len
,       t.blocks
,       t.compress_for
,       round(SUM(bytes)/1024/1024) AS actual_size_mb
,       sum(s.extents)
from user_tables t
join    user_segments s on s.segment_name = t.table_name
where table_name = 'HCC_TEST'
group by t.num_rows
,       t.avg_row_len
,       t.compress_for
,       t.blocks

  NUM_ROWS AVG_ROW_LEN     BLOCKS COMPRESS_FOR  ACTUAL_SIZE_MB SUM(S.EXTENTS)
---------- ----------- ---------- ------------- -------------- --------------
    100000        1012       1016 QUERY HIGH                 8              2
1 row selected.

-- Repeat and check the segment size:
...
  NUM_ROWS AVG_ROW_LEN     BLOCKS COMPRESS_FOR  ACTUAL_SIZE_MB SUM(S.EXTENTS)
---------- ----------- ---------- ------------- -------------- --------------
    100000        1012       1520 QUERY HIGH                12              3

-- Repeat and check the segment size:
...
  NUM_ROWS AVG_ROW_LEN     BLOCKS COMPRESS_FOR  ACTUAL_SIZE_MB SUM(S.EXTENTS)
---------- ----------- ---------- ------------- -------------- --------------
    100000        1012       2024 QUERY HIGH                16              4

Listing 5: Conventional INSERT is populating new extents each time

Each time we repeated the DELETE and INSERT SELECT pairs, the segment grew by one extent (or 4 Mb). Repeating the exercise from the Listing 3 we will see another picture:

   FILE_NO   BLOCK_NO   COUNT(*)
---------- ---------- ----------
       610      37120          1
       610      37121       1208
       610      37125       1086
       610      37127          1
       610      37129        123
       610      37130       1208
       610      37135          1
       610      37136        862
       610      37159       1208
       610      37164       1208
       610      37167          1
       610      37168       1086
      
...

       625      38839          1
       625      38847          1
       649      21865       1208
       649      21871          1
       649      21924       1084
       649      21950        125
           ----------           
count             165           
sum                       100000

165 rows selected.

Listing 6: The distribution of rows to CU’s is not that uniform after conventional INSERT

The behavior of how the CU’s are created when using conventional INSERT SELECT is slightly different: we get significantly more CU’s (about twice as many), some of them consist of only one block and contain fewer rows (some even only one). But that’s not the main problem here. After each repetition we see completely different block numbers. They are not reused!

Okay, you may, of course, ask, why would one implement the approach like this? Well, using DELETE instead of TRUNCATE before the INSERT – without the HCC aspect – can be inefficient, but not the biggest sin. It can even be desired from the data availability point of view. Only through this interaction with HCC does it become a disaster. If this is done by two people: one (a data engineer) implements ETL in this way and the other (a DBA) enables compression later? That’s why it’s important to talk to each other.

Materialized View Trap

And now to the real-world example I mentioned at the beginning. Another possibility where this behavior can be overlooked very quickly is the refresh of materialized views. If the parameter atomic_refresh=true is used (this is a default), this is exactly the case: the materialized view is refreshed by a combination of DELETE and INSERT in the same transaction. Here I have often experienced that the parameter is not changed from the default, even if the requirement for the visibility of the data does not exist. Consider following example:

SQL> create materialized view mv_hcc_test as 
select * from hcc_test
Materialized View created.

SQL> select  t.num_rows, t.avg_row_len, t.compress_for
,       round(SUM(bytes)/1024/1024) AS actual_size_mb
,       sum(s.extents)
from    user_tables t
join    user_segments s on s.segment_name = t.table_name
where table_name = 'MV_HCC_TEST'
group by t.num_rows, t.avg_row_len, t.compress_for

  NUM_ROWS AVG_ROW_LEN COMPRESS_FOR  ACTUAL_SIZE_MB SUM(S.EXTENTS)
---------- ----------- ------------- -------------- --------------
    100000        1012 QUERY HIGH                 4              1
1 row selected.

SQL> begin
  dbms_mview.refresh('MV_HCC_TEST',method=>'C');
end;
 PL/SQL procedure successfully completed.

SQL> select  t.num_rows, t.avg_row_len, t.compress_for
,       round(SUM(bytes)/1024/1024) AS actual_size_mb
,       sum(s.extents)
from    user_tables t
join    user_segments s on s.segment_name = t.table_name
where table_name = 'MV_HCC_TEST'
group by t.num_rows, t.avg_row_len, t.compress_for

  NUM_ROWS AVG_ROW_LEN COMPRESS_FOR  ACTUAL_SIZE_MB SUM(S.EXTENTS)
---------- ----------- ------------- -------------- --------------
    100000        1012 QUERY HIGH                 8              2
1 row selected.

Listing 7: After refresh with the same data the segment size has doubled

Consider the following scenario: As a data engineer, you have created a new materialized view with atomic refresh which should be refreshed on a daily basis. You didn’t even think about any compression at first. But what you didn’t know: your DBA colleague has set the COMPRESS FOR QUERY HIGH setting for the entire tablespace. After a year of regularly refreshing your materialized view you will consume 365 times the needed size! And this will also have an impact on performance: all full table scans have to deal with a segment 365 times larger than necessary.

What can you do? Check if atomic refresh is actually required, consider using other methods like out-of-place MV refresh, schedule a regular re-compressing of the table and so on. Know the implications and talk to each other!

Leave a Reply

Your email address will not be published. Required fields are marked *