Introduction

In my last article, I have discussed about the processing of Oracle text indexes. Now that we have understood how a Oracle text index is created and how text search is performed, let's talk about the Near Real Time Indexes feature which is introduced with Oracle Text 12c version.

In the previous versions of Oracle Text, there was always a trade-off between the latency (time taken to sync index) and fragmentation of text indexes. Oracle text indexes differ significantly from the traditional B-Tree or Bitmap indexes in terms of the index structure and processing. With Oracle Text, indexes are created by decomposing the text document (record) into set of tokens and those tokens are indexed and then looked up for query processing. In Oracle text, the indexes are not updated in real time, rather the token indexes are synchronized periodically or immediately (on commit) based on the index configuration. However, since the size of $I "chunks" (tokens) written to the indexes depend on the number of records synced at a time, this leads to excessive fragmentation of the $I table and that leads to the need to run index optimization more often, and for longer duration.

In 12c, Oracle introduced the concept of two-level indexes under the feature named Near Real Time Indexes. This feature allows us to have a small, fragmented index (intermediate) containing all the recent updates (tokens) to the table, without having to modify the large standard index. The aim of this implementation is that the small index should be small enough to comfortably fit into SGA memory, so the fragmentation of this small index (containing recent updates) is not really an issue.

This feature is turned on with a STORAGE preference STAGE_ITAB set to TRUE. Once we turn on this feature, a new table called DR${index_name}$G is created to hold the recent updates (tokens) to the table data. This tables acts as a intermediary staging table for the recent updates to text indexes and is structurally similar to the $I table (standard index table) and the tokens in this table are indexed by means of a B-Tree index with name having the format DR${index_name}$H. All the tokens from recent updates goes to the new intermediary table and the tokens from this table are periodically merged with the tokens of $I table (standard index table) to synchronize the text index with text data.

Text Index with manual sync

Before jumping into the Near Real Time Indexes feature of Oracle text index, let's examine and collect some facts about the existing methods of text index maintenance. In the first example, I am going to walk through the manual method of text index synchronization and analyze it's pros and cons.

In the following example, I am creating a text index with name BLOG_DOC_TXT_IDX on the field DOC_DATA of table BLOG_DOCS.

---//
---// create a table for demonstration (will index the CLOB data) //---
---//
SQL> create table blog_docs
  2  (
  3  doc_id     number not null,
  4  doc_title  varchar2(25) not null,
  5  doc_data   clob not null
  6  );

Table created.

We have created a table to hold some documents. Now, we will create a Oracle text index of type CONTEXT on the DOC_DATA field and examine the structure of the text index. I am using the simplistic syntax to let Oracle use the default preferences, which will also require to manually sync the index periodically to have the index intact with text data.

---//
---// create CONTEXT index on doc_data //---
---//
SQL> create index blog_doc_txt_idx on blog_docs (doc_data)
  2  indextype is ctxsys.context;

Index created.

Let's populate the table with few records.

---//
---// populate the table with few records //---
---//
SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (1, 'Sample Doc', 'This is the first sample document')
  3  ;

1 row created.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (2, 'Doc 2', 'Yet another sample document');

1 row created.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (3, 'Doc 3', 'Third sample document');

1 row created.

SQL> commit;

Commit complete.

Now, since text indexes are not updated at real time and the fact that we did not configure our index to synchronize on commit, we could see there are no tokens present in the $I table, even though we have inserted few records.

---//
---// no tokens are indexed //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$I;

no rows selected

Since the tokens (from the new records) are not yet populated in the $I table, any attempt to perform a text search against the new records will not return the result-set even though the record exists in the tables as shown below.

---//
---// Text search not showing any records //---
---//
SQL> select doc_title,doc_data from blog_docs where contains(doc_data,'sample') > 0;

no rows selected

---//
---// normal search returns the records //---
---//
SQL> select doc_title,doc_data from blog_docs where doc_data like '%sample%';

DOC_TITLE                 DOC_DATA
------------------------- ------------------------------------------------------------
Sample Doc                This is the first sample document
Doc 2                     Yet another sample document
Doc 3                     Third sample document

We must sync the text index to be able to leverage the text search functionality. As we know, we can do that manually using the CTX_DDL.SYNC_INDEX procedure or configure the text index to sync on every commit. Let's sync our index as shown below.

---//
---// syncing text index with the base table //---
---//
SQL> exec ctx_ddl.sync_index('BLOG_DOC_TXT_IDX');

PL/SQL procedure successfully completed.

Now, if we query the $I table (DR$BLOG_DOC_TXT_IDX$I), we could see it has all the tokens (words) which are indexed within the document.

---//
---// list of indexed tokens //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$I;

TOKEN_TEXT                TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
------------------------- ---------- ----------- ---------- ----------- -------------------------
ANOTHER                            0           2          2           1 008802
DOCUMENT                           0           1          3           3 008806018804018803
FIRST                              0           1          1           1 008804
IS                                 0           1          1           1 008802
SAMPLE                             0           1          3           3 008805018803018802
THE                                0           1          1           1 008803
THIRD                              0           3          3           1 008801
THIS                               0           1          1           1 008801
YET                                0           2          2           1 008801

9 rows selected.

The moment we sync the index, we could see all the tokens are populated to the $I table. We could also notice that there are no duplicate tokens as the tokens were optimized during the synchronization process. If we perform a text search, we could see that Oracle is able to leverage the text search functionality as shown below.

---//
---// text search is working after index synchronization //---
---//
SQL> select doc_title,doc_data from blog_docs where contains(doc_data,'sample') > 0;

DOC_TITLE                 DOC_DATA
------------------------- ------------------------------------------------------------
Sample Doc                This is the first sample document
Doc 2                     Yet another sample document
Doc 3                     Third sample document

---//
---// query execution plan confirming text search //---
---//
Plan hash value: 145656609

----------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name             | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                  |      1 |        |      3 |00:00:00.01 |       9 |
|   1 |  TABLE ACCESS BY INDEX ROWID| BLOG_DOCS        |      1 |      1 |      3 |00:00:00.01 |       9 |
|*  2 |   DOMAIN INDEX              | BLOG_DOC_TXT_IDX |      1 |        |      3 |00:00:00.01 |       6 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CTXSYS"."CONTAINS"("DOC_DATA",'sample')>0)

Let's add few more documents to our table and explore few more details.

---//
---// loading few more documents //---
---//
SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (4,'Doc 4', 'Fourth sample document');

1 row created.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (5, 'Doc 5', 'Fifth sample document');

1 row created.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (6, 'Doc 6', 'Sixth sample document');

1 row created.

SQL> commit;

Commit complete.

Since our index is not configured for sync on commit, let's sync the index to have the tokens populated to the $I table.

---//
---// syncing text index with the base table //---
---//
SQL> exec ctx_ddl.sync_index('BLOG_DOC_TXT_IDX');

PL/SQL procedure successfully completed.

Let's query the $I table and see how the new tokens are populated.

---//
---// tokens populated in $I table //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$I;

TOKEN_TEXT                TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
------------------------- ---------- ----------- ---------- ----------- -------------------------
ANOTHER                            0           2          2           1 008802
DOCUMENT                           0           1          3           3 008806018804018803
FIRST                              0           1          1           1 008804
IS                                 0           1          1           1 008802
SAMPLE                             0           1          3           3 008805018803018802
THE                                0           1          1           1 008803
THIRD                              0           3          3           1 008801
THIS                               0           1          1           1 008801
YET                                0           2          2           1 008801
DOCUMENT                           0           4          6           3 008803018803018803
FIFTH                              0           5          5           1 008801
FOURTH                             0           4          4           1 008801
SAMPLE                             0           4          6           3 008802018802018802
SIXTH                              0           6          6           1 008801

14 rows selected.

As we can observe, there are duplicate tokens in our $I table which leads to index fragmentation. This is due to the fact that the synchronization operation doesn't optimize the existing tokens in $I table rather it optimizes the incoming tokens during synchronization and merges those tokens with existing tokens in $I table. If we generate INDEX_STATS, we could see that the duplicate tokens resulted into index fragmentation as shown below.

---//
---// text index statistics report //---
---//
---------------------------------------------------------------------------
                         FRAGMENTATION STATISTICS
---------------------------------------------------------------------------

total size of $I data:                                                 66

$I rows:                                                               14
estimated $I rows if optimal:                                          12
estimated row fragmentation:                                         14 %

garbage docids:                                                         0
estimated garbage size:                                                 0

most fragmented tokens:
  SAMPLE (0:TEXT)                                                    50 %
  DOCUMENT (0:TEXT)                                                  50 %
  YET (0:TEXT)                                                        0 %
  THIS (0:TEXT)                                                       0 %
  THIRD (0:TEXT)                                                      0 %
  THE (0:TEXT)                                                        0 %
  SIXTH (0:TEXT)                                                      0 %
  IS (0:TEXT)                                                         0 %
  FOURTH (0:TEXT)                                                     0 %
  FIRST (0:TEXT)                                                      0 %
  FIFTH (0:TEXT)                                                      0 %
  ANOTHER (0:TEXT)                                                    0 %

As per the index stats, there is a 14% fragmentation resulted from the duplicate tokens. This is the reason that we need to periodically optimize the entire index using CTX_DDL.OPTIMIZE_INDEX procedure as shown below.

---//
---// optimize text index //---
---//
SQL> exec CTX_DDL.OPTIMIZE_INDEX('BLOG_DOC_TXT_IDX', 'FULL');

PL/SQL procedure successfully completed.

Once we optimize the index, we could see that the duplicate tokens are optimized

---//
---// duplicate tokens are optimized //---
---//
SQL> select token_text from DR$BLOG_DOC_TXT_IDX$I
  2  having count(token_text) > 1 group by token_text;

no rows selected

Text index with sync on commit

Now, if we opt for on commit synchronization for text indexes, we could end up with higher fragmentation as demonstrated below. In the following example, I am using the same table, same index and same records as that of the first example except the fact that the index is configured to sync on commit.

---//
---// drop and recreate table for demonstration //---
---//
SQL> drop table blog_docs purge;

Table dropped.

SQL> create table blog_docs
  2  (
  3  doc_id     number not null,
  4  doc_title  varchar2(25) not null,
  5  doc_data   clob not null
  6  );

Table created.

---//
---// create text index with ON COMMIT synchronization //---
---//
SQL> create index blog_doc_txt_idx on blog_docs (doc_data)
  2  indextype is ctxsys.context
  3  parameters ('sync (on commit)')
  4  ;

Index created.

Let's add a record and see how the on commit synchronization works.

---//
---// populate a record //---
---//
SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (1, 'Sample Doc', 'This is the first sample document')
  3  ;

1 row created.

SQL> commit;

Commit complete.

---//
---// tokens are immediately populated into $I table //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$I;

TOKEN_TEXT           TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
-------------------- ---------- ----------- ---------- ----------- ----------------------------------------
DOCUMENT                      0           1          1           1 008806
FIRST                         0           1          1           1 008804
IS                            0           1          1           1 008802
SAMPLE                        0           1          1           1 008805
THE                           0           1          1           1 008803
THIS                          0           1          1           1 008801

6 rows selected.

As we could see the tokens are immediately populated in the $I table, which means text search can be immediately performed for the new record unlike the manual method where the text search doesn't work until the tokens are synchronized with CTX_DDL.SYNC_INDEX procedure.

Let's populate few more records and explore few more details.

---//
---// populate few more records //---
---//
SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (2, 'Doc 2', 'Yet another sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (3, 'Doc 3', 'Third sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (4,'Doc 4', 'Fourth sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (5, 'Doc 5', 'Fifth sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (6, 'Doc 6', 'Sixth sample document');

1 row created.

SQL> commit;

Commit complete.

We have inserted 5 more documents into our table. If we look into the $I table, we could see that tokens are immediately populated there. However, there was no optimization performed for the duplicate tokens as found below.

---//
---// tokens populated into $I table immediately without optimization //---
---//
SQL> select token_text, count(*) from DR$BLOG_DOC_TXT_IDX$I
  2  having count(token_text) > 1 group by token_text;

TOKEN_TEXT             COUNT(*)
-------------------- ----------
DOCUMENT                      6
SAMPLE                        6

As we can observe, each record resulted into a new token entry for the repetitive words. This behavior of on commit synchronization leads to more fragmentation (as shown in the following report) when compared to manual synchronization. However, it does provide less latency as the tokens are immediately populated when compared to manual synchronization where tokens are populated on periodic basis resulting into latency in index synchronization and in turn impacting the text search.

---//
---// text index statistics report //---
---//
---------------------------------------------------------------------------
                         FRAGMENTATION STATISTICS
---------------------------------------------------------------------------

total size of $I data:                                                 66

$I rows:                                                               22
estimated $I rows if optimal:                                          12
estimated row fragmentation:                                         46 %

garbage docids:                                                         0
estimated garbage size:                                                 0

most fragmented tokens:
  SAMPLE (0:TEXT)                                                    83 %
  DOCUMENT (0:TEXT)                                                  83 %
  YET (0:TEXT)                                                        0 %
  THIS (0:TEXT)                                                       0 %
  THIRD (0:TEXT)                                                      0 %
  THE (0:TEXT)                                                        0 %
  SIXTH (0:TEXT)                                                      0 %
  IS (0:TEXT)                                                         0 %
  FOURTH (0:TEXT)                                                     0 %
  FIRST (0:TEXT)                                                      0 %
  FIFTH (0:TEXT)                                                      0 %
  ANOTHER (0:TEXT)                                                    0 %

Near Real Time Text Indexes

As we can observe from the above examples, we need to always make a trade off between latency and index fragmentation when dealing with Oracle text indexes. To address this bottleneck of text indexes, Oracle has introduced the Near Real Time Indexes in 12c, where we use a intermediate $G table for hosting the recent token entries resulting from the DML on base table. The intermediate table is merged with the standard $I index tables periodically to keep the index synchronized with the base table.

Let's enable the Near Real Time Index feature for our index BLOG_DOC_TXT_IDX by setting the STAGE_ITAB attribute to TRUE for our index storage preference. Again, I am going to use the same table and records from the previous examples. However, I will create a custom preference to set the storage attributes STAGE_ITAB in order to enable the new feature as shown below.

---//
---// create a storage preference with STAGE_ITAB attribute set to TRUE //---
---//
SQL> exec ctx_ddl.create_preference( 'my_storage', 'BASIC_STORAGE' )

PL/SQL procedure successfully completed.

SQL> exec ctx_ddl.set_attribute ( 'my_storage', 'STAGE_ITAB', 'true' )

PL/SQL procedure successfully completed.

Now, we can use this custom storage preference to create a text index with the Near Real Time Indexes feature enabled as shown below.

---//
---// create text index with custom storage preference //---
---//
SQL> create index blog_doc_txt_idx on blog_docs (doc_data)
  2  indextype is ctxsys.context
  3  parameters ('storage my_storage sync (on commit)');

Index created.

Notice that we have enabled the on commitsynchronization for our text index. This will allow the tokens to be updated to the intermediate $G table without any manual intervention. The intention here is to have a small fragmented intermediary index to hold the recent tokens and later merge those tokens with the standard index table.

Once we enable the new feature, we could observe that a new table with suffix $G is created for our text index as shown below which has a structure similar to the $I table as it is an intermediary table for $I.

---//
---// A new index table with suffix $G is being created //---
---//
SQL> select owner,table_name,iot_type from dba_tables
  2  where owner='MYAPP' and table_name like '%BLOG_DOC%' order by table_name;

OWNER           TABLE_NAME                IOT_TYPE
--------------- ------------------------- ------------
MYAPP           BLOG_DOCS
MYAPP           DR$BLOG_DOC_TXT_IDX$G
MYAPP           DR$BLOG_DOC_TXT_IDX$I
MYAPP           DR$BLOG_DOC_TXT_IDX$K     IOT
MYAPP           DR$BLOG_DOC_TXT_IDX$N     IOT
MYAPP           DR$BLOG_DOC_TXT_IDX$R

6 rows selected.

---//
---// $G table structure is similar to $I table //---
---//
SQL> desc DR$BLOG_DOC_TXT_IDX$G
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 TOKEN_TEXT                                NOT NULL VARCHAR2(64)
 TOKEN_TYPE                                NOT NULL NUMBER(10)
 TOKEN_FIRST                               NOT NULL NUMBER(10)
 TOKEN_LAST                                NOT NULL NUMBER(10)
 TOKEN_COUNT                               NOT NULL NUMBER(10)
 TOKEN_INFO                                         BLOB

Like the $I table, the entries in the intermediary $G table are indexed by means of a B-Tree index with name having format DR${index_name}$H as shown below.

---//
---// intermediary $G table is indexed by means of a B-Tree index //---
---//
SQL> select owner,index_name,table_name,index_type from dba_indexes
  2  where index_name like '%BLOG_DOC_TXT_IDX%';

OWNER           INDEX_NAME                TABLE_NAME                INDEX_TYPE
--------------- ------------------------- ------------------------- ---------------------------
MYAPP           BLOG_DOC_TXT_IDX          BLOG_DOCS                 DOMAIN
MYAPP           DR$BLOG_DOC_TXT_IDX$H     DR$BLOG_DOC_TXT_IDX$G     NORMAL
MYAPP           DR$BLOG_DOC_TXT_IDX$X     DR$BLOG_DOC_TXT_IDX$I     NORMAL
MYAPP           DRC$BLOG_DOC_TXT_IDX$R    DR$BLOG_DOC_TXT_IDX$R     NORMAL


SQL> select INDEX_OWNER,INDEX_NAME,TABLE_NAME,COLUMN_NAME,COLUMN_POSITION from dba_ind_columns
  2  where INDEX_NAME='DR$BLOG_DOC_TXT_IDX$H' order by 5;

INDEX_OWNER     INDEX_NAME                TABLE_NAME                COLUMN_NAME     COLUMN_POSITION
--------------- ------------------------- ------------------------- --------------- ---------------
MYAPP           DR$BLOG_DOC_TXT_IDX$H     DR$BLOG_DOC_TXT_IDX$G     TOKEN_TEXT                    1
                                                                    TOKEN_TYPE                    2
                                                                    TOKEN_FIRST                   3
                                                                    TOKEN_LAST                    4
                                                                    TOKEN_COUNT                   5

Now, since we have enabled the new Near Real Time Index feature, let's add few records and observe the behavior of the text index.

---//
---// adding a new record in the base table //---
---//
SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (1, 'Sample Doc', 'This is the first sample document');

1 row created.

SQL> commit;

Commit complete.

Once we insert a new record into our table, we could see the tokens resulting from the insert are populated into the intermediary $G table rather than populating the tokens into the standard $I table as shown below.

---//
---// new tokens are added into the $G table //--
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$G;

TOKEN_TEXT           TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
-------------------- ---------- ----------- ---------- ----------- ----------------------------------------
DOCUMENT                      0           1          1           1 008806
FIRST                         0           1          1           1 008804
IS                            0           1          1           1 008802
SAMPLE                        0           1          1           1 008805
THE                           0           1          1           1 008803
THIS                          0           1          1           1 008801

6 rows selected.

SQL>  select * from DR$BLOG_DOC_TXT_IDX$I;

no rows selected

With the tokens populated into the intermediary $G table, Oracle can still take advantage of the text search even though the tokens are not propagated to the $I table as shown below.

---//
---// text search works with intermediary token table //---
---//
SQL> select doc_title,doc_data from blog_docs where contains(doc_data,'sample') > 0;

DOC_TITLE                 DOC_DATA
------------------------- ------------------------------------------------------------
Sample Doc                This is the first sample document


---//
---// query execution plan confirming text search //---
---//
Plan hash value: 145656609

----------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name             | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                  |      1 |        |      1 |00:00:00.01 |      12 |
|   1 |  TABLE ACCESS BY INDEX ROWID| BLOG_DOCS        |      1 |      1 |      1 |00:00:00.01 |      12 |
|*  2 |   DOMAIN INDEX              | BLOG_DOC_TXT_IDX |      1 |        |      1 |00:00:00.01 |      11 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CTXSYS"."CONTAINS"("DOC_DATA",'sample')>0)

Let's insert few more records and explore few more details about the intermediary table.

---//
---// insert a new records with repetitive words //---
---//
SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (2, 'Doc 2', 'Yet another sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (3, 'Doc 3', 'Third sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (4,'Doc 4', 'Fourth sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (5, 'Doc 5', 'Fifth sample document');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (6, 'Doc 6', 'Sixth sample document');

1 row created.

SQL> commit;

Commit complete.

---//
---// intermediary token table is populated with recent tokens and is fragmented //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$G;

TOKEN_TEXT           TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
-------------------- ---------- ----------- ---------- ----------- ----------------------------------------
DOCUMENT                      0           1          1           1 008806
FIRST                         0           1          1           1 008804
IS                            0           1          1           1 008802
SAMPLE                        0           1          1           1 008805
THE                           0           1          1           1 008803
THIS                          0           1          1           1 008801
ANOTHER                       0           2          2           1 008802
DOCUMENT                      0           2          2           1 008804
SAMPLE                        0           2          2           1 008803
YET                           0           2          2           1 008801
DOCUMENT                      0           3          3           1 008803
SAMPLE                        0           3          3           1 008802
THIRD                         0           3          3           1 008801
DOCUMENT                      0           4          4           1 008803
FOURTH                        0           4          4           1 008801
SAMPLE                        0           4          4           1 008802
DOCUMENT                      0           5          5           1 008803
FIFTH                         0           5          5           1 008801
SAMPLE                        0           5          5           1 008802
DOCUMENT                      0           6          6           1 008803
SAMPLE                        0           6          6           1 008802
SIXTH                         0           6          6           1 008801

22 rows selected.

---//
---// standard token table is not yet populated //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$I;

no rows selected

We have inserted another 5 records which are populated into the intermediary table and resulted into a fragmented intermediary table due to repetitive tokens. This fragmentation in the intermediary table shouldn't be a concern as we have the option of keeping this table in memory (more details in the upcoming section). However, if we take a report of INDEX_STATS for the text index, we could observe that there is no fragmentation reported as the standard index table is not fragmented.

---//
---// text index statistics showing no fragmentation //---
---//
===========================================================================
                 STATISTICS FOR "MYAPP"."BLOG_DOC_TXT_IDX"
===========================================================================

indexed documents:                                                      6
allocated docids:                                                       6
$I rows:                                                                0

We can also notice that at this point none of the tokens are populated into the standard index table $I table. The tokens from the intermediary $G table needs to be periodically merged with the standard $I table to keep the text index intact. This can be done by calling the CTX_DDL.OPTIMIZE_INDEX procedure with the MERGE option as shown below.

---//
---// merge tokens from $G to $I table //---
---//
SQL> exec ctx_ddl.optimize_index('BLOG_DOC_TXT_IDX', 'MERGE');

PL/SQL procedure successfully completed.

Once we merge (move) the tokens from the intermediary $G table to the standard $I table, we could see the entries in the intermediary are cleaned up and populated into the $I table as shown below.

---//
---// tokens from $G table are populated into $I table //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$I;

TOKEN_TEXT           TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
-------------------- ---------- ----------- ---------- ----------- ----------------------------------------
ANOTHER                       0           2          2           1 008802
DOCUMENT                      0           1          6           6 008806018804018803018803018803018803
FIFTH                         0           5          5           1 008801
FIRST                         0           1          1           1 008804
FOURTH                        0           4          4           1 008801
IS                            0           1          1           1 008802
SAMPLE                        0           1          6           6 008805018803018802018802018802018802
SIXTH                         0           6          6           1 008801
THE                           0           1          1           1 008803
THIRD                         0           3          3           1 008801
THIS                          0           1          1           1 008801
YET                           0           2          2           1 008801

12 rows selected.

---//
---// tokens from intermediary $G table are cleaned up //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$G;

no rows selected

If we notice closely, we could observe that during the merge operation, the tokens from the intermediary $G table were optimized (duplicates eliminated) before merging into the $I table. This will help in reducing the fragmentation in the standard $I table when compared to the existing methods of index synchronization. However, we can't completely eliminate fragmentation in the standard $I table as the merge operation will not optimize the tokens from existing records, which means when we merge tokens from $G table to $I table, if a token already exists in the $I table, Oracle will not optimize the existing token rather it will add a new records for the token coming from the $G table as shown below.

---//
---// populate a new record with repetitive token //---
---//
SQL> insert into blog_docs (doc_id, doc_title, doc_data)
  2  values (7, 'Doc 7', 'Seventh sample document');

1 row created.

SQL> commit;

Commit complete.

---//
---// intermediary token table is populated with new tokens //---
---//
SQL> select * from DR$BLOG_DOC_TXT_IDX$G;

TOKEN_TEXT           TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
-------------------- ---------- ----------- ---------- ----------- ----------------------------------------
DOCUMENT                      0           7          7           1 008803
SAMPLE                        0           7          7           1 008802
SEVENTH                       0           7          7           1 008801

---//
---// merge the intermediary table with standard token table //---
---//
SQL> exec ctx_ddl.optimize_index('BLOG_DOC_TXT_IDX', 'MERGE');

PL/SQL procedure successfully completed.

SQL> select * from DR$BLOG_DOC_TXT_IDX$G;

no rows selected

---//
---// existing tokens are not optimized in the standard token table //---
---//
SQL> select token_text, count(*) from DR$BLOG_DOC_TXT_IDX$I
  2  having count(token_text) > 1 group by token_text;

TOKEN_TEXT             COUNT(*)
-------------------- ----------
DOCUMENT                      2
SAMPLE                        2

As demonstrated earlier, to optimize the standard $I table, we can call the CTX_DDL.OPTIMIZE_INDEX with FULL option, which will optimize the token entries in $I table and remove any duplicate tokens. However, in case of the Near Real Time Indexes, we need not frequently optimize the standard text index as the percentage of fragmentation would be much less when compared to the other methods and this would also help in reducing the duration of index optimization.

We mentioned earlier that we wanted to keep the $G table in memory. How do we ensure that happens? We could of course just rely on normal SGA caching - if there are regular updates to the index then we can really expect that it will remain in memory. Alternatively, if you have a keep pool, which is sized to ensure nothing gets flushed to disk, we can make use of this memory pool using appropriate STORAGE attributes as shown below.

---//
---// keeping $G table and its associated index in buffer cache //---
---//
SQL> exec ctx_ddl.set_attribute ( 'my_storage', 'G_TABLE_CLAUSE', 'storage (buffer_pool keep)');

PL/SQL procedure successfully completed.

SQL> exec ctx_ddl.set_attribute ( 'my_storage', 'G_INDEX_CLAUSE', 'storage (buffer_pool keep)' );

PL/SQL procedure successfully completed.

---//
---// validate the intermediary table is using buffer cache //---
---//
SQL>  select * from CTX_PREFERENCE_VALUES where PRV_PREFERENCE='MY_STORAGE';

PRV_OWNER       PRV_PREFERENCE  PRV_ATTRIBUTE   PRV_VALUE
--------------- --------------- --------------- ------------------------------
MYAPP           MY_STORAGE      R_TABLE_CLAUSE  lob (data) store as (cache)
MYAPP           MY_STORAGE      I_INDEX_CLAUSE  compress 2
MYAPP           MY_STORAGE      STAGE_ITAB      YES
MYAPP           MY_STORAGE      G_TABLE_CLAUSE  storage (buffer_pool keep)
MYAPP           MY_STORAGE      G_INDEX_CLAUSE  storage (buffer_pool keep)

Auto Optimize Near Real Time Indexes

In earlier section, we have seen how we can merge the intermediary token table with the standard token table using the CTX_DDL.OPTIMIZE_INDEX procedure with MERGE option. However, this needs manual intervention. We have the option of letting Oracle decide when to merge the intermediary token table with the standard token table by choosing auto optimization for the intermediary token table. We can use the CTX_DDL.ADD_AUTO_OPTIMIZE procedure to enable auto optimization of the intermediary token table as shown below.

---//
---// setting AUTO optimization for intermediary $G table //---
---//
SQL> exec ctx_ddl.add_auto_optimize('BLOG_DOC_TXT_IDX');

PL/SQL procedure successfully completed.

We can validate if the auto optimization is enabled by querying the CTX_USER_AUTO_OPTIMIZE_INDEXES view. This view will contain an entry for each indexes enable for auto optimization as shown below.

---//
---// validate AUTO operation is enabled //---
---//
SQL> select aoi_index_name, aoi_partition_name from
  2  ctx_user_auto_optimize_indexes;

AOI_INDEX_NAME                 AOI_PARTITION_NAME
------------------------------ ------------------------------
BLOG_DOC_TXT_IDX

Once the intermediary token table is enabled for auto optimization, Oracle will merge the tokens from the intermediary $G table to the standard $I table using a background job named DR$BGOPTJOB which is scheduled through the DBMS scheduler as shown below.

---//
---// database job performing auto operation of $G table //---
---//
SQL> select owner, job_name, program_name,last_start_date,state from dba_scheduler_jobs where owner='CTXSYS';

OWNER           JOB_NAME             PROGRAM_NAME         LAST_START_DATE                               STATE
--------------- -------------------- -------------------- --------------------------------------------- ---------------
CTXSYS          DR$BGOPTJOB          DR$BGOPTPRG          05-AUG-16 06.14.03.737613 PM ASIA/CALCUTTA    SUCCEEDED

Conclusion

Near Real Time Indexes feature of the Oracle text index is a significant improvement in the way text indexes are synchronized and maintained in Oracle database. This new and a much awaited feature will certainly help in reducing the text index fragmentation, latency as well as will help in improving the text query performance (provided we leverage the in memory copy of the intermediary token table).