Concept

Fast Start Failover feature automates the failover by using Observer and it doesn't requires DBA intervention during disaster recovery scenario. Observer is a component of Dataguard Broker typically running outside the network of both primary and standby database which continuously monitors primary database. Whenever Observer detects unavailability of primary database it ignites failover after waiting for number of seconds defined in the property 'FastStartFailoverThreshold'. And when broker connectivity is established to former primary database it will reinstate the database as standby provided flashback is configured.

One of the major enhancement from 11g and above releases is we can implement Fast Start Failover when protection mode is in Maximum Performance, in this protection mode we should set the property 'FastStartFailoverLagLimit' for number of seconds of data which can be lost. Earlier releases had the requirement of protection mode to be either Maximum Protection or Maximum Availability.

From 11g and above releases we can configure user-configurable failover conditions and upon detection of such conditions broker will ignore the property 'FastStartFailoverThreshold' and failover immediately. Below are the user-configurable failover conditions.

  • Datafile Offline - Enabled by default, initiates failover if a datafile on the primary database experiences an I/O error resulting in a datafile being taken offline.
  • Corrupted Dictionary - Enabled by default, initiates failover if corruption of a critical database object is found.
  • Corrupted Controlfile - Enabled by default, initiates failover of corruption of controlfile is detected.
  • Inaccessible Log File - Disabled by default, initiates failover of LGWR is unable to write to a member of a log group.
  • Stuck Archiver - Disabled by default, initiates failover if the archiver on the primary database is hung.
  • Application Induced Failover - Failover can be initiated by using dbms_dg.initiate_fs_failover function, this provides capability of failover to Application.

When these user-configurable conditions are detected, broker will not bring up the failed primary database and thus it leaves it in down state. Also the automatic reinstate of the database will not be attempted by the broker.

The two user-configurable failover conditions 'Inaccessible Log File' and 'Stuck Archiver' are instance specific in case of RAC. So if any instance in a RAC cluster detects these two failover conditions then failover will occur regardless of other instance availability in the RAC cluster. Hence these conditions has to be considered carefully before enabling it in RAC cluster.

We can also set ORA errors for failover to take place whenever it detects these defined ORA errors. But according to Oracle manuals only ORA error that is supported is ORA-00240. Usually you can set any user defined ORA error condition as long as you reasonably believe that the said ORA error occurrence impedes business availability of your primary database. There is an diagnostic event 16616 which can be set along with different levels to simulate different error conditions, but in 12c all the levels doesn't seems to work. Level 284 attempts to simulate ORA-00240, however the method it uses for simulation does not work on 12c. In 12c major changes have been made to the code which deals with updating, reading records in controlfile so there is lesser need to get exclusive enqueue on controlfile and this is the reason why this simulation does not work.

Levels provided in broker diagnostic event 16616 for simulating various error conditions are

251 /* healthchk: datafile offline */ 
252 /* healthchk: corrupted controlfil */ 
253 /* healthchk: corrupted dictionary */ 
254 /* healthchk: inaccessible logfile */ 
255 /* healthchk: stuck archiver */ 
256 /* healthchk: lost write */ 
280 /* oracle error 27102 */ 
281 /* oracle error 16506 */ 
282 /* oracle error 16526 */ 
283 /* oracle error 1578 */ 
284 /* oracle error 240 */ 

Any of these broker diagnostic events can be set for simulation purpose by setting corresponding levels as shown below

alter session set events '16616 trace name context forever, level <level_id>'; 

Environment

In this article we will go through one of the user-configurable failover condition 'Inaccessible Log File' by artificially simulating it and confirm the behavior is as expected. This demo has been conducted in environment having primary two node(prim-node1/prim-node2) RAC 12c database(primdb) and standby two node(stdby-node1/stdby-node2) RAC database(stdbydb) along with Observer(obsrvr-node) placed in different network from primary and standby database. Protection mode has been configured for Maximum Availability. Current broker configuration status is as shown below

prim-node1 {/home/oracle}: dgmgrl sys/*****@primdb "show configuration verbose"
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected as SYSDBA.

Configuration - fsfo-demo

  Protection Mode: MaxAvailability
  Members:
  primdb - Primary database
    stdbydb - (*) Physical standby database

  (*) Fast-Start Failover target

  Properties:
    FastStartFailoverThreshold      = '60'
    OperationTimeout                = '30'
    TraceLevel                      = 'USER'
    FastStartFailoverLagLimit       = '60'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'

Fast-Start Failover: ENABLED

  Threshold:          60 seconds
  Target:             stdbydb
  Observer:           obsrvr-node
  Lag Limit:          60 seconds (not in use)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE

Configuration Status:
SUCCESS

Fast start failover has been configured and user-configurable condition 'Inaccessible Log File' has been enabled as shown below.

prim-node1 {/home/oracle}: dgmgrl sys/*****@primdb "show fast_start failover"
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected as SYSDBA.

Fast-Start Failover: ENABLED

  Threshold:          30 seconds
  Target:             stdbydb
  Observer:           obsrvr-node
  Lag Limit:          30 seconds (not in use)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE

Configurable Failover Conditions
  Health Conditions:
    Corrupted Controlfile          NO
    Corrupted Dictionary           NO
    Inaccessible Logfile           YES
    Stuck Archiver                 NO
    Datafile Offline               NO

  Oracle Error Conditions:

Property 'Shutdown Primary' is set to TRUE, when failover happens old primary will be shutdown immediately to avoid any further modifications of data from already connected sessions. Property 'Auto-reinstate' is set to TRUE, when failover happens old primary will be automatically reinstate as standby database and will sync with new primary database. But both of these properties are not applicable for user-configurable conditions. When failover is triggered due to user-configurable condition then old primary will be shutdown and re-instate will not be attempted by the broker automatically.

Simulation

To simulate 'Inaccessible Logfile' condition we will remove/delete all the online redo log files related to thread 1 of first RAC instance. Before proceeding let's gather the status information of each redo log group.

SQL> select thread#,group#, sequence#, members, ARCHIVED, status from v$log;

   THREAD#     GROUP#  SEQUENCE#    MEMBERS ARC STATUS
---------- ---------- ---------- ---------- --- ----------------
         1          1         70          2 YES ACTIVE
         1          2         71          2 NO  CURRENT
         2          3         69          2 YES ACTIVE
         2          4         70          2 NO  CURRENT

Redo log group 1 of thread 1 status is active which means they contain data which is not yet written to data files by dbwr. Due to protection mode set as MaxAvailability with synchronous log transfer we can concur that there will be no data loss when failover occurs.

Remove/delete all online log files related to thread 1

SQL> !rm /oradata1/PRIMDB/onlinelog/o1_mf_1__18637110189630_.log
SQL> !rm /oradata2/PRIMDB/onlinelog/o1_mf_1__18637117200858_.log
SQL> !rm /oradata1/PRIMDB/onlinelog/o1_mf_2__18637124324657_.log
SQL> !rm /oradata2/PRIMDB/onlinelog/o1_mf_2__18637131414786_.log

Immediately after removing online log files broker detects the 'Inaccessible Logfile' condition due to I/O errors caused by LGWR while it tries to write into online log files placed on NFS storage. In alert log file of primary database we could see all these details.

Thu Apr 28 02:45:44 2016
Direct NFS: NFSERR 304 error encountered. Server lab-demo path lab-demo
Thu Apr 28 02:45:44 2016
Errors in file /u01/app/oracle/diag/rdbms/PRIMDB/primdb1/trace/primdb1_lgwr_16386.trc:
ORA-00345: redo log write error block 53 count 1
ORA-00312: online log 2 thread 1: '/oradata1/PRIMDB/onlinelog/o1_mf_2__18637124324657_.log'
ORA-17500: ODM err:KGNFS WRITE FAIL:Stale File handle
ORA-17500: ODM err:KGNFS WRITE FAIL:Stale File handle
Thu Apr 28 02:45:44 2016
Errors in file /u01/app/oracle/diag/rdbms/PRIMDB/primdb1/trace/primdb1_lgwr_16386.trc:
ORA-00346: log member marked as STALE and closed
ORA-00312: online log 2 thread 1: '/oradata1/PRIMDB/onlinelog/o1_mf_2__18637124324657_.log'
Thu Apr 28 02:45:49 2016
Direct NFS: write FAILED 70
Thu Apr 28 02:45:49 2016
Direct NFS: NFSERR 304 error encountered. Server lab-demo path lab-demo
Thu Apr 28 02:45:51 2016
A user-configurable Fast-Start Failover condition was detected. The primary is shutting down due to Inaccessible Logfile.
Database primdb will not be automatically reinstated.

So old primary database has been shutdown and broker will not try to reinstate it due to detection of user-configurable condition. In Observer log file we could see that failover has taken place and it got completed in approx 2 minutes.

07:45:51.54  Thursday, April 28, 2016
Initiating Fast-Start Failover to database "stdbydb"...
Performing failover NOW, please wait...
Failover succeeded, new primary is "stdbydb"
07:47:30.82  Thursday, April 28, 2016

Status of broker configuration after failover is as shown below.

prim-node1 {/home/oracle}: dgmgrl sys/****@primdb "show configuration verbose"
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected as SYSDBA.

Configuration - poc054.int.thomsonreuters.com

  Protection Mode: MaxAvailability
  Members:
  primdb - Primary database
    Warning: ORA-16817: unsynchronized fast-start failover configuration

    stdbydb - (*) Physical standby database (disabled)
      ORA-16661: the standby database needs to be reinstated

  (*) Fast-Start Failover target

  Properties:
    FastStartFailoverThreshold      = '60'
    OperationTimeout                = '30'
    TraceLevel                      = 'USER'
    FastStartFailoverLagLimit       = '60'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'

Fast-Start Failover: ENABLED

  Threshold:          60 seconds
  Target:             primdb
  Observer:           obsrvr-node
  Lag Limit:          60 seconds (not in use)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE

Configuration Status:
WARNING

As expected FSFO is in un-synchronized configuration due to unavailability of standby database. Broker configuration status is WARNING and states that standby database has to be reinstated. Let's reinstate standby database manually through broker as we had enabled flashback before simulating this scenario.

prim-node1 {/home/oracle}: dgmgrl sys/****@stdbydb "reinstate database primdb"
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected as SYSDBA.
Reinstating database "primdb", please wait...
Error: ORA-16653: failed to reinstate database

Failed.
Reinstatement of database "primdb" failed

So reinstate of standby got failed and as per alert log file flashback database was failed due to unavailability of online log files which we deleted previously.

Starting background process NSV0
Thu Apr 28 02:55:56 2016
NSV0 started with pid=53, OS id=12844
FLASHBACK DATABASE TO SCN 102068803
Thu Apr 28 02:56:01 2016
Errors in file /u01/app/oracle/diag/rdbms/primdb/primdb1/trace/primdb1_rsm0_23313.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/oradata2/PRIMDB/onlinelog/o1_mf_1__18637117200858_.log'
ORA-17503: ksfdopn:4 Failed to open file /oradata2/PRIMDB/onlinelog/o1_mf_1__18637117200858_.log
ORA-17500: ODM err:File does not exist
ORA-00312: online log 1 thread 1: '/oradata1/PRIMDB/onlinelog/o1_mf_1__18637110189630_.log'
ORA-17503: ksfdopn:4 Failed to open file /oradata1/PRIMDB/onlinelog/o1_mf_1__18637110189630_.log
ORA-17500: ODM err:File does not exist
Thu Apr 28 02:56:01 2016
Errors in file /u01/app/oracle/diag/rdbms/primdb/primdb1/trace/primdb1_rsm0_23313.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '/oradata2/PRIMDB/onlinelog/o1_mf_2__18637131414786_.log'
ORA-17503: ksfdopn:4 Failed to open file /oradata2/PRIMDB/onlinelog/o1_mf_2__18637131414786_.log
ORA-17500: ODM err:File does not exist
ORA-38754 signalled during: FLASHBACK DATABASE TO SCN 102068803...
Thu Apr 28 02:56:04 2016

Since these online log files were deleted we can try to clear the online logfile groups 1 and 2.

SQL> alter database clear logfile group 1;
alter database clear logfile group 1
*
ERROR at line 1:
ORA-01624: log 1 needed for crash recovery of instance poc054a1 (thread 1)
ORA-00312: online log 1 thread 1:
'/oradata1/PRIMDB/onlinelog/o1_mf_1__18637110189630_.log'
ORA-00312: online log 1 thread 1:
'/oradata2/PRIMDB/onlinelog/o1_mf_1__18637117200858_.log'

SQL> alter database clear logfile group 2;
alter database clear logfile group 2
*
ERROR at line 1:
ORA-01624: log 2 needed for crash recovery of instance poc054a1 (thread 1)
ORA-00312: online log 2 thread 1:
'/oradata1/PRIMDB/onlinelog/o1_mf_2__18637124324657_.log'
ORA-00312: online log 2 thread 1:
'/oradata2/PRIMDB/onlinelog/o1_mf_2__18637131414786_.log'

Its not allowing to clear the logfile groups as controlfile has the information about these logfiles which are required for crash recovery and thus flashback operation would also fail again. So without this redo information we cannot proceed further with standby reinstate.

Resolution

Since this testing is done in Maximum Availability mode with SYNC log transfer setting there is no loss in data while FSFO performed failover to standby database. So to reinstate the standby database we can get the redo information of these deleted online log files from our new primary database. Let's query and find what we need on old primary database for each thread.

SQL> select group#,first_change# from v$log where thread#=1;

GROUP# FIRST_CHANGE#
------ -------------
     1     102068615
     2     102068695	 

We need log information from SCN 102068695, so query from primary to find the required log files.

SQL> select name from v$archived_log where FIRST_CHANGE#>=102068695;

NAME
--------------------------------------------------------------------------------
/oraarch/stdbydb/2_1_910320360.dbf
/oraarch/stdbydb/1_71_910144956.dbf
/oraarch/stdbydb/2_70_910144956.dbf
/oraarch/stdbydb/1_1_910320360.dbf
/oraarch/stdbydb/2_2_910320360.dbf

Transfer these log files from primary to standby database and register these files into standby controlfile by using RMAN catalog. Note that few of the logfiles are of different incarnation.

scp /oraarch/stdbydb/2_1_910320360.dbf  oracle@prim-node01:/oraarch/primdb/
scp /oraarch/stdbydb/1_71_910144956.dbf oracle@prim-node01:/oraarch/primdb/
scp /oraarch/stdbydb/2_70_910144956.dbf oracle@prim-node01:/oraarch/primdb/
scp /oraarch/stdbydb/1_1_910320360.dbf  oracle@prim-node01:/oraarch/primdb/
scp /oraarch/stdbydb/2_2_910320360.dbf  oracle@prim-node01:/oraarch/primdb/

prim-node01 {/home/oracle}: rman target /
Recovery Manager: Release 12.1.0.2.0 - Production on Thu Apr 28 03:23:02 2016
Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.
connected to target database: POC054A (DBID=1278101452, not open)

RMAN> catalog archivelog '/oraarch/primdb/2_1_910320360.dbf';
cataloged archived log
archived log file name=/oraarch/primdb/2_1_910320360.dbf RECID=7646 STAMP=910322773

RMAN> catalog archivelog '/oraarch/primdb/1_71_910144956.dbf';
cataloged archived log
archived log file name=/oraarch/primdb/1_71_910144956.dbf RECID=7647 STAMP=910322787

RMAN> catalog archivelog '/oraarch/primdb/2_70_910144956.dbf';
cataloged archived log
archived log file name=/oraarch/primdb/2_70_910144956.dbf RECID=7648 STAMP=910322800

RMAN> catalog archivelog '/oraarch/primdb/1_1_910320360.dbf';
cataloged archived log
archived log file name=/oraarch/primdb/1_1_910320360.dbf RECID=7649 STAMP=910322811

RMAN> catalog archivelog '/oraarch/primdb/2_2_910320360.dbf';
cataloged archived log
archived log file name=/oraarch/primdb/2_2_910320360.dbf RECID=7650 STAMP=910322822

Now again try to reinstate old primary database primdb through broker.

prim-node1 {/home/oracle}: dgmgrl sys/****@stdbydb "reinstate database primdb"
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected as SYSDBA.
Reinstating database "primdb", please wait...
Reinstatement of database "primdb" succeeded

In alert log file of new standby primdb it seems that reinstation was successfull but mrp errors out saying incarnation has been changed to proceed further for applying the archive log files of different incarnation.

MRP0: Background Media Recovery applied all available redo. Recovery will be restarted once new redo branch is registered
Thu Apr 28 03:35:27 2016
Errors in file /u01/app/oracle/diag/rdbms/primdb/primdb1/trace/primdb1_pr00_13044.trc:
ORA-19906: recovery target incarnation changed during recovery

Incarnations in new primary database stdbydb:

RMAN> list incarnation

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
44      44      POC054A  1278101452       PARENT  98432256   18-APR-16
45      45      POC054A  1278101452       PARENT  101420992  26-APR-16
46      46      POC054A  1278101452       CURRENT 102068825  28-APR-16

Incarnations in old primary database primdb:

RMAN> list incarnation

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
44      44      POC054A  1278101452       PARENT  98432256   18-APR-16
45      45      POC054A  1278101452       CURRENT 101420992  26-APR-16

Since standby is not aware of this new incarnation we could snap in new standby controlfile from primary to this standby database to make standby aware of both incarnations.

Create standby controlfile on new primary stdbydb database

SQL> alter database create standby controlfile as '/tmp/newstdbycntrl.ctl';
Database altered.

SQL> !scp /tmp/newstdbycntrl.ctl oracle@prim-node1:/home/oracle

Stop standby database and replace the standby controlfile by using the latest standby controlfile generated from new primary which is aware of both incarnations.

prim-node1 {/home/oracle}: srvctl stop database -d primdb
prim-node1 {/home/oracle}: cp /home/oracle/newstdbycntrl.ctl /oradata1/primdb/PRIMDB/controlfile/o1_mf__102262659671_.ctl
prim-node1 {/home/oracle}: cp /home/oracle/newstdbycntrl.ctl /oradata2/primdb/PRIMDB/controlfile/o1_mf__102262723422_.ctl
prim-node1 {/home/oracle}: srvctl start database -d primdb

Now standby detects the new incarnation from controlfile and continues the recovery to catch up with primary. We were able to reinstate standby with minimal efforts and avoid complete rebuild of standby.

Conclusion

Fast-Start failover provides minimum RTO (Recovery Time Objective) but there are many other aspects of it which has to be considered before implementing it in production. One of the aspect is about time required to bring back the new standby to avoid prolonged single point of failure window for the new primary database after failover operation. By leveraging flashback feature on both primary and standby database we can automate the process of standby reinstate through broker, but success of reinstating database depends on many other factors. In this article we considered user-configurable condition "Inaccessible logfile" and simulated the scenario to check whether we could bring back the new standby as soon as possible, and we had to perform few manual steps to sync the redo data from new primary database as redo was lost on old primary database. So RTO is directly proportional to the kind of database failure which has caused the failover and thus it is highly recommended to develop proof of concept of RTO for each and every condition which can ignite failover before moving ahead in the production environment.