Dataguard
Using the feature recover standby database from service Dataguard on 19c
Category: Database Author: Andre Luiz Dutra Ontalba (Board Member) Date: 4 years ago Comments: 0

Using the feature recover standby database from service Dataguard on 19c

The Oracle Dataguard, part from MAA (Max Availability Architecture), is a prouct vastly utilized all over the entire globe by several companies whom willing to keep their data safe, preventing lost of data in case of human erros, hardware failures and other kind of disasters. The main function of this product is provide a copy from your primary database, synchronized by archivelogs, usually maintained into a different geographical location, in most of cases. With the Active Dataguard introduction, on Oracle 11g, an option, licensed, leverage this product to another level, even better, in my humble opinion. This feature allows the users to run some reports on standby database with the state read only with apply, this can deviate a lot of heavy workloads from principal database and reducing a lot of resources on it.
With Oracle database 19c, Oracle introduced a feature that allows some DML done on standby database be redirected to the primary database and the shipped by archivelogs again to standby. Note, this is not recomended for usage with intense workloads, because of performance matters, use more, cpu, network and etc. Just mention this features for you follow up some improvements on the product on each version. Today we will talk about one more enhancement that make it easier DBA’s life time to time.
Sometimes, because of some hardware failures, human failures, bugs or even some reconfiguration, we need to recreate the standby database, and on Oracle 18c and onwards, Oracle has introduced a single RMAN command to performance the whole resynchronization from standby databases over network. I don’t need to tell you that is no suitable perform multi-terabyte database restoration over network, most of time (you can have a dedicated network to play with).
 
 
Benefits
 
  • Simplicity to create or recreate the standby database
  • Best control of the process
 
Scenario
 
  • Primary database: orclcdb
  • Standby database: orclstb
 
The primary database and also the standby are managed by Dataguard broker (DGMGRL), I like to use this command line utility to manage the Dataguard environments. It helps to perform all dataguard administration like switchover, configure and a lot more of tasks in a Dataguard environment, if you want to have more information about this incredible tool, please refer the official documentation on the link https://docs.oracle.com/en/database/oracle/oracle-database/19/dgbkr/index.html
 
 
Verification
 
As I mentioned before, using the command line interface from Dataguard Broker, we can use the command show configuration to display the configuration made by us, previously, in our case, on this Dataguard configuration we have the primary database orclcdb and also the standby database orclstb, we can have multiple standby databases and also ZDLRA registered on broker. We will cause some damage on standby database just to show how to solve it on this article, by removing a datafile, let’s say system01.dbf, just for fun.
              We can see on the imagem below, there are no problems there, like a gap or any other kind of issues, in this case, if we lose the primary database, the standby will be able to become the primary database with just a manual switchover. It is also possible to automate this task, by configuring Fast-Start Failover.

Causing some damage on standby

 

              On the follwoing image, we can verify all belonging datafiles from our standby database, where we will cause some damage and recover it just using recover standby database from service. The list below, shows all datafiles from our standby (orclstb)
Now, we will remove the main datafile on the standby, /u01/app/oracle/oradata/ORCLSTB/system01.dbf
As we expected, after remove, one of the datafiles on purpose, our environment crashed, look:

Disabling the apply on standby database

 

            Before to initiate with the recover process, from this standby database, we will neneed  stop the archivelog synchronization, to do that, we can use the commands below, to edit the state of the database within broker. If you don’t do that, we will face this error:

 

starting media recovery

media recovery failed

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-03002: failure of recover command at 01/19/2020 11:16:26

RMAN-03015: error occurred in stored script Memory Script

RMAN-11003: failure during parse/execution of SQL statement: alter database recover

 if needed standby start

ORA-01153: an incompatible media recovery is active


        As we can see, the command to stop apply on DGMGRL was issued, and now, no more error will happen.
            After stopping the recover, we just need to login on RMAN and issue the command recover standby database from service, as we can follow up on upcomming lines of this article:

 

[oracle@ora19c ~]$ rman target=sys/oracle@orclstb




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Jan 19 11:27:59 2020

Version 19.3.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: ORCLCDB (DBID=2780785463, not open)




RMAN> recover standby database from service orclcdb;




Starting recover at 19-JAN-20

using target database control file instead of recovery catalog

Oracle instance started




Total System Global Area    1895823376 bytes




Fixed Size                     9136144 bytes

Variable Size                436207616 bytes

Database Buffers            1442840576 bytes

Redo Buffers                   7639040 bytes




contents of Memory Script:

{

   restore standby controlfile from service  'orclcdb';

   alter database mount standby database;

}

executing Memory Script




Starting restore at 19-JAN-20

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=39 device type=DISK




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: using network backup set from service orclcdb

channel ORA_DISK_1: restoring control file

channel ORA_DISK_1: restore complete, elapsed time: 00:00:02

output file name=/u01/app/oracle/oradata/ORCLSTB/control01.ctl

output file name=/u01/app/oracle/fast_recovery_area/ORCLSTB/control02.ctl

Finished restore at 19-JAN-20




released channel: ORA_DISK_1

Statement processed




contents of Memory Script:

{

set newname for datafile  1 to

 "/u01/app/oracle/oradata/ORCLSTB/system01.dbf";

   restore from service  'orclcdb' datafile

    1;

   catalog datafilecopy  "/u01/app/oracle/oradata/ORCLSTB/system01.dbf";

   switch datafile all;

}

executing Memory Script




executing command: SET NEWNAME




Starting restore at 19-JAN-20

Starting implicit crosscheck backup at 19-JAN-20

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=49 device type=DISK

Crosschecked 3 objects

Finished implicit crosscheck backup at 19-JAN-20




Starting implicit crosscheck copy at 19-JAN-20

using channel ORA_DISK_1

Crosschecked 2 objects

Finished implicit crosscheck copy at 19-JAN-20




searching for all files in the recovery area

cataloging files...

cataloging done




List of Cataloged Files

=======================

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_25_gyq4g4hz_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_26_gyq5km3v_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_27_gyq64bom_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_28_gyq64n7f_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_29_gyq64zy8_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_30_gyq6vxg5_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_32_gyq6vzrv_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_31_gyq6vzs6_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_33_gyq84b59_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_34_gyq88dll_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_35_gyq9rvx9_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_36_gyqcf807_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_37_gyqcfsk1_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_38_gyqcg4l3_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_39_gyqckh4o_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_41_gyqckkfk_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_40_gyqckkg6_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_42_gyqcpsko_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_07/o1_mf_1_43_gyqdchj7_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_08/o1_mf_1_44_gyso1jwz_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_08/o1_mf_1_45_gytopnjo_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2019_12_08/o1_mf_1_46_gytowb7h_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_47_h1gstntb_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_48_h1gtcgd0_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_49_h1gv3wqn_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_50_h1gv8rbs_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_09/o1_mf_1_51_h1gvo2x8_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_16/o1_mf_1_52_h21lqfd2_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_16/o1_mf_1_53_h21m7fwt_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/archivelog/2020_01_19/o1_mf_1_54_h28w8y78_.arc

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/autobackup/2019_12_07/o1_mf_s_1026372241_gyq6h263_.bkp

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/autobackup/2019_12_07/o1_mf_s_1026373077_gyq7r09k_.bkp

File Name: /u01/app/oracle/fast_recovery_area/ORCLSTB/autobackup/2019_12_07/o1_mf_s_1026378166_gyqd9fxy_.bkp




using channel ORA_DISK_1




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: using network backup set from service orclcdb

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00001 to /u01/app/oracle/oradata/ORCLSTB/system01.dbf

channel ORA_DISK_1: restore complete, elapsed time: 00:00:07

Finished restore at 19-JAN-20




cataloged datafile copy

datafile copy file name=/u01/app/oracle/oradata/ORCLSTB/system01.dbf RECID=5 STAMP=1030102136




datafile 1 switched to datafile copy

input datafile copy RECID=5 STAMP=1030102136 file name=/u01/app/oracle/oradata/ORCLSTB/system01.dbf




contents of Memory Script:

{

  recover database from service  'orclcdb';

}

executing Memory Script




Starting recover at 19-JAN-20

using channel ORA_DISK_1

skipping datafile 1; already restored to SCN 4595092

skipping datafile 3; already restored to SCN 4594570

skipping datafile 5; already restored to SCN 2163739

skipping datafile 6; already restored to SCN 2163739

skipping datafile 7; already restored to SCN 4594577

skipping datafile 8; already restored to SCN 2163739

skipping datafile 9; already restored to SCN 4594580

skipping datafile 10; already restored to SCN 4594582

skipping datafile 12; already restored to SCN 4594588

skipping datafile 13; already restored to SCN 4594593

skipping datafile 14; already restored to SCN 4594596

skipping datafile 15; already restored to SCN 4594598

skipping datafile 19; already restored to SCN 4594600

skipping datafile 20; already restored to SCN 4594604

skipping datafile 21; already restored to SCN 4594611




starting media recovery




media recovery complete, elapsed time: 00:00:00

Finished recover at 19-JAN-20

Finished recover at 19-JAN-20




RMAN>


 

      Thus, we are able to see,  with just one simple commad line on RMAN, how can be easy recover/rebuild my entire dataguard database and stay ready for the next disaster recover, we hope not face that, but problems happens. The final result, our standby database was restored and can be fully syncrhonized again with the production database.
 
So, via DGMGRL interface, we can enable the synchronization again as we can see on the image:
 
 
For validation purposes, we will issue a switchover to change the roles of our databases, making the primary database becoming the standby and standby becoming the primary:
 
Now, our database orclstb, originally standby database is the primary, and the database orclcdb was changed to physical standby role:
Alls are opened and ready to use, the instainswith db_unique_name = orclstb now is with PRIMARY ROLE.
Conclusion: Oracle is always inovating and make our lives easier. With every launched version, new features are added. We can agree on this, this funcionality saves a lot of work, but, be concerned that you restore is fully using network, if your database is huge, you may kill the performance of your entire network.

 

 

André Ontalba / Rodrigo Mufalani / Y V RaviKumar 

 

Disclaimer“The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”

 


OBSERVERS, MORE THAN ONE
Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

OBSERVERS, MORE THAN ONE

Recently I made a post about a little issue that I got with Oracle Dataguard. In that scenario, because outage in the standby datacenter, healthy primary database shutdown with error “ORA-16830: primary isolated…”. Just to remember that the database was running with Maximum Availability, Fail-start Failover enabled and (the most important detail) the Observer was running in the standby datacenter too.
The point from my post (that you can read here) tried to show that does not exists one doc that provide full details about “pros” and “cons” where put your observer. Whatever place, on primary datacenter or in standby, have little details to check. Even the best (ideal) scenario with a third datacenter can be tough to sustain.
Here I will try to show one option that can help you and improve the reliability of your environment. At least, you will have more options to decide how to protect your database. Bellow I show some details about how to configure and use multiple observers, but if you want to see a little concern about this you can directly to the end of the post.

 

 

More than one

Basically, to do that, you can add more than one observer to protect your DG environment. It is simple to configure, and you can use this since 12.2 and have at least three of them. To configure you just need to do (in the simplest way):
  1. Install the default Oracle Client infrastructure.
  2. Add TNS entry for/to both sides.
  3. Open the DGMGRL.
  4. Call “start observer” command.
Check how easy it is:

 

[oracle@dbobss ~]$ dgmgrl sys/oracle@orcls

DGMGRL for Linux: Release 12.2.0.1.0 - Production on Sun May 5 16:30:58 2019




Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.




Welcome to DGMGRL, type "help" for information.

Connected to "orcls"

Connected as SYSDBA.

DGMGRL> start observer

[W000 05/05 16:31:40.34] FSFO target standby is orcls

[W000 05/05 16:31:42.53] Observer trace level is set to USER

[W000 05/05 16:31:42.53] Try to connect to the primary.

[W000 05/05 16:31:42.53] Try to connect to the primary orcl.

[W000 05/05 16:31:42.54] The standby orcls is ready to be a FSFO target

[W000 05/05 16:31:42.54] Reconnect interval expired, create new connection to primary database.

[W000 05/05 16:31:42.54] Try to connect to the primary.

[W000 05/05 16:31:43.68] Connection to the primary restored!

[W000 05/05 16:31:44.68] Disconnecting from database orcl.

 

When using multiple observers you can have at least 3 observers at same time. Exists only one master observer and it is responsible for fast-start failover and protect the system. If you lost the master observer the Broker/Primary/Standby decide which one will be the next master observer. Until the 19c version they not work in quorum (or something like this using a voting system to decide the role switch) to protect the DG.
The interesting part about multiple observer it is that provide to you another way to customize your environment. Remember in my first post I reported the complexity (bases in pros and con) to choose the better place to put the observer. Now with multiple observers, you can put one in each data center and switch between then when you want to protect one side or another.
Now, my example environment it is two databases, three observers:

 

Check that I have one in each datacenter and one in external place. And inside of broker you can see:

 

DGMGRL> show observer




Configuration - dgconfig




  Primary:            orcl

  Target:             orcls




Observer "dbobss" - Master




  Host Name:                    dbobss

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          1 second ago




Observer "dbobsp" - Backup




  Host Name:                    dbobsp

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          0 seconds ago




Observer "dbobst" - Backup




  Host Name:                    dbobst

  Last Ping to Primary:         2 seconds ago

  Last Ping to Target:          2 seconds ago


 

In case of failure Broker/Primary/Standby decides which one will be the next master observer. The time to decides that occurs after 30 seconds and need to be coordinated/communicated and the agreement from both, primary and standby. Unfortunately, there is no way to reduce this time/check from 30 seconds.
In my environment, I shutdown the machine running the master observer (dbobss) and the log from broker (in primary):

 

05/05/2019 17:15:34

FSFP: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

Data Guard Broker initiated a master observer switch since the current master observer cannot reach the primary database

FSFP: FSFO SetState(st=12 "SET OBID", fl=0x0 "", ob=0x32cc2ad6, tgt=0, v=0)

Succeeded in switching master observer from observer 'dbobss' to 'dbobsp'

FSFP: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

FSFP: FSFO SetState(st=16 "UNOBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

Master observer begins pinging this instance

Fore: FSFO SetState(st=15 "OBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

 

And in the broker log for standby:

 

05/05/2019 17:15:34

drcx: FSFO SetState(st=16 "UNOBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

drcx: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

05/05/2019 17:15:37

drcx: FSFO SetState(st=15 "OBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)

drcx: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0)

drcx: FSFO SetState(st=12 "SET OBID", fl=0x0 "", ob=0x32cc2ad6, tgt=0, v=0)

05/05/2019 17:15:39

Master observer begins pinging this instance

 

Look in the logs that both (primary and standby) agreed with the change. After the failure you saw the events SET SWOB INPRG (switch observer in progress) and SET OBID (set observer ID) and CLR SWOB INPRG (clear switch observer in progress) to confirm that was detect UNOBSERVED state. You can see here the output when you use the trace level for broker as support. Interesting note that inside broker the faulty observer does not disappears after the failure:

 

DGMGRL> show observer




Configuration - dgconfig




  Primary:            orcl

  Target:             orcls




Observer "dbobsp" - Master




  Host Name:                    dbobsp

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          3 seconds ago




Observer "dbobss" - Backup




  Host Name:                    dbobss

  Last Ping to Primary:         256 seconds ago

  Last Ping to Target:          221 seconds ago




Observer "dbobst" - Backup




  Host Name:                    dbobst

  Last Ping to Primary:         1 second ago

  Last Ping to Target:          5 seconds ago




DGMGRL>

 

After you reinstate you observer and it go back, you can simple set the master observer to the desired one:

 

DGMGRL> set masterobserver to dbobss;

Sent the proposed master observer to the data guard broker configuration. Please run SHOW OBSERVER to see if master observer switch actually happens.

DGMGRL>

 

Hierarchy

When you use multiple observers you can have more control how to protect your DG, you can have one observer in each site and choose the side that you want to protect. You can write one script to check the database role in the observer side and change the master to protect the desired database role.
Remember my previous post. If you choose to protect the primary (with observer in the same datacenter), if your entire datacenter fails, FSFO not occurs because standby does not decide alone. If you choose to protect the standby (with observer in the same datacenter), a datacenter/network failure in standby side, this can lead you a complete shutdown from a healthy primary database because it become “isolated”.
Since multiple observers continues to use hierarchy decision, the decision remains over only one observer. Even if you have a multiple observers, 3 as example and one in each side, if you put the master observer in the same site than standby and they become isolated, they still decide alone and because FSFO the primary continues to shutdown because it thinks that it is isolated. Even if it continues to receive connections from other two observers.
Because the actual design, even if you put the “FastStartFailoverThreshold” as 240, the automatic switch from Master observer does not occurs because the standby side cannot be reach to confirm the change. Maybe for the next versions (20, 21…) we can see a change in this design and when you use multiple observers voting/quorum method are used to decide role change for FSFO. Of course that even a quorum approach can lead a problem if you put two in the same datacenter, but it can mitigate problems in some cases.
In my next post I will dig more about this, with some examples and logs/traces analyses. You will see some details when the standby is isolated and you use multiple observers.

 

 

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”


Observer, Where?
Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Observer, Where?

Some months ago I got one error with Oracle Data Guard and now I had time to review again and time write this article. Just to be clear since the begin, the discussion here it is not about the error itself, but about the circumstances that generated it.
The environment described here follow the most, and common, best practices for DG by Oracle. Have 1 dedicated server for each one: Primary Database, Physical Standby Database and Observer. The primary and standby resides in different datacenters in different cities, dedicated network for interconnect between sites, protection mode was Maximum Availability and runs with fast-start failover enabled (with 30 seconds for threshold). The version here is 12.2, but will be the same for 19c. So, nothing so bad in the environment, basic DG configuration trying to follow the best practices.
But, one day, application servers running, primary linux db server running, but database itself down. Looking for the cause, found in the broker log:

 

 

03/10/2019 12:48:05

LGWR: FSFO SetState(st=2 "UNSYNC", fl=0x2 "WAIT", ob=0x0, tgt=0, v=0)

LGWR: FSFO SetState("UNSYNC", 0x2) operation requires an ack

        Primary database will shutdown within 30 seconds if permission

         is not granted from Observer or FSFO target standby to proceed

LGWR:   current in-memory FSFO state flags=0x40001, version=46

03/10/2019 12:48:19

A Fast-Start Failover target switch is necessary because the primary cannot reach the Fast-Start Failover target standby database

A target switch was not attempted because the observer has not pinging primary recently.

FSFP network call timeout. Killing process FSFP.

03/10/2019 12:48:36

Notifying Oracle Clusterware to disable services and monitoring because primary will be shutdown

Primary has heard from neither observer nor target standby

        within FastStartFailoverThreshold seconds. It is

        likely an automatic failover has already occurred.

        The primary is shutting down.

LGWR: FSFO SetState(st=8 "FO PENDING", fl=0x0 "", ob=0x0, tgt=0, v=0)

LGWR: Shutdown primary instance 1 now because the primary has been isolated




And in the alertlog:

2019-03-10T12:48:35.860142+01:00

Starting background process FSFP

2019-03-10T12:48:35.870802+01:00

FSFP started with pid=30, OS id=6055

2019-03-10T12:48:36.875828+01:00

Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.

It is likely an automatic failover has already occurred. Primary is shutting down.

2019-03-10T12:48:36.876434+01:00

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_lgwr_2472.trc:

ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down

LGWR (ospid: 2472): terminating the instance due to error 16830

 

But this is (and was) not a DG problem, the DG made what was design to do. Primary lost the communication with the Standby and Observer and after the Fast-Start Failover threshold, FSFP killed the primary because it don’t know if was evicted and want avoid split brain (or something similar). Worked as designed!
As I wrote before the main question here is not the ORA-XXXX error, but the circumstances. In this case, by design definition (and probably based in the docs), chosen to put the observer in the same site than standby. But, because one failure in the standby datacenter (just in the enclosure that runs blades for Oracle stuffs, application continued to runs), the entire database was unavailable. One outage in standby datacenter, shutdown the primary database even DG running in Maximum Availability mode.

 

 

As you can imagine, because the design decision “where to put the observer”, everything was down. If the observer was running in the primary datacenter, nothing supposed to occurs. But here it is the point for this post: “Where you put the Observer?”, “Primary site? Standby site?”, and “Why?”, “How you based this decision?”. Appears to be a simple question to answer, but there are a lot of pros and cons, and there is not much information about that.
If you search in the Oracle docs they recommend to put the observer in a third datacenter, isolated from the others, or in the in the same network than application, or in the standby datacenter. Here: https://www.oracle.com/technetwork/database/availability/maa-roletransitionbp-2621582.pdf (page 9). But, where are pros and cons for the decision? And how many clients that have a third datacenter? And if you search about where put the observer over the google, the 99% spread the same information (that go in the opposite of docs) “primary site”. But again, “Why?”.

 

 

Observer in primary site:
  • Pros: Protect for most failures (primary db crash failure, db logical crash as example), low impact network issues for observer (usually same LAN than primary).
  • Cons: Not protect (not switch) in case of whole primary datacenter failure.
  • When to use: When you want to avoid “false positive” switches in case of network problems against standby datacenter or you “not trust” in standby datacenter. Or because all transactions are important (will explain later).

Observer in standby site:

  • Pros: Protect against whole primary datacenter failure.
  • Cons: Heavy dependency from network between sites (need even multiple paths), can suffer “false positives” switches since standby site decides even if primary is running correctly (similar than related here). When database is more important than transactions, maybe Maximum Availability is not suitable (explain later).
  • When to use: When you want to protect the database and when your system and network infrastructure between datacenter are reliable.
Observer Third datacenter:
  • Pro: Cover all scenarios of failures.
  • Cons: Very heavily dependent of good network to avoid “false positives” or will suffer from fast-start failover disabled. Will be more expensive.
  • When to use: when you want to protect for most of possible scenarios.
Bellow just one idea about a good design for high availability for database and applications. This came directly from the Oracle docs about  “Recovering from Unscheduled Outages – https://docs.oracle.com/database/121/HABPT/E40019-02.pdf

 

 

 
Above I talked about transactions and Maximum Availability that it is deeply related here. Remember that primary database shutdown only after the fast start failover threshold? This means that primary database received transactions during 30 seconds before shutting down. If you put observer in standby site you can suffer from data loss (of course that failover, by design, means that) because standby side decides everything.
 
If you do that, maybe you need to operate in Maximum Protection to have zero data loss. This is more clear, or critical, if your database receive connections from applications that are not in the same datacenter and can connect in both at same time. In Maximum Protection you avoid that primary commit data in moments of possible failures, but you will put some overhead in every transaction operating in sync mode (https://www.oracle.com/technetwork/database/availability/sync-2437177.pdf). So, you need to decide if database running is more important than transactions.
 
Of course that every strategy have pro and cons. Observer in primary is more easy, maybe can allow you to use less strict protection and continue to have a high transaction protection, but not handle full datacenter failure. Observer in standby can protect from datacenter failure, but is possible that you need to handle with more “false positive” switches of primary db or even complete primary shutdown (as related here) in the case of standby datacenter failure. Adding the fact that maybe your application needs to allow some data loss, or if not, you may need to operate in maximum protection. Observer in third site can be best option for data protection, but will be more expensive and heavily network and operational dependent.
 
As you can see, there is not easy design. Even a simple choose, like observer location, will left plenty of decisions to be made. With Oracle 12.2 and beyond (including 19c) you have better options to handle it, I will pass over this in the next post about multiple observers, but there is no (yet) 100% solution to cover all possible scenarios.
Martin Bracher, Dr. Martin Wunderli and Torsten Rosenwald made a good cover of this in the past. You can check the presentations here:
 
https://www.doag.org/formes/pubfiles/303232/2008-K-IT-Bracher-Dataguard_Observer_ohne_Rechenzentrum.pdf
https://www.trivadis.com/sites/default/files/downloads/fsfo_understood_decus07.pdf
https://www.doag.org/formes/pubfiles/218046/FSFO.pdf
I recommend read the docs too:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/dgbkr/data-guard-broker.pdf
https://www.oracle.com/technetwork/database/availability/maa-roletransitionbp-2621582.pdf
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sbydb/data-guard-concepts-and-administration.pdf
https://docs.oracle.com/database/121/HABPT/E40019-02.pdf
 

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”