Observer, Where?

Some months ago I got one error with Oracle Data Guard and now I had time to review again and time write this article. Just to be clear since the begin, the discussion here it is not about the error itself, but about the circumstances that generated it.

The environment described here follow the most, and common, best practices for DG by Oracle. Have 1 dedicated server for each one: Primary Database, Physical Standby Database and Observer. The primary and standby resides in different datacenters in different cities, dedicated network for interconnect between sites, protection mode was Maximum Availability and runs with fast-start failover enabled (with 30 seconds for threshold). The version here is 12.2, but will be the same for 19c. So, nothing so bad in the environment, basic DG configuration trying to follow the best practices.

But, one day, application servers running, primary linux db server running, but database itself down. Looking for the cause, found in the broker log:

03/10/2019 12:48:05

LGWR: FSFO SetState(st=2 "UNSYNC", fl=0x2 "WAIT", ob=0x0, tgt=0, v=0)

LGWR: FSFO SetState("UNSYNC", 0x2) operation requires an ack

        Primary database will shutdown within 30 seconds if permission

         is not granted from Observer or FSFO target standby to proceed

LGWR:   current in-memory FSFO state flags=0x40001, version=46

03/10/2019 12:48:19

A Fast-Start Failover target switch is necessary because the primary cannot reach the Fast-Start Failover target standby database

A target switch was not attempted because the observer has not pinging primary recently.

FSFP network call timeout. Killing process FSFP.

03/10/2019 12:48:36

Notifying Oracle Clusterware to disable services and monitoring because primary will be shutdown

Primary has heard from neither observer nor target standby

        within FastStartFailoverThreshold seconds. It is

        likely an automatic failover has already occurred.

        The primary is shutting down.

LGWR: FSFO SetState(st=8 "FO PENDING", fl=0x0 "", ob=0x0, tgt=0, v=0)

LGWR: Shutdown primary instance 1 now because the primary has been isolated




And in the alertlog:

2019-03-10T12:48:35.860142+01:00

Starting background process FSFP

2019-03-10T12:48:35.870802+01:00

FSFP started with pid=30, OS id=6055

2019-03-10T12:48:36.875828+01:00

Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.

It is likely an automatic failover has already occurred. Primary is shutting down.

2019-03-10T12:48:36.876434+01:00

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_lgwr_2472.trc:

ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down

LGWR (ospid: 2472): terminating the instance due to error 16830

But this is (and was) not a DG problem, the DG made what was design to do. Primary lost the communication with the Standby and Observer and after the Fast-Start Failover threshold, FSFP killed the primary because it don’t know if was evicted and want avoid split brain (or something similar). Worked as designed!

As I wrote before the main question here is not the ORA-XXXX error, but the circumstances. In this case, by design definition (and probably based in the docs), chosen to put the observer in the same site than standby. But, because one failure in the standby datacenter (just in the enclosure that runs blades for Oracle stuffs, application continued to runs), the entire database was unavailable. One outage in standby datacenter, shutdown the primary database even DG running in Maximum Availability mode.

As you can imagine, because the design decision “where to put the observer”, everything was down. If the observer was running in the primary datacenter, nothing supposed to occurs. But here it is the point for this post: “Where you put the Observer?”, “Primary site? Standby site?”, and “Why?”, “How you based this decision?”. Appears to be a simple question to answer, but there are a lot of pros and cons, and there is not much information about that.

If you search in the Oracle docs they recommend to put the observer in a third datacenter, isolated from the others, or in the in the same network than application, or in the standby datacenter. Here: https://www.oracle.com/technetwork/database/availability/maa-roletransitionbp-2621582.pdf (page 9). But, where are pros and cons for the decision? And how many clients that have a third datacenter? And if you search about where put the observer over the google, the 99% spread the same information (that go in the opposite of docs) “primary site”. But again, “Why?”.

Observer in primary site:

Pros: Protect for most failures (primary db crash failure, db logical crash as example), low impact network issues for observer (usually same LAN than primary).
Cons: Not protect (not switch) in case of whole primary datacenter failure.
When to use: When you want to avoid “false positive” switches in case of network problems against standby datacenter or you “not trust” in standby datacenter. Or because all transactions are important (will explain later).

Observer in standby site:

Pros: Protect against whole primary datacenter failure.
Cons: Heavy dependency from network between sites (need even multiple paths), can suffer “false positives” switches since standby site decides even if primary is running correctly (similar than related here). When database is more important than transactions, maybe Maximum Availability is not suitable (explain later).
When to use: When you want to protect the database and when your system and network infrastructure between datacenter are reliable.

Observer Third datacenter:

Pro: Cover all scenarios of failures.
Cons: Very heavily dependent of good network to avoid “false positives” or will suffer from fast-start failover disabled. Will be more expensive.
When to use: when you want to protect for most of possible scenarios.

Bellow just one idea about a good design for high availability for database and applications. This came directly from the Oracle docs about “Recovering from Unscheduled Outages – https://docs.oracle.com/database/121/HABPT/E40019-02.pdf”

Above I talked about transactions and Maximum Availability that it is deeply related here. Remember that primary database shutdown only after the fast start failover threshold? This means that primary database received transactions during 30 seconds before shutting down. If you put observer in standby site you can suffer from data loss (of course that failover, by design, means that) because standby side decides everything.

If you do that, maybe you need to operate in Maximum Protection to have zero data loss. This is more clear, or critical, if your database receive connections from applications that are not in the same datacenter and can connect in both at same time. In Maximum Protection you avoid that primary commit data in moments of possible failures, but you will put some overhead in every transaction operating in sync mode (https://www.oracle.com/technetwork/database/availability/sync-2437177.pdf). So, you need to decide if database running is more important than transactions.

Of course that every strategy have pro and cons. Observer in primary is more easy, maybe can allow you to use less strict protection and continue to have a high transaction protection, but not handle full datacenter failure. Observer in standby can protect from datacenter failure, but is possible that you need to handle with more “false positive” switches of primary db or even complete primary shutdown (as related here) in the case of standby datacenter failure. Adding the fact that maybe your application needs to allow some data loss, or if not, you may need to operate in maximum protection. Observer in third site can be best option for data protection, but will be more expensive and heavily network and operational dependent.

As you can see, there is not easy design. Even a simple choose, like observer location, will left plenty of decisions to be made. With Oracle 12.2 and beyond (including 19c) you have better options to handle it, I will pass over this in the next post about multiple observers, but there is no (yet) 100% solution to cover all possible scenarios.

Martin Bracher, Dr. Martin Wunderli and Torsten Rosenwald made a good cover of this in the past. You can check the presentations here:

https://www.doag.org/formes/pubfiles/303232/2008-K-IT-Bracher-Dataguard_Observer_ohne_Rechenzentrum.pdf

https://www.trivadis.com/sites/default/files/downloads/fsfo_understood_decus07.pdf

https://www.doag.org/formes/pubfiles/218046/FSFO.pdf

I recommend read the docs too:

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/dgbkr/data-guard-broker.pdf

https://www.oracle.com/technetwork/database/availability/maa-roletransitionbp-2621582.pdf

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/sbydb/data-guard-concepts-and-administration.pdf

https://docs.oracle.com/database/121/HABPT/E40019-02.pdf

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”

Observer, Where?

Some months ago I got one error with Oracle Data Guard and now I had time to review again and time write this article. Just to be clear since the begin, the discussion here it is not about the error itself, but about the circumstances that generated it.

But, one day, application servers running, primary linux db server running, but database itself down. Looking for the cause, found in the broker log:

Observer in primary site:

Pros: Protect for most failures (primary db crash failure, db logical crash as example), low impact network issues for observer (usually same LAN than primary).

Cons: Not protect (not switch) in case of whole primary datacenter failure.

When to use: When you want to avoid “false positive” switches in case of network problems against standby datacenter or you “not trust” in standby datacenter. Or because all transactions are important (will explain later).

Pros: Protect against whole primary datacenter failure.

When to use: When you want to protect the database and when your system and network infrastructure between datacenter are reliable.