Blog – Page 10 – LUXOUG – LUXEMBOURG ORACLE USERS GROUP

Blog

Oracle E-Business Suite 12i Architecture (Part 2)

Category: E-Business Suite Author: Andre Luiz Dutra Ontalba (Board Member) Date: 6 years ago Comments: 0

Oracle E-Business Suite 12i Architecture (Part 2)

Shared Application System

A traditional multi-node installation of EBS 11i required each application layer to maintain its own file system, which consists of the APPL_TOP file system (APPL_TOP, COMMON_TOP, and some related directories) and application tier technology file stack layer (8.0 .6 ORACLE_HOME and IAS ORACLE_HOME).

Subsequently, this was modified to allow APPL_TOP to be shared between different machines, and subsequently to allow sharing of the entire application layer file system.

Continuing this quick installation strategy, for version 12 it creates a system that shares not only APPL_TOP and the COMMON_TOP file systems, but also the application tier technology stack layer.

Rapid Install sets this setting as the default for nodes running on the same operating system.

These files form the application layer of the file system, and can be shared between application nodes in multiple layers (as long as they are running the same operating system).

Note: Shared file system configuration is not currently supported on application tier nodes servers running Windows.

With a shared application layer file system, all files in this application layer are installed on a single shared disk that is mounted from each application layer node.

Any application layer node can be used to provide standard services, such as a Forms, Web Pages or Concurrent server.

Shared application layer – Example

As well as reducing required disk space, there are several other benefits of setting up shared application levels:

More administrative tasks, patching and maintenance need to be done only once, instead of a single layer application node.

Changes made to the shared file system are immediately accessible on all nodes in the application layer.

Distributes task processing to run in parallel on multiple nodes (Distributed AD).

Reduces general disk requirements.

Add application nodes more easily.

Sharing the File System application between instances.

Capabilities to share the tiered file system application were further extended in version 12.0.4, which introduced the option to share an installation of Oracle E-Business Release 12 with another instance of the database.

An application file system layer installed and configured in this way can be used to access two (or more) database instances.

The restrictions to this are:

All database instances must have the same patches.

Only the application can be shared, the database cannot be shared.

Nota: For more information on features, options and implementation steps, see document 384248.1, Sharing the Application Tier File System in Oracle E-Business Suite Release 12.

Environment Setting

Rapid Install creates environment files to configure the Oracle database, Oracle’s technology suite, Oracle HTTP Server, and Oracle E-Business Suite environments.

The location of these environment files is shown in the following table:

Filename	Location	Path	Environment
<CONTEXT_NAME>.env or CONTEXT_NAME>.cmd	10.2.0.2 ORACLE_HOME	db/tech_st/10.2.0	Oracle Server Enterprise Edition
<CONTEXT_NAME>.env or <CONTEXT_NAME>.cmd	OracleAS 10.1.2 ORACLE_HOME	inst/apps/<context>/ora/10.1.2	Oracle tools technology stack
<CONTEXT_NAME>.env or <CONTEXT_NAME>.cmd	OracleAS 10.1.3 ORACLE_HOME	inst/apps/<context>/ora/10.1.3	Java technology stack
<CONTEXT_NAME>.env or <CONTEXT_NAME>.cmd	APPL_TOP	apps/apps_st/appl	Applications
APPS<CONTEXT_NAME>.env or APPS<CONTEXT_NAME>.cmd	APPL_TOP	apps/apps_st/appl	Consolidated setup file

On UNIX, Oracle E-Business Suite includes a consolidated file called APPS <CONTEXT_NAME> .Env, which establishes both Oracle E-Business Suite and Oracle technology stack environments.

When you install Oracle E-Business Suite, Rapid Install creates this script in the APPL_TOP directory. Many of the parameters are specified during the installation process.

On Windows, the consolidated equivalent environment file is called% APPL_TOP% \ envshell <CONTEXT_NAME> .cmd.

When running it creates a command window with the necessary environment settings for Oracle E-Business Suite. All subsequent operations on APPL_TOP (for example, running adadmin or adpatch) must be performed from this window.

The following table lists the key environment settings in APPS <CONTEXT_NAME> .env.

Parameter	Description
APPLFENV	The name of the environment file, <CONTEXT_NAME>. env. If you rename the environment file, this parameter must be updated.
PLATFORM	The operating system in use. The value (for example, LINUX) must match the value in the APPL_TOP/admin/ adpltfrm.txt file.
APPL_TOP	The main directory for this Oracle E-Business Suite installation.
ADMIN_SCRIPTS_HOME	$ INST_TOP directory that identifies the location of scripts, such as adautocfg.sh, adpreclone.sh, adstrtal.sh and adstpall.sh.
FNDNAM	The name of the ORACLE scheme to which the System Administration responsibility connects. The default is APPS.
GWYUID	The public ORACLE username and password that gives access to the initial Oracle E-Business Suite sign-on form. The default is APPLSYSPUB/PUB.
FND_TOP	The path to the Application Library Object directory. For example, apps /apps_st/appl/fnd/12.0.0.
AU_TOP	The path to the Applications Utilities directory. For example, apps /apps_st/appl/au/12.0.0.
<PROD>_TOP	The path to a product’s top directory. There is an entry for each Oracle E-Business Suite product.
PATH	Sets the directory search path, for example, to FND_TOP and AD_TOP.
APPLDCP	Specifies whether the distributed Concurrent Process is in use. If it is the same it distributes the load in other Concurrent Process in other nodes.
APPCPNAM	Indicates whether the format of the Concurrent Manager log and the output files following the 8.3 file name convention (maximum 8 characters to the left of the dot and 3 to the right, for example, alogfile.log). If this parameter is set to “REQID” (required), Concurrent Manager uses filenames that meet 8.3 naming requirements.
APPLCSF	Identifies the top level directory for Concurrent Manager log and output files. They are consolidated into a single directory for all products. For example, /inst/apps/<context>/logs/appl/conc.
APPLLOG	The subdirectory for Concurrent Manager concurrent log files. The default is extension .log
APPLOUT	The subdirectory for Concurrent Manager’s simultaneous output files. The default is .out
APPLTMP	Identifies the directory of temporary Oracle E-Business Suite files. The default is $ INST_TOP/tmp on UNIX.
APPLPTMP	Identifies the directory for the output temporary PL/SQL files. The possible directory options must be listed in the utl_file_dir parameter init.ora.
INST_TOP	Identifies the top level directory for this instance. For example, inst/apps/<context>. Introduced with Release 12.
NLS_LANG	The language, territory, and character set installed in the database. The default for a new installation is “AMERICAN_AMERICA.US7ASCII”.
NLS_DATE_FORMAT	The National Language Support date format. The default is “DD-MON-RR”, for example, 14-July-19.
NLS_NUMERIC_CHARACTERS	The National Language Support numeric separators. The default is “.” (Semicolon).

Most temporary files are written to the location specified by the APPLTMP environment configuration, which is defined in the Rapid Install.

Oracle E-Business Suite products also create temporary PL/SQL output files used in simultaneous processing. These files are written to a location on the database server node specified by the APPLPTMP environment configuration.

The APPLPTMP directory must be the same directory specified by the utl_file_dir parameter in your database initialization file.

Rapid Install sets both APPLPTMP and the utl_file_dir parameter to the same default directory.

Some Oracle Utilities E-Business Suite use your temporary default operating system directory even if you configure the environment settings listed in the previous paragraph. You must therefore ensure that there is sufficient free disk space in this directory, as well as those indicated by APPLTMP and APPLPTMP.

In a multi-node system, the directory defined by APPLPTMP does not need to exist on application layer servers.

Nota: Temporary files placed in the utl_file_dir directory can be protected from unauthorized access, ensuring that this directory has read and write access to the Oracle database account only.

Other environments files

Several other key environment files are used in an Oracle E-Business Suite system.

O arquivo adovars.env

The adovars.env file, located at $ APPL_TOP/admin, specifies the location of several files, such as Java files, HTML files and the JRE (Java Runtime Environment) files.

It is called from the main application environment file, <CONTEXT_NAME>. Env. The adovars.env file includes comments on the purpose and recommended configuration of each variable. In a 12 release environment, adovars.env is maintained by AutoConfig, and should not be edited manually.

The adovars.env file includes the following parameters:

Parameter	Description
AF_JLIB	Indicates the directory to which all Java archive files are copied. For example, apps /apps_st/COMn/java/lib. Introduced with Release 12.
JAVA_BASE	Indicates the top level of the Java directory. For example, apps/apps_st/COMn/java. Introduced with Release 12.
JAVA_TOP	Indicates the directory to which all Java class files are copied. For example, apps/apps_st/COMn/java/classes. Definition changed with version 12.
OA_JAVA	Indicates the directory to which all Java archive files are copied. For example, apps/apps_st/COMn/java/classes.
OA_JRE_TOP	Indicates the location where the JRE is installed. For example, /local/java/jdk1.5.0_08.
OAH_TOP	Sets the location to which HTML files are copied. For example, apps /apps_st/COMn/webapps/oacore.
OAD_TOP	Defines the locations to which context-sensitive documentation files are copied. For example, apps/apps_st/COMn.
LD_LIBRARY_PATH	Path used on many UNIX platforms to list the directories being scanned for dynamic library files needed at run time.
CLASSPATH	Lists scanned directories and zip files for Java class files needed at run time.

The adconfig.txt file

AD utility programs perform a variety of database and file management tasks. These utilities need to know information about the right configuration to run successfully. This configuration information is specified when Oracle E-Business Suite is installed and subsequently stored in the adconfig.txt file in <APPL_TOP> / admin. Once created, this file is used by other Oracle E-Business Suite utilities.

Nota: adconfig.txt is created with the APPL_TOP file system, and shows the layers that have been configured on a particular node. It is distinct from the config.txt file configured by Rapid Install.

The fndenv.env file

This file defines environment variables used by the application’s object library. For example, it defines APPLBIN as the name of the subdirectory where executable product programs and shell scripts are stored (bin). This file must not be modified: the default values are applicable for all customers. The file is located in the FND_TOP directory.

The devenv.env file

This file defines the variables that allow you to link third-party software and your own custom applications developed with Oracle E-Business Suite.

In version 12, this script is located in $ FND_TOP / usrxit, and is automatically called by fndenv.env. This allows you to compile and link custom forms for Oracle users to outbound and competing programs with Oracle E-Business Suite.

In the next articles, we will continue to understand the structure of the E-Business Suite and guide best practices for installing the product efficiently.

References

Oracle E-Business Suite Release 12 Technology Stack Documentation Roadmap [ID 380482.1]

I hope this help you !!

André Ontalba

Disclaimer: “The postings on this site are my own and don’t necessarily represent may actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications was removed to allow reach generic audience and to be useful.”

ASM, REPLACE DISK Command

Category: Database Author: Fernando Simon (Board Member) Date: 6 years ago Comments: 0

ASM, REPLACE DISK Command

The REPLACE DISK command was released with 12.1 and allow to do an online replacement for a failed disk. This command is important because it reduces the rebalance time doing just the SYNC phase. Comparing with normal disk replacement (DROP and ADD in the same command), the REPLACE just do mirror resync.

Basically, when the REPLACE command is called, the rebalance just copy/sync the data from the survivor disk (the partner disk from the mirror). It is faster since the previous way with drop/add execute a complete rebalance from all AU of the diskgroup, doing REBALANCE and SYNC phase.

The replace disk command is important for the SWAP disk process for Exadata (where you add the new 14TB disks) since it is faster to do the rebalance of the diskgroup.

Below one example from this behavior. Look that AU from DISK01 was SYNCED with the new disk:

And compare with the previous DROP/ADD disk, where all AU from all disks was rebalanced:

Actual Environment And Simulate The failure

In this post, to simulate and show how the replace disk works I have the DATA diskgroup with 6 disks (DISK01-06). The DISK07 it is not in use.

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL      PATH

------------------------------ ------------------------------ ---------- -----------

DISK01                         FAIL01                         DISK01     ORCL:DISK01

DISK02                         FAIL01                         DISK02     ORCL:DISK02

DISK03                         FAIL02                         DISK03     ORCL:DISK03

DISK04                         FAIL02                         DISK04     ORCL:DISK04

DISK05                         FAIL03                         DISK05     ORCL:DISK05

DISK06                         FAIL03                         DISK06     ORCL:DISK06

RECI01                         RECI01                         RECI01     ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01     ORCL:SYSI01

                                                              DISK07     ORCL:DISK07




9 rows selected.




SQL>

And to simulate the error I disconnected the disk from Operational system (since I used iSCSI, I just log off the target for DISK02:

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.eff4683320e8 -p 172.16.0.3:3260 -u

Logging out of session [sid: 41, target: iqn.2006-01.com.openfiler:tsn.eff4683320e8, portal: 172.16.0.3,3260]

Logout of [sid: 41, target: iqn.2006-01.com.openfiler:tsn.eff4683320e8, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

At the same moment, the alertlog from ASM detected the error and informed that the mirror was found in another disk (DISK06):

2020-03-29T00:42:11.160695+01:00

WARNING: Read Failed. group:3 disk:1 AU:29 offset:0 size:4096

path:ORCL:DISK02

         incarnation:0xf0f0c113 synchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f3df8db35b8 bufp:0x7f3df8c9c000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: cache failed reading from group=3(DATA) fn=8 blk=0 count=1 from disk=1 (DISK02) mirror=0 kfkist=0x20 status=0x02 osderr=0x3 file=kfc.c line=13317

WARNING: cache succeeded reading from group=3(DATA) fn=8 blk=0 count=1 from disk=5 (DISK06) mirror=1 kfkist=0x20 status=0x01 osderr=0x0 file=kfc.c line=13366

So, at this moment the DQISK02 will not be removed instantly, but after the disk_repair_time finish:

WARNING: Started Drop Disk Timeout for Disk 1 (DISK02) in group 3 with a value 43200

WARNING: Disk 1 (DISK02) in group 3 will be dropped in: (43200) secs on ASM inst 1

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

If you want to check the full output from ASM alertlog you can access here at ASM-ALERTLOG-Output-Online-Disk-Error.txt

So, the actual diskgroup is

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL      PATH

------------------------------ ------------------------------ ---------- -----------

DISK01                         FAIL01                         DISK01     ORCL:DISK01

DISK02                         FAIL01

DISK03                         FAIL02                         DISK03     ORCL:DISK03

DISK04                         FAIL02                         DISK04     ORCL:DISK04

DISK05                         FAIL03                         DISK05     ORCL:DISK05

DISK06                         FAIL03                         DISK06     ORCL:DISK06

RECI01                         RECI01                         RECI01     ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01     ORCL:SYSI01

                                                              DISK07     ORCL:DISK07




9 rows selected.




SQL>

REPLACE DISK

Since the old disk was lost (by HW or something similar), it is impossible to put it again online. A new disk was attached to the server (DISK07 in this example) and this is added in the diskgroup.

So, we just need to execute the REPLACE DISK command:

SQL> alter diskgroup DATA

  2  REPLACE DISK DISK02 with 'ORCL:DISK07'

  3  power 2;




Diskgroup altered.




SQL>

The command is easy, we replace disk failed disk with the new disk path. And it is possible to replace more than one at the same time and specify the power of the rebalance too.

At ASM alertlog we can see a lot of messages about this replacement, but look that resync of the disk. The full output can be found here at ASM-ALERTLOG-Output-Replace-Disk.txt

Some points here:

2020-03-29T00:44:31.602826+01:00

SQL> alter diskgroup DATA

replace disk DISK02 with 'ORCL:DISK07'

power 2

2020-03-29T00:44:31.741335+01:00

NOTE: cache closing disk 1 of grp 3: (not open) DISK02

2020-03-29T00:44:31.742068+01:00

NOTE: GroupBlock outside rolling migration privileged region

2020-03-29T00:44:31.742968+01:00

NOTE: client +ASM1:+ASM:asmrec no longer has group 3 (DATA) mounted

2020-03-29T00:44:31.746444+01:00

NOTE: Found ORCL:DISK07 for disk DISK02

NOTE: initiating resync of disk group 3 disks

DISK02 (1)




NOTE: process _user20831_+asm1 (20831) initiating offline of disk 1.4042309907 (DISK02) with mask 0x7e in group 3 (DATA) without client assisting

2020-03-29T00:44:31.747191+01:00

NOTE: sending set offline flag message (2044364809) to 1 disk(s) in group 3

…

…

2020-03-29T00:44:34.558097+01:00

NOTE: PST update grp = 3 completed successfully

2020-03-29T00:44:34.559806+01:00

SUCCESS: alter diskgroup DATA

replace disk DISK02 with 'ORCL:DISK07'

power 2

2020-03-29T00:44:36.805979+01:00

NOTE: Attempting voting file refresh on diskgroup DATA

NOTE: Refresh completed on diskgroup DATA. No voting file found.

2020-03-29T00:44:36.820900+01:00

NOTE: starting rebalance of group 3/0xf99030d7 (DATA) at power 2

After that, we can see the rebalance just take the SYNC phase:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE     CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------

         1            3 REBAL COMPACT   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBALANCE WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBUILD   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL RESYNC    RUN           2          2        231       1350        513           2                     0




SQL>

SQL> /




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE     CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------

         1            3 REBAL COMPACT   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBALANCE WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBUILD   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL RESYNC    RUN           2          2        373       1350        822           1                     0




SQL>

SQL> /




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE     CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------

         1            3 REBAL COMPACT   REAP          2          2          0          0          0           0                     0

         1            3 REBAL REBALANCE DONE          2          2          0          0          0           0                     0

         1            3 REBAL REBUILD   DONE          2          2          0          0          0           0                     0

         1            3 REBAL RESYNC    DONE          2          2       1376       1350          0           0                     0




SQL> /




no rows selected




SQL>

In the end, after the rebalance we have:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL      PATH

------------------------------ ------------------------------ ---------- -----------

DISK01                         FAIL01                         DISK01     ORCL:DISK01

DISK02                         FAIL01                         DISK07     ORCL:DISK07

DISK03                         FAIL02                         DISK03     ORCL:DISK03

DISK04                         FAIL02                         DISK04     ORCL:DISK04

DISK05                         FAIL03                         DISK05     ORCL:DISK05

DISK06                         FAIL03                         DISK06     ORCL:DISK06

RECI01                         RECI01                         RECI01     ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01     ORCL:SYSI01




8 rows selected.




SQL>

An important detail is that the NAME for the disk will not change, it is impossible to change using REPLACE DISK command. As you can see above, the disk named DISK02 has the label DISK07 (here this came from asmlib disk).

Know Issues

There is a known issue for REPLACE DISK for 18c and higher for GI where the rebalance can take AGES to finish. This occurs because (when replacing more than one disk per time), it executes the SYNC disk by disk. One example, for one Exadata the replace for a complete cell took more than 48 hours, while a DROP/ADD took just 12 hours for the same disks.

So, it is recommended to have the fix for Bug 30582481 and Bug 31062010 applied. The detail it is that patch 30582481 (Patch 30582481: ASM REPLACE DISK COMMAND EXECUTED ON ALL CELLDISKS OF A FAILGROUP, ASM RUNNING RESYNC ONE DISK AT A TIME) was withdraw and replaced by bug/patch 31062010 that it is not available (at the moment that I write this port – March 2020).

So, be careful to do this in one engineering system or when you need to replace a lot of disks at the same time.

Some reference for reading:

Test Case:: 12C ASM New feature (Doc ID 1571975.1)

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”

ASM, Mount restricted force for recovery

Category: Database Author: Fernando Simon (Board Member) Date: 6 years ago Comments: 0

ASM, Mount restricted force for recovery

Survive to disk failures it is crucial to avoid data corruption, but sometimes, even with redundancy at ASM, multiple failures can happen. Check in this post how to use the undocumented feature “mount restricted force for recovery” to resurrect diskgroup and lose less data when multiple failures occur.

Diskgroup redundancy is a key factor for ASM resilience, where you can survive to disk failures and still continue to run databases. I will not extend about ASM disk redundancy here, but usually, you can configure your diskgroup without redundancy (EXTERNAL), double redundancy (NORMAL), triple redundancy (HIGH), and even fourth redundancy (EXTEND for stretch clusters).

If you want to understand more about redundancy you have a lot of articles at MOS and on the internet that provide useful information. One good is this. The idea is simple, spread multiple copies in different disks. And can even be better if you group disks in the same failgroups, so, your data will have multiple copies in separate places.

As an example, this a key for Exadata, where every storage cell is one independent failgroup and you can survive to one entire cell failure (or double full, depending on the redundancy of your diskgroup) without data loss. The same idea can be applied at a “normal” environment, where you can create failgroup to disks attached to controller A, and another attached to controller B (so the failure of one storage controller does not affect all failgroups). At ASM, if you do not create failgroup, each disk is a different one in diskgroups that have redundancy enabled.

This represents for Exadata, but it is safe for representation. Basically your data will be in at least two different failgroups:

Environment

In the example that I use here, I have one diskgroup called DATA, which has 7 (seven) disks and each one is on failgroup. The redundancy for this diskgroup is NORMAL, this means that the block is copied in two failgroups. If two failures occur, probably, I will have data loss/corruption. Look:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL                           PATH

------------------------------ ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                        CELLI01                        CELLI01                         ORCL:CELLI01

CELLI02                        CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                        CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                        CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                        CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                        CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                        CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                         RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

The version for my GI is 19.6.0.0, but this can be used from 12.1.0.2 and newer versions (works for 11.2.0.4 in some versions). And In this server, I have three databases running, DBA19, DBB19, and DBC19.

So, with everything running correctly, the data from my databases will be spread two failgroups (this is just a representation and not correct representation where the blocks from my database are):

Remember that a NORMAL redundancy just needs two copies. So, some blocks from datafile 1 from DBA19, as an example, can be stored at CELLI01 and CELLI04. And if your database is small (and your failgroups are big), and you are lucky too, the entire database can be stored in just these two places. In case of failure that just involves CELLI02 and CELLI03 failgroups, your data (from DBA19c) can be intact.

Understanding the failure

Unfortunately, failures (will) happen and can be multiple at the same time. In the diskgroup DATA above, after the second failure, your diskgroup will be dismounted instantly. Usually when this occurs, if you can’t recover the hardware error, you need to restore and recover a backup of your databases after recreating the diskgroup.

If you have lucky and the failures occur at the same time, you can (most of the time) return the failed disks and try to mount the diskgroup because there is no difference between the failed disks/failgroups. But the problem occurs if you have one failure (like CELLI03 diskgroup disappears) and after some time another failgroup fails (like CELLI07). The detail is that between the failures, the databases continued to run and change data in the disk. And when this occurs, and when your failgroup returns, there are differences.

Another point that is very important to understand is the time to recover the failure. If you have one disk/failgroup at ASM, the attributes disk_repair_time and failgroup_repair_time define the time that you have to repair your failure before the rebalance of data takes place. The first (disk_repair_time) is the time that you have to repair the disk in case of failure if your failgroup have more than one disk, just the failed is rebalanced. The second (failgroup_repair_time) is the time that you have to repair the failed failgroup (when it fails completely).

The interesting here is that between the moment of failure until the end of this clock you are susceptible to another failure. If it occurs (more failures that your mirror protection) you will lose the diskgroup. And another fact here it is that between the failures, your databases continue to run, so, if your return the first failed disk/failgroup, you need to sync it.

These “repair times” serve to provide to you time to fix/recover the failure and avoid the rebalance. Think about the architecture, usually the diskgroups with redundancy are big and protect big environments think in one Exadata, as an example, where each disk can have 14TB – and one cell can have until 12 of them), and do rebalance of this amount of data takes a lot of time. To avoid this, if your failed disk is replaced before this time, just sync with the block changed is needed.

A “default configuration” have these values:

SQL> select dg.name,a.value,a.name

  2  from v$asm_diskgroup dg, v$asm_attribute a

  3  where dg.group_number=a.group_number

  4  and a.name like '%time'

  5  /




NAME                                     VALUE           NAME

---------------------------------------- --------------- ----------------------------------------

DATA                                     12.0h           disk_repair_time

DATA                                     24.0h           failgroup_repair_time

RECO                                     24.0h           failgroup_repair_time

RECO                                     12.0h           disk_repair_time

SYSTEMDG                                 24.0h           failgroup_repair_time

SYSTEMDG                                 12.0h           disk_repair_time




6 rows selected.




SQL>

But think in one scenario where more than one failure occurs, the first in CELLI01 at 08:00 am and the second in CELL0I6 at 10:00 am, now, from two hours, you have the new version of blocks. If you fix the failure (for CELL01) you don’t guarantee that you have everything in the last version and the normal mount will not work.

And it is here that mount restricted force for recovery enters. It allows you to resurrect the diskgroup and help you to restore fewer things. Think in the example before, if the failures occur at CELLI01 and CELL06, but your datafiles are in CELLI02 and CELLI07, you lose nothing. Or restore just some tablespaces and not all database. So, it is more gain than lose.

Mount restricted force for recovery

Here, I will simulate multiple failures for the disks (more than one) and show how you can use mount restricted force for recovery. Please be careful and follow all the steps correctly to avoid mistakes and to understand how to do and what is happening.

So, here I have DATA diskgroup, with normal redundancy and 7 (seven) failgroups. DBA19, DBB19, and DBC19 databases running.

So, at the first step, I will simulate a complete failure of CELLI03 failgroup. In my environment, to allow more control, I have one iSCSI target for each failgroup (this allows me to disconnect one by one if needed). The CELLI03 died:

[root@asmrec ~]# iscsiadm -m session

tcp: [11] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.d65b214fca9a (non-flash) CELLI04

tcp: [14] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.637b3bbfa86d (non-flash) CELLI07

tcp: [17] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.2f4cdb93107c (non-flash) CELLI05

tcp: [2] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.bb66b92348a7 (non-flash)  CELLI03

tcp: [20] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.57c0a000e316 (non-flash) (SYS)

tcp: [23] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.89ef4420ea4d (non-flash) CELLI06

tcp: [5] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.eff4683320e8 (non-flash)  CELLI01

tcp: [8] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.7d8f4c8f5012 (non-flash)  CELLI02

[root@asmrec ~]#

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -u

Logging out of session [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260]

Logout of [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

And at ASM alertlog we can see:

2020-03-22T17:14:11.589115+01:00

NOTE: process _user8100_+asm1 (8100) initiating offline of disk 9.4042310133 (CELLI03) with mask 0x7e in group 1 (DATA) with client assisting

NOTE: checking PST: grp = 1

2020-03-22T17:14:11.589394+01:00

GMON checking disk modes for group 1 at 127 for pid 40, osid 8100

2020-03-22T17:14:11.589584+01:00

NOTE: checking PST for grp 1 done.

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0xf0f0c1f5, mask = 0x6a, op = clear mandatory

2020-03-22T17:14:11.589746+01:00

GMON updating disk modes for group 1 at 128 for pid 40, osid 8100

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

WARNING: Write Failed. group:1 disk:9 AU:1 offset:4190208 size:4096

path:ORCL:CELLI03

         incarnation:0xf0f0c1f5 synchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f9182f72210 bufp:0x7f9182f78000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: found another non-responsive disk 9.4042310133 (CELLI03) that will be offlined

So, the failure occurred at 17:14. The full output can be found here at ASM-ALERTLOG-Output-Failure-CELLI03.txt

And we can see that disappeared (but not deleted or dropped) from ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01                        CELLI01                         ORCL:CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

At this point, ASM is starting to count the clock of 12hours (as defined in my repair attributes). The failgroup was not dropped and rebalance was not going on because ASM is optimistic that you will fix the issue in this period.

But after some time I had a second failure in the diskgroup:

Now at ASM alertlog you can see that diskgroup was dismounted (and several other messages). Bellow a cropped from the alertlog. The full output (and I think that deserve a look) it is here at ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01.txt

2020-03-22T17:18:39.699555+01:00

WARNING: Write Failed. group:1 disk:1 AU:1 offset:4190208 size:4096

path:ORCL:CELLI01

         incarnation:0xf0f0c1f3 asynchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f9182f833d0 bufp:0x7f91836ef000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: Hbeat write to PST disk 1.4042310131 in group 1 failed. [2]

2020-03-22T17:18:39.704035+01:00

...

...

2020-03-22T17:18:39.746945+01:00

NOTE: cache closing disk 9 of grp 1: (not open) CELLI03

ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.

2020-03-22T17:18:39.747462+01:00

ERROR: too many offline disks in PST (grp 1)

2020-03-22T17:18:39.759171+01:00

NOTE: cache dismounting (not clean) group 1/0xB48031B9 (DATA)

NOTE: messaging CKPT to quiesce pins Unix process pid: 12050, image: [email protected] (B001)

2020-03-22T17:18:39.761807+01:00

NOTE: halting all I/Os to diskgroup 1 (DATA)

2020-03-22T17:18:39.766289+01:00

NOTE: LGWR doing non-clean dismount of group 1 (DATA) thread 1

NOTE: LGWR sync ABA=23.3751 last written ABA 23.3751

...

...

2020-03-22T17:18:40.207406+01:00

SQL> alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */

...

...

2020-03-22T17:18:40.841979+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:

ORA-15130: diskgroup "DATA" is being dismounted

2020-03-22T17:18:40.853738+01:00

...

...

ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.

2020-03-22T17:18:40.861939+01:00

ERROR: too many offline disks in PST (grp 1)

...

...

2020-03-22T17:18:43.214368+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:

ORA-15130: diskgroup "DATA" is being dismounted

2020-03-22T17:18:43.214885+01:00

NOTE: client DBC19:DBC19:asmrec no longer has group 1 (DATA) mounted

2020-03-22T17:18:43.215492+01:00

NOTE: client DBB19:DBB19:asmrec no longer has group 1 (DATA) mounted

NOTE: cache deleting context for group DATA 1/0xb48031b9

...

...

2020-03-22T17:18:43.298551+01:00

SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */

SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA

2020-03-22T17:18:43.352003+01:00

SQL> ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {0:1:9} */

2020-03-22T17:18:43.372816+01:00

NOTE: cache registered group DATA 1/0xB44031BF

NOTE: cache began mount (first) of group DATA 1/0xB44031BF

NOTE: Assigning number (1,8) to disk (ORCL:CELLI02)

NOTE: Assigning number (1,0) to disk (ORCL:CELLI04)

NOTE: Assigning number (1,11) to disk (ORCL:CELLI05)

NOTE: Assigning number (1,3) to disk (ORCL:CELLI06)

NOTE: Assigning number (1,2) to disk (ORCL:CELLI07)

2020-03-22T17:18:43.514642+01:00

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

2020-03-22T17:18:46.089517+01:00

NOTE: detected and added orphaned client id 0x10010

NOTE: detected and added orphaned client id 0x1000e

So, the second failure occurred at 17:18 and lead to diskgroup force dismount. And you can see messages like “NOTE: cache dismounting (not clean)”, “ERROR: too many offline disks in PST (grp 1)”, and even “ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline”.

So, probably some data was lost. And even if you consider that between these 4 minutes data was changed in the databases, the mess is Big. If you want to see the alertlog from databases, check here at ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure.txt

And now we have this at ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01

                                                                        CELLI02                         ORCL:CELLI02

                                                                        CELLI04                         ORCL:CELLI04

                                                                        CELLI05                         ORCL:CELLI05

                                                                        CELLI06                         ORCL:CELLI06

                                                                        CELLI07                         ORCL:CELLI07




7 rows selected.




SQL>

And if we try to mount we receive an error due to disk offline:

SQL> alter diskgroup data mount;

alter diskgroup data mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "9" is missing from group number "1"

ORA-15042: ASM disk "1" is missing from group number "1"


SQL>

Now is the key decision. If you have important data that worth the effort to try to recover you can continue. It is your decision and based on several details. Since the diskgroup is dismounted, the repair time is not counting, and you have days until recovery. Sometimes one day stopped is better than several days to recover all databases from the last backup.

Imagine that you can bring online the first failed failgroup (CELL03) that have 4 minutes of the difference of data:

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -l

Logging in to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] (multiple)

Login to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

And if you try to mount it normally you will receive an error (output from alertlog at this try can be seen here at ASM-ALERTLOG-Output-Mout-With-One-Disk-Online):

SQL> alter diskgroup data mount;

alter diskgroup data mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15017: diskgroup "DATA" cannot be mounted

ORA-15066: offlining disk "1" in group "DATA" may result in a data loss

SQL>

So, now we can try the mount restricted force for recovery:

SQL> alter diskgroup data mount restricted force for recovery;




Diskgroup altered.




SQL>

The alertlog from ASM (that you can full here at ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery.txt) report messages related with cache from diskgropup and disk that need to be checked. And now we are like this:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01

                                                                        CELLI03                         ORCL:CELLI03




10 rows selected.




SQL>

The next step is to bring online the failgroup that came back:

SQL> alter diskgroup data online disks in failgroup CELLI03;




Diskgroup altered.




SQL>

Doing this ASM will resync this failgroup (using this block as the last version) and bring the cache of this disk online. At ASM alertlog you can see messages like (full output here at ASM-ALERTLOG-Output-Online-Restored-Failgroup):

2020-03-22T17:27:47.729003+01:00

SQL> alter diskgroup data online disks in failgroup CELLI03

2020-03-22T17:27:47.729551+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

2020-03-22T17:27:47.729640+01:00

NOTE: cache closing disk 9 of grp 1: (not open) CELLI03

2020-03-22T17:27:47.730398+01:00

NOTE: GroupBlock outside rolling migration privileged region

NOTE: initiating resync of disk group 1 disks

CELLI03 (9)




NOTE: process _user6891_+asm1 (6891) initiating offline of disk 9.4042310248 (CELLI03) with mask 0x7e in group 1 (DATA) without client assisting

2020-03-22T17:27:47.737580+01:00

...

...

2020-03-22T17:27:47.796524+01:00

NOTE: disk validation pending for 1 disk in group 1/0x1d7031d4 (DATA)

NOTE: Found ORCL:CELLI03 for disk CELLI03

NOTE: completed disk validation for 1/0x1d7031d4 (DATA)

2020-03-22T17:27:47.935467+01:00

...

...

2020-03-22T17:27:48.116572+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

NOTE: cache opening disk 9 of grp 1: CELLI03 label:CELLI03

2020-03-22T17:27:48.117158+01:00

SUCCESS: refreshed membership for 1/0x1d7031d4 (DATA)

2020-03-22T17:27:48.123545+01:00

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x5d, op = assign mandatory

...

...

2020-03-22T17:27:48.142068+01:00

NOTE: PST update grp = 1 completed successfully

2020-03-22T17:27:48.143197+01:00

SUCCESS: alter diskgroup data online disks in failgroup CELLI03

2020-03-22T17:27:48.577277+01:00

NOTE: Attempting voting file refresh on diskgroup DATA

NOTE: Refresh completed on diskgroup DATA. No voting file found.

...

...

2020-03-22T17:27:48.643277+01:00

NOTE: Starting resync using Staleness Registry and ATE scan for group 1

2020-03-22T17:27:48.696075+01:00

NOTE: Starting resync using Staleness Registry and ATE scan for group 1

NOTE: header on disk 9 advanced to format #2 using fcn 0.0

2020-03-22T17:27:49.725837+01:00

WARNING: Started Drop Disk Timeout for Disk 1 (CELLI01) in group 1 with a value 43200

2020-03-22T17:27:57.301042+01:00

...

2020-03-22T17:27:59.687480+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

NOTE: reset timers for disk: 9

NOTE: completed online of disk group 1 disks

CELLI03 (9)




2020-03-22T17:27:59.714674+01:00

ERROR: ORA-15421 thrown in ARBA for group number 1

2020-03-22T17:27:59.714805+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arba_8786.trc:

ORA-15421: Rebalance is not supported when the disk group is mounted for recovery.

2020-03-22T17:27:59.715047+01:00

NOTE: stopping process ARB0

NOTE: stopping process ARBA

2020-03-22T17:28:00.652115+01:00

NOTE: rebalance interrupted for group 1/0x1d7031d4 (DATA)

And not we have at ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

And rebalance not continue because is not allowed when diskgroup is in restrict mode:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1                                                                                                               0

         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0

         1            1 REBAL REBUILD   WAIT          1                                                                                                               0

         1            1 REBAL RESYNC    WAIT          1                                                                                                               0




SQL>

But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the arb process trace files at ASM trace directory:

[root@asmrec trace]# ls -lFhtr *arb*

...

...

-rw-r----- 1 grid oinstall 6.4K Mar 22 17:10 +ASM1_arb0_3210.trm

-rw-r----- 1 grid oinstall  44K Mar 22 17:10 +ASM1_arb0_3210.trc

-rw-r----- 1 grid oinstall  984 Mar 22 17:27 +ASM1_arb0_8788.trm

-rw-r----- 1 grid oinstall 2.1K Mar 22 17:27 +ASM1_arb0_8788.trc

-rw-r----- 1 grid oinstall  882 Mar 22 17:27 +ASM1_arba_8786.trm

-rw-r----- 1 grid oinstall 1.2K Mar 22 17:27 +ASM1_arba_8786.trc

[root@asmrec trace]#

And looking from one of the last, we can see that some extend (that does not exist, the recovered failgroup, or the cache is not the last one) was filled with dummy (BADFDA7A) data:

[root@asmrec trace]# cat +ASM1_arb0_8788.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8788.trc

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.6.0.0.0

Build label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417

ORACLE_HOME:    /u01/app/19.0.0.0/grid

System name:    Linux

Node name:      asmrec.oralocal

Release:        4.14.35-1902.10.8.el7uek.x86_64

Version:        #2 SMP Thu Feb 6 11:02:28 PST 2020

Machine:        x86_64

Instance name: +ASM1

Redo thread mounted by this instance: 0 <none>

Oracle process number: 40

Unix process pid: 8788, image: [email protected] (ARB0)







*** 2020-03-22T17:27:59.044949+01:00

*** SESSION ID:(402.55837) 2020-03-22T17:27:59.044969+01:00

*** CLIENT ID:() 2020-03-22T17:27:59.044975+01:00

*** SERVICE NAME:() 2020-03-22T17:27:59.044980+01:00

*** MODULE NAME:() 2020-03-22T17:27:59.044985+01:00

*** ACTION NAME:() 2020-03-22T17:27:59.044989+01:00

*** CLIENT DRIVER:() 2020-03-22T17:27:59.044994+01:00




 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery




*** 2020-03-22T17:27:59.680119+01:00

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x7f, op = assign mandatory

kfdp_updateDsk(): callcnt 195 grp 1

PST verChk -0: req, id=266369333, grp=1, requested=91 at 03/22/2020 17:27:59

NOTE: PST update grp = 1 completed successfully

NOTE: kfdsFilter_freeDskSrSlice for Filter 0x7fbaf6238d38

NOTE: kfdsFilter_clearDskSlice for Filter 0x7fbaf6238d38 (all:TRUE)

NOTE: completed online of disk group 1 disks

CELLI03 (9)

[root@asmrec trace]#

And as you can imagine, this will lead to files that need to be restored from backup. But look that just some data, not everything. Remember at the beginning of the post that this depends on how your data is distributed inside of ASM failgroups. If you have luck, you have just a few data impacted. This depends on a lot of factors, as the time that was offline, the size of the failgroup, the activity of your databases, and many others. But, the gains can be good and mad it worth the effort.

After that, we can normally dismount the diskgroup:

SQL> alter diskgroup data dismount;




Diskgroup altered.




SQL>

And mount it again:

SQL> alter diskgroup data mount;




Diskgroup altered.




SQL>

Since now the diskgroup is mounted in a clean way, you can continue with the rebalance:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1                                                                                                               0

         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0

         1            1 REBAL REBUILD   WAIT          1                                                                                                               0

         1            1 REBAL RESYNC    WAIT          1                                                                                                               0




SQL> alter diskgroup DATA rebalance;




Diskgroup altered.




SQL>

The state at ASM side it is:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

As you can see, the CELL01 was not removed yet (I will talk about it later). But the activities can continue, databases can be checked.

Database side

At database side we need to check what we lost and need to recover. Since I am using cluster the GI tried to start it (and as you can see two became up):

[oracle@asmrec ~]$ ps -ef |grep smon

root      8254     1  2 13:53 ?        00:04:40 /u01/app/19.0.0.0/grid/bin/osysmond.bin

grid      8750     1  0 13:54 ?        00:00:00 asm_smon_+ASM1

oracle   11589     1  0 17:31 ?        00:00:00 ora_smon_DBB19

oracle   11751     1  0 17:31 ?        00:00:00 ora_smon_DBA19

oracle   18817 29146  0 17:44 pts/9    00:00:00 grep --color=auto smon

[oracle@asmrec ~]$

DBA19

The firs that I checked was DBA19C, I used rman to VALIDATE DATABASE:

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:45:21 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: DBA19 (DBID=828667324)




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=260 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBA19/DATAFILE/system.256.1035153873

input datafile file number=00004 name=+DATA/DBA19/DATAFILE/undotbs1.258.1035153973

input datafile file number=00003 name=+DATA/DBA19/DATAFILE/sysaux.257.1035153927

input datafile file number=00007 name=+DATA/DBA19/DATAFILE/users.259.1035153975

channel ORA_DISK_1: validation complete, elapsed time: 00:03:45

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              17722        117766          5042446

  File Name: +DATA/DBA19/DATAFILE/system.256.1035153873

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              79105

  Index      0              13210

  Other      0              7723




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19445        67862           5042695

  File Name: +DATA/DBA19/DATAFILE/sysaux.257.1035153927

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7988

  Index      0              5531

  Other      0              34876




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    FAILED 1              49           83247           5042695

  File Name: +DATA/DBA19/DATAFILE/undotbs1.258.1035153973

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      511            83151




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              93           641             4941613

  File Name: +DATA/DBA19/DATAFILE/users.259.1035153975

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              65

  Index      0              15

  Other      0              467




validate found one or more corrupt blocks

See trace file /u01/app/oracle/diag/rdbms/dba19/DBA19/trace/DBA19_ora_19219.trc for details

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> shutdown abort;




Oracle instance shut down




RMAN> startup mount;




connected to target database (not started)

Oracle instance started

database mounted




Total System Global Area    1610610776 bytes




Fixed Size                     8910936 bytes

Variable Size                859832320 bytes

Database Buffers             734003200 bytes

Redo Buffers                   7864320 bytes




RMAN> run{

2> restore datafile 4;

3> recover datafile 4;

4> }




Starting restore at 22-MAR-20

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=249 device type=DISK




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00004 to +DATA/DBA19/DATAFILE/undotbs1.258.1035153973

channel ORA_DISK_1: reading from backup piece /tmp/9puro5qr_1_1

channel ORA_DISK_1: piece handle=/tmp/9puro5qr_1_1 tag=BKP-DB-INC0

channel ORA_DISK_1: restored backup piece 1

channel ORA_DISK_1: restore complete, elapsed time: 00:00:45

Finished restore at 22-MAR-20




Starting recover at 22-MAR-20

using channel ORA_DISK_1




starting media recovery

media recovery complete, elapsed time: 00:00:02




Finished recover at 22-MAR-20




RMAN> alter database open;




Statement processed




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

[oracle@asmrec ~]$

As you can see, the datafile 4 FAILED and needs to be recovered. Luckily, the redo was not affected too and the open was OK. Since it was the UNDO, I made abort (because the immediate can take an eternity, and even since undo was down, nothing was happening inside of the database).

But as you saw, just one datafile was corrupted. Of course that with big databases and big failgroup, more files will be corrupted. But it is a shot that can worth it.

DBB19

The second was DBB19 and I used the same approach, VALIDATE DATABASE:

[oracle@asmrec ~]$ export ORACLE_SID=DBB19

[oracle@asmrec ~]$

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:55:20 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




PL/SQL package SYS.DBMS_BACKUP_RESTORE version 19.03.00.00 in TARGET database is not current

PL/SQL package SYS.DBMS_RCVMAN version 19.03.00.00 in TARGET database is not current

connected to target database: DBB19 (DBID=1336872427)




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=374 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051

input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177

input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235

input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241

channel ORA_DISK_1: validation complete, elapsed time: 00:00:35

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              16763        116487          3861452

  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              78871

  Index      0              13010

  Other      0              7836




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19307        62758           3861452

  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7459

  Index      0              5158

  Other      0              30796




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            35847           3652497

  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              35839




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              85           641             3759202

  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              70

  Index      0              15

  Other      0              470




channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> VALIDATE CHECK LOGICAL DATABASE;




Starting validate at 22-MAR-20

using channel ORA_DISK_1

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051

input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177

input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235

input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241

channel ORA_DISK_1: validation complete, elapsed time: 00:00:35

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              16763        116487          3861452

  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              78871

  Index      0              13010

  Other      0              7836




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19307        62758           3861452

  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7459

  Index      0              5158

  Other      0              30796




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            35847           3652497

  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              35839




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              85           641             3759202

  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              70

  Index      0              15

  Other      0              470




channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

[oracle@asmrec ~]$

[oracle@asmrec ~]$

As you saw, no failures for DBB19. I still checked logically the database with VALIDATE CHECK LOGICAL DATABASE because since the validate returned no failed files, I wanted to check logically the blocks.

DBC19

Same for the last database, but now, datafile 3 failed:

[oracle@asmrec ~]$ export ORACLE_SID=DBC19

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 18:01:33 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database (not started)




RMAN> startup mount;




Oracle instance started

database mounted




Total System Global Area    1610610776 bytes




Fixed Size                     8910936 bytes

Variable Size                864026624 bytes

Database Buffers             729808896 bytes

Redo Buffers                   7864320 bytes




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=134 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBC19/DATAFILE/system.262.1035154053

input datafile file number=00004 name=+DATA/DBC19/DATAFILE/undotbs1.270.1035154249

input datafile file number=00003 name=+DATA/DBC19/DATAFILE/sysaux.266.1035154181

input datafile file number=00007 name=+DATA/DBC19/DATAFILE/users.271.1035154253

channel ORA_DISK_1: validation complete, elapsed time: 00:03:15

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              17777        117764          4188744

  File Name: +DATA/DBC19/DATAFILE/system.262.1035154053

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              79161

  Index      0              13182

  Other      0              7640




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    FAILED 1              19272        66585           4289434

  File Name: +DATA/DBC19/DATAFILE/sysaux.266.1035154181

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7311

  Index      0              4878

  Other      511            35099




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            84522           4188748

  File Name: +DATA/DBC19/DATAFILE/undotbs1.270.1035154249

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              84479




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              93           641             3717377

  File Name: +DATA/DBC19/DATAFILE/users.271.1035154253

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              65

  Index      0              15

  Other      0              467




validate found one or more corrupt blocks

See trace file /u01/app/oracle/diag/rdbms/dbc19/DBC19/trace/DBC19_ora_22091.trc for details

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> run{

2> restore datafile 3;

3> recover datafile 3;

4> }




Starting restore at 22-MAR-20

using channel ORA_DISK_1




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00003 to +DATA/DBC19/DATAFILE/sysaux.266.1035154181

channel ORA_DISK_1: reading from backup piece /tmp/0buro5rh_1_1

channel ORA_DISK_1: piece handle=/tmp/0buro5rh_1_1 tag=BKP-DB-INC0

channel ORA_DISK_1: restored backup piece 1

channel ORA_DISK_1: restore complete, elapsed time: 00:00:45

Finished restore at 22-MAR-20




Starting recover at 22-MAR-20

using channel ORA_DISK_1




starting media recovery




archived log for thread 1 with sequence 25 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103

archived log for thread 1 with sequence 26 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_26.329.1035739907

archived log for thread 1 with sequence 27 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_27.332.1035741283

archived log file name=+RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103 thread=1 sequence=25

media recovery complete, elapsed time: 00:00:03

Finished recover at 22-MAR-20




RMAN> alter database open;




Statement processed




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

Dropping failgroup

If the fix for the remaining failgroup took a lot, it will be dropped automatically. But we can do this manually with force (look that without force it fails):

SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01;

ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15084: ASM disk "CELLI01" is offline and cannot be dropped.







SQL>

SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01 FORCE;




Diskgroup altered.




SQL>

And after the rebalance finish, all disk will be removed:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

_DROPPED_0001_DATA                       CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1          1          0          0          0           0                                                       0

         1            1 REBAL REBALANCE WAIT          1          1          0          0          0           0                                                       0

         1            1 REBAL REBUILD   RUN           1          1        292        642        666           0                                                       0

         1            1 REBAL RESYNC    DONE          1          1          0          0          0           0                                                       0




SQL> select * from gv$asm_operation;




no rows selected




SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




8 rows selected.




SQL>

The steps for MOUNT RESTRICTED FORCE FOR RECOVERY

To resume, the steps needed are (in order):

Put online the failed disk/failgroup
Execute alter diskgroup <DG> mount restricted force for recovery
Brink online the failgroup with alter diskgroup data online disks in failgroup <FG>
Clean dismount DG alter diskgroup <DG> dismount
Clean mount alter diskgroup <DG> mount
Check databases for failures and recover it

Undocumented feature

So, the question is, why it is undocumented? I don’t have the answer but can figure out some points. For me, the most important is that is not a full, clean return. You need to restore and recover from the backup. Maybe you will lose a lot of data.

Of course that here in this example is a controlled scenario, I have just a few databases and my failgroup have just one disk inside. In real life, the problem will be worst. More diskgroups can be affected, as RECO/REDO/FRA. And probably you lost some redologs and archivelogs too and you can’t do a clean recovery. Or even need to recover OCR and Votedisk from the cluster.

This is the point for correct architecture design, if you need more protection at ASM side, you can use HIGH redundancy to survive at least two failures without interruption. This is the reason that SYSTEMDG (or OCR/Vote disk) is put high redundancy diskgroup at Exadata.

Outages and failures can occur in different layers of your environment. But storage/disk failures are catastrophic for databases because they can lead data corruption and you need to use backups to recover it. They can occur in any environment, from Storage until Exadata. I had one in an old Exadata V2 in 2016, used just for DEV databases, that crashed two storage cells (with one hour of difference) and needed to use this procedure to save some files and reduce the downtime avoiding to restore everything (more than 10TB).

So, it is good to know this kind of a procedure because can save time. But it is your decision to use it or no, check if worth or no.

Some references that you can check:

Oracle E-Business Suite 12i Architecture

Category: E-Business Suite Author: Andre Luiz Dutra Ontalba (Board Member) Date: 6 years ago Comments: 0

Oracle E-Business Suite 12i Architecture

Introduction

An Oracle E-Business Suite Release 12i is a system that uses components from various Oracle products.

These files are stored in a product structure, below we will see some of the top-level directories in the database and application server.

Depending on how you chose to install Oracle E-Business Suite, these product directories can be located on a single machine (1 node) or multiple machines (multiple nodes).

Oracle E-Business Suite directory structure

The db/apps_st/datadirectory is located on the database node machine, and contains the system tablespaces, redo log files, data tablespaces, index tablespaces, and database files
The db/tech_st/11.1.0directory is located on the database node machine, and contains the ORACLE_HOME for the Oracle 11g database
The apps/apps_st/appl(APPL_TOP) directory contains the product directories and files for Oracle E-Business Suite
The apps/apps_st/comn(COMMON_TOP) directory contains Java classes, HTML pages, and other files and directories used by multiple products
The apps/tech_st/10.1.2directory contains the ORACLE_HOME used for the Oracle E-Business Suite technology stack tools components
The apps/tech_st/10.1.3directory contains the ORACLE_HOME used for the Oracle E-Business Suite technology stack Java components

Oracle E-Business Suite Environment

Oracle E-Business Suite makes extensive use of environment settings to find executable programs and other files essential to Oracle E-Business Suite.

These environment settings are set when you install Oracle E-Business Suite.

Many of the settings are set by information that you provide when running Rapid Install, while others have the same values across all installations.

Environment settings and their associated values are stored in environment files, which have a .env suffix in UNIX or . cmd in Windows.

Files and environment settings we will talk about the continuity of the blog.

Instance Home ($INST_TOP)

Oracle E-Business Suite Release 12i introduces the concept of a directory to an Oracle E-Business Suite instance. This directory is referred to as Instance Home and denoted by the environment by the variable $INST_TOP.

Using an Instance Home, it also provides the ability to share applications and technology stack code between multiple instances for example, one a development instance and one test instance.

The basic structure of instance home is:

/inst/apps/, where APPS_BASE (which does not have or need a corresponding environment variable) is the highest level of the Oracle E-Business Suite installation and the is the highest level at which the applications context exists.

For example, the $INST_TOP setting can be /applmgr/inst/apps/test, where test is the context name.

All configuration files created by AutoConfig are stored under the Instance Home. This makes it easy to use a shared application filesystem.

Instance Home structure

Read-Only File Systems

One of the main benefits of moving to the new Instance Home model is that because AutoConfig no longer writes to directories APPL_TOP or ORACLE_HOME, both can be done on read-only file systems if necessary.

In earlier versions of Oracle E-Business Suite, the adpatch utility changed $APPL_TOP/admin in a patching application. Under the new template, $APPL_CONFIG_HOME/admin is used.

$APPL_CONFIG_HOME vai is equivalent to um value such as /u01/oracle/VIS/apps/apps_st/appl.

Important: In a shared file system environment, Oracle recommends that $INST_TOP should be located on a local disk and not on a shared resource, such as NFS, because of potential issues that store log files on shared resources.

Log Files

The advantage of employing the concept of an Instance Home is that log files can be centrally stored by an instance and therefore more easily managed.

The following diagram shows the directory structure used for log files in Release 12i, with some of the subdirectories used to sort the log files:

Log Files Structure

DATA Directory

The db/apps_st/data directory stores the different file types used by the Oracle database. Rapid Install places the system, data, and index files in directories with multiple file system mount points on the database machine. You can specify these mount points during installation.

COMN Directory

The apps/apps_st/comn (COMMON_TOP) contains files used by different Oracle E-Business Suite products, which can also be used with third-party products.

COMMON_TOP structure

ADMIN Directory

The admin directory, under the COMMON_TOP directory, is the default location for the concurrent manager log and output directories. When the concurrent managers run Oracle E-Business Suite reports, they write the log files and temporary files to the log subdirectory of the admin directory, and the output files to the out subdirectory of the admin directory.

You can change the location the concurrent managers write these files to, so that, for example, the log and output files are written to directories in each <PROD>_TOP directory.

This may be more desirable in terms of disk space management, or the need to avoid a possible performance bottleneck on a system that has a high concurrent processing throughput.

The install subdirectory of the admin directory contains scripts and log files used by Rapid Install. The scripts subdirectory of admin contains scripts used to start and stop services such as listeners and concurrent managers.

HTML Directory

The OA_HTML environment setting points to the html directory. The Oracle E-Business Suite HTML-based sign-on screen and Oracle HTML-based Applications HTML files are installed here.

The html directory also contains other files used by the HTML-based products, such as JavaServer Page (JSP) files, Java scripts, XML files, and style sheets.

Typically, the path will look like: <diskresource>/applmgr/apps/apps_st/comn/webapps/oacore/html.

Important: The META-INF and WEB-INF subdirectories were introduced in Release 12 to meet J2EE specifications.

JAVA Directory

Release 12 introduces some significant changes to the locations in which the various types of Java files are stored. Rapid Install installs all Oracle E-Business Suite class files in the COMMON_TOP/classes directory, pointed to by the $JAVA_TOP environment variable.

Zip and jar files are installed in the $COMMON_TOP/java/lib directory, pointed to by the $AF_JLIB environment variable (introduced with Release 12).

The top-level Java directory, $COMMON_TOP/java, is pointed to by the $JAVA_BASE environment variable.

UTIL Directory

The util directory contains the third-party utilities licensed to use with Oracle E-Business Suite. These include, for example, the Java Runtime Environment (JRE), Java Development Kit (JDK) and the Zip util.

The APPL directory

Oracle E-Business Suite files are stored in the <dbname>APPL directory, which is generally known as the APPL_TOP directory.

APPL_TOP Directory Structure

The APPL_TOP directory contains:

The core technology files and directories.
The product files and directories (for all products).
The main Oracle E-Business Suite environment file, called <CONTEXT_NAME>.env on UNIX, and <CONTEXT_NAME>.cmdon Windows.
The consolidated environment file, called APPS<CONTEXT_NAME>.env on UNIX, and APPS<CONTEXT_NAME>.cmd on Windows.

Warning: Regardless of registration status, all Oracle E-Business Suite products are installed in the database and the file system. Do not attempt to remove files for any unregistered products.

Rapid Install installs a new APPL_TOP directory when you upgrade. Rapid Install does not delete any existing product files from earlier releases, but unloads new product files into a new apps/apps_st/appl directory tree.

Each APPL_TOP directory is associated with a single Oracle E-Business Suite database. If you install both a Vision Demo system and a test system, Rapid Install will lay down two file systems, one for each of these Oracle E-Business Suite systems.

Product Directory

Each product has its own subdirectory under APPL_TOP. Subdirectories are named according to the product’s default abbreviation, such as gl for Oracle General Ledger.

Within each product directory is a subdirectory that has the name using the Base Oracle E-Business Suite Release number, such as 12.0.0 for the initial release of 12. This directory contains the different subdirectories for the product files.

<PROD>_TOP Directory

The <APPL_TOP>/<prod>/<version> path is known as the product top directory (<PROD>_TOP), and its value is stored in the <PROD>_TOP environment variable.

For example, if APPL_TOP=/u01/prodapps, then the value contained in the AD_TOP environment variable is /u01/prodapps/ad/12.0.0, and the AD_TOP environment variable points to the <APPL_TOP>/ad/12.0.0 directory.

For the same APPL_TOP, the value of AU_TOP is /u01/prodapps/au/12.0.0, and the AU_TOP environment variable points to the <APPL_TOP>/au/12.0.0 directory. The same principle applies to all directories, apart for the admin directory.

Product Files

Each <PROD>_TOP directory, such as <APPL_TOP>/gl/12.0.0, contains subdirectories for product files. Product files include forms, report files, and files used to update the database.

To view the forms of data entry for Oracle General Ledger, for example, Oracle E-Business Suite accesses files in the subdirectory forms under directory 12.0.0.

APPL_TOP Directory Structure

Within each <PROD>_TOP directory, the product’s files are grouped into subdirectories according to file type and function. The next figure expands the inset to show the full directory structure for gl.

GL Structure Detail

The following table summarizes product subdirectories and the types of files each one may contain.

Subdirectory Name	Description
admin	The <PROD>_TOP/admin directory contains product-specific files used to upgrade each product. This is in distinction to the <APPL_TOP>/admin directory, which contains upgrade-related files for all products.
driver	Contains driver files (.drv files) used in upgrading.
import	Contains DataMerge files used to upgrade seed data.
odf	Contains object description files (.odf files) used to create tables and other database objects.
sql	Contains SQL*Plus scripts used to upgrade data, and .pkh, .pkb, and .pls scripts to create PL/SQL stored procedures.
bin	Contains concurrent programs, other C language programs and shell scripts for each product.
forms	Contains Oracle Forms generated runtime (.fmx) files (Oracle Forms form files).
help	Contains the online help source files. Within this directory are subdirectories for each language installed.
html	Contains HTML, JavaScript, and JavaServer Page (JSP) files, primarily for HTML-based Applications products.
include	Contains C language header (.h) files that my be linked with files in the lib directory. Not all products require this directory.
java	Contains JAR files (Java Archive files) and Java dependency files. Copies of JAR files are also located in the $AF_JLIB directory.
lib	Contains files used to relink concurrent programs with the Oracle server libraries. These files include: · object files (.o on UNIX, .OBJ on Windows), with compiled code specific to one of the product’s programs. · library files (.a on UNIX, various including .DLL on Windows), with compiled code common to the product’s programs. make files (.mk) that specify how to create executables from object files and library files.
log and out	Contains output files for concurrent programs: · .mgr (master log file for concurrent manager) · .req (log file for a concurrent process) Note that log and out subdirectories under a product directory are not used if you choose to set up a common directory for log and output files (FND_TOP is the only exception to this).
media	Contains .gif files used in the display of text and graphics on the desktop client.
mesg	Concurrent programs also print messages in the log and output files. This directory contains the .msb files (binary message files used at runtime), and language-specific message files (such as a US.msb file for American English and a D.msb file for German). The files contain the forms messages that are displayed at the bottom of the screen or in popup windows.
patch	Updates to the data or data model utilize this directory to store the patch files.
reports	Contains Oracle Reports platform-specific rdf binary report files for each product. Reports for each language are stored in subdirectories of the reports directory.
resource	Contains .pll files (PL/SQL library files for Oracle Forms), which, like the plsql directory files, are later copied to AU_TOP.
sql	Contains .sql files (SQL*Plus scripts) for concurrent processing.

Language Files

When you install Oracle E-Business Suite in a language other than American English, each product tree includes directories that use the relevant NLS language code. These directories hold translated data, forms, and message files.

For example, the language directory named D designates German. The data loader files in the D subdirectory of the admin directory contain the German translation of the product seed data.

The US subdirectory in the forms directory contains Oracle Forms forms in American English. The D directory in the forms directory contains the same forms, translated into German. However, the mesg directory contains message files in both American English and German.

Core Technology Directory

The administrator, ad, au, and FND directories are the central directories of technology.

The admin directory

This directory and its subdirectories contain files and scripts used by AD during the upgrade and maintenance processes.

These files and scripts include:

The adovars.env environment file, which defines certain file and directory locations
Scripts run during the upgrade
<SID>/log and <SID>/out directories for upgrade, log, and output files respectively
A <SID>/restart directory where AD programs create restart files

The ad (Applications DBA) directory

This directory and its subdirectories contains installation and maintenance utilities, including:

AD Administration (adadmin)
AutoConfig (adconfig.sh)

The au (Applications Utilities) directory

This directory and its subdirectories contain product files that are consolidated in one place for optimal processing. These files include:

PL/SQL libraries used by Oracle Forms, in the resource subdirectory
Oracle Forms source files, in the forms subdirectory
A copy of all Java files used when regenerating the desktop client JAR files, in the java subdirectory
Certain reports needed by products such as Discoverer, in the reports subdirectory

The fnd (Application Object Library) directory

This directory and its subdirectories contain the scripts and programs that are used as the foundation for all Oracle E-Business Suite products to build data dictionaries, forms and C object libraries.

Conclusions

In version 12i of E-Business Suite, there were several changes to version 11i and the inclusion of SOA architecture, already giving adherence to new integration technologies and process workflow.

In the next articles we will continue on understanding the structure of the E-Business Suite and guide best practices for installing the product efficiently.

References

Oracle E-Business Suite Release 12 Technology Stack Documentation Roadmap [ID 380482.1]

See you next time!

Andre Ontalba

« 1 … 8 9 10 11 12 … 32 »

Oracle E-Business Suite 12i Architecture (Part 2)

Shared Application System

Subsequently, this was modified to allow APPL_TOP to be shared between different machines, and subsequently to allow sharing of the entire application layer file system.

Continuing this quick installation strategy, for version 12 it creates a system that shares not only APPL_TOP and the COMMON_TOP file systems, but also the application tier technology stack layer.

Rapid Install sets this setting as the default for nodes running on the same operating system.

These files form the application layer of the file system, and can be shared between application nodes in multiple layers (as long as they are running the same operating system).

Note: Shared file system configuration is not currently supported on application tier nodes servers running Windows.

With a shared application layer file system, all files in this application layer are installed on a single shared disk that is mounted from each application layer node.

Any application layer node can be used to provide standard services, such as a Forms, Web Pages or Concurrent server.

Shared application layer – Example

As well as reducing required disk space, there are several other benefits of setting up shared application levels:

More administrative tasks, patching and maintenance need to be done only once, instead of a single layer application node.

Changes made to the shared file system are immediately accessible on all nodes in the application layer.

Distributes task processing to run in parallel on multiple nodes (Distributed AD).

Reduces general disk requirements.

Add application nodes more easily.

Sharing the File System application between instances.

Capabilities to share the tiered file system application were further extended in version 12.0.4, which introduced the option to share an installation of Oracle E-Business Release 12 with another instance of the database.

An application file system layer installed and configured in this way can be used to access two (or more) database instances.

The restrictions to this are:

All database instances must have the same patches.

Only the application can be shared, the database cannot be shared.

Nota: For more information on features, options and implementation steps, see document 384248.1, Sharing the Application Tier File System in Oracle E-Business Suite Release 12.

Environment Setting

Rapid Install creates environment files to configure the Oracle database, Oracle’s technology suite, Oracle HTTP Server, and Oracle E-Business Suite environments.

The location of these environment files is shown in the following table:

Filename

Location

Path

Environment

On UNIX, Oracle E-Business Suite includes a consolidated file called APPS <CONTEXT_NAME> .Env, which establishes both Oracle E-Business Suite and Oracle technology stack environments.

When you install Oracle E-Business Suite, Rapid Install creates this script in the APPL_TOP directory. Many of the parameters are specified during the installation process.

On Windows, the consolidated equivalent environment file is called% APPL_TOP% \ envshell <CONTEXT_NAME> .cmd.

When running it creates a command window with the necessary environment settings for Oracle E-Business Suite. All subsequent operations on APPL_TOP (for example, running adadmin or adpatch) must be performed from this window.

The following table lists the key environment settings in APPS <CONTEXT_NAME> .env.

Parameter

Description

Most temporary files are written to the location specified by the APPLTMP environment configuration, which is defined in the Rapid Install.

Oracle E-Business Suite products also create temporary PL/SQL output files used in simultaneous processing. These files are written to a location on the database server node specified by the APPLPTMP environment configuration.

The APPLPTMP directory must be the same directory specified by the utl_file_dir parameter in your database initialization file.

Rapid Install sets both APPLPTMP and the utl_file_dir parameter to the same default directory.

In a multi-node system, the directory defined by APPLPTMP does not need to exist on application layer servers.

Nota: Temporary files placed in the utl_file_dir directory can be protected from unauthorized access, ensuring that this directory has read and write access to the Oracle database account only.

Other environments files

Several other key environment files are used in an Oracle E-Business Suite system.

O arquivo adovars.env

The adovars.env file, located at $ APPL_TOP/admin, specifies the location of several files, such as Java files, HTML files and the JRE (Java Runtime Environment) files.

It is called from the main application environment file, <CONTEXT_NAME>. Env. The adovars.env file includes comments on the purpose and recommended configuration of each variable. In a 12 release environment, adovars.env is maintained by AutoConfig, and should not be edited manually.

The adovars.env file includes the following parameters:

Parameter

Description

The adconfig.txt file

Nota: adconfig.txt is created with the APPL_TOP file system, and shows the layers that have been configured on a particular node. It is distinct from the config.txt file configured by Rapid Install.

The fndenv.env file

The devenv.env file

This file defines the variables that allow you to link third-party software and your own custom applications developed with Oracle E-Business Suite.

In version 12, this script is located in $ FND_TOP / usrxit, and is automatically called by fndenv.env. This allows you to compile and link custom forms for Oracle users to outbound and competing programs with Oracle E-Business Suite.

In the next articles, we will continue to understand the structure of the E-Business Suite and guide best practices for installing the product efficiently.

References

Oracle E-Business Suite Release 12 Technology Stack Documentation Roadmap [ID 380482.1]

I hope this help you !!

André Ontalba

ASM, REPLACE DISK Command

Basically, when the REPLACE command is called, the rebalance just copy/sync the data from the survivor disk (the partner disk from the mirror). It is faster since the previous way with drop/add execute a complete rebalance from all AU of the diskgroup, doing REBALANCE and SYNC phase.

The replace disk command is important for the SWAP disk process for Exadata (where you add the new 14TB disks) since it is faster to do the rebalance of the diskgroup.

Below one example from this behavior. Look that AU from DISK01 was SYNCED with the new disk:

And compare with the previous DROP/ADD disk, where all AU from all disks was rebalanced:

Actual Environment And Simulate The failure

In this post, to simulate and show how the replace disk works I have the DATA diskgroup with 6 disks (DISK01-06). The DISK07 it is not in use.

And to simulate the error I disconnected the disk from Operational system (since I used iSCSI, I just log off the target for DISK02:

At the same moment, the alertlog from ASM detected the error and informed that the mirror was found in another disk (DISK06):

So, at this moment the DQISK02 will not be removed instantly, but after the disk_repair_time finish:

If you want to check the full output from ASM alertlog you can access here at ASM-ALERTLOG-Output-Online-Disk-Error.txt

So, the actual diskgroup is

REPLACE DISK

Since the old disk was lost (by HW or something similar), it is impossible to put it again online. A new disk was attached to the server (DISK07 in this example) and this is added in the diskgroup.

So, we just need to execute the REPLACE DISK command:

The command is easy, we replace disk failed disk with the new disk path. And it is possible to replace more than one at the same time and specify the power of the rebalance too.

At ASM alertlog we can see a lot of messages about this replacement, but look that resync of the disk. The full output can be found here at ASM-ALERTLOG-Output-Replace-Disk.txt

Some points here: