Now at ASM alertlog you can see that diskgroup was dismounted (and several other messages). Bellow a cropped from the alertlog. The full output (and I think that deserve a look) it is here at ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01.txt
 
2020-03-22T17:18:39.699555+01:00
WARNING: Write Failed. group:1 disk:1 AU:1 offset:4190208 size:4096
path:ORCL:CELLI01
         incarnation:0xf0f0c1f3 asynchronous result:'I/O error'
         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f9182f833d0 bufp:0x7f91836ef000 osderr1:0x3 osderr2:0x2e
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: Hbeat write to PST disk 1.4042310131 in group 1 failed. [2]
2020-03-22T17:18:39.704035+01:00
...
...
2020-03-22T17:18:39.746945+01:00
NOTE: cache closing disk 9 of grp 1: (not open) CELLI03
ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.
2020-03-22T17:18:39.747462+01:00
ERROR: too many offline disks in PST (grp 1)
2020-03-22T17:18:39.759171+01:00
NOTE: cache dismounting (not clean) group 1/0xB48031B9 (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 12050, image: [email protected] (B001)
2020-03-22T17:18:39.761807+01:00
NOTE: halting all I/Os to diskgroup 1 (DATA)
2020-03-22T17:18:39.766289+01:00
NOTE: LGWR doing non-clean dismount of group 1 (DATA) thread 1
NOTE: LGWR sync ABA=23.3751 last written ABA 23.3751
...
...
2020-03-22T17:18:40.207406+01:00
SQL> alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */
...
...
2020-03-22T17:18:40.841979+01:00
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:
ORA-15130: diskgroup "DATA" is being dismounted
2020-03-22T17:18:40.853738+01:00
...
...
ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.
2020-03-22T17:18:40.861939+01:00
ERROR: too many offline disks in PST (grp 1)
...
...
2020-03-22T17:18:43.214368+01:00
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:
ORA-15130: diskgroup "DATA" is being dismounted
2020-03-22T17:18:43.214885+01:00
NOTE: client DBC19:DBC19:asmrec no longer has group 1 (DATA) mounted
2020-03-22T17:18:43.215492+01:00
NOTE: client DBB19:DBB19:asmrec no longer has group 1 (DATA) mounted
NOTE: cache deleting context for group DATA 1/0xb48031b9
...
...
2020-03-22T17:18:43.298551+01:00
SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA
2020-03-22T17:18:43.352003+01:00
SQL> ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {0:1:9} */
2020-03-22T17:18:43.372816+01:00
NOTE: cache registered group DATA 1/0xB44031BF
NOTE: cache began mount (first) of group DATA 1/0xB44031BF
NOTE: Assigning number (1,8) to disk (ORCL:CELLI02)
NOTE: Assigning number (1,0) to disk (ORCL:CELLI04)
NOTE: Assigning number (1,11) to disk (ORCL:CELLI05)
NOTE: Assigning number (1,3) to disk (ORCL:CELLI06)
NOTE: Assigning number (1,2) to disk (ORCL:CELLI07)
2020-03-22T17:18:43.514642+01:00
cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1
2020-03-22T17:18:46.089517+01:00
NOTE: detected and added orphaned client id 0x10010
NOTE: detected and added orphaned client id 0x1000e
 
So, the second failure occurred at 17:18 and lead to diskgroup force dismount. And you can see messages like “NOTE: cache dismounting (not clean)”, “ERROR: too many offline disks in PST (grp 1)”, and even “ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline”.
So, probably some data was lost. And even if you consider that between these 4 minutes data was changed in the databases, the mess is Big. If you want to see the alertlog from databases, check here at ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure.txt
And now we have this at ASM:
 
SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;
NAME                                     FAILGROUP                      LABEL                           PATH
---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------
RECI01                                   RECI01                         RECI01                          ORCL:RECI01
SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01
                                                                        CELLI02                         ORCL:CELLI02
                                                                        CELLI04                         ORCL:CELLI04
                                                                        CELLI05                         ORCL:CELLI05
                                                                        CELLI06                         ORCL:CELLI06
                                                                        CELLI07                         ORCL:CELLI07
7 rows selected.
SQL>
 
And if we try to mount we receive an error due to disk offline:
 
SQL> alter diskgroup data mount;
alter diskgroup data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "9" is missing from group number "1"
ORA-15042: ASM disk "1" is missing from group number "1"
SQL>
 
Now is the key decision. If you have important data that worth the effort to try to recover you can continue. It is your decision and based on several details. Since the diskgroup is dismounted, the repair time is not counting, and you have days until recovery. Sometimes one day stopped is better than several days to recover all databases from the last backup.
Imagine that you can bring online the first failed failgroup (CELL03) that have 4 minutes of the difference of data:
 
[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -l
Logging in to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] (multiple)
Login to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.
[root@asmrec ~]#
 
And if you try to mount it normally you will receive an error (output from alertlog at this try can be seen here at ASM-ALERTLOG-Output-Mout-With-One-Disk-Online):
 
SQL> alter diskgroup data mount;
alter diskgroup data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15066: offlining disk "1" in group "DATA" may result in a data loss
SQL>
 
So, now we can try the mount restricted force for recovery:
 
SQL> alter diskgroup data mount restricted force for recovery;
Diskgroup altered.
SQL>
 
The alertlog from ASM (that you can full here at ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery.txt) report messages related with cache from diskgropup and disk that need to be checked. And now we are like this:
 
SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;
NAME                                     FAILGROUP                      LABEL                           PATH
---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------
CELLI01                                  CELLI01
CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02
CELLI03                                  CELLI03
CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04
CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05
CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06
CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07
RECI01                                   RECI01                         RECI01                          ORCL:RECI01
SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01
                                                                        CELLI03                         ORCL:CELLI03
10 rows selected.
SQL>
 
The next step is to bring online the failgroup that came back:
 
SQL> alter diskgroup data online disks in failgroup CELLI03;
Diskgroup altered.
SQL>
 
Doing this ASM will resync this failgroup (using this block as the last version) and bring the cache of this disk online. At ASM alertlog you can see messages like (full output here at ASM-ALERTLOG-Output-Online-Restored-Failgroup):
 
2020-03-22T17:27:47.729003+01:00
SQL> alter diskgroup data online disks in failgroup CELLI03
2020-03-22T17:27:47.729551+01:00
NOTE: cache closing disk 1 of grp 1: (not open) CELLI01
2020-03-22T17:27:47.729640+01:00
NOTE: cache closing disk 9 of grp 1: (not open) CELLI03
2020-03-22T17:27:47.730398+01:00
NOTE: GroupBlock outside rolling migration privileged region
NOTE: initiating resync of disk group 1 disks
CELLI03 (9)
NOTE: process _user6891_+asm1 (6891) initiating offline of disk 9.4042310248 (CELLI03) with mask 0x7e in group 1 (DATA) without client assisting
2020-03-22T17:27:47.737580+01:00
...
...
2020-03-22T17:27:47.796524+01:00
NOTE: disk validation pending for 1 disk in group 1/0x1d7031d4 (DATA)
NOTE: Found ORCL:CELLI03 for disk CELLI03
NOTE: completed disk validation for 1/0x1d7031d4 (DATA)
2020-03-22T17:27:47.935467+01:00
...
...
2020-03-22T17:27:48.116572+01:00
NOTE: cache closing disk 1 of grp 1: (not open) CELLI01
NOTE: cache opening disk 9 of grp 1: CELLI03 label:CELLI03
2020-03-22T17:27:48.117158+01:00
SUCCESS: refreshed membership for 1/0x1d7031d4 (DATA)
2020-03-22T17:27:48.123545+01:00
NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x5d, op = assign mandatory
...
...
2020-03-22T17:27:48.142068+01:00
NOTE: PST update grp = 1 completed successfully
2020-03-22T17:27:48.143197+01:00
SUCCESS: alter diskgroup data online disks in failgroup CELLI03
2020-03-22T17:27:48.577277+01:00
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. No voting file found.
...
...
2020-03-22T17:27:48.643277+01:00
NOTE: Starting resync using Staleness Registry and ATE scan for group 1
2020-03-22T17:27:48.696075+01:00
NOTE: Starting resync using Staleness Registry and ATE scan for group 1
NOTE: header on disk 9 advanced to format #2 using fcn 0.0
2020-03-22T17:27:49.725837+01:00
WARNING: Started Drop Disk Timeout for Disk 1 (CELLI01) in group 1 with a value 43200
2020-03-22T17:27:57.301042+01:00
...
2020-03-22T17:27:59.687480+01:00
NOTE: cache closing disk 1 of grp 1: (not open) CELLI01
NOTE: reset timers for disk: 9
NOTE: completed online of disk group 1 disks
CELLI03 (9)
2020-03-22T17:27:59.714674+01:00
ERROR: ORA-15421 thrown in ARBA for group number 1
2020-03-22T17:27:59.714805+01:00
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arba_8786.trc:
ORA-15421: Rebalance is not supported when the disk group is mounted for recovery.
2020-03-22T17:27:59.715047+01:00
NOTE: stopping process ARB0
NOTE: stopping process ARBA
2020-03-22T17:28:00.652115+01:00
NOTE: rebalance interrupted for group 1/0x1d7031d4 (DATA)
 
And not we have at ASM:
 
SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;
NAME                                     FAILGROUP                      LABEL                           PATH
---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------
CELLI01                                  CELLI01
CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02
CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03
CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04
CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05
CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06
CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07
RECI01                                   RECI01                         RECI01                          ORCL:RECI01
SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01
9 rows selected.
SQL>
 
And rebalance not continue because is not allowed when diskgroup is in restrict mode:
 
SQL> select * from gv$asm_operation;
   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
         1            1 REBAL COMPACT   WAIT          1                                                                                                               0
         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0
         1            1 REBAL REBUILD   WAIT          1                                                                                                               0
         1            1 REBAL RESYNC    WAIT          1                                                                                                               0
SQL>
 
But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the *arb* process trace files at ASM trace directory:
 
[root@asmrec trace]# ls -lFhtr *arb*
...
...
-rw-r----- 1 grid oinstall 6.4K Mar 22 17:10 +ASM1_arb0_3210.trm
-rw-r----- 1 grid oinstall  44K Mar 22 17:10 +ASM1_arb0_3210.trc
-rw-r----- 1 grid oinstall  984 Mar 22 17:27 +ASM1_arb0_8788.trm
-rw-r----- 1 grid oinstall 2.1K Mar 22 17:27 +ASM1_arb0_8788.trc
-rw-r----- 1 grid oinstall  882 Mar 22 17:27 +ASM1_arba_8786.trm
-rw-r----- 1 grid oinstall 1.2K Mar 22 17:27 +ASM1_arba_8786.trc
[root@asmrec trace]#
 
And looking from one of the last, we can see that some extend (that does not exist, the recovered failgroup, or the cache is not the last one) was filled with dummy (BADFDA7A) data:
 
[root@asmrec trace]# cat +ASM1_arb0_8788.trc
Trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8788.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.6.0.0.0
Build label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417
ORACLE_HOME:    /u01/app/19.0.0.0/grid
System name:    Linux
Node name:      asmrec.oralocal
Release:        4.14.35-1902.10.8.el7uek.x86_64
Version:        #2 SMP Thu Feb 6 11:02:28 PST 2020
Machine:        x86_64
Instance name: +ASM1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 40
Unix process pid: 8788, image: [email protected] (ARB0)
*** 2020-03-22T17:27:59.044949+01:00
*** SESSION ID:(402.55837) 2020-03-22T17:27:59.044969+01:00
*** CLIENT ID:() 2020-03-22T17:27:59.044975+01:00
*** SERVICE NAME:() 2020-03-22T17:27:59.044980+01:00
*** MODULE NAME:() 2020-03-22T17:27:59.044985+01:00
*** ACTION NAME:() 2020-03-22T17:27:59.044989+01:00
*** CLIENT DRIVER:() 2020-03-22T17:27:59.044994+01:00
 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery
 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery
 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery
 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery
 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery
 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery
 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery
 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery
*** 2020-03-22T17:27:59.680119+01:00
NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x7f, op = assign mandatory
kfdp_updateDsk(): callcnt 195 grp 1
PST verChk -0: req, id=266369333, grp=1, requested=91 at 03/22/2020 17:27:59
NOTE: PST update grp = 1 completed successfully
NOTE: kfdsFilter_freeDskSrSlice for Filter 0x7fbaf6238d38
NOTE: kfdsFilter_clearDskSlice for Filter 0x7fbaf6238d38 (all:TRUE)
NOTE: completed online of disk group 1 disks
CELLI03 (9)
[root@asmrec trace]#
 
And as you can imagine, this will lead to files that need to be restored from backup. But look that just some data, not everything. Remember at the beginning of the post that this depends on how your data is distributed inside of ASM failgroups. If you have luck, you have just a few data impacted. This depends on a lot of factors, as the time that was offline, the size of the failgroup, the activity of your databases, and many others. But, the gains can be good and mad it worth the effort.
After that, we can normally dismount the diskgroup:
 
SQL> alter diskgroup data dismount;
Diskgroup altered.
SQL>
 
And mount it again:
 
SQL> alter diskgroup data mount;
Diskgroup altered.
SQL>
 
Since now the diskgroup is mounted in a clean way, you can continue with the rebalance:
 
SQL> select * from gv$asm_operation;
   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
         1            1 REBAL COMPACT   WAIT          1                                                                                                               0
         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0
         1            1 REBAL REBUILD   WAIT          1                                                                                                               0
         1            1 REBAL RESYNC    WAIT          1                                                                                                               0
SQL> alter diskgroup DATA rebalance;
Diskgroup altered.
SQL>
 
The state at ASM side it is:
 
SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;
NAME                                     FAILGROUP                      LABEL                           PATH
---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------
CELLI01                                  CELLI01
CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02
CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03
CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04
CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05
CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06
CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07
RECI01                                   RECI01                         RECI01                          ORCL:RECI01
SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01
9 rows selected.
SQL>
 
As you can see, the CELL01 was not removed yet (I will talk about it later). But the activities can continue, databases can be checked.
 
Database side
 
At database side we need to check what we lost and need to recover. Since I am using cluster the GI tried to start it (and as you can see two became up):
 
[oracle@asmrec ~]$ ps -ef |grep smon
root      8254     1  2 13:53 ?        00:04:40 /u01/app/19.0.0.0/grid/bin/osysmond.bin
grid      8750     1  0 13:54 ?        00:00:00 asm_smon_+ASM1
oracle   11589     1  0 17:31 ?        00:00:00 ora_smon_DBB19
oracle   11751     1  0 17:31 ?        00:00:00 ora_smon_DBA19
oracle   18817 29146  0 17:44 pts/9    00:00:00 grep --color=auto smon
[oracle@asmrec ~]$
 
DBA19
The firs that I checked was DBA19C, I used rman to VALIDATE DATABASE:
 
[oracle@asmrec ~]$ rman target /
Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:45:21 2020
Version 19.6.0.0.0
Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.
connected to target database: DBA19 (DBID=828667324)
RMAN> validate database;
Starting validate at 22-MAR-20
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=260 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00001 name=+DATA/DBA19/DATAFILE/system.256.1035153873
input datafile file number=00004 name=+DATA/DBA19/DATAFILE/undotbs1.258.1035153973
input datafile file number=00003 name=+DATA/DBA19/DATAFILE/sysaux.257.1035153927
input datafile file number=00007 name=+DATA/DBA19/DATAFILE/users.259.1035153975
channel ORA_DISK_1: validation complete, elapsed time: 00:03:45
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
1    OK     0              17722        117766          5042446
  File Name: +DATA/DBA19/DATAFILE/system.256.1035153873
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              79105
  Index      0              13210
  Other      0              7723
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
3    OK     0              19445        67862           5042695
  File Name: +DATA/DBA19/DATAFILE/sysaux.257.1035153927
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              7988
  Index      0              5531
  Other      0              34876
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
4    FAILED 1              49           83247           5042695
  File Name: +DATA/DBA19/DATAFILE/undotbs1.258.1035153973
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0
  Index      0              0
  Other      511            83151
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
7    OK     0              93           641             4941613
  File Name: +DATA/DBA19/DATAFILE/users.259.1035153975
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              65
  Index      0              15
  Other      0              467
validate found one or more corrupt blocks
See trace file /u01/app/oracle/diag/rdbms/dba19/DBA19/trace/DBA19_ora_19219.trc for details
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
including current control file for validation
including current SPFILE in backup set
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Control File and SPFILE
===============================
File Type    Status Blocks Failing Blocks Examined
------------ ------ -------------- ---------------
SPFILE       OK     0              2
Control File OK     0              646
Finished validate at 22-MAR-20
RMAN> shutdown abort;
Oracle instance shut down
RMAN> startup mount;
connected to target database (not started)
Oracle instance started
database mounted
Total System Global Area    1610610776 bytes
Fixed Size                     8910936 bytes
Variable Size                859832320 bytes
Database Buffers             734003200 bytes
Redo Buffers                   7864320 bytes
RMAN> run{
2> restore datafile 4;
3> recover datafile 4;
4> }
Starting restore at 22-MAR-20
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=249 device type=DISK
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00004 to +DATA/DBA19/DATAFILE/undotbs1.258.1035153973
channel ORA_DISK_1: reading from backup piece /tmp/9puro5qr_1_1
channel ORA_DISK_1: piece handle=/tmp/9puro5qr_1_1 tag=BKP-DB-INC0
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:45
Finished restore at 22-MAR-20
Starting recover at 22-MAR-20
using channel ORA_DISK_1
starting media recovery
media recovery complete, elapsed time: 00:00:02
Finished recover at 22-MAR-20
RMAN> alter database open;
Statement processed
RMAN> exit
Recovery Manager complete.
[oracle@asmrec ~]$
[oracle@asmrec ~]$
 
As you can see, the datafile 4 FAILED and needs to be recovered. Luckily, the redo was not affected too and the open was OK. Since it was the UNDO, I made abort (because the immediate can take an eternity, and even since undo was down, nothing was happening inside of the database).
But as you saw, just one datafile was corrupted. Of course that with big databases and big failgroup, more files will be corrupted. But it is a shot that can worth it.
 
DBB19
The second was DBB19 and I used the same approach, VALIDATE DATABASE:
 
[oracle@asmrec ~]$ export ORACLE_SID=DBB19
[oracle@asmrec ~]$
[oracle@asmrec ~]$ rman target /
Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:55:20 2020
Version 19.6.0.0.0
Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.
PL/SQL package SYS.DBMS_BACKUP_RESTORE version 19.03.00.00 in TARGET database is not current
PL/SQL package SYS.DBMS_RCVMAN version 19.03.00.00 in TARGET database is not current
connected to target database: DBB19 (DBID=1336872427)
RMAN> validate database;
Starting validate at 22-MAR-20
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=374 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051
input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177
input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235
input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241
channel ORA_DISK_1: validation complete, elapsed time: 00:00:35
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
1    OK     0              16763        116487          3861452
  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              78871
  Index      0              13010
  Other      0              7836
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
3    OK     0              19307        62758           3861452
  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              7459
  Index      0              5158
  Other      0              30796
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
4    OK     0              1            35847           3652497
  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0
  Index      0              0
  Other      0              35839
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
7    OK     0              85           641             3759202
  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              70
  Index      0              15
  Other      0              470
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
including current control file for validation
including current SPFILE in backup set
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Control File and SPFILE
===============================
File Type    Status Blocks Failing Blocks Examined
------------ ------ -------------- ---------------
SPFILE       OK     0              2
Control File OK     0              646
Finished validate at 22-MAR-20
RMAN> VALIDATE CHECK LOGICAL DATABASE;
Starting validate at 22-MAR-20
using channel ORA_DISK_1
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051
input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177
input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235
input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241
channel ORA_DISK_1: validation complete, elapsed time: 00:00:35
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
1    OK     0              16763        116487          3861452
  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              78871
  Index      0              13010
  Other      0              7836
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
3    OK     0              19307        62758           3861452
  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              7459
  Index      0              5158
  Other      0              30796
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
4    OK     0              1            35847           3652497
  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0
  Index      0              0
  Other      0              35839
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
7    OK     0              85           641             3759202
  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              70
  Index      0              15
  Other      0              470
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
including current control file for validation
including current SPFILE in backup set
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Control File and SPFILE
===============================
File Type    Status Blocks Failing Blocks Examined
------------ ------ -------------- ---------------
SPFILE       OK     0              2
Control File OK     0              646
Finished validate at 22-MAR-20
RMAN> exit
Recovery Manager complete.
[oracle@asmrec ~]$
[oracle@asmrec ~]$
[oracle@asmrec ~]$
 
As you saw, no failures for DBB19. I still checked logically the database with VALIDATE CHECK LOGICAL DATABASE because since the validate returned no failed files, I wanted to check logically the blocks.
 
DBC19
Same for the last database, but now, datafile 3 failed:
 
[oracle@asmrec ~]$ export ORACLE_SID=DBC19
[oracle@asmrec ~]$ rman target /
Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 18:01:33 2020
Version 19.6.0.0.0
Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.
connected to target database (not started)
RMAN> startup mount;
Oracle instance started
database mounted
Total System Global Area    1610610776 bytes
Fixed Size                     8910936 bytes
Variable Size                864026624 bytes
Database Buffers             729808896 bytes
Redo Buffers                   7864320 bytes
RMAN> validate database;
Starting validate at 22-MAR-20
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=134 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00001 name=+DATA/DBC19/DATAFILE/system.262.1035154053
input datafile file number=00004 name=+DATA/DBC19/DATAFILE/undotbs1.270.1035154249
input datafile file number=00003 name=+DATA/DBC19/DATAFILE/sysaux.266.1035154181
input datafile file number=00007 name=+DATA/DBC19/DATAFILE/users.271.1035154253
channel ORA_DISK_1: validation complete, elapsed time: 00:03:15
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
1    OK     0              17777        117764          4188744
  File Name: +DATA/DBC19/DATAFILE/system.262.1035154053
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              79161
  Index      0              13182
  Other      0              7640
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
3    FAILED 1              19272        66585           4289434
  File Name: +DATA/DBC19/DATAFILE/sysaux.266.1035154181
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              7311
  Index      0              4878
  Other      511            35099
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
4    OK     0              1            84522           4188748
  File Name: +DATA/DBC19/DATAFILE/undotbs1.270.1035154249
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              0
  Index      0              0
  Other      0              84479
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
7    OK     0              93           641             3717377
  File Name: +DATA/DBC19/DATAFILE/users.271.1035154253
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              65
  Index      0              15
  Other      0              467
validate found one or more corrupt blocks
See trace file /u01/app/oracle/diag/rdbms/dbc19/DBC19/trace/DBC19_ora_22091.trc for details
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
including current control file for validation
including current SPFILE in backup set
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Control File and SPFILE
===============================
File Type    Status Blocks Failing Blocks Examined
------------ ------ -------------- ---------------
SPFILE       OK     0              2
Control File OK     0              646
Finished validate at 22-MAR-20
RMAN> run{
2> restore datafile 3;
3> recover datafile 3;
4> }
Starting restore at 22-MAR-20
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00003 to +DATA/DBC19/DATAFILE/sysaux.266.1035154181
channel ORA_DISK_1: reading from backup piece /tmp/0buro5rh_1_1
channel ORA_DISK_1: piece handle=/tmp/0buro5rh_1_1 tag=BKP-DB-INC0
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:45
Finished restore at 22-MAR-20
Starting recover at 22-MAR-20
using channel ORA_DISK_1
starting media recovery
archived log for thread 1 with sequence 25 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103
archived log for thread 1 with sequence 26 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_26.329.1035739907
archived log for thread 1 with sequence 27 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_27.332.1035741283
archived log file name=+RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103 thread=1 sequence=25
media recovery complete, elapsed time: 00:00:03
Finished recover at 22-MAR-20
RMAN> alter database open;
Statement processed
RMAN> exit
Recovery Manager complete.
[oracle@asmrec ~]$
 
Dropping failgroup
If the fix for the remaining failgroup took a lot, it will be dropped automatically. But we can do this manually with force (look that without force it fails):
 
SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01;
ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15084: ASM disk "CELLI01" is offline and cannot be dropped.
SQL>
SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01 FORCE;
Diskgroup altered.
SQL>
 
And after the rebalance finish, all disk will be removed:
 
SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;
NAME                                     FAILGROUP                      LABEL                           PATH
---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------
_DROPPED_0001_DATA                       CELLI01
CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02
CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03
CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04
CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05
CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06
CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07
RECI01                                   RECI01                         RECI01                          ORCL:RECI01
SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01
9 rows selected.
SQL> select * from gv$asm_operation;
   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
         1            1 REBAL COMPACT   WAIT          1          1          0          0          0           0                                                       0
         1            1 REBAL REBALANCE WAIT          1          1          0          0          0           0                                                       0
         1            1 REBAL REBUILD   RUN           1          1        292        642        666           0                                                       0
         1            1 REBAL RESYNC    DONE          1          1          0          0          0           0                                                       0
SQL> select * from gv$asm_operation;
no rows selected
SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;
NAME                                     FAILGROUP                      LABEL                           PATH
---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------
CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02
CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03
CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04
CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05
CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06
CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07
RECI01                                   RECI01                         RECI01                          ORCL:RECI01
SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01
8 rows selected.
SQL>
 
The steps for MOUNT RESTRICTED FORCE FOR RECOVERY
To resume, the steps needed are (in order):
- Put online the failed disk/failgroup
- Execute alter diskgroup <DG> mount restricted force for recovery
- Brink online the failgroup with alter diskgroup data online disks in failgroup <FG>
- Clean dismount DG alter diskgroup <DG> dismount
- Clean mount alter diskgroup <DG> mount
- Check databases for failures and recover it
-  
Undocumented feature
 
So, the question is, why it is undocumented? I don’t have the answer but can figure out some points. For me, the most important is that is not a full, clean return. You need to restore and recover from the backup. Maybe you will lose a lot of data.
Of course that here in this example is a controlled scenario, I have just a few databases and my failgroup have just one disk inside. In real life, the problem will be worst. More diskgroups can be affected, as RECO/REDO/FRA. And probably you lost some redologs and archivelogs too and you can’t do a clean recovery. Or even need to recover OCR and Votedisk from the cluster.
This is the point for correct architecture design, if you need more protection at ASM side, you can use HIGH redundancy to survive at least two failures without interruption. This is the reason that SYSTEMDG (or OCR/Vote disk) is put high redundancy diskgroup at Exadata.
Outages and failures can occur in different layers of your environment. But storage/disk failures are catastrophic for databases because they can lead data corruption and you need to use backups to recover it. They can occur in any environment, from Storage until Exadata. I had one in an old Exadata V2 in 2016, used just for DEV databases, that crashed two storage cells (with one hour of difference) and needed to use this procedure to save some files and reduce the downtime avoiding to restore everything (more than 10TB).
So, it is good to know this kind of a procedure because can save time. But it is your decision to use it or no, check if worth or no.
Some references that you can check:
Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”