Fernando Simon (Board Member) – Page 2 – LUXOUG

Fernando Simon (Board Member)

ASM, REPLACE DISK Command

Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ASM, REPLACE DISK Command

The REPLACE DISK command was released with 12.1 and allow to do an online replacement for a failed disk. This command is important because it reduces the rebalance time doing just the SYNC phase. Comparing with normal disk replacement (DROP and ADD in the same command), the REPLACE just do mirror resync.

Basically, when the REPLACE command is called, the rebalance just copy/sync the data from the survivor disk (the partner disk from the mirror). It is faster since the previous way with drop/add execute a complete rebalance from all AU of the diskgroup, doing REBALANCE and SYNC phase.

The replace disk command is important for the SWAP disk process for Exadata (where you add the new 14TB disks) since it is faster to do the rebalance of the diskgroup.

Below one example from this behavior. Look that AU from DISK01 was SYNCED with the new disk:

And compare with the previous DROP/ADD disk, where all AU from all disks was rebalanced:

Actual Environment And Simulate The failure

In this post, to simulate and show how the replace disk works I have the DATA diskgroup with 6 disks (DISK01-06). The DISK07 it is not in use.

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL      PATH

------------------------------ ------------------------------ ---------- -----------

DISK01                         FAIL01                         DISK01     ORCL:DISK01

DISK02                         FAIL01                         DISK02     ORCL:DISK02

DISK03                         FAIL02                         DISK03     ORCL:DISK03

DISK04                         FAIL02                         DISK04     ORCL:DISK04

DISK05                         FAIL03                         DISK05     ORCL:DISK05

DISK06                         FAIL03                         DISK06     ORCL:DISK06

RECI01                         RECI01                         RECI01     ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01     ORCL:SYSI01

                                                              DISK07     ORCL:DISK07




9 rows selected.




SQL>

And to simulate the error I disconnected the disk from Operational system (since I used iSCSI, I just log off the target for DISK02:

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.eff4683320e8 -p 172.16.0.3:3260 -u

Logging out of session [sid: 41, target: iqn.2006-01.com.openfiler:tsn.eff4683320e8, portal: 172.16.0.3,3260]

Logout of [sid: 41, target: iqn.2006-01.com.openfiler:tsn.eff4683320e8, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

At the same moment, the alertlog from ASM detected the error and informed that the mirror was found in another disk (DISK06):

2020-03-29T00:42:11.160695+01:00

WARNING: Read Failed. group:3 disk:1 AU:29 offset:0 size:4096

path:ORCL:DISK02

         incarnation:0xf0f0c113 synchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f3df8db35b8 bufp:0x7f3df8c9c000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: cache failed reading from group=3(DATA) fn=8 blk=0 count=1 from disk=1 (DISK02) mirror=0 kfkist=0x20 status=0x02 osderr=0x3 file=kfc.c line=13317

WARNING: cache succeeded reading from group=3(DATA) fn=8 blk=0 count=1 from disk=5 (DISK06) mirror=1 kfkist=0x20 status=0x01 osderr=0x0 file=kfc.c line=13366

So, at this moment the DQISK02 will not be removed instantly, but after the disk_repair_time finish:

WARNING: Started Drop Disk Timeout for Disk 1 (DISK02) in group 3 with a value 43200

WARNING: Disk 1 (DISK02) in group 3 will be dropped in: (43200) secs on ASM inst 1

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

If you want to check the full output from ASM alertlog you can access here at ASM-ALERTLOG-Output-Online-Disk-Error.txt

So, the actual diskgroup is

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL      PATH

------------------------------ ------------------------------ ---------- -----------

DISK01                         FAIL01                         DISK01     ORCL:DISK01

DISK02                         FAIL01

DISK03                         FAIL02                         DISK03     ORCL:DISK03

DISK04                         FAIL02                         DISK04     ORCL:DISK04

DISK05                         FAIL03                         DISK05     ORCL:DISK05

DISK06                         FAIL03                         DISK06     ORCL:DISK06

RECI01                         RECI01                         RECI01     ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01     ORCL:SYSI01

                                                              DISK07     ORCL:DISK07




9 rows selected.




SQL>

REPLACE DISK

Since the old disk was lost (by HW or something similar), it is impossible to put it again online. A new disk was attached to the server (DISK07 in this example) and this is added in the diskgroup.

So, we just need to execute the REPLACE DISK command:

SQL> alter diskgroup DATA

  2  REPLACE DISK DISK02 with 'ORCL:DISK07'

  3  power 2;




Diskgroup altered.




SQL>

The command is easy, we replace disk failed disk with the new disk path. And it is possible to replace more than one at the same time and specify the power of the rebalance too.

At ASM alertlog we can see a lot of messages about this replacement, but look that resync of the disk. The full output can be found here at ASM-ALERTLOG-Output-Replace-Disk.txt

Some points here:

2020-03-29T00:44:31.602826+01:00

SQL> alter diskgroup DATA

replace disk DISK02 with 'ORCL:DISK07'

power 2

2020-03-29T00:44:31.741335+01:00

NOTE: cache closing disk 1 of grp 3: (not open) DISK02

2020-03-29T00:44:31.742068+01:00

NOTE: GroupBlock outside rolling migration privileged region

2020-03-29T00:44:31.742968+01:00

NOTE: client +ASM1:+ASM:asmrec no longer has group 3 (DATA) mounted

2020-03-29T00:44:31.746444+01:00

NOTE: Found ORCL:DISK07 for disk DISK02

NOTE: initiating resync of disk group 3 disks

DISK02 (1)




NOTE: process _user20831_+asm1 (20831) initiating offline of disk 1.4042309907 (DISK02) with mask 0x7e in group 3 (DATA) without client assisting

2020-03-29T00:44:31.747191+01:00

NOTE: sending set offline flag message (2044364809) to 1 disk(s) in group 3

…

…

2020-03-29T00:44:34.558097+01:00

NOTE: PST update grp = 3 completed successfully

2020-03-29T00:44:34.559806+01:00

SUCCESS: alter diskgroup DATA

replace disk DISK02 with 'ORCL:DISK07'

power 2

2020-03-29T00:44:36.805979+01:00

NOTE: Attempting voting file refresh on diskgroup DATA

NOTE: Refresh completed on diskgroup DATA. No voting file found.

2020-03-29T00:44:36.820900+01:00

NOTE: starting rebalance of group 3/0xf99030d7 (DATA) at power 2

After that, we can see the rebalance just take the SYNC phase:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE     CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------

         1            3 REBAL COMPACT   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBALANCE WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBUILD   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL RESYNC    RUN           2          2        231       1350        513           2                     0




SQL>

SQL> /




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE     CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------

         1            3 REBAL COMPACT   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBALANCE WAIT          2          2          0          0          0           0                     0

         1            3 REBAL REBUILD   WAIT          2          2          0          0          0           0                     0

         1            3 REBAL RESYNC    RUN           2          2        373       1350        822           1                     0




SQL>

SQL> /




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE     CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------

         1            3 REBAL COMPACT   REAP          2          2          0          0          0           0                     0

         1            3 REBAL REBALANCE DONE          2          2          0          0          0           0                     0

         1            3 REBAL REBUILD   DONE          2          2          0          0          0           0                     0

         1            3 REBAL RESYNC    DONE          2          2       1376       1350          0           0                     0




SQL> /




no rows selected




SQL>

In the end, after the rebalance we have:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL      PATH

------------------------------ ------------------------------ ---------- -----------

DISK01                         FAIL01                         DISK01     ORCL:DISK01

DISK02                         FAIL01                         DISK07     ORCL:DISK07

DISK03                         FAIL02                         DISK03     ORCL:DISK03

DISK04                         FAIL02                         DISK04     ORCL:DISK04

DISK05                         FAIL03                         DISK05     ORCL:DISK05

DISK06                         FAIL03                         DISK06     ORCL:DISK06

RECI01                         RECI01                         RECI01     ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01     ORCL:SYSI01




8 rows selected.




SQL>

An important detail is that the NAME for the disk will not change, it is impossible to change using REPLACE DISK command. As you can see above, the disk named DISK02 has the label DISK07 (here this came from asmlib disk).

Know Issues

There is a known issue for REPLACE DISK for 18c and higher for GI where the rebalance can take AGES to finish. This occurs because (when replacing more than one disk per time), it executes the SYNC disk by disk. One example, for one Exadata the replace for a complete cell took more than 48 hours, while a DROP/ADD took just 12 hours for the same disks.

So, it is recommended to have the fix for Bug 30582481 and Bug 31062010 applied. The detail it is that patch 30582481 (Patch 30582481: ASM REPLACE DISK COMMAND EXECUTED ON ALL CELLDISKS OF A FAILGROUP, ASM RUNNING RESYNC ONE DISK AT A TIME) was withdraw and replaced by bug/patch 31062010 that it is not available (at the moment that I write this port – March 2020).

So, be careful to do this in one engineering system or when you need to replace a lot of disks at the same time.

Some reference for reading:

Test Case:: 12C ASM New feature (Doc ID 1571975.1)

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”

ASM, Mount restricted force for recovery

Category: Database Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ASM, Mount restricted force for recovery

Survive to disk failures it is crucial to avoid data corruption, but sometimes, even with redundancy at ASM, multiple failures can happen. Check in this post how to use the undocumented feature “mount restricted force for recovery” to resurrect diskgroup and lose less data when multiple failures occur.

Diskgroup redundancy is a key factor for ASM resilience, where you can survive to disk failures and still continue to run databases. I will not extend about ASM disk redundancy here, but usually, you can configure your diskgroup without redundancy (EXTERNAL), double redundancy (NORMAL), triple redundancy (HIGH), and even fourth redundancy (EXTEND for stretch clusters).

If you want to understand more about redundancy you have a lot of articles at MOS and on the internet that provide useful information. One good is this. The idea is simple, spread multiple copies in different disks. And can even be better if you group disks in the same failgroups, so, your data will have multiple copies in separate places.

As an example, this a key for Exadata, where every storage cell is one independent failgroup and you can survive to one entire cell failure (or double full, depending on the redundancy of your diskgroup) without data loss. The same idea can be applied at a “normal” environment, where you can create failgroup to disks attached to controller A, and another attached to controller B (so the failure of one storage controller does not affect all failgroups). At ASM, if you do not create failgroup, each disk is a different one in diskgroups that have redundancy enabled.

This represents for Exadata, but it is safe for representation. Basically your data will be in at least two different failgroups:

Environment

In the example that I use here, I have one diskgroup called DATA, which has 7 (seven) disks and each one is on failgroup. The redundancy for this diskgroup is NORMAL, this means that the block is copied in two failgroups. If two failures occur, probably, I will have data loss/corruption. Look:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                           FAILGROUP                      LABEL                           PATH

------------------------------ ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                        CELLI01                        CELLI01                         ORCL:CELLI01

CELLI02                        CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                        CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                        CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                        CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                        CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                        CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                         RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                    SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

The version for my GI is 19.6.0.0, but this can be used from 12.1.0.2 and newer versions (works for 11.2.0.4 in some versions). And In this server, I have three databases running, DBA19, DBB19, and DBC19.

So, with everything running correctly, the data from my databases will be spread two failgroups (this is just a representation and not correct representation where the blocks from my database are):

Remember that a NORMAL redundancy just needs two copies. So, some blocks from datafile 1 from DBA19, as an example, can be stored at CELLI01 and CELLI04. And if your database is small (and your failgroups are big), and you are lucky too, the entire database can be stored in just these two places. In case of failure that just involves CELLI02 and CELLI03 failgroups, your data (from DBA19c) can be intact.

Understanding the failure

Unfortunately, failures (will) happen and can be multiple at the same time. In the diskgroup DATA above, after the second failure, your diskgroup will be dismounted instantly. Usually when this occurs, if you can’t recover the hardware error, you need to restore and recover a backup of your databases after recreating the diskgroup.

If you have lucky and the failures occur at the same time, you can (most of the time) return the failed disks and try to mount the diskgroup because there is no difference between the failed disks/failgroups. But the problem occurs if you have one failure (like CELLI03 diskgroup disappears) and after some time another failgroup fails (like CELLI07). The detail is that between the failures, the databases continued to run and change data in the disk. And when this occurs, and when your failgroup returns, there are differences.

Another point that is very important to understand is the time to recover the failure. If you have one disk/failgroup at ASM, the attributes disk_repair_time and failgroup_repair_time define the time that you have to repair your failure before the rebalance of data takes place. The first (disk_repair_time) is the time that you have to repair the disk in case of failure if your failgroup have more than one disk, just the failed is rebalanced. The second (failgroup_repair_time) is the time that you have to repair the failed failgroup (when it fails completely).

The interesting here is that between the moment of failure until the end of this clock you are susceptible to another failure. If it occurs (more failures that your mirror protection) you will lose the diskgroup. And another fact here it is that between the failures, your databases continue to run, so, if your return the first failed disk/failgroup, you need to sync it.

These “repair times” serve to provide to you time to fix/recover the failure and avoid the rebalance. Think about the architecture, usually the diskgroups with redundancy are big and protect big environments think in one Exadata, as an example, where each disk can have 14TB – and one cell can have until 12 of them), and do rebalance of this amount of data takes a lot of time. To avoid this, if your failed disk is replaced before this time, just sync with the block changed is needed.

A “default configuration” have these values:

SQL> select dg.name,a.value,a.name

  2  from v$asm_diskgroup dg, v$asm_attribute a

  3  where dg.group_number=a.group_number

  4  and a.name like '%time'

  5  /




NAME                                     VALUE           NAME

---------------------------------------- --------------- ----------------------------------------

DATA                                     12.0h           disk_repair_time

DATA                                     24.0h           failgroup_repair_time

RECO                                     24.0h           failgroup_repair_time

RECO                                     12.0h           disk_repair_time

SYSTEMDG                                 24.0h           failgroup_repair_time

SYSTEMDG                                 12.0h           disk_repair_time




6 rows selected.




SQL>

But think in one scenario where more than one failure occurs, the first in CELLI01 at 08:00 am and the second in CELL0I6 at 10:00 am, now, from two hours, you have the new version of blocks. If you fix the failure (for CELL01) you don’t guarantee that you have everything in the last version and the normal mount will not work.

And it is here that mount restricted force for recovery enters. It allows you to resurrect the diskgroup and help you to restore fewer things. Think in the example before, if the failures occur at CELLI01 and CELL06, but your datafiles are in CELLI02 and CELLI07, you lose nothing. Or restore just some tablespaces and not all database. So, it is more gain than lose.

Mount restricted force for recovery

Here, I will simulate multiple failures for the disks (more than one) and show how you can use mount restricted force for recovery. Please be careful and follow all the steps correctly to avoid mistakes and to understand how to do and what is happening.

So, here I have DATA diskgroup, with normal redundancy and 7 (seven) failgroups. DBA19, DBB19, and DBC19 databases running.

So, at the first step, I will simulate a complete failure of CELLI03 failgroup. In my environment, to allow more control, I have one iSCSI target for each failgroup (this allows me to disconnect one by one if needed). The CELLI03 died:

[root@asmrec ~]# iscsiadm -m session

tcp: [11] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.d65b214fca9a (non-flash) CELLI04

tcp: [14] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.637b3bbfa86d (non-flash) CELLI07

tcp: [17] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.2f4cdb93107c (non-flash) CELLI05

tcp: [2] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.bb66b92348a7 (non-flash)  CELLI03

tcp: [20] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.57c0a000e316 (non-flash) (SYS)

tcp: [23] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.89ef4420ea4d (non-flash) CELLI06

tcp: [5] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.eff4683320e8 (non-flash)  CELLI01

tcp: [8] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.7d8f4c8f5012 (non-flash)  CELLI02

[root@asmrec ~]#

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -u

Logging out of session [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260]

Logout of [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

And at ASM alertlog we can see:

2020-03-22T17:14:11.589115+01:00

NOTE: process _user8100_+asm1 (8100) initiating offline of disk 9.4042310133 (CELLI03) with mask 0x7e in group 1 (DATA) with client assisting

NOTE: checking PST: grp = 1

2020-03-22T17:14:11.589394+01:00

GMON checking disk modes for group 1 at 127 for pid 40, osid 8100

2020-03-22T17:14:11.589584+01:00

NOTE: checking PST for grp 1 done.

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0xf0f0c1f5, mask = 0x6a, op = clear mandatory

2020-03-22T17:14:11.589746+01:00

GMON updating disk modes for group 1 at 128 for pid 40, osid 8100

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

WARNING: Write Failed. group:1 disk:9 AU:1 offset:4190208 size:4096

path:ORCL:CELLI03

         incarnation:0xf0f0c1f5 synchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f9182f72210 bufp:0x7f9182f78000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: found another non-responsive disk 9.4042310133 (CELLI03) that will be offlined

So, the failure occurred at 17:14. The full output can be found here at ASM-ALERTLOG-Output-Failure-CELLI03.txt

And we can see that disappeared (but not deleted or dropped) from ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01                        CELLI01                         ORCL:CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

At this point, ASM is starting to count the clock of 12hours (as defined in my repair attributes). The failgroup was not dropped and rebalance was not going on because ASM is optimistic that you will fix the issue in this period.

But after some time I had a second failure in the diskgroup:

Now at ASM alertlog you can see that diskgroup was dismounted (and several other messages). Bellow a cropped from the alertlog. The full output (and I think that deserve a look) it is here at ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01.txt

2020-03-22T17:18:39.699555+01:00

WARNING: Write Failed. group:1 disk:1 AU:1 offset:4190208 size:4096

path:ORCL:CELLI01

         incarnation:0xf0f0c1f3 asynchronous result:'I/O error'

         subsys:/opt/oracle/extapi/64/asm/orcl/1/libasm.so krq:0x7f9182f833d0 bufp:0x7f91836ef000 osderr1:0x3 osderr2:0x2e

         IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: Hbeat write to PST disk 1.4042310131 in group 1 failed. [2]

2020-03-22T17:18:39.704035+01:00

...

...

2020-03-22T17:18:39.746945+01:00

NOTE: cache closing disk 9 of grp 1: (not open) CELLI03

ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.

2020-03-22T17:18:39.747462+01:00

ERROR: too many offline disks in PST (grp 1)

2020-03-22T17:18:39.759171+01:00

NOTE: cache dismounting (not clean) group 1/0xB48031B9 (DATA)

NOTE: messaging CKPT to quiesce pins Unix process pid: 12050, image: [email protected] (B001)

2020-03-22T17:18:39.761807+01:00

NOTE: halting all I/Os to diskgroup 1 (DATA)

2020-03-22T17:18:39.766289+01:00

NOTE: LGWR doing non-clean dismount of group 1 (DATA) thread 1

NOTE: LGWR sync ABA=23.3751 last written ABA 23.3751

...

...

2020-03-22T17:18:40.207406+01:00

SQL> alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */

...

...

2020-03-22T17:18:40.841979+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:

ORA-15130: diskgroup "DATA" is being dismounted

2020-03-22T17:18:40.853738+01:00

...

...

ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.

2020-03-22T17:18:40.861939+01:00

ERROR: too many offline disks in PST (grp 1)

...

...

2020-03-22T17:18:43.214368+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_8756.trc:

ORA-15130: diskgroup "DATA" is being dismounted

2020-03-22T17:18:43.214885+01:00

NOTE: client DBC19:DBC19:asmrec no longer has group 1 (DATA) mounted

2020-03-22T17:18:43.215492+01:00

NOTE: client DBB19:DBB19:asmrec no longer has group 1 (DATA) mounted

NOTE: cache deleting context for group DATA 1/0xb48031b9

...

...

2020-03-22T17:18:43.298551+01:00

SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:3028300217 */

SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA

2020-03-22T17:18:43.352003+01:00

SQL> ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {0:1:9} */

2020-03-22T17:18:43.372816+01:00

NOTE: cache registered group DATA 1/0xB44031BF

NOTE: cache began mount (first) of group DATA 1/0xB44031BF

NOTE: Assigning number (1,8) to disk (ORCL:CELLI02)

NOTE: Assigning number (1,0) to disk (ORCL:CELLI04)

NOTE: Assigning number (1,11) to disk (ORCL:CELLI05)

NOTE: Assigning number (1,3) to disk (ORCL:CELLI06)

NOTE: Assigning number (1,2) to disk (ORCL:CELLI07)

2020-03-22T17:18:43.514642+01:00

cluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1

2020-03-22T17:18:46.089517+01:00

NOTE: detected and added orphaned client id 0x10010

NOTE: detected and added orphaned client id 0x1000e

So, the second failure occurred at 17:18 and lead to diskgroup force dismount. And you can see messages like “NOTE: cache dismounting (not clean)”, “ERROR: too many offline disks in PST (grp 1)”, and even “ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline”.

So, probably some data was lost. And even if you consider that between these 4 minutes data was changed in the databases, the mess is Big. If you want to see the alertlog from databases, check here at ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure.txt

And now we have this at ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01

                                                                        CELLI02                         ORCL:CELLI02

                                                                        CELLI04                         ORCL:CELLI04

                                                                        CELLI05                         ORCL:CELLI05

                                                                        CELLI06                         ORCL:CELLI06

                                                                        CELLI07                         ORCL:CELLI07




7 rows selected.




SQL>

And if we try to mount we receive an error due to disk offline:

SQL> alter diskgroup data mount;

alter diskgroup data mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "9" is missing from group number "1"

ORA-15042: ASM disk "1" is missing from group number "1"


SQL>

Now is the key decision. If you have important data that worth the effort to try to recover you can continue. It is your decision and based on several details. Since the diskgroup is dismounted, the repair time is not counting, and you have days until recovery. Sometimes one day stopped is better than several days to recover all databases from the last backup.

Imagine that you can bring online the first failed failgroup (CELL03) that have 4 minutes of the difference of data:

[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -l

Logging in to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] (multiple)

Login to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.

[root@asmrec ~]#

And if you try to mount it normally you will receive an error (output from alertlog at this try can be seen here at ASM-ALERTLOG-Output-Mout-With-One-Disk-Online):

SQL> alter diskgroup data mount;

alter diskgroup data mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15017: diskgroup "DATA" cannot be mounted

ORA-15066: offlining disk "1" in group "DATA" may result in a data loss

SQL>

So, now we can try the mount restricted force for recovery:

SQL> alter diskgroup data mount restricted force for recovery;




Diskgroup altered.




SQL>

The alertlog from ASM (that you can full here at ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery.txt) report messages related with cache from diskgropup and disk that need to be checked. And now we are like this:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01

                                                                        CELLI03                         ORCL:CELLI03




10 rows selected.




SQL>

The next step is to bring online the failgroup that came back:

SQL> alter diskgroup data online disks in failgroup CELLI03;




Diskgroup altered.




SQL>

Doing this ASM will resync this failgroup (using this block as the last version) and bring the cache of this disk online. At ASM alertlog you can see messages like (full output here at ASM-ALERTLOG-Output-Online-Restored-Failgroup):

2020-03-22T17:27:47.729003+01:00

SQL> alter diskgroup data online disks in failgroup CELLI03

2020-03-22T17:27:47.729551+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

2020-03-22T17:27:47.729640+01:00

NOTE: cache closing disk 9 of grp 1: (not open) CELLI03

2020-03-22T17:27:47.730398+01:00

NOTE: GroupBlock outside rolling migration privileged region

NOTE: initiating resync of disk group 1 disks

CELLI03 (9)




NOTE: process _user6891_+asm1 (6891) initiating offline of disk 9.4042310248 (CELLI03) with mask 0x7e in group 1 (DATA) without client assisting

2020-03-22T17:27:47.737580+01:00

...

...

2020-03-22T17:27:47.796524+01:00

NOTE: disk validation pending for 1 disk in group 1/0x1d7031d4 (DATA)

NOTE: Found ORCL:CELLI03 for disk CELLI03

NOTE: completed disk validation for 1/0x1d7031d4 (DATA)

2020-03-22T17:27:47.935467+01:00

...

...

2020-03-22T17:27:48.116572+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

NOTE: cache opening disk 9 of grp 1: CELLI03 label:CELLI03

2020-03-22T17:27:48.117158+01:00

SUCCESS: refreshed membership for 1/0x1d7031d4 (DATA)

2020-03-22T17:27:48.123545+01:00

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x5d, op = assign mandatory

...

...

2020-03-22T17:27:48.142068+01:00

NOTE: PST update grp = 1 completed successfully

2020-03-22T17:27:48.143197+01:00

SUCCESS: alter diskgroup data online disks in failgroup CELLI03

2020-03-22T17:27:48.577277+01:00

NOTE: Attempting voting file refresh on diskgroup DATA

NOTE: Refresh completed on diskgroup DATA. No voting file found.

...

...

2020-03-22T17:27:48.643277+01:00

NOTE: Starting resync using Staleness Registry and ATE scan for group 1

2020-03-22T17:27:48.696075+01:00

NOTE: Starting resync using Staleness Registry and ATE scan for group 1

NOTE: header on disk 9 advanced to format #2 using fcn 0.0

2020-03-22T17:27:49.725837+01:00

WARNING: Started Drop Disk Timeout for Disk 1 (CELLI01) in group 1 with a value 43200

2020-03-22T17:27:57.301042+01:00

...

2020-03-22T17:27:59.687480+01:00

NOTE: cache closing disk 1 of grp 1: (not open) CELLI01

NOTE: reset timers for disk: 9

NOTE: completed online of disk group 1 disks

CELLI03 (9)




2020-03-22T17:27:59.714674+01:00

ERROR: ORA-15421 thrown in ARBA for group number 1

2020-03-22T17:27:59.714805+01:00

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arba_8786.trc:

ORA-15421: Rebalance is not supported when the disk group is mounted for recovery.

2020-03-22T17:27:59.715047+01:00

NOTE: stopping process ARB0

NOTE: stopping process ARBA

2020-03-22T17:28:00.652115+01:00

NOTE: rebalance interrupted for group 1/0x1d7031d4 (DATA)

And not we have at ASM:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

And rebalance not continue because is not allowed when diskgroup is in restrict mode:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1                                                                                                               0

         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0

         1            1 REBAL REBUILD   WAIT          1                                                                                                               0

         1            1 REBAL RESYNC    WAIT          1                                                                                                               0




SQL>

But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the arb process trace files at ASM trace directory:

[root@asmrec trace]# ls -lFhtr *arb*

...

...

-rw-r----- 1 grid oinstall 6.4K Mar 22 17:10 +ASM1_arb0_3210.trm

-rw-r----- 1 grid oinstall  44K Mar 22 17:10 +ASM1_arb0_3210.trc

-rw-r----- 1 grid oinstall  984 Mar 22 17:27 +ASM1_arb0_8788.trm

-rw-r----- 1 grid oinstall 2.1K Mar 22 17:27 +ASM1_arb0_8788.trc

-rw-r----- 1 grid oinstall  882 Mar 22 17:27 +ASM1_arba_8786.trm

-rw-r----- 1 grid oinstall 1.2K Mar 22 17:27 +ASM1_arba_8786.trc

[root@asmrec trace]#

And looking from one of the last, we can see that some extend (that does not exist, the recovered failgroup, or the cache is not the last one) was filled with dummy (BADFDA7A) data:

[root@asmrec trace]# cat +ASM1_arb0_8788.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_8788.trc

Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production

Version 19.6.0.0.0

Build label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417

ORACLE_HOME:    /u01/app/19.0.0.0/grid

System name:    Linux

Node name:      asmrec.oralocal

Release:        4.14.35-1902.10.8.el7uek.x86_64

Version:        #2 SMP Thu Feb 6 11:02:28 PST 2020

Machine:        x86_64

Instance name: +ASM1

Redo thread mounted by this instance: 0 <none>

Oracle process number: 40

Unix process pid: 8788, image: [email protected] (ARB0)







*** 2020-03-22T17:27:59.044949+01:00

*** SESSION ID:(402.55837) 2020-03-22T17:27:59.044969+01:00

*** CLIENT ID:() 2020-03-22T17:27:59.044975+01:00

*** SERVICE NAME:() 2020-03-22T17:27:59.044980+01:00

*** MODULE NAME:() 2020-03-22T17:27:59.044985+01:00

*** ACTION NAME:() 2020-03-22T17:27:59.044989+01:00

*** CLIENT DRIVER:() 2020-03-22T17:27:59.044994+01:00




 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery

 WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery




*** 2020-03-22T17:27:59.680119+01:00

NOTE: initiating PST update: grp 1 (DATA), dsk = 9/0x0, mask = 0x7f, op = assign mandatory

kfdp_updateDsk(): callcnt 195 grp 1

PST verChk -0: req, id=266369333, grp=1, requested=91 at 03/22/2020 17:27:59

NOTE: PST update grp = 1 completed successfully

NOTE: kfdsFilter_freeDskSrSlice for Filter 0x7fbaf6238d38

NOTE: kfdsFilter_clearDskSlice for Filter 0x7fbaf6238d38 (all:TRUE)

NOTE: completed online of disk group 1 disks

CELLI03 (9)

[root@asmrec trace]#

And as you can imagine, this will lead to files that need to be restored from backup. But look that just some data, not everything. Remember at the beginning of the post that this depends on how your data is distributed inside of ASM failgroups. If you have luck, you have just a few data impacted. This depends on a lot of factors, as the time that was offline, the size of the failgroup, the activity of your databases, and many others. But, the gains can be good and mad it worth the effort.

After that, we can normally dismount the diskgroup:

SQL> alter diskgroup data dismount;




Diskgroup altered.




SQL>

And mount it again:

SQL> alter diskgroup data mount;




Diskgroup altered.




SQL>

Since now the diskgroup is mounted in a clean way, you can continue with the rebalance:

SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1                                                                                                               0

         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0

         1            1 REBAL REBUILD   WAIT          1                                                                                                               0

         1            1 REBAL RESYNC    WAIT          1                                                                                                               0




SQL> alter diskgroup DATA rebalance;




Diskgroup altered.




SQL>

The state at ASM side it is:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI01                                  CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL>

As you can see, the CELL01 was not removed yet (I will talk about it later). But the activities can continue, databases can be checked.

Database side

At database side we need to check what we lost and need to recover. Since I am using cluster the GI tried to start it (and as you can see two became up):

[oracle@asmrec ~]$ ps -ef |grep smon

root      8254     1  2 13:53 ?        00:04:40 /u01/app/19.0.0.0/grid/bin/osysmond.bin

grid      8750     1  0 13:54 ?        00:00:00 asm_smon_+ASM1

oracle   11589     1  0 17:31 ?        00:00:00 ora_smon_DBB19

oracle   11751     1  0 17:31 ?        00:00:00 ora_smon_DBA19

oracle   18817 29146  0 17:44 pts/9    00:00:00 grep --color=auto smon

[oracle@asmrec ~]$

DBA19

The firs that I checked was DBA19C, I used rman to VALIDATE DATABASE:

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:45:21 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database: DBA19 (DBID=828667324)




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=260 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBA19/DATAFILE/system.256.1035153873

input datafile file number=00004 name=+DATA/DBA19/DATAFILE/undotbs1.258.1035153973

input datafile file number=00003 name=+DATA/DBA19/DATAFILE/sysaux.257.1035153927

input datafile file number=00007 name=+DATA/DBA19/DATAFILE/users.259.1035153975

channel ORA_DISK_1: validation complete, elapsed time: 00:03:45

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              17722        117766          5042446

  File Name: +DATA/DBA19/DATAFILE/system.256.1035153873

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              79105

  Index      0              13210

  Other      0              7723




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19445        67862           5042695

  File Name: +DATA/DBA19/DATAFILE/sysaux.257.1035153927

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7988

  Index      0              5531

  Other      0              34876




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    FAILED 1              49           83247           5042695

  File Name: +DATA/DBA19/DATAFILE/undotbs1.258.1035153973

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      511            83151




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              93           641             4941613

  File Name: +DATA/DBA19/DATAFILE/users.259.1035153975

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              65

  Index      0              15

  Other      0              467




validate found one or more corrupt blocks

See trace file /u01/app/oracle/diag/rdbms/dba19/DBA19/trace/DBA19_ora_19219.trc for details

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> shutdown abort;




Oracle instance shut down




RMAN> startup mount;




connected to target database (not started)

Oracle instance started

database mounted




Total System Global Area    1610610776 bytes




Fixed Size                     8910936 bytes

Variable Size                859832320 bytes

Database Buffers             734003200 bytes

Redo Buffers                   7864320 bytes




RMAN> run{

2> restore datafile 4;

3> recover datafile 4;

4> }




Starting restore at 22-MAR-20

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=249 device type=DISK




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00004 to +DATA/DBA19/DATAFILE/undotbs1.258.1035153973

channel ORA_DISK_1: reading from backup piece /tmp/9puro5qr_1_1

channel ORA_DISK_1: piece handle=/tmp/9puro5qr_1_1 tag=BKP-DB-INC0

channel ORA_DISK_1: restored backup piece 1

channel ORA_DISK_1: restore complete, elapsed time: 00:00:45

Finished restore at 22-MAR-20




Starting recover at 22-MAR-20

using channel ORA_DISK_1




starting media recovery

media recovery complete, elapsed time: 00:00:02




Finished recover at 22-MAR-20




RMAN> alter database open;




Statement processed




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

[oracle@asmrec ~]$

As you can see, the datafile 4 FAILED and needs to be recovered. Luckily, the redo was not affected too and the open was OK. Since it was the UNDO, I made abort (because the immediate can take an eternity, and even since undo was down, nothing was happening inside of the database).

But as you saw, just one datafile was corrupted. Of course that with big databases and big failgroup, more files will be corrupted. But it is a shot that can worth it.

DBB19

The second was DBB19 and I used the same approach, VALIDATE DATABASE:

[oracle@asmrec ~]$ export ORACLE_SID=DBB19

[oracle@asmrec ~]$

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:55:20 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




PL/SQL package SYS.DBMS_BACKUP_RESTORE version 19.03.00.00 in TARGET database is not current

PL/SQL package SYS.DBMS_RCVMAN version 19.03.00.00 in TARGET database is not current

connected to target database: DBB19 (DBID=1336872427)




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=374 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051

input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177

input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235

input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241

channel ORA_DISK_1: validation complete, elapsed time: 00:00:35

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              16763        116487          3861452

  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              78871

  Index      0              13010

  Other      0              7836




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19307        62758           3861452

  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7459

  Index      0              5158

  Other      0              30796




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            35847           3652497

  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              35839




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              85           641             3759202

  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              70

  Index      0              15

  Other      0              470




channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> VALIDATE CHECK LOGICAL DATABASE;




Starting validate at 22-MAR-20

using channel ORA_DISK_1

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBB19/DATAFILE/system.261.1035154051

input datafile file number=00003 name=+DATA/DBB19/DATAFILE/sysaux.265.1035154177

input datafile file number=00004 name=+DATA/DBB19/DATAFILE/undotbs1.267.1035154235

input datafile file number=00007 name=+DATA/DBB19/DATAFILE/users.268.1035154241

channel ORA_DISK_1: validation complete, elapsed time: 00:00:35

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              16763        116487          3861452

  File Name: +DATA/DBB19/DATAFILE/system.261.1035154051

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              78871

  Index      0              13010

  Other      0              7836




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    OK     0              19307        62758           3861452

  File Name: +DATA/DBB19/DATAFILE/sysaux.265.1035154177

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7459

  Index      0              5158

  Other      0              30796




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            35847           3652497

  File Name: +DATA/DBB19/DATAFILE/undotbs1.267.1035154235

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              35839




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              85           641             3759202

  File Name: +DATA/DBB19/DATAFILE/users.268.1035154241

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              70

  Index      0              15

  Other      0              470




channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

[oracle@asmrec ~]$

[oracle@asmrec ~]$

As you saw, no failures for DBB19. I still checked logically the database with VALIDATE CHECK LOGICAL DATABASE because since the validate returned no failed files, I wanted to check logically the blocks.

DBC19

Same for the last database, but now, datafile 3 failed:

[oracle@asmrec ~]$ export ORACLE_SID=DBC19

[oracle@asmrec ~]$ rman target /




Recovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 18:01:33 2020

Version 19.6.0.0.0




Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.




connected to target database (not started)




RMAN> startup mount;




Oracle instance started

database mounted




Total System Global Area    1610610776 bytes




Fixed Size                     8910936 bytes

Variable Size                864026624 bytes

Database Buffers             729808896 bytes

Redo Buffers                   7864320 bytes




RMAN> validate database;




Starting validate at 22-MAR-20

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=134 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00001 name=+DATA/DBC19/DATAFILE/system.262.1035154053

input datafile file number=00004 name=+DATA/DBC19/DATAFILE/undotbs1.270.1035154249

input datafile file number=00003 name=+DATA/DBC19/DATAFILE/sysaux.266.1035154181

input datafile file number=00007 name=+DATA/DBC19/DATAFILE/users.271.1035154253

channel ORA_DISK_1: validation complete, elapsed time: 00:03:15

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

1    OK     0              17777        117764          4188744

  File Name: +DATA/DBC19/DATAFILE/system.262.1035154053

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              79161

  Index      0              13182

  Other      0              7640




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

3    FAILED 1              19272        66585           4289434

  File Name: +DATA/DBC19/DATAFILE/sysaux.266.1035154181

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              7311

  Index      0              4878

  Other      511            35099




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

4    OK     0              1            84522           4188748

  File Name: +DATA/DBC19/DATAFILE/undotbs1.270.1035154249

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              0

  Index      0              0

  Other      0              84479




File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

7    OK     0              93           641             3717377

  File Name: +DATA/DBC19/DATAFILE/users.271.1035154253

  Block Type Blocks Failing Blocks Processed

  ---------- -------------- ----------------

  Data       0              65

  Index      0              15

  Other      0              467




validate found one or more corrupt blocks

See trace file /u01/app/oracle/diag/rdbms/dbc19/DBC19/trace/DBC19_ora_22091.trc for details

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

including current control file for validation

including current SPFILE in backup set

channel ORA_DISK_1: validation complete, elapsed time: 00:00:01

List of Control File and SPFILE

===============================

File Type    Status Blocks Failing Blocks Examined

------------ ------ -------------- ---------------

SPFILE       OK     0              2

Control File OK     0              646

Finished validate at 22-MAR-20




RMAN> run{

2> restore datafile 3;

3> recover datafile 3;

4> }




Starting restore at 22-MAR-20

using channel ORA_DISK_1




channel ORA_DISK_1: starting datafile backup set restore

channel ORA_DISK_1: specifying datafile(s) to restore from backup set

channel ORA_DISK_1: restoring datafile 00003 to +DATA/DBC19/DATAFILE/sysaux.266.1035154181

channel ORA_DISK_1: reading from backup piece /tmp/0buro5rh_1_1

channel ORA_DISK_1: piece handle=/tmp/0buro5rh_1_1 tag=BKP-DB-INC0

channel ORA_DISK_1: restored backup piece 1

channel ORA_DISK_1: restore complete, elapsed time: 00:00:45

Finished restore at 22-MAR-20




Starting recover at 22-MAR-20

using channel ORA_DISK_1




starting media recovery




archived log for thread 1 with sequence 25 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103

archived log for thread 1 with sequence 26 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_26.329.1035739907

archived log for thread 1 with sequence 27 is already on disk as file +RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_27.332.1035741283

archived log file name=+RECO/DBC19/ARCHIVELOG/2020_03_22/thread_1_seq_25.323.1035737103 thread=1 sequence=25

media recovery complete, elapsed time: 00:00:03

Finished recover at 22-MAR-20




RMAN> alter database open;




Statement processed




RMAN> exit







Recovery Manager complete.

[oracle@asmrec ~]$

Dropping failgroup

If the fix for the remaining failgroup took a lot, it will be dropped automatically. But we can do this manually with force (look that without force it fails):

SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01;

ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15084: ASM disk "CELLI01" is offline and cannot be dropped.







SQL>

SQL> ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01 FORCE;




Diskgroup altered.




SQL>

And after the rebalance finish, all disk will be removed:

SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

_DROPPED_0001_DATA                       CELLI01

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




9 rows selected.




SQL> select * from gv$asm_operation;




   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID

---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------

         1            1 REBAL COMPACT   WAIT          1          1          0          0          0           0                                                       0

         1            1 REBAL REBALANCE WAIT          1          1          0          0          0           0                                                       0

         1            1 REBAL REBUILD   RUN           1          1        292        642        666           0                                                       0

         1            1 REBAL RESYNC    DONE          1          1          0          0          0           0                                                       0




SQL> select * from gv$asm_operation;




no rows selected




SQL> select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;




NAME                                     FAILGROUP                      LABEL                           PATH

---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------

CELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02

CELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03

CELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04

CELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05

CELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06

CELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07

RECI01                                   RECI01                         RECI01                          ORCL:RECI01

SYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01




8 rows selected.




SQL>

The steps for MOUNT RESTRICTED FORCE FOR RECOVERY

To resume, the steps needed are (in order):

Put online the failed disk/failgroup
Execute alter diskgroup <DG> mount restricted force for recovery
Brink online the failgroup with alter diskgroup data online disks in failgroup <FG>
Clean dismount DG alter diskgroup <DG> dismount
Clean mount alter diskgroup <DG> mount
Check databases for failures and recover it

Undocumented feature

So, the question is, why it is undocumented? I don’t have the answer but can figure out some points. For me, the most important is that is not a full, clean return. You need to restore and recover from the backup. Maybe you will lose a lot of data.

Of course that here in this example is a controlled scenario, I have just a few databases and my failgroup have just one disk inside. In real life, the problem will be worst. More diskgroups can be affected, as RECO/REDO/FRA. And probably you lost some redologs and archivelogs too and you can’t do a clean recovery. Or even need to recover OCR and Votedisk from the cluster.

This is the point for correct architecture design, if you need more protection at ASM side, you can use HIGH redundancy to survive at least two failures without interruption. This is the reason that SYSTEMDG (or OCR/Vote disk) is put high redundancy diskgroup at Exadata.

Outages and failures can occur in different layers of your environment. But storage/disk failures are catastrophic for databases because they can lead data corruption and you need to use backups to recover it. They can occur in any environment, from Storage until Exadata. I had one in an old Exadata V2 in 2016, used just for DEV databases, that crashed two storage cells (with one hour of difference) and needed to use this procedure to save some files and reduce the downtime avoiding to restore everything (more than 10TB).

So, it is good to know this kind of a procedure because can save time. But it is your decision to use it or no, check if worth or no.

Some references that you can check:

EXADATA X8M, Workshop

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

EXADATA X8M, Workshop

When Exadata X8M was released during the last Open World I made one post about the technical details about it. You can check it here: Exadata X8M (this post received some good shares and reviews). If you read it, you can see that I focused in more internal details (like torn blocks, one side path, two sides read/writes, and others), that differ from the normal analyses for Exadata X8M.

But recently I was invited by Oracle to participate exclusive workshop about Exadata X8M and I needed to share some details that picked me up. The workshop was done directly from Oracle Solution Center in Santa Clara Campus (it is an amazing place that I had an opportunity to visit in 2015, and have a rich history – if you have the opportunity, visit), and cover some technical details and with the hands-on part.

Unfortunately, I can’t share everything (I even don’t know if I can share something), but see the info below.

Exadata X8M

Little briefing before. If you don’t know, the big change for Exadata X8M was the add of RDMA memory directly at the storage server. Basically, added Intel Optane Memory (that it is nonvolatile by the way), in front of everything (including flashcache). Called PMEM (Persistent memory) as Exadata.

You can see more detail about Exadata at “Exadata Technical Deep Dive: Architecture and Internals” available at Oracle Web Site.

And just to add, the database server can access directly it, without passing operational system caches in the middle since the RDMA is ZDP (Zero-loss Zero-Copy Datagram Protocol). The data goes directly from database server memory to storage server memory. And if you think that now it uses RCoE (100GB/s network), it is fast.

Workshop

As told the workshop was more than a simple marketing presentation and had hands-on part. The focus was to show the PMEM gains over a “traditional” Exadata. Gains over log writes (PMEMLog) and gains over cache for reads (PMEMCache).

The test executed some loads (to test the log writes) and reads (to test the cache) for some defined period. Two runs were made, one with features disabled and other with enabled. And after that, we can compare the results.

Just to be clear that the focus of the tests was not to stress the machine and force to reach the limits. The focus was to see the difference between executions with and without PMEMCache and PMEMLog enabled. So, the numbers that you see here are below the limits for the Exadata.

Environment

Just to be more precise, the Workshop runs over Exadata X8M Extreme Flash Edition. This means that the disk was the flash driver. So, here we are comparing Flash against Memory. You will see the numbers below, but when read it remember that is flash, and think that they will be even slower in “normal” environment with disks instead of flash. Check the datasheet for more info.

For PMEM, the Exadata had 1.5TB of Intel Optane Memory. Twelve modules attached in each storage:

PMEMLog

The PMEMlog works in the same way that flashlog. It will speed up the redo writes that came from LGWR. For the workshop was first executed one run without PMEMLog, and the numbers were:

Execution per second: 26566
Log file sync average time: 310.00 μs
Log file parallel write average time: 129.00 μs

At CloudControl (CC):

And the same run was made with PMEMLog enabled. The results were:

Execution per second: 73915
Log file sync average time: 56.00 μs
Log file parallel write average time: 14.00 μs

And for CC:

As you can see, the difference was:

7x more executions.
2x faster to do log switch.
2x faster to do log writes.

As you can see, good numbers and good savings. Note that the actual performance measured may vary based on your workload. As expected, the use of PMEM speed up the redologs writes executed over the same workload (the database was restarted between the runs).

PMEM Cache

PMEM Cache works as a cache in front of the flashcache for Exadata Software, in the same way that flashcache but using just the PMEM memory modules. There nothing much to explain, everything is controlled by Exadata Software (the blocks that will be at PMEMCache), but the results are pretty amazing.

During the workshop one run was executed without PMEMCache enabled, the result was:

Number of read IOPS: 125.688
Average single block read latency: 233.00 μs
Average MB/s: 500MB/s

And form CC view:

And after, the run with PMEMCache enabled:

Number of read IOPS: 020.731
Average single block read latency (us): 16.00 μs
Average MB/s: 4000 MB/s (4GB/s)

For CC you can see how huge the difference:

The test was the same over the database. Both runs were the same. And if you compare the results, with PMEMCache enable was:

8x time more IOPS.
14x time faster/less latency.
8x more throughput at MB/s.

And again, remember that the Exadata was an EF edition. So, we are comparing flash against memory. Think how huge the difference will be if we compare “normal” hard disk against memory.

Under the Hood

But under the hood, how the PMEM appear? It is simple (and tricky at the same time). The modules appear as normal celldisk for the cell (yes, dimm memory as celldisk):

CellCLI> list celldisk

          FD_00_exa8cel01 normal

          FD_01_exa8cel01 normal

          FD_02_exa8cel01 normal

          FD_03_exa8cel01 normal

          FD_04_exa8cel01 normal

          FD_05_exa8cel01 normal

          FD_06_exa8cel01 normal

          FD_07_exa8cel01 normal

          PM_00_exa8cel01 normal

          PM_01_exa8cel01 normal

          PM_02_exa8cel01 normal

          PM_03_exa8cel01 normal

          PM_04_exa8cel01 normal

          PM_05_exa8cel01 normal

          PM_06_exa8cel01 normal

          PM_07_exa8cel01 normal

          PM_08_exa8cel01 normal

          PM_09_exa8cel01 normal

          PM_10_exa8cel01 normal

          PM_11_exa8cel01 normal




CellCLI>

And the PMEMLog in the same way as flashlog:

CellCLI> list pmemlog detail

          name:                   exa8cel01_PMEMLOG

          cellDisk:               PM_11_exa8cel01,PM_06_exa8cel01,PM_01_exa8cel01,PM_10_exa8cel01,PM_03_exa8cel01,PM_04_exa8cel01,PM_07_exa8cel01,PM_08_exa8cel01,PM_09_exa8cel01,PM_05_exa8cel01,PM_00_exa8cel01,PM_02_exa8cel01

          creationTime:           2020-01-31T21:01:06-08:00

          degradedCelldisks:     

          effectiveSize:          960M

          efficiency:             100.0

          id:                     9ba61418-e8cc-43c6-ba55-c74e4b5bdec8

          size:                   960M

          status:                 normal




CellCLI>

And PMEMCache too:

CellCLI> list pmemcache detail

          name:                   exa8cel01_PMEMCACHE

          cellDisk:               PM_04_exa8cel01,PM_01_exa8cel01,PM_09_exa8cel01,PM_03_exa8cel01,PM_06_exa8cel01,PM_08_exa8cel01,PM_05_exa8cel01,PM_10_exa8cel01,PM_11_exa8cel01,PM_00_exa8cel01,PM_02_exa8cel01,PM_07_exa8cel01

          creationTime:           2020-01-31T21:23:09-08:00

          degradedCelldisks:     

          effectiveCacheSize:     1.474365234375T

          id:                     d3c71ce8-9e4e-4777-9015-771a4e1a2376

          size:                   1.474365234375T

          status:                 normal




CellCLI>

So, the administration at the cell side will be easy. We (as DMA) don’t need to concern about special administration details. It was not the point of the workshop, but I think that when the DIMM needs to be replaced, it will not be hot-swapped and after the change, some “warm” process needs to occur to load the cache.

Conclusion

For me, the workshop was a really nice surprise. I work with Exadata since 2010, started with Exadata V2 and passed over X2, X4 X5 EF and X7, and saw a lot of new features over these 10 years, but the addition of PMEM was a good adding. I was not expecting much difference for the numbers, but it was possible to see the real difference that PMEM can deliver. The notable point was the latency reduction.

Think about one environment that you need to be fast, these μs make the difference. Think in one DB that used Data Guard with Real-time apply for redo, where the primary waits for standby apply to release the commit. Think that these μs that you gain, you deduct for the total time.

Of course, that not everyone maybe needs the gains for PMEM, but for those that need it really make the difference.

ZDLRA, Manual clone for Tape

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ZDLRA, Manual clone for Tape

In my previous post, I showed how the clone to tape occurs for ZDLRA. But as explained, the clones occur through the scheduler and follow some rules. For full backup, as an example, it clones the last available.

But sometimes, it is needed to call the clone for some specific backup, maybe to do long-term storage to follow some regimentation/law. And if we leverage this for the clone, jobs can maybe take to long, or clone more that you need.

Manual Clone

It is possible to clone one backup piece directly using CLI at ZDLRA. But we need to specify backup piece per backup piece since the parameter is the bp_key. But before clone, I recommend identifying some important things.

The first is the database key inside the rman catalog. A simple list db_unique_name show this:

RMAN> list db_unique_name of database ORCL18C;







List of Databases

DB Key  DB Name  DB ID            Database Role    Db_unique_name

------- ------- ----------------- ---------------  ------------------

15993   ORCL18C  558466555        PRIMARY          ORCL18C




RMAN>

The second is to identify the backupset and the backup piece that you need to clone. In this case, I picked up one from SYSTEM tablespace:

RMAN> list backupset 16512;







List of Backup Sets

===================







BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

16512   Incr 0  330.35M    SBT_TAPE    00:03:03     02/02/2020 16:12:04

        BP Key: 16513   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB-INC0

        Handle: VB$_1891149551_16507I   Media:

  List of Datafiles in backup set 16512

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 1909028    02/02/2020 16:09:01              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




RMAN>

Other are related inside ZDLRA database and are the protection policy linked with the database doing select over ra_database table:

SQL> select policy_name, db_key, db_unique_name from ra_database where db_unique_name = 'ORCL18C';




POLICY_NAME                                            DB_KEY DB_UNIQUE_NAME

-------------------------------------------------- ---------- --------------------------------

ZDLRA_ONLY_TAPE                                         15993 ORCL18C




SQL>

And to finish it needed to discover the template it is associated with the protection policy:

SQL> set linesize 250

SQL> col LIB_NAME format a30

SQL> col TEMPLATE_NAME format a50

SQL> col LAST_SCHEDULE_TIME format a35

SQL> col POLICY_NAME format a50

SQL> select TEMPLATE_NAME, BACKUP_TYPE, LAST_SCHEDULE_TIME, LIB_NAME, DB_UNIQUE_NAME, POLICY_NAME FROM RA_SBT_JOB WHERE LIB_NAME = 'OSBSBT' and POLICY_NAME = 'ZDLRA_ONLY_TAPE' ;




TEMPLATE_NAME                                      BACKUP_TYPE      LAST_SCHEDULE_TIME                  LIB_NAME                       DB_UNIQUE_NAME                 POLICY_NAME

-------------------------------------------------- ---------------- ----------------------------------- ------------------------------ ------------------------------ --------------------------------------------------

ONLYTAPE_FULL_ARCH                                 FULL, ARCH       02-FEB-20 04.40.37.077769 PM +01:00 OSBSBT                                                        ZDLRA_ONLY_TAPE

ONLYTAPE_ARCH                                      ARCH                                                 OSBSBT                                                        ZDLRA_ONLY_TAPE




SQL>

COPY_BACKUP_PIECE

The information collected before it is important because it is needed to use the DBMS_RA.COPY_BACKUP_PIECE procedure.

The parameter for the usage:

BP_KEY: Backup piece key that will be cloned.
FORMAT: Format to create the name for the backup. It follows the same format as RMAN (including the same wildcards).
TEMPLATE_NAME: The template that will be used to clone to tape. This is how ZDLRA discover the library and attributes for the clone.

So, to clone we just need to execute at ZDLRA:

SQL> BEGIN

  2      DBMS_RA.COPY_BACKUP_PIECE(

  3          BP_KEY => 16513,

  4          FORMAT => 'RA_SBT_ORCL18C_16512_16513_%U',

  5          TEMPLATE_NAME => 'ONLYTAPE_FULL_ARCH'

  6      );

  7  END;

  8  /




PL/SQL procedure successfully completed.




SQL>

Important there is the FORMAT. Look that that I used RA_SBT in the beginning to follow the same used by the normal clone to the tape of ZDLRA, after I added the DB_UNIQUE_NAME, with <BS_KEY>_<BP_KEY>, and %U to generate a unique name.

If we look internally, the backup was scheduled:

SQL> SELECT TASK_ID, STATE, DB_UNIQUE_NAME, ERROR_TEXT, BS_KEY, PIECE# FROM RA_SBT_TASK WHERE LIB_NAME = 'OSBSBT' AND  DB_UNIQUE_NAME = 'ORCL18C' and BS_KEY = 16512;




   TASK_ID STATE                DB_UNIQUE_NAME                 ERROR_TEXT                                             BS_KEY     PIECE#

---------- -------------------- ------------------------------ -------------------------------------------------- ---------- ----------

     33857 EXECUTABLE           ORCL18C                                                                                16512          1




SQL>

And OSB executed it:

[root@zdlras1n1 ~]# obtool lsjob

Job ID           Sched time  Contents                       State

---------------- ----------- ------------------------------ ---------------------------------------

oracle/42        none        database ORCL18C (dbid=558466555) processed; Oracle job(s) scheduled

oracle/42.1      none        datafile backup                running since 2020/02/02.20:48

[root@zdlras1n1 ~]#

[root@zdlras1n1 ~]# cat /usr/etc/ob/xcr/[email protected]

[4100000001]2020/02/02.20:48:35 ______________________________________________________________________

[4100000001]2020/02/02.20:48:35

[4100000002]2020/02/02.20:48:35         Transcript for job oracle/42.1 running on zdlras1n1

[4100000002]2020/02/02.20:48:35

[1100000003]2020/02/02.20:48:35 (amh)  qdv__automount_in_mh entered

[1100000004]2020/02/02.20:48:35 (amh)  qdv__automount_in_mh tape03 at 2020/02/02.20:48:35, flags 0x100

[1100000005]2020/02/02.20:48:35 (amh)  mount volume options list contains:

[1100000006]2020/02/02.20:48:35 (amh)     vtype 3 (app), vid (null), vs_create 0, family zdlras1-osbmf, retain (null), size 0, mediainfo 2, scratch 0

[1100000007]2020/02/02.20:48:36 (amh)  don't preserve previous mh automount state

[1100000008]2020/02/02.20:48:36 (gep)  getting reservation for element 0x3 (dte)

[1100000009]2020/02/02.20:48:36 (una)  unload_anywhere entered

[110000000A]2020/02/02.20:48:36 (amh) attempting automount with lm based volume selection

[110000000B]2020/02/02.20:48:36 (ulg)  1,(lm_amh)  beginning pass 1, mediainfo pass 1 at 2020/02/02.20:48:36

[110000000C]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    1 no vid

[110000000D]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    2 no vid

[110000000E]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    3 no vid

[110000000F]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    4 no vid

[1100000010]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  need next volume after oid 146 (vid zdlras1-osbmf-000001, tag E01005L4, create_time 1577658912, mediainfo 0) for append; its oid 155

[1100000011]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    5 oid 146 doesn't meet mount requirements - the next volume of this set is needed (OB device mgr)

[1100000012]2020/02/02.20:48:36 (ulg)  1,(lm_amh) oid 146 is added to state

[1100000013]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  need next volume after oid 148 (vid zdlras1-osbmf-000002, tag E01006L4, create_time 1577658918, mediainfo 4) for append; its oid 157

[1100000014]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    6 oid 148 doesn't meet mount requirements - the next volume of this set is needed (OB device mgr)

[1100000015]2020/02/02.20:48:36 (ulg)  1,(lm_amh) oid 148 is added to state

[1100000016]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  need next volume after oid 150 (vid zdlras1-osbmf-000003, tag E01007L4, create_time 1577658926, mediainfo 4) for append; its oid 153

[1100000017]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    7 oid 150 doesn't meet mount requirements - the next volume of this set is needed (OB device mgr)

[1100000018]2020/02/02.20:48:36 (ulg)  1,(lm_amh) oid 150 is added to state

[1100000019]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  family zdlras1-osbmf != RMAN-DEFAULT

[110000001A]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  oid 151 (vid RMAN-DEFAULT-000001, tag E01008L4, create_time 1577034094, mediainfo 0) fails mount criteria

[110000001B]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    8 oid 151 doesn't meet mount requirements - volume doesn't meet mount criteria (OB device mgr)

[110000001C]2020/02/02.20:48:36 (ulg)  1,(lm_amh) oid 151 is added to state

[110000001D]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  oid 153 (vid zdlras1-osbmf-000004, tag E01009L4, create_time 1577658926, mediainfo 4) passes criteria

[110000001E]2020/02/02.20:48:36 (ulg)  1,(lm_amh)    9 appendable volume found

[110000001F]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  oid 155 (vid zdlras1-osbmf-000005, tag E01010L4, create_time 1577658912, mediainfo 0) passes criteria

[1100000020]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   10 better appendable volume found

[1100000021]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   10 available space 159744,   9                                               available space 233472

[1100000022]2020/02/02.20:48:36 (ulg)  1,(lm_mmr)  oid 157 (vid zdlras1-osbmf-000006, tag E01011L4, create_time 1577658918, mediainfo 0) passes criteria

[1100000023]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   11 better appendable volume found

[1100000024]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   11 available space 18432,  10                                               available space 159744

[1100000025]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   12 no vid

[1100000026]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   13 no vid

[1100000027]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   14 no vid

[1100000028]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   15 no vid

[1100000029]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   16 no vid

[110000002A]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   17 no vid

[110000002B]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   18 no vid

[110000002C]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   19 no vid

[110000002D]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   20 no vid

[110000002E]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   21 vacant

[110000002F]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   22 no vid

[1100000030]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   23 no vid

[1100000031]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   24 vacant

[1100000032]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   25 vacant

[1100000033]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   26 vacant

[1100000034]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   27 vacant

[1100000035]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   28 vacant

[1100000036]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   29 vacant

[1100000037]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   30 no vid

[1100000038]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   31 no vid

[1100000039]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   32 no vid

[110000003A]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   33 no vid

[110000003B]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   34 no vid

[110000003C]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   35 no vid

[110000003D]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   36 no vid

[110000003E]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   37 no vid

[110000003F]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   38 no vid

[1100000040]2020/02/02.20:48:36 (ulg)  1,(lm_amh)   39 no vid

[1100000041]2020/02/02.20:48:36 (ulg)  1,(lm_amh)  end of pass 1 at 2020/02/02.20:48:362020/02/02.20:48:36

[1100000042]2020/02/02.20:48:36 (amh)  qlm__select_volume returned se state for vid = zdlras1-osbmf-000006 last_se = 0

[1100000043]2020/02/02.20:48:36 (amh)  state.pass = 1

[1100000044]2020/02/02.20:48:36 (amh)  state.last_se_checked = 11

[1100000045]2020/02/02.20:48:36 (amh)  state.mediainfo_pass = 1

[1100000046]2020/02/02.20:48:36 (amh)  state.mediainfo_loops = 1

[1100000047]2020/02/02.20:48:36 (amh)  state.rls_eltype = se

[1100000048]2020/02/02.20:48:36 (amh)  state.rls_elnum = 11

[1100000049]2020/02/02.20:48:36 (amh)   11 loading

[110000004A]2020/02/02.20:48:38 (atv)  qdv__automount_this_vol entered

[110000004B]2020/02/02.20:48:38 (atv)  calling qdv__mount

[110000004C]2020/02/02.20:48:38 (mt)   qdv__read_mount_db() succeeded, found vol_oid 0

[110000004D]2020/02/02.20:48:38 (mt)   qdv__read_label() succeeded; read 65536 bytes

[110000004E]2020/02/02.20:48:38 (mt)   exp time obtained from label

[110000004F]2020/02/02.20:48:38 (mt)   qdb__label_event() returned vol_oid 157

[1100000050]2020/02/02.20:48:38 (mt)   setting vol_oid in mount_info to 157

[1100000051]2020/02/02.20:48:38 (mt)   updated volume close time from db

[1100000052]2020/02/02.20:48:38 (atv)  qdv__mount succeeded

[1100000053]2020/02/02.20:48:38 (atv)  automount worked

[1100000054]2020/02/02.20:48:38 (atv)  qdv__automount_this_vol exited

[1100000055]2020/02/02.20:48:38 (gep)  getting reservation for element 0x3 (dte)

[1100000056]2020/02/02.20:48:38 (amh)   11 automount worked - returning

[1100000057]2020/02/02.20:48:38 (amh)  end of automount at 2020/02/02.20:48:38 (0x0)

[1100000058]2020/02/02.20:48:38 (amh)  returning from qdv__automount_in_mh

[3100000059]2020/02/02.20:48:38 Info: volume in tape03 is usable for this operation.

[110000005A]2020/02/02.20:48:38 (pvfw) at BOT

[110000005B]2020/02/02.20:48:38 (pvfw) previous state is invalid

[110000005C]2020/02/02.20:48:38 (alv)  backup image label is valid, file 19, section 2

[110000005D]2020/02/02.20:48:38 (pvfw) invalidating tape position in mount db

[110000005E]2020/02/02.20:48:38 (ial)  invalidate backup image label (was valid)

[110000005F]2020/02/02.20:48:38 (pvfw) space to EOD

[1100000060]2020/02/02.20:48:38 (pvfw) inspect_recs BSR: rtypes [0] = filemark

[1100000061]2020/02/02.20:48:38 (pvfw) inspect_recs BSR: rtypes [1] = filemark

[1100000062]2020/02/02.20:48:38 (pvfw) inspect_recs BSR: rtypes [2] = data

[1100000063]2020/02/02.20:48:38 (pvfw) inspect_recs BSR: rtypes [3] = filemark

[1100000064]2020/02/02.20:48:38 (pvfw) inspect_recs FSF

[1100000065]2020/02/02.20:48:38 (pvfw) inspect_recs ready to mount

[1100000066]2020/02/02.20:48:38 (pvfw) mounting at inspect_rec's request in rw_mode 2

[1100000067]2020/02/02.20:48:38 (pvfw) mounted ok

[1100000068]2020/02/02.20:48:38 (pvfw) at OB EOD, returning (2)

[1100000069]2020/02/02.20:48:38 (pvfw) pos_vol_cleanup not returning pstate

[010000006A]2020/02/02.20:48:38 (dmap) tape03 success

[410000006B]Volume label:

[410000006C]    Volume tag:             E01011L4

[410000006D]    Volume UUID:            a1f8db28-109b-1038-847b-080027b2a93d

[410000006E]    Volume ID:              zdlras1-osbmf-000006

[410000006F]    Volume set ID:          zdlras1-osbmf-000002

[4100000070]    Previous volume:        zdlras1-osbmf-000002

[4100000071]    Volume sequence:        2

[4100000072]    Volume set owner:       root

[4100000073]    Volume set created:     Sun Dec 29 23:35:18 2019

[4100000074]    Volume set closes:      Wed Oct 19 00:35:18 2022 (no writes after this time)

[4000000075]    Media family:           zdlras1-osbmf

[4100000076]

[4100000077]    Volume set expires: never; content manages reuse

[4100000078]

[4100000079]Archive label:

[410000007A]    File number:            24

[410000007B]    File section:           1

[410000007C]    Owner:                  root

[410000007D]    Client host:            zdlras1n1

[410000007E]    Backup level:           0

[410000007F]    S/w compression:        no

[4100000080]    Archive created:        Sun Feb 02 20:48:38 2020

[4100000081]    Archive owner:          oracle (UUID bbad0004-01a9-1038-abb2-080027b2a93d)

[4100000082]    Owner class:            oracle (UUID a43aa638-01a9-1038-8809-080027b2a93d)

[4100000083]    Backup piece name:      RA_SBT_ORCL18C_16512_16513_2tunhkot_1_2_0_qEdAn

[4100000084]    Backup db name:         ORCL18C

[4100000085]    Backup db id:           558466555

[4100000086]    Backup copy number:     not applicable

[4100000087]    Backup content:         full

[4100000088]    Encryption:             off

[4100000089]    Catalog data:           no

[410000008A]    Backup image UUID:      8cb63f94-281a-1038-86d5-080027b2a93d

[410000008B]    Backup instance UUID:   8cb63f9e-281a-1038-86d5-080027b2a93d

[410000008C]    Backup instance created:Sun Feb 02 20:48:38 2020

[410000008D]

[410000008E]

[410000008E]End of tape has been reached.  Please wait while I rewind and unload the tape.

[410000008F]The Volume ID of the next tape to be written is zdlras1-osbmf-000007.

[4100000090]The tape has been unloaded.

[4100000090]

[4100000091]Volume label:

[4100000092]    Volume tag:             E01012L4

[4100000093]    Volume UUID:            93cb6548-281a-1038-86d5-080027b2a93d

[4100000094]    Volume ID:              zdlras1-osbmf-000007

[4100000095]    Volume set ID:          zdlras1-osbmf-000002

[4100000096]    Previous volume:        zdlras1-osbmf-000006

[4100000097]    Volume sequence:        3

[4100000098]    Volume set owner:       root

[4100000099]    Volume set created:     Sun Dec 29 23:35:18 2019

[410000009A]    Volume set closes:      Wed Oct 19 00:35:18 2022 (no writes after this time)

[400000009B]    Media family:           zdlras1-osbmf

[410000009C]

[410000009D]    Volume set expires: never; content manages reuse

[410000009E]

[410000009F]Archive label:

[41000000A0]    File number:            24

[41000000A1]    File section:           2

[41000000A2]    Owner:                  root

[41000000A3]    Client host:            zdlras1n1

[41000000A4]    Backup level:           0

[41000000A5]    S/w compression:        no

[41000000A6]    Archive created:        Sun Feb 02 20:48:38 2020

[41000000A7]    Archive owner:          oracle (UUID bbad0004-01a9-1038-abb2-080027b2a93d)

[41000000A8]    Owner class:            oracle (UUID a43aa638-01a9-1038-8809-080027b2a93d)

[41000000A9]    Backup piece name:      RA_SBT_ORCL18C_16512_16513_2tunhkot_1_2_0_qEdAn

[41000000AA]    Backup db name:         ORCL18C

[41000000AB]    Backup db id:           558466555

[41000000AC]    Backup copy number:     not applicable

[41000000AD]    Backup content:         full

[41000000AE]    Encryption:             off

[41000000AF]    Catalog data:           no

[41000000B0]    Backup image UUID:      8cb63f94-281a-1038-86d5-080027b2a93d

[41000000B1]    Backup instance UUID:   8cb63f9e-281a-1038-86d5-080027b2a93d

[41000000B2]    Backup instance created:Sun Feb 02 20:48:38 2020

[41000000B3]

[41000000B4]

[41000000B4]Backup statistics:

[41000000B5]status 0

[40000000B6]devices

[40000000B7] tape03

[41000000B8]

[40000000B9]volumes

[40000000BA] zdlras1-osbmf-000006

[41000000BB]

[40000000BC]voltags

[40000000BD] E01011L4

[41000000BE]

[41000000BF]file 24

[41000000C0]host zdlras1n1

[41000000C1]encryption off

[41000000C2]compression no

[41000000C3]start_time  Sun Feb 02 2020 at 20:48:38 (1580672918)

[41000000C4]end_time    Sun Feb 02 2020 at 20:49:36 (1580672976)

[41000000C5]backup_time Sun Feb 02 2020 at 20:48:38 (1580672918)

[41000000C6]dev_kbytes 769938

[41000000C7]dev_iosecs 53

[41000000C8]dev_iorate 14.9 MB/S

[41000000C9]wrt_iosecs 58

[41000000CA]wrt_iorate 13.6 MB/S

[41000000CB]physical_blks_written 376

[41000000CC]write_errors 0

[41000000CD]physical_blks_read 0

[41000000CE]read_errors 0

[41000000CF]error_rate 0%

[41000000D0]piece RA_SBT_ORCL18C_16512_16513_2tunhkot_1_2_0_qEdAn completed, status 0

[root@zdlras1n1 ~]#

Look that above you can see that one tape became full and OSB used another to finish.

After the Clone

Returning to rman to check the same list we have:

RMAN> list backupset 16512;







List of Backup Sets

===================







BS Key  Type LV Size

------- ---- -- ----------

16512   Incr 0  330.35M

  List of Datafiles in backup set 16512

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 1909028    02/02/2020 16:09:01              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




  Backup Set Copy #1 of backup set 16512

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    03:40:36     02/02/2020 16:12:04 YES        BKP-DB-INC0




    List of Backup Pieces for backup set 16512 Copy #1

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    16513   1   AVAILABLE                           VB$_1891149551_16507I




  Backup Set Copy #2 of backup set 16512

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    03:40:36     02/02/2020 19:49:37 NO         BKP-DB-INC0




    List of Backup Pieces for backup set 16512 Copy #2

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    16921   1   AVAILABLE   zdlras1-osbmf-000006,zdlras1-osbmf-000007 RA_SBT_ORCL18C_16512_16513_2tunhkot_1_2_0_qEdAn




RMAN>

As you can see, the backupset 16512 (of backup piece 16513) has two copies. One in ZDLRA database and another at OSB (look the media because of lack of space in one).

QUEUE_SBT_BACKUP_TASK

Another way to call manually the clone is to use the DBMS_RA.QUEUE_SBT_BACKUP_TASK. And in this case, everything that is linked with the template is cloned to tape. This a good way to force one clone one database entirely.

Look below that I have some archivedlogs that was not cloned yet:

RMAN> list copy of archivelog all;




List of Archived Log Copies for database with db_unique_name ORCL18C

=====================================================================




Key     Thrd Seq     S Low Time

------- ---- ------- - -------------------

16943   1    22      A 02/02/2020 16:28:28

        Name: /u01/app/oracle/oradata/ORCL18C/archivelog/2020_02_02/o1_mf_1_22_h3gbs5fk_.arc




16965   1    23      A 02/02/2020 21:03:49

        Name: /u01/app/oracle/oradata/ORCL18C/archivelog/2020_02_02/o1_mf_1_23_h3gcfwc8_.arc







RMAN> BACKUP DEVICE TYPE SBT ARCHIVELOG ALL DELETE ALL INPUT TAG 'BKP-ARCH';




Starting backup at 02/02/2020 21:15:09

current log archived

using channel ORA_SBT_TAPE_1

channel ORA_SBT_TAPE_1: starting archived log backup set

channel ORA_SBT_TAPE_1: specifying archived log(s) in backup set

input archived log thread=1 sequence=22 RECID=16 STAMP=1031346234

input archived log thread=1 sequence=23 RECID=17 STAMP=1031346892

input archived log thread=1 sequence=24 RECID=18 STAMP=1031346910

channel ORA_SBT_TAPE_1: starting piece 1 at 02/02/2020 21:15:11

channel ORA_SBT_TAPE_1: finished piece 1 at 02/02/2020 21:15:36

piece handle=ORCL18C_6nuni6mv_1_1 tag=BKP-ARCH comment=API Version 2.0,MMS Version 12.2.0.2

channel ORA_SBT_TAPE_1: backup set complete, elapsed time: 00:00:25

channel ORA_SBT_TAPE_1: deleting archived log(s)

archived log file name=/u01/app/oracle/oradata/ORCL18C/archivelog/2020_02_02/o1_mf_1_22_h3gbs5fk_.arc RECID=16 STAMP=1031346234

archived log file name=/u01/app/oracle/oradata/ORCL18C/archivelog/2020_02_02/o1_mf_1_23_h3gcfwc8_.arc RECID=17 STAMP=1031346892

archived log file name=/u01/app/oracle/oradata/ORCL18C/archivelog/2020_02_02/o1_mf_1_24_h3gcgg15_.arc RECID=18 STAMP=1031346910

Finished backup at 02/02/2020 21:15:36




Starting Control File and SPFILE Autobackup at 02/02/2020 21:15:37

piece handle=c-558466555-20200202-03 comment=API Version 2.0,MMS Version 12.2.0.2

Finished Control File and SPFILE Autobackup at 02/02/2020 21:15:54




RMAN> list backup of archivelog sequence between 22 and 24;







List of Backup Sets

===================







BS Key  Size       Device Type Elapsed Time Completion Time

------- ---------- ----------- ------------ -------------------

16991   25.25M     SBT_TAPE    00:00:20     02/02/2020 21:15:31

        BP Key: 16992   Status: AVAILABLE  Compressed: NO  Tag: BKP-ARCH

        Handle: ORCL18C_6nuni6mv_1_1   Media: Recovery Appliance (ZDLRAS1)




  List of Archived Logs in backup set 16991

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    22      1910746    02/02/2020 16:28:28 1923481    02/02/2020 21:03:49

  1    23      1923481    02/02/2020 21:03:49 1924479    02/02/2020 21:14:52

  1    24      1924479    02/02/2020 21:14:52 1924521    02/02/2020 21:15:09




RMAN>

And I called the template backup that I defined just for archivelogs (see my previous post) manually:

SQL> BEGIN

  2      DBMS_RA.QUEUE_SBT_BACKUP_TASK('ONLYTAPE_ARCH');

  3  END;

  4  /




PL/SQL procedure successfully completed.




SQL>

After that we can see the list backup again:

RMAN> list backup of archivelog sequence between 22 and 24;







List of Backup Sets

===================







BS Key  Size

------- ----------

16991   25.25M




  List of Archived Logs in backup set 16991

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    22      1910746    02/02/2020 16:28:28 1923481    02/02/2020 21:03:49

  1    23      1923481    02/02/2020 21:03:49 1924479    02/02/2020 21:14:52

  1    24      1924479    02/02/2020 21:14:52 1924521    02/02/2020 21:15:09




  Backup Set Copy #1 of backup set 16991

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    00:29:22     02/02/2020 21:15:35 NO         BKP-ARCH




    List of Backup Pieces for backup set 16991 Copy #1

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    16992   1   AVAILABLE   Recovery Appliance (ZDLRAS1) ORCL18C_6nuni6mv_1_1




  Backup Set Copy #2 of backup set 16991

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    00:29:22     02/02/2020 20:45:49 NO         BKP-ARCH




    List of Backup Pieces for backup set 16991 Copy #2

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    17065   1   AVAILABLE   zdlras1-osbmf-000005    RA_SBT_ORCL18C_558466555_16850_6nuni6mv_1_2_16991




RMAN>

And now we can see that we have a second copy for the backupset 16991.

Conclusion

Sometimes it is needed to manually clone the backups for tape. Remember that for full backups the ZDLRA only clones the last full backup. If you need, for some reason, clone other backups for long-term retention you can call manually the clone to tape. The options allow one to one clone, or the entire template.

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”

ZDLRA, OSB and Clone to Tape

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ZDLRA, OSB and Clone to Tape

As you saw in my last post, the configuration to enable clone to tape for ZDLRA it is not complicated, but you need to take care of some details to avoid errors. Besides that, ZDLRA relies on OSB to do that (when configured with native tape support) and this has some details that you need to be aware of.

In this post, I will show how the clone to tape works for ZDLRA. And how you can check some details about OSB.

Basic Config

Below you can see how I configured the clone to tape templates:

SQL> BEGIN

  2      DBMS_RA.CREATE_SBT_JOB_TEMPLATE (

  3          template_name => 'ONLYTAPE_FULL_ARCH',

  4          protection_policy_name => 'ZDLRA_ONLY_TAPE',

  5          attribute_set_name => 'bronzerepfull',

  6          backup_type => 'FULL,ARCH',

  7          priority => DBMS_RA.SBT_PRIORITY_LOW,

  8          window => INTERVAL '4' HOUR

  9      );

 10  END;

 11  /




PL/SQL procedure successfully completed.




SQL> BEGIN

  2      DBMS_RA.CREATE_SBT_JOB_TEMPLATE (

  3          template_name => 'ONLYTAPE_ARCH',

  4          protection_policy_name => 'ZDLRA_ONLY_TAPE',

  5          attribute_set_name => 'bronzerepfull',

  6          backup_type => 'ARCH',

  7          priority => DBMS_RA.SBT_PRIORITY_LOW,

  8          window => INTERVAL '4' HOUR

  9      );

 10  END;

 11  /




PL/SQL procedure successfully completed.




SQL>

Basic configuration for the protection policy ZDLRA_ONLY_TAPE and with FULL and ARCHIVELOGS clones. For more details you can check my previous post.

You can check the table RASYS.RA_SBT_JOB to check all the templates already configured:

SQL> set linesize 250

SQL> col LIB_NAME format a30

SQL> col TEMPLATE_NAME format a50

SQL> col LAST_SCHEDULE_TIME format a35

SQL> col POLICY_NAME format a50

SQL> select TEMPLATE_NAME, BACKUP_TYPE, LAST_SCHEDULE_TIME, LIB_NAME, DB_UNIQUE_NAME, POLICY_NAME FROM RA_SBT_JOB WHERE LIB_NAME = 'OSBSBT' ;




TEMPLATE_NAME                                      BACKUP_TYPE      LAST_SCHEDULE_TIME                  LIB_NAME                       DB_UNIQUE_NAME                 POLICY_NAME

-------------------------------------------------- ---------------- ----------------------------------- ------------------------------ ------------------------------ --------------------------------------------------

BRONZE_ARCH                                        ARCH             29-DEC-19 11.59.59.513917 PM +01:00 OSBSBT                                                        ZDLRA_BRONZE

BRONZE_FULL_ARCH                                   FULL, ARCH       03-JAN-20 11.10.27.411320 PM +01:00 OSBSBT                                                        ZDLRA_BRONZE

GOLDEN_ARCH                                        ARCH                                                 OSBSBT                                                        ZDLRA_GOLDEN

GOLDEN_FULL_ARCH                                   FULL, ARCH       01-JAN-20 11.10.47.987397 PM +01:00 OSBSBT                                                        ZDLRA_GOLDEN

ONLYTAPE_FULL_ARCH                                 FULL, ARCH                                           OSBSBT                                                        ZDLRA_ONLY_TAPE

ONLYTAPE_ARCH                                      ARCH                                                 OSBSBT                                                        ZDLRA_ONLY_TAPE




6 rows selected.




SQL>

Check that if the clone for this template was called you can see the information in the column LAST_SCHEDULE_TIME. And in this case, since I have not created the scheduler for this, it is empty.

To do that, just created the scheduler for that:

SQL> BEGIN

  2      DBMS_SCHEDULER.CREATE_JOB(

  3          job_name => 'SBTJOB_ONLYTAPE_FULL_ARCH',

  4          job_type => 'PLSQL_BLOCK',

  5          job_action => 'dbms_ra.queue_sbt_backup_task(''ONLYTAPE_FULL_ARCH'');',

  6          start_date => SYSDATE+(1/1440),

  7          enabled => TRUE,

  8          auto_drop => TRUE,

  9          repeat_interval => 'freq=WEEKLY; BYDAY=SUN; BYHOUR=20'

 10      );

 11  END;

 12  /




PL/SQL procedure successfully completed.




SQL> BEGIN

  2      DBMS_SCHEDULER.CREATE_JOB(

  3          job_name => 'SBTJOB_ONLYTAPE_ARCH',

  4          job_type => 'PLSQL_BLOCK',

  5          job_action => 'dbms_ra.queue_sbt_backup_task(''ONLYTAPE_ARCH'');',

  6          start_date => SYSDATE+(1/1440),

  7          enabled => TRUE,

  8          auto_drop => TRUE,

  9          repeat_interval => 'freq=HOURLY; BYMINUTE=40'

 10      );

 11  END;

 12  /




PL/SQL procedure successfully completed.




SQL>

Database Backup

For ZDLRA I have the database ORCL18C linked with this protection policy and I have some backups for this database. You can see below the backups for datafile 1:

RMAN> list backup of datafile 1;







List of Backup Sets

===================







BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

16512   Incr 0  330.35M    SBT_TAPE    00:03:03     02/02/2020 16:12:04

        BP Key: 16513   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB-INC0

        Handle: VB$_1891149551_16507I   Media:

  List of Datafiles in backup set 16512

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 1909028    02/02/2020 16:09:01              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

16673   Incr 1  96.00K     SBT_TAPE    00:00:04     02/02/2020 16:27:01

        BP Key: 16674   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB

        Handle: VB$_1891149551_16669I   Media:

  List of Datafiles in backup set 16673

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    1  Incr 1910623    02/02/2020 16:26:57              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

16701   Incr 0  329.22M    SBT_TAPE    00:00:04     02/02/2020 16:27:01

        BP Key: 16702   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB

        Handle: VB$_1891149551_16669_1   Media:

  List of Datafiles in backup set 16701

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 1910623    02/02/2020 16:26:57              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




RMAN>

Check that I have the backupsets 16512 and 16701 that are the full backup. And the same for archivelogs.

During the Clone

When the scheduler trigger the clones, we can follow it inside of ZDLRA in several ways. But the point is that (as usual), for ZDLRA, everything is a task. So, we can see information in the tables RA_TASK and in RA_SBT_TASK.

You can create your sql’s to check what it is running, but look this example:

SQL> col state format a20

SQL> col ERROR_TEXT format a50

SQL> SELECT TASK_ID, STATE, DB_UNIQUE_NAME, ERROR_TEXT, BS_KEY, PIECE# FROM RA_SBT_TASK WHERE LIB_NAME = 'OSBSBT' AND  DB_UNIQUE_NAME = 'ORCL18C' ORDER BY DB_UNIQUE_NAME, BS_KEY, PIECE#;




   TASK_ID STATE                DB_UNIQUE_NAME                 ERROR_TEXT                                             BS_KEY     PIECE#

---------- -------------------- ------------------------------ -------------------------------------------------- ---------- ----------

     33698 EXECUTABLE           ORCL18C                                                                                16542          1

     33700 EXECUTABLE           ORCL18C                                                                                16565          1

     33702 EXECUTABLE           ORCL18C                                                                                16582          1

     33721 EXECUTABLE           ORCL18C                                                                                16618          1

     33704 EXECUTABLE           ORCL18C                                                                                16701          1

     33722 EXECUTABLE           ORCL18C                                                                                16706          1

     33706 EXECUTABLE           ORCL18C                                                                                16725          1

     33708 EXECUTABLE           ORCL18C                                                                                16734          1

     33710 EXECUTABLE           ORCL18C                                                                                16743          1

     33712 EXECUTABLE           ORCL18C                                                                                16752          1

     33714 EXECUTABLE           ORCL18C                                                                                16761          1




   TASK_ID STATE                DB_UNIQUE_NAME                 ERROR_TEXT                                             BS_KEY     PIECE#

---------- -------------------- ------------------------------ -------------------------------------------------- ---------- ----------

     33716 EXECUTABLE           ORCL18C                                                                                16782          1

     33718 EXECUTABLE           ORCL18C                                                                                16791          1

     33723 EXECUTABLE           ORCL18C                                                                                16795          1

     33720 EXECUTABLE           ORCL18C                                                                                16825          1




15 rows selected.




SQL>

The output above I picked up exactly when the job SBTJOB_ONLYTAPE_FULL_ARCH was running. You can see that ZDLRA put all the backupsets that was not cloned yet in the queue. At this moment, ZDLRA is creating the backupsets (since they exist jus in RA.VBDF table) and sending them (one to one) to OSB.

You can do the same for RA_TASK:

SQL> select TASK_ID, TASK_TYPE, STATE, WAITING_ON, DB_KEY, DB_UNIQUE_NAME, CREATION_TIME, ERROR_COUNT, INTERRUPT_COUNT, BP_KEY,BS_KEY,DF_KEY,VB_KEY from rasys.ra_task where DB_UNIQUE_NAME = 'ORCL18C' and task_type LIKE '%SBT' order by 5,2,7,10,11,12,13;




   TASK_ID TASK_TYPE                      STATE                     WAITING_ON     DB_KEY DB_UNIQUE_NAME                 CREATION_TIME                       ERROR_COUNT INTERRUPT_COUNT     BP_KEY     BS_KEY     DF_KEY     VB_KEY

---------- ------------------------------ ------------------------- ---------- ---------- ------------------------------ ----------------------------------- ----------- --------------- ---------- ---------- ---------- ----------

     33698 BACKUP_SBT                     RUNNING                                   15993 ORCL18C                        02-FEB-20 04.40.36.631549 PM +01:00           0               0                 16542

     33700 BACKUP_SBT                     RUNNING                                   15993 ORCL18C                        02-FEB-20 04.40.36.833354 PM +01:00           0               0                 16565

     33702 BACKUP_SBT                     RUNNING                                   15993 ORCL18C                        02-FEB-20 04.40.36.844599 PM +01:00           0               0                 16582

     33704 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.861902 PM +01:00           0               0                 16701

     33706 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.883175 PM +01:00           0               0                 16725

     33708 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.898718 PM +01:00           0               0                 16734

     33710 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.905336 PM +01:00           0               0                 16743

     33712 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.924090 PM +01:00           0               0                 16752

     33714 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.939241 PM +01:00           0               0                 16761

     33716 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.961104 PM +01:00           0               0                 16782

     33718 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.980517 PM +01:00           0               0                 16791




   TASK_ID TASK_TYPE                      STATE                     WAITING_ON     DB_KEY DB_UNIQUE_NAME                 CREATION_TIME                       ERROR_COUNT INTERRUPT_COUNT     BP_KEY     BS_KEY     DF_KEY     VB_KEY

---------- ------------------------------ ------------------------- ---------- ---------- ------------------------------ ----------------------------------- ----------- --------------- ---------- ---------- ---------- ----------

     33720 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.992346 PM +01:00           0               0                 16825

     33721 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.37.059282 PM +01:00           0               0                 16618

     33722 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.37.061375 PM +01:00           0               0                 16706

     33723 BACKUP_SBT                     EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.37.063679 PM +01:00           0               0                 16795

     33699 PLAN_SBT                       COMPLETED                                 15993 ORCL18C                        02-FEB-20 04.40.36.783661 PM +01:00           0               0                 16542      16014      16538

     33701 PLAN_SBT                       COMPLETED                                 15993 ORCL18C                        02-FEB-20 04.40.36.834255 PM +01:00           0               0                 16565      16012      16561

     33703 PLAN_SBT                       COMPLETED                                 15993 ORCL18C                        02-FEB-20 04.40.36.845368 PM +01:00           0               0                 16582      16018      16577

     33705 PLAN_SBT                       COMPLETED                                 15993 ORCL18C                        02-FEB-20 04.40.36.862820 PM +01:00           0               0                 16701      16006      16669

     33707 PLAN_SBT                       COMPLETED                                 15993 ORCL18C                        02-FEB-20 04.40.36.883803 PM +01:00           0               0                 16725      16008      16710

     33709 PLAN_SBT                       COMPLETED                                 15993 ORCL18C                        02-FEB-20 04.40.36.899324 PM +01:00           0               0                 16734      16022      16729

     33711 PLAN_SBT                       COMPLETED                                 15993 ORCL18C                        02-FEB-20 04.40.36.906037 PM +01:00           0               0                 16743      16010      16738




   TASK_ID TASK_TYPE                      STATE                     WAITING_ON     DB_KEY DB_UNIQUE_NAME                 CREATION_TIME                       ERROR_COUNT INTERRUPT_COUNT     BP_KEY     BS_KEY     DF_KEY     VB_KEY

---------- ------------------------------ ------------------------- ---------- ---------- ------------------------------ ----------------------------------- ----------- --------------- ---------- ---------- ---------- ----------

     33713 PLAN_SBT                       EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.924844 PM +01:00           0               0                 16752      16020      16747

     33715 PLAN_SBT                       EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.939997 PM +01:00           0               0                 16761      16024      16756

     33717 PLAN_SBT                       EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.961701 PM +01:00           0               0                 16782      16016      16777

     33719 PLAN_SBT                       EXECUTABLE                                15993 ORCL18C                        02-FEB-20 04.40.36.981277 PM +01:00           0               0                 16791      16026      16786




26 rows selected.




SQL>

Inside OSB

At OSB side we can check some information about how it is processing the clones. For OSB, everything is one job, and we can check in with obtool.

To check the jobs we execute obtool lsjob:

[root@zdlras1n1 ~]# obtool lsjob

Job ID           Sched time  Contents                       State

---------------- ----------- ------------------------------ ---------------------------------------

oracle/38        none        database ORCL18C (dbid=558466555) processed; Oracle job(s) scheduled

oracle/38.2      none        datafile backup                running since 2020/02/02.16:43

oracle/38.4      none        datafile backup                running since 2020/02/02.16:44

[root@zdlras1n1 ~]#

Some important information here. For every scheduler that was triggered by ZDLRA, we can see the job for each database, in this case, oracle/38. This is the master job, just to control what is happening. If you have more than one database in the same job being cloned, you will see more.

For each backupset that it is needed to be cloned, you will see child jobs (derived from the master). And for this, we can see logs from what is happening. All the logs for OSB are in the folder /usr/etc/ob/xcr/. Doing a cat from one finished job we can see useful information:

[root@zdlras1n1 ~]# cat /usr/etc/ob/xcr/[email protected]

[4100000001]2020/02/02.16:43:15 ______________________________________________________________________

[4100000001]2020/02/02.16:43:15

[4100000002]2020/02/02.16:43:15         Transcript for job oracle/38.2 running on zdlras1n1

[4100000002]2020/02/02.16:43:15

[3100000003]2020/02/02.16:43:15 Info: mount data verified.

[3100000004]2020/02/02.16:43:15 Info: volume in tape04 is usable for this operation.

[1100000005]2020/02/02.16:43:15 (pvfw) not at BOT

[1100000006]2020/02/02.16:43:15 (pvfw) previous state is invalid

[1100000007]2020/02/02.16:43:15 (alv)  backup image label is valid, file 21, section 1

[1100000008]2020/02/02.16:43:15 (pvfw) at OB EOD, returning (1)

[1100000009]2020/02/02.16:43:15 (pvfw) pos_vol_cleanup not returning pstate

[010000000A]2020/02/02.16:43:15 (dmap) tape04 success

[410000000B]Volume label:

[410000000C]    Volume tag:             E01011L4

[410000000D]    Volume UUID:            a1f8db28-109b-1038-847b-080027b2a93d

[410000000E]    Volume ID:              zdlras1-osbmf-000006

[410000000F]    Volume set ID:          zdlras1-osbmf-000002

[4100000010]    Previous volume:        zdlras1-osbmf-000002

[4100000011]    Volume sequence:        2

[4100000012]    Volume set owner:       root

[4100000013]    Volume set created:     Sun Dec 29 23:35:18 2019

[4100000014]    Volume set closes:      Wed Oct 19 00:35:18 2022 (no writes after this time)

[4000000015]    Media family:           zdlras1-osbmf

[4100000016]

[4100000017]    Volume set expires: never; content manages reuse

[4100000018]

[4100000019]Archive label:

[410000001A]    File number:            21

[410000001B]    File section:           1

[410000001C]    Owner:                  root

[410000001D]    Client host:            zdlras1n1

[410000001E]    Backup level:           0

[410000001F]    S/w compression:        no

[4100000020]    Archive created:        Sun Feb 02 16:43:15 2020

[4100000021]    Archive owner:          oracle (UUID bbad0004-01a9-1038-abb2-080027b2a93d)

[4100000022]    Owner class:            oracle (UUID a43aa638-01a9-1038-8809-080027b2a93d)

[4100000023]    Backup piece name:      RA_SBT_ORCL18C_558466555_16849_rpunhlqh_1_2_16701

[4100000024]    Backup db name:         ORCL18C

[4100000025]    Backup db id:           558466555

[4100000026]    Backup copy number:     not applicable

[4100000027]    Backup content:         full

[4100000028]    Encryption:             off

[4100000029]    Catalog data:           no

[410000002A]    Backup image UUID:      4599459c-27f8-1038-adc6-080027b2a93d

[410000002B]    Backup instance UUID:   459945b0-27f8-1038-adc6-080027b2a93d

[410000002C]    Backup instance created:Sun Feb 02 16:43:15 2020

[410000002D]

[410000002E]

[410000002E]Backup statistics:

[410000002F]status 0

[4000000030]devices

[4000000031] tape04

[4100000032]

[4000000033]volumes

[4000000034] zdlras1-osbmf-000006

[4100000035]

[4000000036]voltags

[4000000037] E01011L4

[4100000038]

[4100000039]file 21

[410000003A]host zdlras1n1

[410000003B]encryption off

[410000003C]compression no

[410000003D]start_time  Sun Feb 02 2020 at 16:43:15 (1580658195)

[410000003E]end_time    Sun Feb 02 2020 at 16:44:20 (1580658260)

[410000003F]backup_time Sun Feb 02 2020 at 16:43:15 (1580658195)

[4100000040]dev_kbytes 753036

[4100000041]dev_iosecs 60

[4100000042]dev_iorate 12.9 MB/S

[4100000043]wrt_iosecs 65

[4100000044]wrt_iorate 11.9 MB/S

[4100000045]physical_blks_written 368

[4100000046]write_errors 0

[4100000047]physical_blks_read 0

[4100000048]read_errors 0

[4100000049]error_rate 0%

[410000004A]piece RA_SBT_ORCL18C_558466555_16849_rpunhlqh_1_2_16701 completed, status 0

[root@zdlras1n1 ~]#

Some important details from there:

Volume tag: This is the tag for your tape. This can from the barcode read by OSB
Backup piece name: The name that was created and identify the file inside OSB
wrt_iorate: This is the speed that OSB wrote into the tape
write_errors: Errors that occurred while writing.

If that you can create some scripts to verify details from your OSBD clones:

[root@zdlras1n1 ~]# cat /usr/etc/ob/xcr/oracle@38.* |grep "Volume tag"

[410000000C]    Volume tag:             E01010L4

[410000000C]    Volume tag:             E01011L4

[410000000C]    Volume tag:             E01009L4

[410000000C]    Volume tag:             E01010L4

[410000000C]    Volume tag:             E01010L4

[410000000C]    Volume tag:             E01011L4

[410000000C]    Volume tag:             E01009L4

[root@zdlras1n1 ~]#

As told before (in my previous post), the integration between OSB and ZDLRA is not the best (to be honest). For ZDLRA OSB is one external tool. As an example, from ZDLRA library (cli or in cloud control) you can’t check OSB jobs and other details (like if the library has free tapes). If the clone fails, ZDLRA does not provide a good report and you need to go to CLI and use obtool. Look below that OSB is a separate box (image took from ZDLRA Administrator’s Guide).

Useful OSB commands

You can see more information about the devices using the obtool lsdev. Here I have 4 drivers:

[root@zdlras1n1 ~]# obtool lsdev

library             lib01            in service

  drive 1           tape01           in service

  drive 2           tape02           in service

  drive 3           tape03           in service

  drive 4           tape04           in service

[root@zdlras1n1 ~]#

And about the library (listing tapes) with obtool lsvol. To verify if you have free tapes inside of OSB library, as an example:

[root@zdlras1n1 ~]# obtool lsvol -L lib01

Inventory of library lib01:

    in    1:             unlabeled, barcode E01001L4, readonly

    in    2:             unlabeled, barcode E01002L4, readonly

    in    3:             unlabeled, barcode E01003L4, readonly

    in    4:             unlabeled, barcode E01004L4, readonly

    in    5:             volume zdlras1-osbmf-000001, barcode E01005L4, full, content manages reuse

    in    6:             volume zdlras1-osbmf-000002, barcode E01006L4, full, content manages reuse, mediainfo hw encryptable

    in    7:             volume zdlras1-osbmf-000003, barcode E01007L4, full, content manages reuse, mediainfo hw encryptable

    in    8:             volume RMAN-DEFAULT-000001, barcode E01008L4, content manages reuse

    in    12:            barcode E01012L4

    in    13:            barcode E01013L4

    in    14:            barcode E01014L4

    in    15:            barcode E01015L4

    in    16:            barcode E01016L4

    in    17:            barcode E01017L4

    in    18:            barcode E01018L4

    in    19:            barcode E01019L4

    in    20:            barcode E01020L4

    in    22:            barcode CLN101L4

    in    23:            barcode CLN102L5

    in    30:            barcode F01030L5

    in    31:            barcode F01031L5

    in    32:            barcode F01032L5

    in    33:            barcode F01033L5

    in    34:            barcode F01034L5

    in    35:            barcode F01035L5

    in    36:            barcode F01036L5

    in    37:            barcode F01037L5

    in    38:            barcode F01038L5

    in    39:            barcode F01039L5

    in    dte1:          volume zdlras1-osbmf-000004, barcode E01009L4, 233472 kb remaining, content manages reuse, lastse 9, mediainfo hw encryptable

    in    dte3:          volume zdlras1-osbmf-000005, barcode E01010L4, 159744 kb remaining, content manages reuse, lastse 10

    in    dte4:          volume zdlras1-osbmf-000006, barcode E01011L4, 18432 kb remaining, content manages reuse, lastse 11

[root@zdlras1n1 ~]#

And if you want to check for some volume obtool lsvol -b can be used. This is important to see everything that is inside one tape in specific:

[root@zdlras1n1 ~]# obtool lsvol -b E01011L4

    VOID    OOID Seq Volume ID          Barcode     Family      Created     Attributes

     157     157   2 zdlras1-osbmf-000006 E01011L4    zdlras1-osbmf 12/29.23:35 open; closes 2022/10/18; content manages reuse

[root@zdlras1n1 ~]#

And for the OSB volume that it inside of this tape we can use obtool lsbkup:

[root@zdlras1n1 ~]# obtool lsbkup -l -c zdlras1-osbmf-000006

Backup image name:    zdlras1n1-20200103-221130

    Type:                   Oracle database

    Client:                 zdlras1n1

    Backup piece name:      RA_SBT_ORCL19_13806_13805_nrul3953_1_2_0_MMNYx

    Database:               ORCL19

    Content:                full

    Size:                   768.2 MB

    Backup owner:           oracle

    Owner class:            oracle

    Backup date and time:   2020/01/03.23:11

    Created by job:         oracle/36.1

    UUID:                   89a815e8-109b-1038-847b-080027b2a93d

Backup image name:    zdlras1n1-20200202-154208

    Type:                   Oracle database

    Client:                 zdlras1n1

    Backup piece name:      RA_SBT_ORCL18C_558466555_16849_33unhlfr_1_2_16565

    Database:               ORCL18C

    Content:                full

    Size:                   240.8 MB

    Backup owner:           oracle

    Owner class:            oracle

    Backup date and time:   2020/02/02.16:42

    Created by job:         oracle/37.2

    UUID:                   1ef6b2ee-27f8-1038-995e-080027b2a93d

Backup image name:    zdlras1n1-20200202-154315

    Type:                   Oracle database

    Client:                 zdlras1n1

    Backup piece name:      RA_SBT_ORCL18C_558466555_16849_rpunhlqh_1_2_16701

    Database:               ORCL18C

    Content:                full

    Size:                   735.4 MB

    Backup owner:           oracle

    Owner class:            oracle

    Backup date and time:   2020/02/02.16:43

    Created by job:         oracle/38.2

    UUID:                   4599459c-27f8-1038-adc6-080027b2a93d

Backup image name:    zdlras1n1-20200202-154431

    Type:                   Oracle database

    Client:                 zdlras1n1

    Backup piece name:      RA_SBT_ORCL18C_558466555_16849_f2unhlqu_1_2_16782

    Database:               ORCL18C

    Content:                full

    Size:                   1.2 MB

    Backup owner:           oracle

    Owner class:            oracle

    Backup date and time:   2020/02/02.16:44

    Created by job:         oracle/38.6

    UUID:                   72577360-27f8-1038-8987-080027b2a93d

Backup image name:    zdlras1n1-20200202-154504

    Type:                   Oracle database

    Client:                 zdlras1n1

    Backup piece name:      RA_SBT_c-558466555-20200202-00

    Database:               ORCL18C

    Content:                full

    Size:                   18.1 MB

    Backup owner:           oracle

    Owner class:            oracle

    Backup date and time:   2020/02/02.16:45

    Created by job:         oracle/39.2

    UUID:                   85c228c8-27f8-1038-bcf9-080027b2a93d

[root@zdlras1n1 ~]#

As you can see we can use these commands to check what a lot of details about the volumes and tapes. In the command above you can see that inside of the same volume store backups from different databases. This occurs because of the share the same attribute inside ZDLRA, and there, I specified only one volume.

If you want to see some details about one database you can use obtool lspiece:

[root@zdlras1n1 ~]# obtool lspiece -d ORCL18C

    POID Database   Content    Copy Created      Host             Piece name

     190 ORCL18C    full          0 02/02.16:42  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_34unhlje_1_2_16582

     191 ORCL18C    full          0 02/02.16:42  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_33unhlfr_1_2_16565

     192 ORCL18C    full          0 02/02.16:42  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_30unhl75_1_2_16542

     193 ORCL18C    full          0 02/02.16:43  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_cqunhlqk_1_2_16725

     194 ORCL18C    full          0 02/02.16:43  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_obunhlqn_1_2_16734

     195 ORCL18C    full          0 02/02.16:44  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_5dunhlqo_1_2_16743

     196 ORCL18C    full          0 02/02.16:43  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_rpunhlqh_1_2_16701

     197 ORCL18C    full          0 02/02.16:44  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_vuunhlqp_1_2_16752

     198 ORCL18C    full          0 02/02.16:44  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_f2unhlqu_1_2_16782

     199 ORCL18C    full          0 02/02.16:44  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_h1unhlqt_1_2_16761

     200 ORCL18C    full          0 02/02.16:44  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_9junhlqv_1_2_16791

     201 ORCL18C    full          0 02/02.16:45  zdlras1n1        RA_SBT_c-558466555-20200202-00

     202 ORCL18C    full          0 02/02.16:45  zdlras1n1        RA_SBT_c-558466555-20200202-02

     203 ORCL18C    full          0 02/02.16:45  zdlras1n1        RA_SBT_c-558466555-20200202-01

     204 ORCL18C    full          0 02/02.16:45  zdlras1n1        RA_SBT_ORCL18C_558466555_16849_6lunhltm_1_2_16795

[root@zdlras1n1 ~]#

For job list you can use the obtool lsjob -A:

[root@zdlras1n1 ~]# obtool lsjob -A

Job ID           Sched time  Contents                       State

---------------- ----------- ------------------------------ ---------------------------------------

oracle/36        none        database ORCL19 (dbid=323177095) completed successfully at 2020/01/03.23:12

oracle/36.1      none        datafile backup                completed successfully at 2020/01/03.23:12

oracle/37        none        database ORCL18C (dbid=558466555) completed successfully at 2020/02/02.16:42

oracle/37.1      none        datafile backup                completed successfully at 2020/02/02.16:42

oracle/37.2      none        datafile backup                completed successfully at 2020/02/02.16:42

oracle/37.3      none        datafile backup                completed successfully at 2020/02/02.16:42

oracle/38        none        database ORCL18C (dbid=558466555) completed successfully at 2020/02/02.16:44

oracle/38.1      none        datafile backup                completed successfully at 2020/02/02.16:43

oracle/38.2      none        datafile backup                completed successfully at 2020/02/02.16:44

oracle/38.3      none        datafile backup                completed successfully at 2020/02/02.16:43

oracle/38.4      none        datafile backup                completed successfully at 2020/02/02.16:44

oracle/38.5      none        datafile backup                completed successfully at 2020/02/02.16:44

oracle/38.6      none        datafile backup                completed successfully at 2020/02/02.16:44

oracle/38.7      none        datafile backup                completed successfully at 2020/02/02.16:44

oracle/39        none        database ORCL18C (dbid=558466555) completed successfully at 2020/02/02.16:45

oracle/39.1      none        datafile backup                completed successfully at 2020/02/02.16:45

oracle/39.2      none        datafile backup                completed successfully at 2020/02/02.16:45

oracle/39.3      none        datafile backup                completed successfully at 2020/02/02.16:45

oracle/40        none        database ORCL18C (dbid=558466555) completed successfully at 2020/02/02.16:45

oracle/40.1      none        datafile backup                completed successfully at 2020/02/02.16:45

oracle/41        none        database ORCL18C (dbid=558466555) completed successfully at 2020/02/02.16:45

oracle/41.1      none        datafile backup                completed successfully at 2020/02/02.16:45

[root@zdlras1n1 ~]#

As told, with these commands you can verify allot of useful info about the health of your clone to tapes at OSB side. And unfortunately, it is the only way to do that since it can be done using ZDLRA CLI.

After clone

After the clone job finish you can see that a new copy for the backupset appears in the rman catalog:

RMAN> list backup of datafile 1;







List of Backup Sets

===================







BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

16512   Incr 0  330.35M    SBT_TAPE    00:03:03     02/02/2020 16:12:04

        BP Key: 16513   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB-INC0

        Handle: VB$_1891149551_16507I   Media:

  List of Datafiles in backup set 16512

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 1909028    02/02/2020 16:09:01              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

16673   Incr 1  96.00K     SBT_TAPE    00:00:04     02/02/2020 16:27:01

        BP Key: 16674   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB

        Handle: VB$_1891149551_16669I   Media:

  List of Datafiles in backup set 16673

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    1  Incr 1910623    02/02/2020 16:26:57              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




BS Key  Type LV Size

------- ---- -- ----------

16701   Incr 0  329.22M

  List of Datafiles in backup set 16701

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 1910623    02/02/2020 16:26:57              NO    /u01/app/oracle/oradata/ORCL18C/system01.dbf




  Backup Set Copy #1 of backup set 16701

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    00:42:37     02/02/2020 16:27:01 YES        BKP-DB




    List of Backup Pieces for backup set 16701 Copy #1

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    16702   1   AVAILABLE                           VB$_1891149551_16669_1




  Backup Set Copy #2 of backup set 16701

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    00:42:37     02/02/2020 15:44:20 NO         BKP-DB




    List of Backup Pieces for backup set 16701 Copy #2

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    16876   1   AVAILABLE   zdlras1-osbmf-000006    RA_SBT_ORCL18C_558466555_16849_rpunhlqh_1_2_16701




RMAN>

Look that now the backupset 16701 has two copies. One inside ZDLRA database and others in the tape. And as you can see above, just the last full was cloned to tape. This is the default behavior for ZDLRA, just the last full is cloned to tape. I explained more in the previous post.

For archivelogs the process is a little different, but basically clones everything that was not yet. And if you have real-time redo you will have a lot of clones (since each one has a different backupset).

The management of the rman catalog is done automatically by ZDLRA. If your backup passes over the retention for recovery_window_goal (that manages just what is inside ZDLRA database), the copy will disappear from the list. The same occurs for cloned backups that follow the recovery_window_sbt parameter for your policy. The same is done when the clone is made, the new copy appears automatically.

ZDLRA, Configuring Tape Library

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

ZDLRA, Configuring Tape Library

With ZDLRA you can clone your backups to tape using two ways. The first is using third-party software and the second is using Oracle Secure Backup (OSB). This integration from ZDLRA and OSB is the native way to do that.

Cloning backups to tape will help to offload backups from ZDLRA (reducing the space usage if you need to sustain long recovery windows), and add another layer of protection (since you can put tapes in a third site).

Here I will show how easy is to configure the OSB backup and how to integrate it into your backup policy.

TAPE and ZDLRA

Tape backups are recommended for long-term storage, going to years and decades of storage. For ZDLRA itself, when using OSB, it turns itself as Media Manager too. This means that all the operations are realized by ZDLRA, tape load, rewind, cleans. Everything.

Using OSB as a tape library for ZDLRA, the communication with tape is done trough SAN, with direct access. This means that every database node of ZDLRA have SAN card, and you can connect to tape directly, or using SAN switch. All the cards have two links (4 as total) and you can configure as failover access (in each node).

All the tasks for tape are done with OSB (load, unload, overwrite), so, we can use commands like obtool lsdev, obtool lsvol and others to check the library and tapes.

Since OSB is external and exists outside the ZDLRA world there are some details that need be explained. As explained in previous posts (ZDLRA Internals, Virtual Full Backup), ZDLRA deconstructs the ingested backup and this database internal tables are not compatible with OSB. OSB not connect inside of the ZDLRA database and copy the backups.

So (copied directly from documentation), when Recovery Appliance executes a copy-to-tape job for a virtual full backup, *it constructs the physical backup sets, and copies them to tape, and then writes the metadata to the recovery catalog. If desired, the Recovery Appliance can also copy successive incremental backups and archived redo log file backups to tape. Whereas the backup on the Recovery Appliance is virtual, the backup on tape is a non-virtual**, full physical backup. The Recovery Appliance automatically handles requests to restore backups from tape, with no need for administrator intervention*.

The info above is clear, but basically this means that ZDLRA constructs a new backupset and sends it to OSB to copy to tape. If you are sending a full backup to tape (and your backup has 14TB as an example) be aware of the time consumption to do that. Another point is that ZDLRA will automatically update the rman catalog for your database when the copy is finished at the tape, the backupset will appears inside the rman catalog.

One important point here it is there OSB manages totally the tape communication. If you receive one backupset to clone, you can’t go to ZDLRA and pause the copy for this backupset (you can pause the job that triggered the copy inside ZDLRA, but no the copy that already started). Consider that OSB as external software, where you send something to copy and the result will be OK or ERROR.

We can use some commands using obtool, but since it is integrated/managed by ZDLRA, it is not allowed to change the configuration. Commands like lsjob, lsdev, lsinventory, unlabelvol can be used.

Be careful when managing tapes because their communication is only from ZDLRA to OSB. If you clean one tape at OSB side, ZDLRA will not be aware of that and (if you have some not expired backupset there) you can lose some data.

Third-party Libraries

If you are not using OSB you can use external/third-party libraries to offload to tape. In this case, it needs to be compatible with RMAN. Basically, when you create the library inside ZDLRA, you point to the operational system client library (installed in each database node) and follow the normal procedure as OSB.

Be aware that you need to allow in your third-party software, that the client can send the backups to do the server. The start of the clone always will start at the ZDLRA side (and not called by third-party software).

CREATE_SBT_LIBRARY

In this post I will use OSB as a clone library, so, some details will be relevant just for OSB. But the process for creating the library will be similar (just adjust some parameters).

The first step is to create the library using the DBMS_RA.CREATE_SBT_LIBRARY:

SQL> BEGIN

  2      DBMS_RA.CREATE_SBT_LIBRARY(

  3          lib_name => 'osbsbt',

  4          drives => 4,

  5          restore_drives => 1,

  6          parms => 'SBT_LIBRARY=libobk.so'

  7      );

  8  END;

  9  /




PL/SQL procedure successfully completed.




SQL>

The parameters are:

lib_name: Name that identify the library inside ZDRLA. It is unique name
drivers: Number of drivers that are available for backup usage
restore_drivers: Number of the driver used for restore. Will be always a subset of the drivers If not specified, ZDLRA can use all drivers to do the restore.
parms: here you can specify some parameters that are passed when to call the library.

Remember that the tape library needs to be compatible with RMAN? So, the parms and send attributes for the procedure are used as parameters for ALLOCATE CHANNEL when ZDLRA calls the tape library. If you need to define some specific parameters (like tapes groups, client names, or whatever), you need to define it here. This follows the same format as PARMS in rman ALLOCATE CHANNEL.

After that, you can create attributes for your tape using the CREATE_SBT_ATTRIBUTE_SET. These attributes are important if you want to segregate some backups. As an example, you can create one attribute to send some backups for specific media families. So, if you have tapes to store backups after some application (or company milestones) you can create specific attributes.

SQL> BEGIN

  2      DBMS_RA.CREATE_SBT_ATTRIBUTE_SET(

  3          lib_name => 'osbsbt',

  4          attribute_set_name => 'bronzerepfull',

  5          streams => 4,

  6          parms => 'ENV=(OB_MEDIA_FAMILY=zdlras1-osbmf)'

  7      );

  8  END;

  9  /




PL/SQL procedure successfully completed.




SQL>

Here, I defined that the attribute bronzerepfull for my library osbsbt. That means (for OSB) that will use the media family zdlras1-osbmf to store the backups. Again, the parms and send use the format for rman PARMS at ALLOCATE CHANNEL in rman.

One detail important it is that streams attribute needs to be less or equal than drivers value that you specified during CREATE_SBT_LIBRARY commands.

If you check the RASYS.RA_SBT_LIBRARY table the library is there:

SQL> col LIB_NAME format a30

SQL> col LAST_ERROR_TEXT format a30

SQL> SELECT LIB_NAME, LAST_ERROR_TEXT, STATUS FROM RA_SBT_LIBRARY;




LIB_NAME                       LAST_ERROR_TEXT                STATU

------------------------------ ------------------------------ -----

OSBSBT                                                        READY

REP$LIB_9043_ZDLRAS2_REP                                      READY




SQL>

If you are starting the ZDLRA and it already will come with OSB integration the steps above are not needed. Will be configured directly during the deployment phase of ZDLRA.

Clone to Tape Jobs

The clone to tape jobs inside of ZDLRA is managed entirely with DBMS_SCHEDULER and for protection policies or db_unique_name as a parameter. This means that unfortunately, you need to take care of the overlaps of the clone to tape jobs

If, as an example, you defined the clone for BRONZE policies to start from 10:00 until 11:00 and the SILVER to start at 11:00. And your backup for BRONZE took more time to execute and pass over 11:00, you will see some contentions.

DBMS_RA.CREATE_SBT_JOB_TEMPLATE

The first step to clone to tape is to define what and how to do that. For ZDLRA we can clone full, incremental, and archivelogs backups. Besides that, we can define the templates for protection policies or database unique names.

SQL> BEGIN

  2      DBMS_RA.CREATE_SBT_JOB_TEMPLATE (

  3          template_name => 'BRONZE_FULL_ARCH',

  4          protection_policy_name => 'ZDLRA_BRONZE',

  5          attribute_set_name => 'bronzerepfull',

  6          backup_type => 'FULL,ARCH',

  7          priority => DBMS_RA.SBT_PRIORITY_LOW,

  8          window => INTERVAL '4' HOUR

  9      );

 10  END;

 11  /




PL/SQL procedure successfully completed.




SQL> BEGIN

  2      DBMS_RA.CREATE_SBT_JOB_TEMPLATE (

  3          template_name => 'BRONZE_ARCH',

  4          protection_policy_name => 'ZDLRA_BRONZE',

  5          attribute_set_name => 'bronzerepfull',

  6          backup_type => 'ARCH',

  7          priority => DBMS_RA.SBT_PRIORITY_LOW,

  8          window => INTERVAL '4' HOUR

  9      );

 10  END;

 11  /




PL/SQL procedure successfully completed.




SQL>

Basically, with CREATE_SBT_JOB_TEMPLATE, we define what we want to backup (database or policy) or which type of backup we want (full, archive).

Another important definition it is the priority between the templates. So, in case of your backup took more time than expected, the priority helps ZDLRA to automatically handle the overlaps of clones times. You can use the window to specify the maximum time that clone can runs, after that the backupset will put in the queue to be copy in the next run.

In the example above you can see that:

BRONZE_FULL_ARCH: Template that defines that FULL and ARCHIVELOGS backups will be copied at the same time to tape. We will use the SBT attribute bronzerepfull that we defined previously. And have 4 hours to maximum runs and have LOW priority.
BRONZE_ARCH: Template that defines that just ARCHIVELOGS backups will be copied at the same time to tape. We will use the SBT attribute bronzerepfull that we defined previously. And have 4 hours to maximum runs and have LOW priority.

Important to remember that ZDLRA always copies the last backupset available. This means that for a full backup, it will be always the last that was not copied since the last execution of your job. For archivelogs, it will be from the last clone to tape (even if the last copied just the archivelogs). One example is that you execute every day one copy per day for your database at 22:00, but you do two incremental backups (one 11:00 and other 19:00), just the full backup generated because the incremental backup at 19:00 will be cloned to tape.

For archivelogs, the rule is copy all since the last clone to tape. This guarantees that you always have the archivelogs needed to recover the database. One point that needs attention is when using real-time redo, in this case, each backup of archivelog is unique (different than we execute filesperset 1000 – which will have one backupset with 1000 archivelogs inside). This means that if you generate 2000 archivelogs per day, you will need to clone these files one to one, and for tape (depending the tape type), can took long time because will be (inside tape) 2000 different files (think about the process for tape, where each file need a mark where start and end – this took a lot of time consumption for tape process).

It is possible to create a template with TAG’s too. In this case, just backups with specific TAGS (from rman) will be cloned to tape. This is useful when you need to clone specific backups for long retention schemas.

DBMS_SCHEDULER.CREATE_JOB

After creating the templates it is possible to schedule it using the database scheduler. Is used the normal scheduler, setting the desired template and the desired time that will be called.

As an example:

SQL> BEGIN

  2      DBMS_SCHEDULER.CREATE_JOB(

  3          job_name => 'SBTJOB_BRONZE_FULL_ARCH',

  4          job_type => 'PLSQL_BLOCK',

  5          job_action => 'dbms_ra.queue_sbt_backup_task(''BRONZE_FULL_ARCH'');',

  6          start_date => SYSDATE+(1/1440),

  7          enabled => TRUE,

  8          auto_drop => TRUE,

  9          repeat_interval => 'freq=WEEKLY; BYDAY=SUN; BYHOUR=20'

 10      );

 11  END;

 12  /




PL/SQL procedure successfully completed.




SQL>

We can see that it is a normal schedule that calls one PLSQL_BLOCK, and the block it is dbms_ra.queue_sbt_backup_task that will inform the internal ZDLRA library to generate the backupset and send to OSB. As you can see, the parameter for the function is the template name created before.

The start_date and repeat_internal are defined as is for the normal scheduler. In this particular example, the backup will be every week, Sunday at 20:00.

And translating everything this means that I will have, in tape, one full backup (plus archivelog) for each database that it is part of BRONZE protection policy. And this will occurs every Sunday at 20:00. The library will be OSB and the media family to store the backups will be zdlras1-osbmf.

Another example:

SQL> BEGIN

  2      DBMS_SCHEDULER.CREATE_JOB(

  3          job_name => 'SBTJOB_BRONZE_ARCH',

  4          job_type => 'PLSQL_BLOCK',

  5          job_action => 'dbms_ra.queue_sbt_backup_task(''BRONZE_ARCH'');',

  6          start_date => SYSDATE+(1/1440),

  7          enabled => TRUE,

  8          auto_drop => TRUE,

  9          repeat_interval => 'freq=HOURLY; BYMINUTE=40'

 10      );

 11  END;

 12  /




PL/SQL procedure successfully completed.




SQL>

And some examples of repeat intervals:

repeat_interval => ‘freq=HOURLY; BYMINUTE=0’: Every hour as *:00
repeat_interval => ‘FREQ=DAILY; INTERNAL=1; BYHOUR=20; BYMINUTE=15’: Every day at 20:15
repeat_interval => ‘FREQ=DAILY; BYDAY=MON,TUE,WED,THU,FRI; BYHOUR=7,8,9,10,11,12,13,14,15,16,17,18; BYMINUTE=05’: Monday to Friday, from 08-18 at 05 minute.
repeat_interval => ‘freq=DAILY; BYHOUR=22; BYDAY=MON,TUE,WED,THU,FRI,SAT,SUN’: Another way to specify every day at 22:00
repeat_interval => ‘freq=WEEKLY; BYDAY=SAT; BYHOUR=23’: Every week at 20:00

Cloned Backups

After configure the scheduler, and it is called we can see this in rman:

RMAN> list backup of archivelog sequence 92;







List of Backup Sets

===================







BS Key  Size

------- ----------

10082   2.80M




  List of Archived Logs in backup set 10082

  Thrd Seq     Low SCN    Low Time            Next SCN   Next Time

  ---- ------- ---------- ------------------- ---------- ---------

  1    92      2520099    01/01/2020 22:31:51 2530992    01/01/2020 23:29:08




  Backup Set Copy #1 of backup set 10082

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    00:43:14     01/01/2020 23:29:24 YES        TAG20200101T232923




    List of Backup Pieces for backup set 10082 Copy #1

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    10083   1   AVAILABLE                           $RSCN_1920977_RTIM_1028557385_THRD_1_SEQ_92_CTKEY_10033_BACKUP




  Backup Set Copy #2 of backup set 10082

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    00:43:14     01/01/2020 22:46:09 YES        TAG20200101T232923




    List of Backup Pieces for backup set 10082 Copy #2

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    10338   1   AVAILABLE   zdlras2-osbmf-000003    RA_SBT_OR19DG_41954437_10086_66uku2ij_1_2_10082




RMAN>

And for datafile

RMAN> list backup of datafile 1;







List of Backup Sets

===================







BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

9067    Incr 0  337.52M    SBT_TAPE    00:00:57     01/01/2020 19:07:12

        BP Key: 9068   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB-INC0

        Handle: VB$_2127575003_9062I   Media:

  List of Datafiles in backup set 9067

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 2438840    01/01/2020 19:03:19 2440207      NO    +DATA/OR19DGS/DATAFILE/system.265.1028562437




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

9210    Incr 1  136.00K    SBT_TAPE    00:00:10     01/01/2020 19:16:43

        BP Key: 9211   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB

        Handle: VB$_2127575003_9206I   Media:

  List of Datafiles in backup set 9210

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    1  Incr 2441418    01/01/2020 19:15:20              NO    +DATA/OR19DGS/DATAFILE/system.265.1028562437




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

9218    Incr 0  336.36M    SBT_TAPE    00:00:10     01/01/2020 19:16:43

        BP Key: 9219   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB

        Handle: VB$_2127575003_9206_1   Media:

  List of Datafiles in backup set 9218

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 2441418    01/01/2020 19:15:20              NO    +DATA/OR19DGS/DATAFILE/system.265.1028562437




BS Key  Type LV Size       Device Type Elapsed Time Completion Time

------- ---- -- ---------- ----------- ------------ -------------------

9496    Incr 1  168.00K    SBT_TAPE    00:00:10     01/01/2020 20:26:41

        BP Key: 9497   Status: AVAILABLE  Compressed: YES  Tag: BKP-DB

        Handle: VB$_2127575003_9492I   Media:

  List of Datafiles in backup set 9496

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    1  Incr 2452089    01/01/2020 20:24:22              NO    +DATA/OR19DGS/DATAFILE/system.265.1028562437




BS Key  Type LV Size

------- ---- -- ----------

9500    Incr 0  336.36M

  List of Datafiles in backup set 9500

  File LV Type Ckp SCN    Ckp Time            Abs Fuz SCN Sparse Name

  ---- -- ---- ---------- ------------------- ----------- ------ ----

  1    0  Incr 2452089    01/01/2020 20:24:22              NO    +DATA/OR19DGS/DATAFILE/system.265.1028562437




  Backup Set Copy #1 of backup set 9500

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    02:14:00     01/01/2020 20:26:41 YES        BKP-DB




    List of Backup Pieces for backup set 9500 Copy #1

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    9501    1   AVAILABLE                           VB$_2127575003_9492_1




  Backup Set Copy #2 of backup set 9500

  Device Type Elapsed Time Completion Time     Compressed Tag

  ----------- ------------ ------------------- ---------- ---

  SBT_TAPE    02:14:00     01/01/2020 22:40:31 NO         BKP-DB




    List of Backup Pieces for backup set 9500 Copy #2

    BP Key  Pc# Status      Media                   Piece Name

    ------- --- ----------- ----------------------- ----------

    10125   1   AVAILABLE   zdlras2-osbmf-000002,zdlras2-osbmf-000005 RA_SBT_OR19DG_41954437_10086_qjuktnrn_1_2_9500




RMAN>

For both cases look that we have the Backup Set Copy #2 in tape. And specifically for datafile, look that just the last full backup was cloned to tape.

These are just examples, in other posts, I will provide more examples about clone to tape.

Conclusion

Configurate ZDLRA to do clone to tapes and offload the backup is not complicated. But will require some attention because of the details. The most important it is related with the job scheduler. It is DBA duties correctly schedule it and avoid some overlap between the jobs.

Unfortunately, there is no easy way to do that. And can be more complicate because you can add more database to your protection policies and this will increate the clone time.

Beside that we have the fact that OSB (or third-part library) are handle as “external” by ZDLRA. This means that are not much control beside start and stop coming from ZDLRA. No easy error identification, report and solution besides “try again”.

Clone to tape backup is one of activity that demand more attention for ZDLRA, and need to take care a lot of details to have a good clone to tape plan that will lead less errors in the future.

Solving MGMTDB errors during 18c GI RU apply

Category: Engineer System Author: Fernando Simon (Board Member) Date: 5 years ago Comments: 0

Solving MGMTDB errors during 18c GI RU apply

Recently I executed the upgrade of Oracle GI to 19c version, from 18.6.0.0 to 19.5.0.0 version. But one step that was not showed there was that, because of requirements, the GI was upgraded from 18.2.0.0 to 18.6.0.0. This upgrade is a just Release Update (RU) apply and opatchauto command.

But during this upgrade, from 18.2 to 18.6, I faced (more than one time – 5 to be precise) errors during the update because of the MGMTDB errors. I got these errors:

ORA-12514, TNS: Listener does not currently know of service requested in connect descriptor
ORA-01017: invalid username/password; logon denied
MGTCA-1005 : Could not connect to the GIMR.
CRS-10407: (:CLSCRED1079:)Credential domain does not exist.

Here I will show how to solve these errors, how to identify if everything was fine and if you can continue. Be careful that it is an example, always open a support SR to identify the source of the error.

RU and Patch Process

To apply 18c RU over GI it is simple, basically, it is needed to call opatchauto for every node. One example of correct execution is:

[root@exa01vm01 ~]# opatchauto apply /u01/patches/grid/29301682 -oh /u01/app/18.0.0/grid




OPatchauto session is initiated at Mon Jan 20 12:48:25 2020




System initialization log file is /u01/app/18.0.0/grid/cfgtoollogs/opatchautodb/systemconfig2020-01-20_12-48-48PM.log.




Session log file is /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/opatchauto2020-01-20_12-49-44PM.log

The id for this session is RQ3F




Executing OPatch prereq operations to verify patch applicability on home /u01/app/18.0.0/grid

Patch applicability verified successfully on home /u01/app/18.0.0/grid







Bringing down CRS service on home /u01/app/18.0.0/grid

CRS service brought down successfully on home /u01/app/18.0.0/grid







Start applying binary patch on home /u01/app/18.0.0/grid

Binary patch applied successfully on home /u01/app/18.0.0/grid







Starting CRS service on home /u01/app/18.0.0/grid

CRS service started successfully on home /u01/app/18.0.0/grid




OPatchAuto successful.




--------------------------------Summary--------------------------------




Patching is completed successfully. Please find the summary as follows:




Host:exa01vm01

CRS Home:/u01/app/18.0.0/grid

Version:18.0.0.0.0

Summary:




==Following patches were SUCCESSFULLY applied:




Patch: /u01/patches/grid/29301682/28435192

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-20_12-51-42PM_1.log




Patch: /u01/patches/grid/29301682/28547619

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-20_12-51-42PM_1.log




Patch: /u01/patches/grid/29301682/29301631

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-20_12-51-42PM_1.log




Patch: /u01/patches/grid/29301682/29301643

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-20_12-51-42PM_1.log




Patch: /u01/patches/grid/29301682/29302264

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-20_12-51-42PM_1.log










OPatchauto session completed at Mon Jan 20 13:00:01 2020

Time taken to complete the session 11 minutes, 36 seconds

[root@exa01vm01 ~]#










####################################################

#Execute in node 2

####################################################







[root@exa02vm01 ~]# opatchauto apply /u01/patches/grid/29301682 -oh /u01/app/18.0.0/grid




OPatchauto session is initiated at Tue Jan 21 13:12:37 2020




System initialization log file is /u01/app/18.0.0/grid/cfgtoollogs/opatchautodb/systemconfig2020-01-21_01-12-51PM.log.




Session log file is /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/opatchauto2020-01-21_01-13-37PM.log

The id for this session is NFXL




Executing OPatch prereq operations to verify patch applicability on home /u01/app/18.0.0/grid

Patch applicability verified successfully on home /u01/app/18.0.0/grid







Bringing down CRS service on home /u01/app/18.0.0/grid

CRS service brought down successfully on home /u01/app/18.0.0/grid







Start applying binary patch on home /u01/app/18.0.0/grid

Binary patch applied successfully on home /u01/app/18.0.0/grid







Starting CRS service on home /u01/app/18.0.0/grid

CRS service started successfully on home /u01/app/18.0.0/grid




OPatchAuto successful.




--------------------------------Summary--------------------------------




Patching is completed successfully. Please find the summary as follows:




Host:exa02vm01

CRS Home:/u01/app/18.0.0/grid

Version:18.0.0.0.0

Summary:




==Following patches were SUCCESSFULLY applied:




Patch: /u01/patches/grid/29301682/28435192

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-21_13-15-58PM_1.log




Patch: /u01/patches/grid/29301682/28547619

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-21_13-15-58PM_1.log




Patch: /u01/patches/grid/29301682/29301631

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-21_13-15-58PM_1.log




Patch: /u01/patches/grid/29301682/29301643

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-21_13-15-58PM_1.log




Patch: /u01/patches/grid/29301682/29302264

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-21_13-15-58PM_1.log










OPatchauto session completed at Tue Jan 21 13:35:21 2020

Time taken to complete the session 22 minutes, 44 seconds

[root@exa02vm01 ~]#

As you can see, call in one node first, and after the second node. Basically during this process, the opatch will apply the binary updates in GI home, restart GI (to open with the new version), and apply needed SQL patches over MGMTDB and ASM.

MGMTDB and errors

But some errors can occur during the apply, mainly for MGMTDB. These errors are usually related to the communication between MGMTDB and CRS resources (listener and others). Usually, these errors occur when applying the patch in the last node of the cluster because the MGMTDB is swapped from other nodes (until the last one) during the patch application.

Some errors examples that I faced:

ORA-12514, TNS:listener does not currently know of service requested in connect descriptor

...

CRS-4123: Oracle High Availability Services has been started.

Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [4127601284].

SQL Patching tool version 18.0.0.0.0 Production on Wed Jan 15 13:57:08 2020

Copyright (c) 2012, 2019, Oracle.  All rights reserved.




Log file for this invocation: /u01/app/grid/cfgtoollogs/sqlpatch/sqlpatch_360667_2020_01_15_13_57_08/sqlpatch_invocation.log




Connecting to database...OK

Gathering database info...done




Note:  Datapatch will only apply or rollback SQL fixes for PDBs

       that are in an open state, no patches will be applied to closed PDBs.

       Please refer to Note: Datapatch: Database 12c Post Patch SQL Automation

       (Doc ID 1585822.1)




Bootstrapping registry and package to current versions...done

Determining current state...done




Current state of interim SQL patches:

  No interim patches found




Current state of release update SQL patches:

  Binary registry:

    18.6.0.0.0 Release_Update 1903190102: Installed

  PDB CDB$ROOT:

    Applied 18.2.0.0.0 Release_Update 1804041635 successfully on 02-OCT-18 03.11.47.950696 PM

  PDB PDB$SEED:

    Applied 18.2.0.0.0 Release_Update 1804041635 successfully on 02-OCT-18 03.16.34.680672 PM




Adding patches to installation queue and performing prereq checks...done

Installation queue:

  For the following PDBs: CDB$ROOT PDB$SEED

    No interim patches need to be rolled back

    Patch 29301631 (Database Release Update : 18.6.0.0.190416 (29301631)):

      Apply from 18.2.0.0.0 Release_Update 1804041635 to 18.6.0.0.0 Release_Update 1903190102

    No interim patches need to be applied




Installing patches...

Patch installation complete.  Total patches installed: 2




Validating logfiles...done

Patch 29301631 apply (pdb CDB$ROOT): SUCCESS

  logfile: /u01/app/grid/cfgtoollogs/sqlpatch/29301631/22832106/29301631_apply__MGMTDB_CDBROOT_2020Jan15_13_58_14.log (no errors)

Patch 29301631 apply (pdb PDB$SEED): SUCCESS

  logfile: /u01/app/grid/cfgtoollogs/sqlpatch/29301631/22832106/29301631_apply__MGMTDB_PDBSEED_2020Jan15_14_01_22.log (no errors)

SQL Patching tool complete on Wed Jan 15 14:02:41 2020

MGTCA-1005 : Could not connect to the GIMR.




Listener refused the connection with the following error:




ORA-12514, TNS:listener does not currently know of service requested in connect descriptor










2020/01/15 14:02:49 CLSRSC-180: An error occurred while executing the command '/u01/app/18.0.0/grid/bin/mgmtca applysql'




After fixing the cause of failure Run opatchauto resume




]

OPATCHAUTO-68061: The orchestration engine failed.

OPATCHAUTO-68061: The orchestration engine failed with return code 1

OPATCHAUTO-68061: Check the log for more details.

OPatchAuto failed.




OPatchauto session completed at Wed Jan 15 14:02:50 2020

Time taken to complete the session 28 minutes, 34 seconds




 opatchauto failed with error code 42

[root@exa03vm02 ~]#

ORA-01017: invalid username/password; logon denied

CRS-4123: Oracle High Availability Services has been started.

Oracle Clusterware active version on the cluster is [18.0.0.0.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [4127601284].

SQL Patching tool version 18.0.0.0.0 Production on Fri Nov 29 10:32:41 2019

Copyright (c) 2012, 2019, Oracle.  All rights reserved.




Log file for this invocation: /u01/app/grid/cfgtoollogs/sqlpatch/sqlpatch_128123_2019_11_29_10_32_41/sqlpatch_invocation.log




Connecting to database...OK

Gathering database info...done




Note:  Datapatch will only apply or rollback SQL fixes for PDBs

       that are in an open state, no patches will be applied to closed PDBs.

       Please refer to Note: Datapatch: Database 12c Post Patch SQL Automation

       (Doc ID 1585822.1)




Bootstrapping registry and package to current versions...done

Determining current state...done




Current state of interim SQL patches:

  No interim patches found




Current state of release update SQL patches:

  Binary registry:

    18.6.0.0.0 Release_Update 1903190102: Installed

  PDB CDB$ROOT:

    Applied 18.2.0.0.0 Release_Update 1804041635 successfully on 02-OCT-18 01.32.26.561113 PM

  PDB GIMR_DSCREP_10:

    Applied 18.2.0.0.0 Release_Update 1804041635 successfully on 02-OCT-18 01.37.12.290147 PM

  PDB PDB$SEED:

    Applied 18.2.0.0.0 Release_Update 1804041635 successfully on 02-OCT-18 01.37.12.290147 PM




Adding patches to installation queue and performing prereq checks...done

Installation queue:

  For the following PDBs: CDB$ROOT PDB$SEED GIMR_DSCREP_10

    No interim patches need to be rolled back

    Patch 29301631 (Database Release Update : 18.6.0.0.190416 (29301631)):

      Apply from 18.2.0.0.0 Release_Update 1804041635 to 18.6.0.0.0 Release_Update 1903190102

    No interim patches need to be applied




Installing patches...

Patch installation complete.  Total patches installed: 3




Validating logfiles...done

Patch 29301631 apply (pdb CDB$ROOT): SUCCESS

  logfile: /u01/app/grid/cfgtoollogs/sqlpatch/29301631/22832106/29301631_apply__MGMTDB_CDBROOT_2019Nov29_10_33_45.log (no errors)

Patch 29301631 apply (pdb PDB$SEED): SUCCESS

  logfile: /u01/app/grid/cfgtoollogs/sqlpatch/29301631/22832106/29301631_apply__MGMTDB_PDBSEED_2019Nov29_10_36_02.log (no errors)

Patch 29301631 apply (pdb GIMR_DSCREP_10): SUCCESS

  logfile: /u01/app/grid/cfgtoollogs/sqlpatch/29301631/22832106/29301631_apply__MGMTDB_GIMR_DSCREP_10_2019Nov29_10_36_02.log (no errors)

SQL Patching tool complete on Fri Nov 29 10:38:04 2019

MGTCA-1005 : Could not connect to the GIMR.




ORA-01017: invalid username/password; logon denied










2019/11/29 10:38:13 CLSRSC-180: An error occurred while executing the command '/u01/app/18.0.0/grid/bin/mgmtca applysql'




After fixing the cause of failure Run opatchauto resume




]

OPATCHAUTO-68061: The orchestration engine failed.

OPATCHAUTO-68061: The orchestration engine failed with return code 1

OPATCHAUTO-68061: Check the log for more details.

OPatchAuto failed.




OPatchauto session completed at Fri Nov 29 10:38:15 2019

Time taken to complete the session 26 minutes, 48 seconds




 opatchauto failed with error code 42

[root@exa01vm02 ~]#

And inside of the opatch log you can reach/see something like:

SQL Patching tool complete on Fri Nov 29 10:44:09 2019

MGTCA-1005 : Could not connect to the GIMR.




ORA-01017: invalid username/password; logon denied










2019/11/29 10:44:18 CLSRSC-180: An error occurred while executing the command '/u01/app/18.0.0/grid/bin/mgmtca applysql'




After fixing the cause of failure Run opatchauto resume




]]. Failures:

OPATCHAUTO-68067: Check the details to determine the cause of the failure.

        at com.oracle.glcm.patch.auto.action.PatchActionExecutor.execute(PatchActionExecutor.java:172)

        at com.oracle.glcm.patch.auto.wizard.silent.tasks.PatchActionTask.execute(PatchActionTask.java:102)

        ... 2 more

2019-11-29 10:44:19,660 INFO  [1] com.oracle.glcm.patch.auto.db.integration.model.productsupport.DBBaseProductSupport - Space available after session: 29898 MB

2019-11-29 10:44:19,728 SEVERE [1] com.oracle.glcm.patch.auto.OPatchAuto - OPatchAuto failed.

com.oracle.glcm.patch.auto.OPatchAutoException: OPATCHAUTO-68061: The orchestration engine failed.

OPATCHAUTO-68061: The orchestration engine failed with return code 1

OPATCHAUTO-68061: Check the log for more details.

        at com.oracle.glcm.patch.auto.OrchestrationEngineImpl.orchestrate(OrchestrationEngineImpl.java:40)

        at com.oracle.glcm.patch.auto.OPatchAuto.orchestrate(OPatchAuto.java:858)

        at com.oracle.glcm.patch.auto.OPatchAuto.orchestrate(OPatchAuto.java:398)

        at com.oracle.glcm.patch.auto.OPatchAuto.orchestrate(OPatchAuto.java:344)

        at com.oracle.glcm.patch.auto.OPatchAuto.main(OPatchAuto.java:212)

2019-11-29 10:44:19,729 INFO  [1] com.oracle.cie.common.util.reporting.CommonReporter - Reporting console output : Message{id='null', message='OPATCHAUTO-68061: The orchestration engine failed.

OPATCHAUTO-68061: The orchestration engine failed with return code 1

OPATCHAUTO-68061: Check the log for more details.'}

2019-11-29 10:44:19,729 INFO  [1] com.oracle.cie.common.util.reporting.CommonReporter - Reporting console output : Message{id='null', message='OPatchAuto failed.'}

^C

[root@exa01vm02 ~]#

But the common in all cases occurred during the “/u01/app/18.0.0/grid/bin/mgmtca applysql” call because the opatch detected that is need to apply the patch over MGMTDB, and tried to apply (and reached the error).

Solving the issue

After some dig in the web about the error and MOS/Metalink search, nothing was found. No error, hint or something related. Even after one SR opened the report was always the same: inconclusive and not relevant to build a fix. But for all one solution/workaround can be applied and worked every time.

Before, just to show that if you look inside the MGMTDB itself you can see that the patch was applied correctly:

[grid@exa01vm02 -MGMTDB]$ sqlplus / as sysdba




SQL*Plus: Release 18.0.0.0.0 - Production on Thu Nov 14 09:19:05 2019

Version 18.6.0.0.0




Copyright (c) 1982, 2018, Oracle.  All rights reserved.







Connected to:

Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0




SQL> show pdbs




    CON_ID CON_NAME                       OPEN MODE  RESTRICTED

---------- ------------------------------ ---------- ----------

         2 PDB$SEED                       READ ONLY  NO

SQL> select instance_name from v$instance;




INSTANCE_NAME

----------------

-MGMTDB




SQL> exit

Disconnected from Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$ cd $ORACLE_HOME/OPatch

[grid@exa01vm02 -MGMTDB]$ ./datapatch -verbose

SQL Patching tool version 18.0.0.0.0 Production on Thu Nov 14 09:21:33 2019

Copyright (c) 2012, 2019, Oracle.  All rights reserved.




Log file for this invocation: /u01/app/grid/cfgtoollogs/sqlpatch/sqlpatch_242627_2019_11_14_09_21_33/sqlpatch_invocation.log




Connecting to database...OK

Gathering database info...done




Note:  Datapatch will only apply or rollback SQL fixes for PDBs

       that are in an open state, no patches will be applied to closed PDBs.

       Please refer to Note: Datapatch: Database 12c Post Patch SQL Automation

       (Doc ID 1585822.1)




Bootstrapping registry and package to current versions...done

Determining current state...done




Current state of interim SQL patches:

  No interim patches found




Current state of release update SQL patches:

  Binary registry:

    18.6.0.0.0 Release_Update 1903190102: Installed

  PDB CDB$ROOT:

    Applied 18.6.0.0.0 Release_Update 1903190102 successfully on 13-NOV-19 05.17.48.183101 PM

  PDB PDB$SEED:

    Applied 18.6.0.0.0 Release_Update 1903190102 successfully on 13-NOV-19 05.17.50.152110 PM




Adding patches to installation queue and performing prereq checks...done

Installation queue:

  For the following PDBs: CDB$ROOT PDB$SEED

    No interim patches need to be rolled back

    No release update patches need to be installed

    No interim patches need to be applied




SQL Patching tool complete on Thu Nov 14 09:21:48 2019

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$ sqlplus / as sysdba




SQL*Plus: Release 18.0.0.0.0 - Production on Thu Nov 14 09:23:35 2019

Version 18.6.0.0.0




Copyright (c) 1982, 2018, Oracle.  All rights reserved.







Connected to:

Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0




SQL> set linesize 250

SQL> col comments format a50

SQL> select ACTION_TIME,VERSION,COMMENTS from dba_registry_history;




ACTION_TIME                                                                 VERSION                        COMMENTS

--------------------------------------------------------------------------- ------------------------------ --------------------------------------------------

                                                                            18                             RDBMS_18.6.0.0.0DBRU_LINUX.X64_190313.1

19-JUN-18 01.51.48.687187 PM                                                18.0.0.0.0                     Patch applied from 18.1.0.0.0 to 18.2.0.0.0

13-NOV-19 05.16.14.040821 PM                                                18.0.0.0.0                     Patch applied from 18.2.0.0.0 to 18.6.0.0.0




SQL> exit

Disconnected from Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0

[grid@exa01vm02 -MGMTDB]$

If you look above even the datapatch and database itself report that patch was applied correctly:

[grid@exa01vm02 -MGMTDB]$ sqlplus / as sysdba




SQL*Plus: Release 18.0.0.0.0 - Production on Thu Nov 14 09:19:05 2019

Version 18.6.0.0.0




Copyright (c) 1982, 2018, Oracle.  All rights reserved.







Connected to:

Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0




SQL> show pdbs




    CON_ID CON_NAME                       OPEN MODE  RESTRICTED

---------- ------------------------------ ---------- ----------

         2 PDB$SEED                       READ ONLY  NO

SQL> select instance_name from v$instance;




INSTANCE_NAME

----------------

-MGMTDB




SQL> exit

Disconnected from Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$ cd $ORACLE_HOME/OPatch

[grid@exa01vm02 -MGMTDB]$ ./datapatch -verbose

SQL Patching tool version 18.0.0.0.0 Production on Thu Nov 14 09:21:33 2019

Copyright (c) 2012, 2019, Oracle.  All rights reserved.




Log file for this invocation: /u01/app/grid/cfgtoollogs/sqlpatch/sqlpatch_242627_2019_11_14_09_21_33/sqlpatch_invocation.log




Connecting to database...OK

Gathering database info...done




Note:  Datapatch will only apply or rollback SQL fixes for PDBs

       that are in an open state, no patches will be applied to closed PDBs.

       Please refer to Note: Datapatch: Database 12c Post Patch SQL Automation

       (Doc ID 1585822.1)




Bootstrapping registry and package to current versions...done

Determining current state...done




Current state of interim SQL patches:

  No interim patches found




Current state of release update SQL patches:

  Binary registry:

    18.6.0.0.0 Release_Update 1903190102: Installed

  PDB CDB$ROOT:

    Applied 18.6.0.0.0 Release_Update 1903190102 successfully on 13-NOV-19 05.17.48.183101 PM

  PDB PDB$SEED:

    Applied 18.6.0.0.0 Release_Update 1903190102 successfully on 13-NOV-19 05.17.50.152110 PM




Adding patches to installation queue and performing prereq checks...done

Installation queue:

  For the following PDBs: CDB$ROOT PDB$SEED

    No interim patches need to be rolled back

    No release update patches need to be installed

    No interim patches need to be applied




SQL Patching tool complete on Thu Nov 14 09:21:48 2019

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$

[grid@exa01vm02 -MGMTDB]$ sqlplus / as sysdba




SQL*Plus: Release 18.0.0.0.0 - Production on Thu Nov 14 09:23:35 2019

Version 18.6.0.0.0




Copyright (c) 1982, 2018, Oracle.  All rights reserved.







Connected to:

Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0




SQL> set linesize 250

SQL> col comments format a50

SQL> select ACTION_TIME,VERSION,COMMENTS from dba_registry_history;




ACTION_TIME                                                                 VERSION                        COMMENTS

--------------------------------------------------------------------------- ------------------------------ --------------------------------------------------

                                                                            18                             RDBMS_18.6.0.0.0DBRU_LINUX.X64_190313.1

19-JUN-18 01.51.48.687187 PM                                                18.0.0.0.0                     Patch applied from 18.1.0.0.0 to 18.2.0.0.0

13-NOV-19 05.16.14.040821 PM                                                18.0.0.0.0                     Patch applied from 18.2.0.0.0 to 18.6.0.0.0




SQL> exit

Disconnected from Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

Version 18.6.0.0.0

[grid@exa01vm02 -MGMTDB]$

The Workaround

The workaround tries to bypass some checks executed to opatch when checking MGMTDB. If you look at the errors above, they were caused because CRS can’t communicate correctly with the database. Maybe because of some credential failure inside of CRS or listener registration.

The first step is to remove completely the MGMTDB. To do that you can use the dbca (from GI home and only one node):

[grid@exa01vm02 +ASM2]$ $ORACLE_HOME/bin/dbca -silent -deleteDatabase -sourceDB -MGMTDB

[WARNING] [DBT-19202] The Database Configuration Assistant will delete the Oracle instances and datafiles for your database. All information in the database will be destroyed.

Prepare for db operation

32% complete

Connecting to database

35% complete

39% complete

42% complete

45% complete

48% complete

52% complete

65% complete

Updating network configuration files

68% complete

Deleting instance and datafiles

84% complete

100% complete

Database deletion completed.

Look at the log file "/u01/app/grid/cfgtoollogs/dbca/_mgmtdb/_mgmtdb.log" for further details.

[grid@exa01vm02 +ASM2]$

After remove completely, you just need to recreate the MGMTDB. This can be done manually or using the script from MOS. One good source for information is the note How to Move/Recreate GI Management Repository (GIMR / MGMTDB) to Different Shared Storage (Diskgroup, CFS or NFS etc) (Doc ID 1589394.1) where you can find how to create manually or the script (the script just call the dbca).

Here I used the script, you can download it from the note MDBUtil: GI Management Repository configuration tool (Doc ID 2065175.1). The script call DBCA and you just need to specify the diskgroup that you want to use to store the database files.

In this case, I made:

[grid@exa01vm02 +ASM2]$ /tmp/mdbutil.pl --addmdb --target=+DATAC2

mdbutil.pl version : 1.98

2019-11-14 09:54:58: I Starting To Configure MGMTDB at +DATAC2...

2019-11-14 09:55:19: I Container database creation in progress... for GI 18.0.0.0.0

2019-11-14 10:12:50: I Plugable database creation in progress...

2019-11-14 10:16:59: I Executing "/tmp/mdbutil.pl --addchm" on exa01vm02 as root to configure CHM.

root@exa01vm02's password:

2019-11-14 10:17:26: I Executing "/tmp/mdbutil.pl --addchm" on exa01vm02 as root to configure CHM.

root@exa01vm02's password:

2019-11-14 10:17:34: I MGMTDB & CHM configuration done!

[grid@exa01vm02 +ASM2]$

You can even call with debug mode:

[grid@exa02vm03 -MGMTDB]$ /tmp/mdbutil.pl --addmdb --target=+DATAC3 --debug

mdbutil.pl version : 1.98

2020-01-15 15:30:58: D Executing: /u01/app/18.0.0/grid/bin/srvctl status diskgroup -g DATAC3

2020-01-15 15:30:59: D Exit code: 0

2020-01-15 15:30:59: D Output of last command execution:

Disk Group DATAC3 is running on exa02vm03,exa01vm03

2020-01-15 15:30:59: I Starting To Configure MGMTDB at +DATAC3...

2020-01-15 15:30:59: D Executing: /u01/app/18.0.0/grid/bin/srvctl status mgmtlsnr

2020-01-15 15:30:59: D Exit code: 0

2020-01-15 15:30:59: D Output of last command execution:

Listener MGMTLSNR is enabled

2020-01-15 15:30:59: D Executing: /u01/app/18.0.0/grid/bin/srvctl status mgmtdb

2020-01-15 15:31:00: D Exit code: 1

2020-01-15 15:31:00: D Output of last command execution:

PRCD-1120 : The resource for database _mgmtdb could not be found.

2020-01-15 15:31:00: D Executing: /u01/app/18.0.0/grid/bin/srvctl status mgmtdb

2020-01-15 15:31:01: D Exit code: 1

2020-01-15 15:31:01: D Output of last command execution:

PRCD-1120 : The resource for database _mgmtdb could not be found.

2020-01-15 15:31:01: D Executing: /u01/app/18.0.0/grid/bin/srvctl stop mgmtlsnr

2020-01-15 15:31:05: D Exit code: 0

2020-01-15 15:31:05: D Output of last command execution:

2020-01-15 15:31:05: D Executing: /u01/app/18.0.0/grid/bin/crsctl query crs activeversion

2020-01-15 15:31:05: D Exit code: 0

2020-01-15 15:31:05: D Output of last command execution:

Oracle Clusterware active version on the cluster is [18.0.0.0.0]

2020-01-15 15:31:05: D Executing: /u01/app/18.0.0/grid/bin/srvctl enable qosmserver

2020-01-15 15:31:06: D Exit code: 2

2020-01-15 15:31:06: D Output of last command execution:

PRKF-1321 : QoS Management Server is already enabled.

2020-01-15 15:31:06: D Executing: /u01/app/18.0.0/grid/bin/srvctl start qosmserver

2020-01-15 15:31:06: D Exit code: 2

2020-01-15 15:31:06: D Output of last command execution:

PRCC-1014 : qosmserver was already running

2020-01-15 15:31:06: I Container database creation in progress... for GI 18.0.0.0.0

2020-01-15 15:31:06: D Executing: /u01/app/18.0.0/grid/bin/dbca  -silent -createDatabase -createAsContainerDatabase true -templateName MGMTSeed_Database.dbc -sid -MGMTDB -gdbName _mgmtdb -storageType ASM -diskGroupName DATAC3 -datafileJarLocation /u01/app/18.0.0/grid/assistants/dbca/templates -characterset AL32UTF8 -autoGeneratePasswords -skipUserTemplateCheck

2020-01-15 15:47:19: D Exit code: 0

2020-01-15 15:47:19: D Output of last command execution:

Prepare for db operation

2020-01-15 15:47:19: I Plugable database creation in progress...

2020-01-15 15:47:19: D Executing: /u01/app/18.0.0/grid/bin/mgmtca -local

2020-01-15 15:49:28: D Exit code: 0

2020-01-15 15:49:28: D Output of last command execution:

2020-01-15 15:49:28: D Executing: scp /tmp/mdbutil.pl exa01vm03:/tmp/

2020-01-15 15:49:28: D Exit code: 0

2020-01-15 15:49:28: D Output of last command execution:

2020-01-15 15:49:28: I Executing "/tmp/mdbutil.pl --addchm" on exa01vm03 as root to configure CHM.

2020-01-15 15:49:28: D Executing: ssh root@exa01vm03 "/tmp/mdbutil.pl --addchm"

root@exa01vm03's password:

2020-01-15 15:49:39: D Exit code: 0

2020-01-15 15:49:39: D Output of last command execution:

mdbutil.pl version : 1.98

2020-01-15 15:49:39: D Executing: scp /tmp/mdbutil.pl exa02vm03:/tmp/

2020-01-15 15:49:39: D Exit code: 0

2020-01-15 15:49:39: D Output of last command execution:

2020-01-15 15:49:39: I Executing "/tmp/mdbutil.pl --addchm" on exa02vm03 as root to configure CHM.

2020-01-15 15:49:39: D Executing: ssh root@exa02vm03 "/tmp/mdbutil.pl --addchm"

root@exa02vm03's password:

2020-01-15 15:49:46: D Exit code: 0

2020-01-15 15:49:46: D Output of last command execution:

mdbutil.pl version : 1.98

2020-01-15 15:49:46: I MGMTDB & CHM configuration done!

[grid@exa02vm03 -MGMTDB]$

Resume and finish

So, after recreating the MGMTDB we can resume the opatch:

[root@exa02vm03 ~]# opatchauto resume




OPatchauto session is initiated at Wed Jan 15 15:50:15 2020

Session log file is /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/opatchauto2020-01-15_03-50-16PM.log

Resuming existing session with id NRZ1




Starting CRS service on home /u01/app/18.0.0/grid

CRS service started successfully on home /u01/app/18.0.0/grid




OPatchAuto successful.




--------------------------------Summary--------------------------------




Patching is completed successfully. Please find the summary as follows:




Host:exa02vm03

CRS Home:/u01/app/18.0.0/grid

Version:18.0.0.0.0

Summary:




==Following patches were SUCCESSFULLY applied:




Patch: /u01/patches/grid/29301682/28435192

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-15_13-40-42PM_1.log




Patch: /u01/patches/grid/29301682/28547619

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-15_13-40-42PM_1.log




Patch: /u01/patches/grid/29301682/29301631

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-15_13-40-42PM_1.log




Patch: /u01/patches/grid/29301682/29301643

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-15_13-40-42PM_1.log




Patch: /u01/patches/grid/29301682/29302264

Log: /u01/app/18.0.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2020-01-15_13-40-42PM_1.log










OPatchauto session completed at Wed Jan 15 15:57:24 2020

Time taken to complete the session 7 minutes, 9 seconds

[root@exa02vm03 ~]#

But why this work? So, the idea is that when the opatch tries to check if it is needed to apply the patch over the MGMTDB, it will reach it and verify that everything was already there and the call for mgmtca applysql will be successful. And since all the credentials between GI and MGMTDB are fine, the check can be done correctly. Another point is that you are recreating the database with one binary that already has the RU running.

Again, this solution can be applied for these errors when the communication between MGMTDB and GI fails because of credential errors or listener errors. And I include the complete check for opatch error to identify exactly the error (it is just because mgmtca applysql), and if the SQL’s was applied by datpatch and are registered inside of the database. To finish, always open SR.

« 1 2 3 4 … 7 »

ASM, REPLACE DISK Command

Basically, when the REPLACE command is called, the rebalance just copy/sync the data from the survivor disk (the partner disk from the mirror). It is faster since the previous way with drop/add execute a complete rebalance from all AU of the diskgroup, doing REBALANCE and SYNC phase.

The replace disk command is important for the SWAP disk process for Exadata (where you add the new 14TB disks) since it is faster to do the rebalance of the diskgroup.

Below one example from this behavior. Look that AU from DISK01 was SYNCED with the new disk:

And compare with the previous DROP/ADD disk, where all AU from all disks was rebalanced:

Actual Environment And Simulate The failure

In this post, to simulate and show how the replace disk works I have the DATA diskgroup with 6 disks (DISK01-06). The DISK07 it is not in use.

And to simulate the error I disconnected the disk from Operational system (since I used iSCSI, I just log off the target for DISK02:

At the same moment, the alertlog from ASM detected the error and informed that the mirror was found in another disk (DISK06):

So, at this moment the DQISK02 will not be removed instantly, but after the disk_repair_time finish:

If you want to check the full output from ASM alertlog you can access here at ASM-ALERTLOG-Output-Online-Disk-Error.txt

So, the actual diskgroup is

REPLACE DISK

Since the old disk was lost (by HW or something similar), it is impossible to put it again online. A new disk was attached to the server (DISK07 in this example) and this is added in the diskgroup.

So, we just need to execute the REPLACE DISK command:

The command is easy, we replace disk failed disk with the new disk path. And it is possible to replace more than one at the same time and specify the power of the rebalance too.

At ASM alertlog we can see a lot of messages about this replacement, but look that resync of the disk. The full output can be found here at ASM-ALERTLOG-Output-Replace-Disk.txt

Some points here:

After that, we can see the rebalance just take the SYNC phase:

In the end, after the rebalance we have:

An important detail is that the NAME for the disk will not change, it is impossible to change using REPLACE DISK command. As you can see above, the disk named DISK02 has the label DISK07 (here this came from asmlib disk).

Know Issues

So, be careful to do this in one engineering system or when you need to replace a lot of disks at the same time.

Some reference for reading:

Test Case:: 12C ASM New feature (Doc ID 1571975.1)

ASM, Mount restricted force for recovery

This represents for Exadata, but it is safe for representation. Basically your data will be in at least two different failgroups:

Environment

In the example that I use here, I have one diskgroup called DATA, which has 7 (seven) disks and each one is on failgroup. The redundancy for this diskgroup is NORMAL, this means that the block is copied in two failgroups. If two failures occur, probably, I will have data loss/corruption. Look:

The version for my GI is 19.6.0.0, but this can be used from 12.1.0.2 and newer versions (works for 11.2.0.4 in some versions). And In this server, I have three databases running, DBA19, DBB19, and DBC19.

So, with everything running correctly, the data from my databases will be spread two failgroups (this is just a representation and not correct representation where the blocks from my database are):

Understanding the failure

A “default configuration” have these values:

Mount restricted force for recovery

Here, I will simulate multiple failures for the disks (more than one) and show how you can use mount restricted force for recovery. Please be careful and follow all the steps correctly to avoid mistakes and to understand how to do and what is happening.

So, here I have DATA diskgroup, with normal redundancy and 7 (seven) failgroups. DBA19, DBB19, and DBC19 databases running.

So, at the first step, I will simulate a complete failure of CELLI03 failgroup. In my environment, to allow more control, I have one iSCSI target for each failgroup (this allows me to disconnect one by one if needed). The CELLI03 died:

And at ASM alertlog we can see:

So, the failure occurred at 17:14. The full output can be found here at ASM-ALERTLOG-Output-Failure-CELLI03.txt

And we can see that disappeared (but not deleted or dropped) from ASM:

At this point, ASM is starting to count the clock of 12hours (as defined in my repair attributes). The failgroup was not dropped and rebalance was not going on because ASM is optimistic that you will fix the issue in this period.

But after some time I had a second failure in the diskgroup:

Now at ASM alertlog you can see that diskgroup was dismounted (and several other messages). Bellow a cropped from the alertlog. The full output (and I think that deserve a look) it is here at ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01.txt

So, probably some data was lost. And even if you consider that between these 4 minutes data was changed in the databases, the mess is Big. If you want to see the alertlog from databases, check here at ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure.txt

And now we have this at ASM:

And if we try to mount we receive an error due to disk offline:

Imagine that you can bring online the first failed failgroup (CELL03) that have 4 minutes of the difference of data:

And if you try to mount it normally you will receive an error (output from alertlog at this try can be seen here at ASM-ALERTLOG-Output-Mout-With-One-Disk-Online):

So, now we can try the mount restricted force for recovery:

The alertlog from ASM (that you can full here at ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery.txt) report messages related with cache from diskgropup and disk that need to be checked. And now we are like this:

The next step is to bring online the failgroup that came back:

Doing this ASM will resync this failgroup (using this block as the last version) and bring the cache of this disk online. At ASM alertlog you can see messages like (full output here at ASM-ALERTLOG-Output-Online-Restored-Failgroup):

And not we have at ASM:

And rebalance not continue because is not allowed when diskgroup is in restrict mode:

But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the *arb* process trace files at ASM trace directory:

And looking from one of the last, we can see that some extend (that does not exist, the recovered failgroup, or the cache is not the last one) was filled with dummy (BADFDA7A) data:

After that, we can normally dismount the diskgroup:

And mount it again:

Since now the diskgroup is mounted in a clean way, you can continue with the rebalance:

The state at ASM side it is:

As you can see, the CELL01 was not removed yet (I will talk about it later). But the activities can continue, databases can be checked.

Database side

At database side we need to check what we lost and need to recover. Since I am using cluster the GI tried to start it (and as you can see two became up):

DBA19

The firs that I checked was DBA19C, I used rman to VALIDATE DATABASE:

As you can see, the datafile 4 FAILED and needs to be recovered. Luckily, the redo was not affected too and the open was OK. Since it was the UNDO, I made abort (because the immediate can take an eternity, and even since undo was down, nothing was happening inside of the database).

But as you saw, just one datafile was corrupted. Of course that with big databases and big failgroup, more files will be corrupted. But it is a shot that can worth it.

DBB19

The second was DBB19 and I used the same approach, VALIDATE DATABASE:

As you saw, no failures for DBB19. I still checked logically the database with VALIDATE CHECK LOGICAL DATABASE because since the validate returned no failed files, I wanted to check logically the blocks.

DBC19

Same for the last database, but now, datafile 3 failed:

Dropping failgroup

If the fix for the remaining failgroup took a lot, it will be dropped automatically. But we can do this manually with force (look that without force it fails):

And after the rebalance finish, all disk will be removed:

The steps for MOUNT RESTRICTED FORCE FOR RECOVERY

To resume, the steps needed are (in order):

Put online the failed disk/failgroup

Execute alter diskgroup <DG> mount restricted force for recovery

Brink online the failgroup with alter diskgroup data online disks in failgroup <FG>

But since the failgroup become online “in force way”, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the arb process trace files at ASM trace directory: