Start CRS even getting “ORA-15040: diskgroup is incomplete” on voting file/OCR diskgroup

Problem:

CRS was down on both nodes, during startup cluster encountered the following error when it was trying to mount diskgroup containing voting files and OCR:

WARNING: Disk Group VOTE containing configured OCR is not mounted
WARNING: Disk Group VOTE containing voting files is not mounted
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "0" is missing from group number "1" 

The diskgroup, where OCR and voting files were located was not able to mount because one disk was missing. As a result CRS is down:

# crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

We know that NORMAL redundancy diskgroup can tolerate one mirror problem at a time.

Solution:

1. Start HAS and check status of the local resoureces

# crsctl start has

# crsctl status res -t -init

---------------------------------------------------------------------------
Name          Target      State        Server      State details       
---------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------- 
ora.asm
       1       ONLINE      ONLINE       rac2         STABLE
 ora.cluster_interconnect.haip
       1        ONLINE     ONLINE       rac2         STABLE
 ora.crf
       1        OFFLINE    OFFLINE                   STABLE
 ora.crsd
       1        ONLINE      OFFLINE                  STABLE
 ora.cssd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.cssdmonitor
       1        ONLINE      ONLINE       rac2        STABLE
 ora.ctssd
       1        ONLINE      ONLINE       rac2        OBSERVER,STABLE
 ora.diskmon
       1        OFFLINE      OFFLINE                 STABLE
 ora.drivers.acfs
       1        ONLINE      ONLINE       rac2        STABLE
 ora.evmd
       1        ONLINE      INTERMEDIATE rac2        STABLE
 ora.gipcd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.gpnpd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.mdnsd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.storage
       1        ONLINE      OFFLINE      rac2        STABLE 

2. Connect to the ASM instance and mount diskgroup using force option.

ASM instance will be in nomount state, because diskgroup having voting files and OCR cannot be mounted.

Force option is mandatory, otherwise you will get the same ORA-15040 error.

# su - grid

$ sqlplus / as sysasm

SQL*Plus: Release 12.2.0.1.0 Production on Tue May 28 16:14:14 2019
Copyright (c) 1982, 2016, Oracle.  All rights reserved.

Connected to:
 Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

SQL> alter diskgroup VOTE mount force;
Diskgroup altered.

This operation sometimes takes ~6min to complete because of the following notification in alert_ASM?.log

"WARNING: Background operations delayed until 05/28/19 16:19:47 because ASM was not stopped cleanly and there could be disconnected client(s)"

The error message is self explanatory.

3. The diskgroup online operation on the 2nd step should trigger clusterware autostart, if not start it using the following command:

# crsctl start cluster

4. Check CRS status:

# crsctl status res -t 

---------------------------------------------------------------------------
Name           Target  State        Server       State details       
--------------------------------------------------------------------------- 
Local Resources

ora.ASMNET1LSNR_ASM.lsnr
                ONLINE  ONLINE       rac2        STABLE
ora.DATA.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.FRA.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.LISTENER.lsnr
                ONLINE  ONLINE       rac2        STABLE
ora.MGMT.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.VOTE.dg
                ONLINE  ONLINE       rac2        STABLE
ora.chad
                ONLINE  OFFLINE      rac2        STABLE
ora.net1.network
                ONLINE  ONLINE       rac2        STABLE
ora.ons
                ONLINE  ONLINE       rac2        STABLE
ora.proxy_advm
                OFFLINE OFFLINE      rac2        STABLE
---------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------- 
ora.LISTENER_SCAN1.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.LISTENER_SCAN2.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.LISTENER_SCAN3.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.MGMTLSNR
       1        OFFLINE OFFLINE                  STABLE
 ora.asm
       1        ONLINE  OFFLINE                  STABLE
       2        ONLINE  ONLINE       rac2        Started,STABLE
 ora.cvu
       1        ONLINE  ONLINE       rac2        STABLE
 ora.mgmtdb
       1        OFFLINE OFFLINE                  STABLE
 ora.qosmserver
       1        ONLINE  ONLINE       rac2        STABLE 
 ora.rac1.vip
       1        ONLINE  INTERMEDIATE rac2        FAILED OVER,STABLE
 ora.rac2.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan1.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan2.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan3.vip
       1        ONLINE  ONLINE       rac2        STABLE

Recommendation:

Change corrupted disks as soon as possible and make it online.

Backup best practices for Oracle Clusterware

I recommend you to backup clusterware related files after initial setup and at any change. The backup files can save you from OCR, OLR corruption during GI patch. If any of the files become corrupted you will be able to recover it in several minutes (or seconds). Depends on the failure, you may lose several hours to recover your cluster to the state it was before something happened.

Here are the steps to protect your cluster:

1. Backup ASM spfile initially and at any change.

There are several ways to backup ASM spfile using spcopy, spbackup or create pfile=<backup location> from spfile.

To locate the Oracle ASM SPFILE, use the ASMCMD spget command:

ASMCMD> spget
+GRID/myrac/ASMPARAMETERFILE/registry.253.974466047

Copy the Oracle ASM SPFILE to the backup location:

ASMCMD> spbackup +GRID/myrac/ASMPARAMETERFILE/registry.253.974466047 /backup/spfileasm.ora

2. Backing up ASM password file once should be enough. If you change password for pwfile users or add another user into the list, then make a new backup.

Locate the password file using the ASMCMD pwget command.

ASMCMD> pwget --asm
+GRID/orapwASM

Back up the password file to another location with the pwcopy command.

ASMCMD> pwcopy +GRID/orapwASM  /backup/orapwASM 
copying +GRID/orapwASM -> /backup/orapwASM

3. Use md_backup command to create a backup file containing metadata for one or more disk groups.

To backup metadata for all disk groups, do the following:

ASMCMD> md_backup /tmp/dgmetabackup

Disk group metadata to be backed up: DATA
Disk group metadata to be backed up: FRA
Disk group metadata to be backed up: GRID

In case you need to backup metadata only for a specific disk group, use -G option.

4. Backup OLR on each node.

If OLR is missing or corrupted, clusterware can’t be started on that node. So make manual backup initially and after any change:

Do the following on each node:

# ocrconfig -local -manualbackup

Copy generated file to the backup location:

# cp /u01/app/12.2.0/grid/cdata/rac1/backup_20180510_230359.olr /backup/

Or change default backup location to /backup before making the actual backup:

# ocrconfig -local -backuploc /backup

# ocrconfig -local -manualbackup

5. Mirror and Backup OCR.

You should configure OCR in two independent disk groups. Typically, this is the work area and the recovery area. At least two OCR locations should be configured.

# ocrconfig -add +FRA

There are automatic OCR backups that are taken in the past 4 hours, 8 hours, 12 hours, and in the last day and week.

You can also manually backup OCR before applying any patch or upgrade GI home:

# ocrconfig -manualbackup

Regularly save taken backup to another location using the following way:

Identify the latest backup (manual or automatic):

[grid@rac1 ~]$ ocrconfig -showbackup
rac1 2018/05/10 13:06:18 +GRID:/myrac/OCRBACKUP/backup00.ocr.289.975762375 830990544
..

Copy it to the backup location:

$ ocrconfig -copy +GRID:/myrac/OCRBACKUP/backup00.ocr.289.975762375 /backup/backup00.ocr

Or change default backup locations to another diskgroup other than GRID:

# ocrconfig -backuploc +FRA