Forceful startup of CRS, when minority VMs are down

If a minority of database nodes are down because of cloud maintenance, those nodes may not be startable. If CRS is also down on the remaining working nodes, manual intervention is required.

Before proceeding, confirm that the cluster still has majority quorum.

Majority formula = TRUNC((number of database nodes + number of quorum nodes) / 2) + 1

The cluster can only be started when the majority of voting members are available. If the majority of database nodes or quorum nodes are down, the steps below will not work.

Use the following procedure on each database node where CRS fails to start.



Procedure 1: Restart CRS cleanly

1. Temporarily disable CRS autostart

crsctl disable crs

2. Stop any running CRS processes

crsctl stop crs -f

It is normal to see errors such as CRS-4639 or CRS-4000 when running this command. You can continue with the next steps.

3. Kill any remaining ohasd.bin reboot processes

ps -ef | grep "ohasd.bin reboot" | grep -v grep | awk '{print $2}' | xargs kill -9 > /dev/null 2>&1

4. [Only if using FlashGrid cluster] Stop flashgrid_wait service

flashgrid-node stop-waiting

Expected output may look similar to this:

pkill -USR1 -f flashgrid_wait ... OK

5. Restart the ohasd services

systemctl restart ohasd
systemctl restart oracle-ohasd

6. Monitor CRS startup

First, check whether the Clusterware daemons are running:

crsctl status res -t -init

If the Clusterware daemons started successfully, check the cluster resources:

crsctl status res -t

If CRS does not start automatically, start it manually:

crsctl start crs -wait

If startup hangs on ora.storage, check the ASM alert log (alert_+ASM?.log).

Look for errors such as: ORA-15042, ORA-15040

If these errors are present, cancel the CRS startup, skip step 7, and continue with Procedure 2 below.

7. Re-enable CRS autostart

crsctl enable crs

Procedure 2: If CRS still does not start

Use this procedure if CRS did not start successfully and some CRS resources remain failed.

Repeat the following steps on each database node where CRS still fails to start.

1. Stop any running CRS processes

crsctl stop crs -f

2. Kill any remaining ohasd.bin reboot processes

ps -ef | grep "ohasd.bin reboot" | grep -v grep | awk '{print $2}' | xargs kill -9 > /dev/null 2>&1

3. Restart the ohasd services

systemctl restart ohasd
systemctl restart oracle-ohasd

4. Start only HAS

crsctl start has

5. Start ASM in nomount mode

Connect as the Grid Infrastructure owner, for example grid:

su - grid
sqlplus / as sysasm

Then start ASM in nomount mode:

startup nomount;

6. Try to mount all ASM diskgroups

alter diskgroup all mount;

7. If mounting all diskgroups fails, mount them one by one using force

For example:

alter diskgroup GRID mount force;
alter diskgroup DATA mount force;

Sometimes ASM delays background operations after an unclean shutdown. In that case, you may see a message similar to this in alert_+ASM?.log:

WARNING: Background operations delayed until 08/08/23 21:22:21 because ASM was not stopped cleanly and there could be disconnected client(s)

Do not cancel the running command. Wait until the time shown in the message. The diskgroup should mount after that delay.

8. Re-enable CRS autostart

crsctl enable crs

9. Check cluster status

crsctl status res -t

Start CRS even getting “ORA-15040: diskgroup is incomplete” on voting file/OCR diskgroup

Problem:

CRS was down on both nodes, during startup cluster encountered the following error when it was trying to mount diskgroup containing voting files and OCR:

WARNING: Disk Group VOTE containing configured OCR is not mounted
WARNING: Disk Group VOTE containing voting files is not mounted
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "0" is missing from group number "1" 

The diskgroup, where OCR and voting files were located was not able to mount because one disk was missing. As a result CRS is down:

# crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

We know that NORMAL redundancy diskgroup can tolerate one mirror problem at a time.

Solution:

1. Start HAS and check status of the local resoureces

# crsctl start has

# crsctl status res -t -init

---------------------------------------------------------------------------
Name          Target      State        Server      State details       
---------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------- 
ora.asm
       1       ONLINE      ONLINE       rac2         STABLE
 ora.cluster_interconnect.haip
       1        ONLINE     ONLINE       rac2         STABLE
 ora.crf
       1        OFFLINE    OFFLINE                   STABLE
 ora.crsd
       1        ONLINE      OFFLINE                  STABLE
 ora.cssd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.cssdmonitor
       1        ONLINE      ONLINE       rac2        STABLE
 ora.ctssd
       1        ONLINE      ONLINE       rac2        OBSERVER,STABLE
 ora.diskmon
       1        OFFLINE      OFFLINE                 STABLE
 ora.drivers.acfs
       1        ONLINE      ONLINE       rac2        STABLE
 ora.evmd
       1        ONLINE      INTERMEDIATE rac2        STABLE
 ora.gipcd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.gpnpd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.mdnsd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.storage
       1        ONLINE      OFFLINE      rac2        STABLE 

2. Connect to the ASM instance and mount diskgroup using force option.

ASM instance will be in nomount state, because diskgroup having voting files and OCR cannot be mounted.

Force option is mandatory, otherwise you will get the same ORA-15040 error.

# su - grid

$ sqlplus / as sysasm

SQL*Plus: Release 12.2.0.1.0 Production on Tue May 28 16:14:14 2019
Copyright (c) 1982, 2016, Oracle.  All rights reserved.

Connected to:
 Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

SQL> alter diskgroup VOTE mount force;
Diskgroup altered.

This operation sometimes takes ~6min to complete because of the following notification in alert_ASM?.log

"WARNING: Background operations delayed until 05/28/19 16:19:47 because ASM was not stopped cleanly and there could be disconnected client(s)"

The error message is self explanatory.

3. The diskgroup online operation on the 2nd step should trigger clusterware autostart, if not start it using the following command:

# crsctl start cluster

4. Check CRS status:

# crsctl status res -t 

---------------------------------------------------------------------------
Name           Target  State        Server       State details       
--------------------------------------------------------------------------- 
Local Resources

ora.ASMNET1LSNR_ASM.lsnr
                ONLINE  ONLINE       rac2        STABLE
ora.DATA.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.FRA.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.LISTENER.lsnr
                ONLINE  ONLINE       rac2        STABLE
ora.MGMT.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.VOTE.dg
                ONLINE  ONLINE       rac2        STABLE
ora.chad
                ONLINE  OFFLINE      rac2        STABLE
ora.net1.network
                ONLINE  ONLINE       rac2        STABLE
ora.ons
                ONLINE  ONLINE       rac2        STABLE
ora.proxy_advm
                OFFLINE OFFLINE      rac2        STABLE
---------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------- 
ora.LISTENER_SCAN1.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.LISTENER_SCAN2.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.LISTENER_SCAN3.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.MGMTLSNR
       1        OFFLINE OFFLINE                  STABLE
 ora.asm
       1        ONLINE  OFFLINE                  STABLE
       2        ONLINE  ONLINE       rac2        Started,STABLE
 ora.cvu
       1        ONLINE  ONLINE       rac2        STABLE
 ora.mgmtdb
       1        OFFLINE OFFLINE                  STABLE
 ora.qosmserver
       1        ONLINE  ONLINE       rac2        STABLE 
 ora.rac1.vip
       1        ONLINE  INTERMEDIATE rac2        FAILED OVER,STABLE
 ora.rac2.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan1.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan2.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan3.vip
       1        ONLINE  ONLINE       rac2        STABLE

Recommendation:

Change corrupted disks as soon as possible and make it online.