Forceful startup of CRS, when minority VMs are down

If a minority of database nodes are down because of cloud maintenance, those nodes may not be startable. If CRS is also down on the remaining working nodes, manual intervention is required.

Before proceeding, confirm that the cluster still has majority quorum.

Majority formula = TRUNC((number of database nodes + number of quorum nodes) / 2) + 1

The cluster can only be started when the majority of voting members are available. If the majority of database nodes or quorum nodes are down, the steps below will not work.

Use the following procedure on each database node where CRS fails to start.



Procedure 1: Restart CRS cleanly

1. Temporarily disable CRS autostart

crsctl disable crs

2. Stop any running CRS processes

crsctl stop crs -f

It is normal to see errors such as CRS-4639 or CRS-4000 when running this command. You can continue with the next steps.

3. Kill any remaining ohasd.bin reboot processes

ps -ef | grep "ohasd.bin reboot" | grep -v grep | awk '{print $2}' | xargs kill -9 > /dev/null 2>&1

4. [Only if using FlashGrid cluster] Stop flashgrid_wait service

flashgrid-node stop-waiting

Expected output may look similar to this:

pkill -USR1 -f flashgrid_wait ... OK

5. Restart the ohasd services

systemctl restart ohasd
systemctl restart oracle-ohasd

6. Monitor CRS startup

First, check whether the Clusterware daemons are running:

crsctl status res -t -init

If the Clusterware daemons started successfully, check the cluster resources:

crsctl status res -t

If CRS does not start automatically, start it manually:

crsctl start crs -wait

If startup hangs on ora.storage, check the ASM alert log (alert_+ASM?.log).

Look for errors such as: ORA-15042, ORA-15040

If these errors are present, cancel the CRS startup, skip step 7, and continue with Procedure 2 below.

7. Re-enable CRS autostart

crsctl enable crs

Procedure 2: If CRS still does not start

Use this procedure if CRS did not start successfully and some CRS resources remain failed.

Repeat the following steps on each database node where CRS still fails to start.

1. Stop any running CRS processes

crsctl stop crs -f

2. Kill any remaining ohasd.bin reboot processes

ps -ef | grep "ohasd.bin reboot" | grep -v grep | awk '{print $2}' | xargs kill -9 > /dev/null 2>&1

3. Restart the ohasd services

systemctl restart ohasd
systemctl restart oracle-ohasd

4. Start only HAS

crsctl start has

5. Start ASM in nomount mode

Connect as the Grid Infrastructure owner, for example grid:

su - grid
sqlplus / as sysasm

Then start ASM in nomount mode:

startup nomount;

6. Try to mount all ASM diskgroups

alter diskgroup all mount;

7. If mounting all diskgroups fails, mount them one by one using force

For example:

alter diskgroup GRID mount force;
alter diskgroup DATA mount force;

Sometimes ASM delays background operations after an unclean shutdown. In that case, you may see a message similar to this in alert_+ASM?.log:

WARNING: Background operations delayed until 08/08/23 21:22:21 because ASM was not stopped cleanly and there could be disconnected client(s)

Do not cancel the running command. Wait until the time shown in the message. The diskgroup should mount after that delay.

8. Re-enable CRS autostart

crsctl enable crs

9. Check cluster status

crsctl status res -t

Azure: Get email when VM instance state changes

To set up email notification in Azure environment when VM changes the state, you should do the following:

1. In the Azure portal, select Service Health

2. From the left side panel choose Resource health -> click Create Resource Health alert rule

Fill in the necessary fields:

In the Actions section, you have to indicate action groups. As long as we have not created any before, let’s click Add action groups -> Create action group

And fill in the following fields:

Click Review + create -> Create.

After creating the group you will see that the new group was chosen automatically. Fill in fields under Alert rule details:

Click Create alert rule.

3. Go to the Resource Group (in my case marirac2) where you have created Action Group.

Resource groups -> marirac2 -> from the left-side panel choose Alerts -> Action groups -> choose action group name in my case mariactgrp -> on the Notifications section choose Email/SMS message/Push/Voice -> on the right-side panel click Email checkbox -> enter email address who will be responsible for receiving and handling these alerts -> click OK -> enter desirable name under Notifications section -> click Save changes.

4. Test alert by stopping and starting the VM (assuming it is a test environment)

After changing the state of the VM you will receive the following notification.

Please note notification should have been sent as soon as VM changes the state but email can come 2 – 3 min later.