CRS-2549: Resource ‘ora.asmgroup’ cannot be placed on ‘rac1’ as it is not a valid candidate as per the placement policy

Problem:

After failed JDK patching on the 1st node, we tried troubleshooting and saw that ASM was not able to start:

# su - grid
$ sqlplus / as sysasm
SQL> startup nomount;
ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance
ORA-39511: Start of CRS resource for instance '223' failed with error:[CRS-2549: Resource 'ora.asmgroup' cannot be placed on 'rac1' as it is not a valid candidate as per the placement policy
CRS-0223: Resource 'ora.asm' has placement error.
clsr_start_resource:260 status:223
clsrapi_start_asm:start_asmdbs status:223

Reason:

Prepatch modified RESOURCE_USE_ENABLED=0 for rac1 node:

[grid@rac1 ~]$ crsctl stat server -f

NAME=rac1
MEMORY_SIZE=63465
CPU_COUNT=8
CPU_CLOCK_RATE=2499
CPU_HYPERTHREADING=1
CPU_EQUIVALENCY=1000
DEPLOYMENT=other
CONFIGURED_CSS_ROLE=hub
RESOURCE_USE_ENABLED=0
SERVER_LABEL=
PHYSICAL_HOSTNAME=
CSS_CRITICAL=no
CSS_CRITICAL_TOTAL=0
RESOURCE_TOTAL=0
SITE_NAME=stsfilive
STATE=ONLINE
ACTIVE_POOLS=Free
STATE_DETAILS=
ACTIVE_CSS_ROLE=hub

NAME=rac2
MEMORY_SIZE=63465
CPU_COUNT=8
CPU_CLOCK_RATE=2499
CPU_HYPERTHREADING=1
CPU_EQUIVALENCY=1000
DEPLOYMENT=other
CONFIGURED_CSS_ROLE=hub
RESOURCE_USE_ENABLED=1
….

Solution:

Connect to the failing node and run:

[root@rac1 ~]# crsctl set resource use 1

Start ASM.

Monitoring ASM disk performance using IOSTAT

iostat in asmcmd displays I/O statistics for Oracle ASM disks in mounted disk groups.

Connect to the database node via GI owner:

# su - grid

Run iostat with the following options (Reads & Writes are in bytes):

# asmcmd
ASMCMD> iostat -t -G FRA 5
Group_Name  Dsk_Name   Reads      Writes    Read_Time  Write_Time
FRA         RAC1$LUN3  585083392  98942464  94.659862  4.03044
FRA         RAC2$LUN3  1847296    98942464  .054822    4.134049
FRA         RACQ$LUN4  57344      24576     .035944    .018594

Group_Name  Dsk_Name   Reads      Writes  Read_Time  Write_Time
FRA         RAC1$LUN3  368640.00  0.00    0.01       0.00
FRA         RAC2$LUN3  0.00       0.00    0.00       0.00
FRA         RACQ$LUN4  0.00       0.00    0.00       0.00

Where
-t displays time statistics (Read_Time, Write_Time)
-G FRA displays statistics for the FRA diskgroup, change the diskgroup name according to your needs.
5 is a refresh interval. When the interval is specified then the value displayed (bytes or I/Os) is the difference between the previous and current values, not the total value. But if a refresh interval is not specified, the number displayed represents the total number of bytes or I/Os.

For synopsis and description about all available iostat options, run help:

ASMCMD> help iostat
iostat
        Displays I/O statistics for Oracle ASM disks in mounted disk groups.

Synopsis
        iostat [-et][--io] [--suppressheader] [--region] [-G <diskgroup>] [<interval>]

Description
        iostat lists disk group statistics using the V$ASM_DISK_STAT view.
        The options for the iostat command are described below.
        -e		- Displays error statistics (Read_Err, Write_Err).
        -G diskgroup	- Displays statistics for the disk group name.
        --suppressheader	- Suppresses column headings.
        --io		- Displays information in number of I/Os, instead
                          of bytes.
        -t		- Displays time statistics (Read_Time, Write_Time).
        --region	- Displays information for cold and hot disk regions
                          (Cold_Reads, Cold_Writes, Hot_Reads, Hot_Writes).
        interval	- Refreshes the statistics display based on the
                          interval value (seconds).
        The attribute descriptions for iostat command output are described
	below. To view the complete set of statistics for a disk group,
	use the V$ASM_DISK_STAT view.
        Group_Name	        Name of the disk group.
        Dsk_Name	        Name of the disk.
        Reads	        	Total number of bytes read from the disk.
				If the --io option is entered, then the value
				is displayed as number of I/Os.
        Writes	        	Total number of bytes written to the disk.
				If the --io option is entered, then the value
				is displayed as number of I/Os.
        Cold_Reads	        Total number of bytes read from the cold disk
				region. If the --io option is entered, then
				the value is displayed as number of I/Os.
        Writes	        	Total number of bytes written to the disk.
        Cold_Writes	        Total number of bytes written to the cold
				disk region. If the --io option is entered,
				then the value is displayed as number of I/Os.
        Hot_Reads	        Total number of bytes read from the hot
				disk region. If the --io option is entered,
				then the value is displayed as number of I/Os.
        Writes	        	Total number of bytes written to the disk.
        Cold_Writes	        Total number of bytes written to the cold
        Hot_Writes	        Total number of bytes written to the hot disk
				region. If the --io option is entered, then the
				value is displayed as number of I/Os.
        Read_Err	        Total number of failed I/O read requests for
				the disk.
        Write_Err	        Total number of failed I/O write requests for
				the disk.
        Read_Time	        Total I/O time (in seconds) for
				read requests for the disk if the
				TIMED_STATISTICS initialization parameter is
				set to TRUE (0 if set to FALSE).
        Write_Time	        Total I/O time (in seconds) for
				write requests for the disk if the
				TIMED_STATISTICS initialization parameter is
				set to TRUE (0 if set to FALSE).
        Writes	        	Total number of bytes written to the disk.
        Cold_Writes	        Total number of bytes written to the cold
        Hot_Writes	        Total number of bytes written to the hot disk
        If a refresh interval is not specified, the number displayed represents
        the total number of bytes or I/Os.  Ifa refresh interval is specified,
        then the value displayed (bytes or I/Os) is the difference between the
        previous and current values, not the total value.

Examples
        The following are examples of the iostat command. The first example
        displays disk I/O statistics for the data disk group in total number
        of bytes. The second example displays disk I/O statistics for the data
        disk group in total number of I/O operations.
        ASMCMD [+] > iostat -G data
        Group_Name  Dsk_Name   Reads       Writes
        DATA        DATA_0000  180488192   473707520
        DATA        DATA_0001  1089585152  469538816
        DATA        DATA_0002  191648256   489570304
        DATA        DATA_0003  175724032   424845824
        DATA        DATA_0004  183421952   781429248
        DATA        DATA_0005  1102540800  855269888
        DATA        DATA_0006  171290624   447662592
        DATA        DATA_0007  172281856   361337344
        DATA        DATA_0008  173225472   390840320
        DATA        DATA_0009  288497152   838680576
        DATA        DATA_0010  196657152   375764480
        DATA        DATA_0011  436420096   356003840
        ASMCMD [+] > iostat --io -G data
        Group_Name  Dsk_Name   Reads  Writes
        DATA        DATA_0000  2801   34918
        DATA        DATA_0001  58301  35700
        DATA        DATA_0002  3320   36345
        DATA        DATA_0003  2816   10629
        DATA        DATA_0004  2883   34850
        DATA        DATA_0005  59306  38097
        DATA        DATA_0006  2151   10129
        DATA        DATA_0007  2686   10376
        DATA        DATA_0008  2105   8955
        DATA        DATA_0009  9121   36713
        DATA        DATA_0010  3557   8596
        DATA        DATA_0011  17458  9269

ora.storage fails, Error 4 querying length of attr ASM_DISCOVERY_ADDRESS, ORA-01017

Problem:

CRS on the 1st node is able to start, but not on the 2nd node.

CRS on the 2nd node hangs and later fails:

CRS-2672: Attempting to start 'ora.storage' on 'rac2'
ORA-01017: invalid username/password; logon denied
CRS-5055: unable to connect to an ASM instance because no ASM instance is running in the cluster

during that time CRS alert.log shows:

2022-03-15 20:15:23.722 [ORAROOTAGENT(63477)]CRS-5019: All OCR locations are on ASM disk groups [GRID], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/grid/diag/crs/rac2/crs/trace/ohasd_orarootagent_root.trc".

ohasd_orarootagent_root.trc shows:

2022-03-15 20:23:35.108 : USRTHRD:1769867008: [     INFO] {0:5:3} [ora.storage] 9788 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS

2022-03-15 20:23:35.110 : USRTHRD:1769867008: [     INFO] {0:5:3} [ora.storage] 9788 Error 4 querying length of attr ASM_STATIC_DISCOVERY_ADDRESS

2022-03-15 20:23:35.136 : USRTHRD:1769867008: [     INFO] {0:5:3} [ora.storage] 9506 Error 4 opening dom root in 0x7fa3100013a0

Reason:

Either the password file is corrupted or it does not exist. In our case, GRID diskgroup was created after clearing disk headers and forgot to copy ASM password file.

Solution:

  1. If you have ASM password file backup, then you can simply place it to the asm diskgroup:
$ asmcmd pwcopy --asm /tmp/asm_passwordfile +GRID/orapwASM -f

and stop/start CRS.

2. If you don’t have password file backup, you need to create a new one and add necessary users into it:

[grid@rac1 ~]$ asmcmd pwcreate --asm +GRID/orapwasm -f
Enter password: **********

Check existing users:

[grid@rac1 ~]$ asmcmd lspwusr
Username sysdba sysoper sysasm
SYS TRUE TRUE FALSE

Add necessary users and grant permissions:

$ asmcmd orapwusr --grant sysasm SYS
$ asmcmd orapwusr --add ASMSNMP
Enter password: *********
$ asmcmd orapwusr --grant sysdba ASMSNMP

Check permissions again:

$ asmcmd lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE   TRUE
 ASMSNMP   TRUE   FALSE  FALSE

Find out the user name and password for CRSD to connect, GI uses internal user CRSUSER__ASM_001 with an internally generated password to access ASM during startup:

Find the string SYSTEM.ASM.CREDENTIALS.USERS.CRSUSER__ASM_001 in the following output and save. ORATEXT value:

# ocrdump -stdout | less
...
[SYSTEM.ASM.CREDENTIALS.USERS.CRSUSER__ASM_001]
ORATEXT : d68aec9585136fa8ff8f79f483e4ae64:grid
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_NONE, USER_NAME : grid, GROUP_NAME : oinstall}

Query password for GUID-user. GUID will be different in your case. Retrieve value from your output:

# crsctl get credmaint -path /ASM/Self/d68aec9585136fa8ff8f79f483e4ae64 -credtype userpass -id 0 -attr passwd -local
mB28wSM4AVFAVEYamUIvrMjEo2Nfa

Add this user to ASM password file:

$ asmcmd orapwusr --add CRSUSER__ASM_001
>>>>> provide <password> you retrieved earlier

Add necessary credentials to this user:

$ asmcmd orapwusr --grant sysdba CRSUSER__ASM_001
$ asmcmd orapwusr --grant sysasm CRSUSER__ASM_001

Check the list again:

$ asmcmd lspwusr
        Username sysdba sysoper sysasm
             SYS   TRUE    TRUE   TRUE
         ASMSNMP   TRUE   FALSE  FALSE
CRSUSER__ASM_001   TRUE   FALSE   TRUE

Stop/Start CRS on the remaining node.

Apply GI patch 19.14 (33509923) on GI 19.3

Current environment:

$ $ORACLE_HOME/OPatch/opatch lspatches
29585399;OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
29517247;ACFS RELEASE UPDATE 19.3.0.0.0 (29517247)
29517242;Database Release Update : 19.3.0.0.190416 (29517242)
29401763;TOMCAT RELEASE UPDATE 19.0.0.0.0 (29401763)

1. You must use the OPatch utility version 12.2.0.1.28 or later to apply this patch

Download p6880880_190000_Linux-x86-64.zip – OPatch 12.2.0.1.28 for DB 19.0.0.0.0 (Nov 2021), or later.

Replace existing opatch with the new one:

# export ORACLE_HOME=/u01/app/19c/grid
# rm -rf $ORACLE_HOME/OPatch
# su - grid 
$ export ORACLE_HOME=/u01/app/19c/grid
$ unzip -o /u01/swtmp/p6880880_190000_Linux-x86-64.zip -d $ORACLE_HOME

$ $ORACLE_HOME/OPatch/opatch version
OPatch Version: 12.2.0.1.28

2. Unzip the patch 33509923 as the Grid home owner:

# su - grid
$ cd /u01/swtmp
$ unzip p33509923_190000_Linux-x86-64.zip

3. Determine whether any currently installed one-off patches conflict with this patch 33509923 as follows:

# su - grid

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33515361

Oracle Home       : /u01/app/19c/grid
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/19c/grid/oraInst.loc
OPatch version    : 12.2.0.1.28
OUI version       : 12.2.0.7.0
Log file location : /u01/app/19c/grid/cfgtoollogs/opatch/opatch2022-01-31_11-58-55AM_1.log

Invoking prereq "checkconflictagainstohwithdetail"
Prereq "checkConflictAgainstOHWithDetail" passed.
OPatch succeeded.

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33529556

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33534448

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33239955

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33575402

In case conflict is detected, stop the patch installation and contact Oracle Support Services, otherwise continue.

4. Stop database instances running on that server.

5. Apply patch using opatchauto:

# cd /u01/swtmp/33509923/
# export ORACLE_HOME=/u01/app/19c/grid/
# export PATH=$ORACLE_HOME/bin:$PATH
# $ORACLE_HOME/OPatch/opatchauto apply
...
==Following patches were SUCCESSFULLY applied:

Patch: /u01/swtmp/33509923/33239955
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33515361
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33529556
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33534448
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33575402
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Check current version:

$ $ORACLE_HOME/OPatch/opatch lspatches
33575402;DBWLM RELEASE UPDATE 19.0.0.0.0 (33575402)
33534448;ACFS RELEASE UPDATE 19.14.0.0.0 (33534448)
33529556;OCW RELEASE UPDATE 19.14.0.0.0 (33529556)
33515361;Database Release Update : 19.14.0.0.220118 (33515361)
33239955;TOMCAT RELEASE UPDATE 19.0.0.0.0 (33239955)

6. After finishing patch it is recommened to backup clusterware components:

Make Oracle ASM voting file online

Problem:

After changing the quorum node instance type, my cluster’s one of the voting file became offline:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   49400dd2b39a4f12bf3c5fa677c056fe (/dev/flashgrid/rac2.xvdba) [GRID]
 2. ONLINE   4a6d94d206104fe6bfbe5435ac7f4586 (/dev/flashgrid/rac1.xvdba) [GRID]
 3. OFFLINE  faf99f5fd78f4f35bfe833bdd1d22b9a (/dev/flashgrid/racq.xvdba) [GRID]
Located 3 voting disk(s).

Solution:

Find out the ASM disk name which contains mentioned voting file, offline and online it:

SQL> select NAME from v$ASM_DISK where PATH='/dev/flashgrid/racq.xvdba';

NAME
------------------------------
RACQ$XVDBA

Offline the disk:

SQL> alter diskgroup GRID offline quorum disk "RACQ$XVDBA";

Diskgroup altered.

Online again:

SQL> alter diskgroup GRID online quorum disk "RACQ$XVDBA";

Diskgroup altered.

Check the status again:

SQL> !crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   49400dd2b39a4f12bf3c5fa677c056fe (/dev/flashgrid/rac2.xvdba) [GRID]
 2. ONLINE   4a6d94d206104fe6bfbe5435ac7f4586 (/dev/flashgrid/rac1.xvdba) [GRID]
 3. ONLINE   784f924d23c94f3fbf4287c5c6ef572c (/dev/flashgrid/racq.xvdba) [GRID]

Moving GRID disk group files to another disk group

To migrate all content from +GRID diskgroup to another newly created one, we need to know what is the list of necessary files that are located on it:

  • ASM password file
  • ASM Spfile
  • OCR
  • Voting files
  • OCR backups (if configured on the same diskgroup)

Let’s migrate all of them one by one:

Migrate ASM password file

1. Locate the Oracle ASM password file:

[grid@rac1 ~]$ asmcmd pwget --asm
+GRID/orapwASM

2. Migrate the password file:

[grid@rac1 ~]$ asmcmd pwmove --asm -f +GRID/orapwASM +GRID2/orapwASM
moving +GRID/orapwASM -> +GRID2/orapwASM

3. Verify that the file has a new path:

[grid@rac1 ~]$ asmcmd pwget --asm
+GRID2/orapwASM

Migrate ASM Spfile

1. Locate the Oracle ASM SPFILE:

[grid@rac1 ~]$ asmcmd spget
+GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891

2. Migrate the spfile:

[grid@rac1 ~]$ asmcmd spmove +GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891 +GRID2/marirac/ASMPARAMETERFILE/spfileASM
ORA-15032: not all alterations performed
ORA-15028: ASM file '+GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891' not dropped; currently being accessed (DBD ERROR: OCIStmtExecute)

The error message can be ignored, the new location will be used after we restart CRS.

3. Verify:

[grid@rac1 ~]$ asmcmd spget
+GRID2/marirac/ASMPARAMETERFILE/spfileASM

Migrate OCR

1. Get the current OCR location:

[grid@rac1 ~]$ ocrcheck -config
Oracle Cluster Registry configuration is :
	 Device/File Name         :      +GRID

2. Move OCR:

[grid@rac1 ~]$  ocrconfig -add +GRID2
PROT-20: Insufficient permission to proceed. Require privileged user

[grid@rac1 ~]$ exit
logout

[root@rac1 ~]# ocrconfig -add +GRID2
[root@rac1 ~]# ocrconfig -delete +GRID

3. Verify:

[root@rac1 ~]# ocrcheck -config
Oracle Cluster Registry configuration is :
	 Device/File Name         :     +GRID2

Migrate voting files

1. Get the current location:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   544b7b2dc9f14f8dbf8f5c560a32a95f (/dev/flashgrid/rac2.xvdba) [GRID]
2. ONLINE   c4035c7009be4f26bffd663651e4d520 (/dev/flashgrid/rac1.xvdba) [GRID]
3. ONLINE   5737c31731574fa8bf2acc107fbbd364 (/dev/flashgrid/racq.xvdba) [GRID]
Located 3 voting disk(s).

2. Move:

[root@rac1 ~]# crsctl replace votedisk +GRID2
Successful addition of voting disk 26221fd4d7334fa8bfc98be1908ee3ef.
Successful addition of voting disk 093f9c21b9864f87bfc4853547f05a16.
Successful addition of voting disk 9c2a9fd2fc334f7ebfb44c04bdb0cf57.
Successful deletion of voting disk 544b7b2dc9f14f8dbf8f5c560a32a95f.
Successful deletion of voting disk c4035c7009be4f26bffd663651e4d520.
Successful deletion of voting disk 5737c31731574fa8bf2acc107fbbd364.
Successfully replaced voting disk group with +GRID2.
CRS-4266: Voting file(s) successfully replaced

3. Verify:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   26221fd4d7334fa8bfc98be1908ee3ef (/dev/flashgrid/rac1.xvdbc) [GRID2]
2. ONLINE   093f9c21b9864f87bfc4853547f05a16 (/dev/flashgrid/rac2.xvdbc) [GRID2]
3. ONLINE   9c2a9fd2fc334f7ebfb44c04bdb0cf57 (/dev/flashgrid/racq.xvdbz) [GRID2]
Located 3 voting disk(s).

Moving OCR backup

Please note that having OCR backup on the same location where OCR is located is not a good practice, you should have another disgroup for that. So let’s assume, we have separate DG for that.

1. Check the current location:

[root@rac1 ~]# ocrconfig -showbackup

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/backup00.ocr.276.1089911215     1443639413

rac2     2021/11/25 16:52:08     +GRID:/marirac/OCRBACKUP/backup01.ocr.275.1089564721     1443639413

rac2     2021/11/21 14:13:23     +GRID:/marirac/OCRBACKUP/backup02.ocr.277.1089209597     1443639413

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/day.ocr.272.1089911223     1443639413

rac1     2021/11/15 15:05:26     +GRID:/marirac/OCRBACKUP/week.ocr.273.1088694327     1443639413
PROT-25: Manual backups for the Oracle Cluster Registry are not available

2. Reconfigure:

[root@rac1 ~]# ocrconfig -backuploc +FRA

There are automatic OCR backups that are taken in the past 4 hours, 8 hours, 12 hours, and in the last day and week. Until this time passes, we can run manual backup for safety:

[root@rac1 ~]# ocrconfig -manualbackup

rac2     2021/11/30 12:20:15     +FRA:/marirac/OCRBACKUP/backup_20211130_122015.ocr.257.1089980415     1443639413

3. Verify:

[root@rac1 ~]# ocrconfig -showbackup

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/backup00.ocr.276.1089911215     1443639413

rac2     2021/11/25 16:52:08     +GRID:/marirac/OCRBACKUP/backup01.ocr.275.1089564721     1443639413

rac2     2021/11/21 14:13:23     +GRID:/marirac/OCRBACKUP/backup02.ocr.277.1089209597     1443639413

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/day.ocr.272.1089911223     1443639413

rac1     2021/11/15 15:05:26     +GRID:/marirac/OCRBACKUP/week.ocr.273.1088694327     1443639413

rac2     2021/11/30 12:20:15     +FRA:/marirac/OCRBACKUP/backup_20211130_122015.ocr.257.1089980415     1443639413

Display ASM disk attributes while ASM is not running, using KFOD

$GRID_HOME/bin/kfod has many usages (kfod -help), one of them is to print disk attributes without connecting to an ASM instance. Even more, you can display these attributes while ASM is not running. Imagine how useful can it be for you, when troubleshooting ASM startup issues.

Let’s display: disk size, header, path, diskgroup name, owner user, owner group, physical sector size, logical sector size.

[root@rac1~]# kfod op=disks status=true disks=all dscvgroup=true diskattr=all

Let’s see if ASM is running during that time:

[root@rac1~]# ps -ef|grep smon

root 3716 1     4 12:36 ?      00:00:01 /u01/app/19.3.0/grid/bin/osysmond.bin
root 5178 5083  0 12:37 pts/0  00:00:00 grep --color=auto smon

There is no asm_smon_+ASM1, which means ASM is down.

Print the content of multiple differently named files in Linux

If the number of files you are working on is big, then you need automation as soon as possible.
This post describes find -o option, which helps you work on differently named files when their number is big.

For example, if you want to output the content of files physical_block_size and logical_block_size located under /sys/block/*/queue, run the following:

# find /sys/block/*/queue -name physical_block_size -o -name logical_block_size | while read f ; do echo "$f $(cat $f)" ; done

..
/sys/block/dm-0/queue/physical_block_size 4096
/sys/block/dm-0/queue/logical_block_size 512
/sys/block/dm-1/queue/physical_block_size 512
...

Where -o means OR.

Useful when working on ASM disks.

ora.evmd and ora.mdnsd fails to start when http_proxy is set to https://

Problem:

After setting http_proxy to https string (export http_proxy=https://test) and then stopping and starting CRS got the following error:

CRS-2883: Resource 'ora.evmd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/bin/CommonSetup.pm" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
....
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/lib/libnl19.a" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
CRS-4000: Command Start failed, or completed with errors.

Even after unsetting http_proxy and trying to stop CRS got the following:

[root@rac1 ~]# crsctl start crs -wait
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[root@rac1 ~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'
CRS-2679: Attempting to clean 'ora.mdnsd' on 'rac1'
CRS-2679: Attempting to clean 'ora.gpnpd' on 'rac1'
CRS-2679: Attempting to clean 'ora.evmd' on 'rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2680: Clean of 'ora.evmd' on 'rac1' failed
CRS-2680: Clean of 'ora.gpnpd' on 'rac1' failed
CRS-2680: Clean of 'ora.mdnsd' on 'rac1' failed
CRS-2799: Failed to shut down resource 'ora.evmd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.gpnpd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.mdnsd' on 'rac1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors

So https entry in http_proxy variable caused my CRS even not being able to stop.

Solution:

The solution is simple, find processes that were started during previous attempt and kill them (be careful, not to kill anything that is not started from GI home):

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10228 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin
root     20305     1  2 05:16 ?        00:00:33 /u01/app/19.3.0/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.US7ASCII
root     20631     1  0 05:16 ?        00:00:05 /u01/app/19.3.0/grid/bin/orarootagent.bin

[root@rac1 ~]# kill -9 20305 20631

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10296 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin

Make sure http_proxy is not set or instead of https there is http as a value:

[root@rac1 ~]# unset http_proxy

[root@rac1 ~]# echo $http_proxy

Or

[root@rac1 ~]# export http_proxy=http://test

Try to start CRS now:

[root@rac1 ~]# crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'rac1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2676: Start of 'ora.evmd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'rac1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.crf' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'rac1'
CRS-2676: Start of 'ora.storage' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rac1'
CRS-2676: Start of 'ora.crsd' on 'rac1' succeeded
CRS-6017: Processing resource auto-start for servers: rac1
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac2'
CRS-2672: Attempting to start 'ora.chad' on 'rac1'
CRS-2672: Attempting to start 'ora.ons' on 'rac1'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac2'
CRS-2677: Stop of 'ora.scan1.vip' on 'rac2' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac1'
CRS-2676: Start of 'ora.chad' on 'rac1' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac1'
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded
CRS-2676: Start of 'ora.ons' on 'rac1' succeeded
CRS-6016: Resource auto-start has completed for server rac1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.

ORA-15041 during rebalance OR add disk

Problem:

One of our customers had a disk offline for more than disk_repair_time, which caused Oracle to drop 1TB disk. The problem started after that, the drop command caused rebalance operation and because of less than 1TB free space on the diskgroup, the rebalance failed with ORA-15041. Mentioned rebalance caused some of the disks to become 100% full, so free MB on some disks were 0.

Adding disks did not help, because when we were checking free space on the existing disks we were getting the following output:

# su - grid
$ sqlplus / as sysasm
SQL> select disk_number "Disk #", free_mb 
     from v$asm_disk 
     where group_number = 1 
     order by 2

    Disk #    FREE_MB
---------- ----------
        13       0
         0       0
         4       0
         3       4
        11       132900
        ...

As mentioned our rebalance was failing:

Solution

It was AWS environment and in cloud we could easily increase disk size, so we increased all disks in the diskgorup by 200GB:

Resizing steps: https://dba010.com/2019/08/23/resize-asm-disks-in-aws-fg-enabled-cluster/

Triggered Rebalance:

# su - grid
$ sqlplus / as sysasm
SQL> ALTER DISKGROUP DATA REBALANCE POWER 13; 

And after several hours rebalance finished successfully.

Please note that initially we increased space on disks by 1GB and rebalance failed again, then we increased by 200GB and the operation was successful. So you may need to increase disk size several times.

Useful note from Oracle Doc ID 473271.1