ora.storage fails, Error 4 querying length of attr ASM_DISCOVERY_ADDRESS, ORA-01017

Problem:

CRS on the 1st node is able to start, but not on the 2nd node.

CRS on the 2nd node hangs and later fails:

CRS-2672: Attempting to start 'ora.storage' on 'rac2'
ORA-01017: invalid username/password; logon denied
CRS-5055: unable to connect to an ASM instance because no ASM instance is running in the cluster

during that time CRS alert.log shows:

2022-03-15 20:15:23.722 [ORAROOTAGENT(63477)]CRS-5019: All OCR locations are on ASM disk groups [GRID], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/grid/diag/crs/rac2/crs/trace/ohasd_orarootagent_root.trc".

ohasd_orarootagent_root.trc shows:

2022-03-15 20:23:35.108 : USRTHRD:1769867008: [     INFO] {0:5:3} [ora.storage] 9788 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS

2022-03-15 20:23:35.110 : USRTHRD:1769867008: [     INFO] {0:5:3} [ora.storage] 9788 Error 4 querying length of attr ASM_STATIC_DISCOVERY_ADDRESS

2022-03-15 20:23:35.136 : USRTHRD:1769867008: [     INFO] {0:5:3} [ora.storage] 9506 Error 4 opening dom root in 0x7fa3100013a0

Reason:

Either the password file is corrupted or it does not exist. In our case, GRID diskgroup was created after clearing disk headers and forgot to copy ASM password file.

Solution:

  1. If you have ASM password file backup, then you can simply place it to the asm diskgroup:
$ asmcmd pwcopy --asm /tmp/asm_passwordfile +GRID/orapwASM -f

and stop/start CRS.

2. If you don’t have password file backup, you need to create a new one and add necessary users into it:

[grid@rac1 ~]$ asmcmd pwcreate --asm +GRID/orapwasm -f
Enter password: **********

Check existing users:

[grid@rac1 ~]$ asmcmd lspwusr
Username sysdba sysoper sysasm
SYS TRUE TRUE FALSE

Add necessary users and grant permissions:

$ asmcmd orapwusr --grant sysasm SYS
$ asmcmd orapwusr --add ASMSNMP
Enter password: *********
$ asmcmd orapwusr --grant sysdba ASMSNMP

Check permissions again:

$ asmcmd lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE   TRUE
 ASMSNMP   TRUE   FALSE  FALSE

Find out the user name and password for CRSD to connect, GI uses internal user CRSUSER__ASM_001 with an internally generated password to access ASM during startup:

Find the string SYSTEM.ASM.CREDENTIALS.USERS.CRSUSER__ASM_001 in the following output and save. ORATEXT value:

# ocrdump -stdout | less
...
[SYSTEM.ASM.CREDENTIALS.USERS.CRSUSER__ASM_001]
ORATEXT : d68aec9585136fa8ff8f79f483e4ae64:grid
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_NONE, USER_NAME : grid, GROUP_NAME : oinstall}

Query password for GUID-user. GUID will be different in your case. Retrieve value from your output:

# crsctl get credmaint -path /ASM/Self/d68aec9585136fa8ff8f79f483e4ae64 -credtype userpass -id 0 -attr passwd -local
mB28wSM4AVFAVEYamUIvrMjEo2Nfa

Add this user to ASM password file:

$ asmcmd orapwusr --add CRSUSER__ASM_001
>>>>> provide <password> you retrieved earlier

Add necessary credentials to this user:

$ asmcmd orapwusr --grant sysdba CRSUSER__ASM_001
$ asmcmd orapwusr --grant sysasm CRSUSER__ASM_001

Check the list again:

$ asmcmd lspwusr
        Username sysdba sysoper sysasm
             SYS   TRUE    TRUE   TRUE
         ASMSNMP   TRUE   FALSE  FALSE
CRSUSER__ASM_001   TRUE   FALSE   TRUE

Stop/Start CRS on the remaining node.

Oracle 12c relink resets oracle group to oinstall, while 19c sets asmadmin

Problem:

In a mixed environment when you have 19c GI, but several versions of Oracle RDBMS home (19c, 12c), initially during 12c database restore you may encounter the following error:

ORA-15001: diskgroup "DATA" does not exist or is not mounted
ORA-15040: diskgroup is incomplete

There are several reasons for the above error and one of them is when the $ORACLE_HOME/bin/oracle binary under rdbms home does not have group asmadmin. After identifying it, you try to change permissions to oracle:asmadmin and but after a while, you may notice that the group is reset to oinstall. It can happen after patching or after running relink manually. So let’s see test case:

[oracle@rac1 lib]$ ll /u01/app/oracle/product/12.1.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle asmadmin 323613256 Feb 19 17:00 /u01/app/oracle/product/12.1.0/dbhome_1/bin/oracle

Run relink:

[oracle@rac1 lib]$ make -f ins_rdbms.mk ioracle

Check permissions again:

[oracle@rac1 lib]$ ll /u01/app/oracle/product/12.1.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle oinstall 323613256 Feb 19 17:03 /u01/app/oracle/product/12.1.0/dbhome_1/bin/oracle

If you do the same for 19c home, group is still asmadmin.

Reason:

ins_rdbms.mk script is different in 19c and 12c, comparing ioracle section:

12c:

ioracle: preinstall $(ORACLE)
        -$(NOT_EXIST) $(ORACLE_HOME)/bin/oracle ||\
           mv -f $(ORACLE_HOME)/bin/oracle $(ORACLE_HOME)/bin/oracleO
        -mv $(ORACLE_HOME)/rdbms/lib/oracle $(ORACLE_HOME)/bin/oracle
        -chmod 6751 $(ORACLE_HOME)/bin/oracle

19c:

ioracle: preinstall $(ORACLE)
        -$(RMF) $(ORACLE_HOME)/bin/oracle
        -mv $(ORACLE_HOME)/rdbms/lib/oracle $(ORACLE_HOME)/bin/oracle
        -chmod 6751 $(ORACLE_HOME)/bin/oracle

        -(if [ ! -f $(ORACLE_HOME)/bin/crsd.bin ]; then \
            getcrshome="$(ORACLE_HOME)/srvm/admin/getcrshome" ; \
            if [ -f "$$getcrshome" ]; then \
                crshome="`$$getcrshome`"; \
                if [ -n "$$crshome" ]; then \
                    if [ $$crshome != $(ORACLE_HOME) ]; then \
                        oracle="$(ORACLE_HOME)/bin/oracle"; \
                        $$crshome/bin/setasmgidwrap oracle_binary_path=$$oracle; \
                    fi \
                fi \
            fi \
        fi\
        );

As you see ioracle section in 12c does not contain the last IF statement which sets oracle binary group to asmadmin (using setasmgidwrap)

Solution:

Unfortunately, I cannot recommend adding that if statement in 12c ins_rdbms.mk although I did it in my test environment and works as expected.
After running make -f ins_rdbms.mk ioracle for 12c the group is set to asmadmin.

It is better to ask Oracle about this change and do only after you get an approval from them.

The only non-harmful recommendation from my side would be to have permission change script and run after every patch or relink:

# chown oracle:asmadmin  /u01/app/oracle/product/12.1.0/dbhome_1/bin/oracle
​# chmod 6751  /u01/app/oracle/product/12.1.0/dbhome_1/bin/oracle
# ll   /u01/app/oracle/product/12.1.0/dbhome_1/bin/oracle

If you still decide to add IF statement as I did, please make sure there is TAB instead of spaces before IF clause. Otherwise, you will get the following error and you can easily understand which line should be corrected.

[oracle@rac1 lib]$ make -f ins_rdbms.mk ioracle
ins_rdbms.mk:120: *** missing separator (did you mean TAB instead of 8 spaces?).  Stop.

Useful tools for analyzing OS diagnostic data

Apply GI patch 19.14 (33509923) on GI 19.3

Current environment:

$ $ORACLE_HOME/OPatch/opatch lspatches
29585399;OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
29517247;ACFS RELEASE UPDATE 19.3.0.0.0 (29517247)
29517242;Database Release Update : 19.3.0.0.190416 (29517242)
29401763;TOMCAT RELEASE UPDATE 19.0.0.0.0 (29401763)

1. You must use the OPatch utility version 12.2.0.1.28 or later to apply this patch

Download p6880880_190000_Linux-x86-64.zip – OPatch 12.2.0.1.28 for DB 19.0.0.0.0 (Nov 2021), or later.

Replace existing opatch with the new one:

# export ORACLE_HOME=/u01/app/19c/grid
# rm -rf $ORACLE_HOME/OPatch
# su - grid 
$ export ORACLE_HOME=/u01/app/19c/grid
$ unzip -o /u01/swtmp/p6880880_190000_Linux-x86-64.zip -d $ORACLE_HOME

$ $ORACLE_HOME/OPatch/opatch version
OPatch Version: 12.2.0.1.28

2. Unzip the patch 33509923 as the Grid home owner:

# su - grid
$ cd /u01/swtmp
$ unzip p33509923_190000_Linux-x86-64.zip

3. Determine whether any currently installed one-off patches conflict with this patch 33509923 as follows:

# su - grid

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33515361

Oracle Home       : /u01/app/19c/grid
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/19c/grid/oraInst.loc
OPatch version    : 12.2.0.1.28
OUI version       : 12.2.0.7.0
Log file location : /u01/app/19c/grid/cfgtoollogs/opatch/opatch2022-01-31_11-58-55AM_1.log

Invoking prereq "checkconflictagainstohwithdetail"
Prereq "checkConflictAgainstOHWithDetail" passed.
OPatch succeeded.

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33529556

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33534448

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33239955

$ $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /u01/swtmp/33509923/33575402

In case conflict is detected, stop the patch installation and contact Oracle Support Services, otherwise continue.

4. Stop database instances running on that server.

5. Apply patch using opatchauto:

# cd /u01/swtmp/33509923/
# export ORACLE_HOME=/u01/app/19c/grid/
# export PATH=$ORACLE_HOME/bin:$PATH
# $ORACLE_HOME/OPatch/opatchauto apply
...
==Following patches were SUCCESSFULLY applied:

Patch: /u01/swtmp/33509923/33239955
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33515361
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33529556
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33534448
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Patch: /u01/swtmp/33509923/33575402
Log: /u01/app/19c/grid/cfgtoollogs/opatchauto/core/opatch/opatch2022-02-03_11-33-44AM_1.log

Check current version:

$ $ORACLE_HOME/OPatch/opatch lspatches
33575402;DBWLM RELEASE UPDATE 19.0.0.0.0 (33575402)
33534448;ACFS RELEASE UPDATE 19.14.0.0.0 (33534448)
33529556;OCW RELEASE UPDATE 19.14.0.0.0 (33529556)
33515361;Database Release Update : 19.14.0.0.220118 (33515361)
33239955;TOMCAT RELEASE UPDATE 19.0.0.0.0 (33239955)

6. After finishing patch it is recommened to backup clusterware components:

Azure: Get email when VM instance state changes

To set up email notification in Azure environment when VM changes the state, you should do the following:

1. In the Azure portal, select Service Health

2. From the left side panel choose Resource health -> click Create Resource Health alert rule

Fill in the necessary fields:

In the Actions section, you have to indicate action groups. As long as we have not created any before, let’s click Add action groups -> Create action group

And fill in the following fields:

Click Review + create -> Create.

After creating the group you will see that the new group was chosen automatically. Fill in fields under Alert rule details:

Click Create alert rule.

3. Go to the Resource Group (in my case marirac2) where you have created Action Group.

Resource groups -> marirac2 -> from the left-side panel choose Alerts -> Action groups -> choose action group name in my case mariactgrp -> on the Notifications section choose Email/SMS message/Push/Voice -> on the right-side panel click Email checkbox -> enter email address who will be responsible for receiving and handling these alerts -> click OK -> enter desirable name under Notifications section -> click Save changes.

4. Test alert by stopping and starting the VM (assuming it is a test environment)

After changing the state of the VM you will receive the following notification.

Please note notification should have been sent as soon as VM changes the state but email can come 2 – 3 min later.

AWS: Get email when EC2 instance state changes

To setup email notification when your EC2 instances change state, you should configure the following:

  • Amazon SNS topic – where you define recepients
  • Create an EventBridge event – where you define rule on instances that will trigger the notification

1. Create an Amazon Simple Notification Service (Amazon SNS) topic:

Go to Amazon SNS -> Create topic ->

Screen 1

-> Click Create topic

Go to Subscriptions -> Create subscription

Screen 2

A subscription confirmation email is sent to the address you have entered. Choose Confirm subscription in the email (Screen 3):

Screen 3

2. Create an EventBridge event

Open the EventBridge console -> Create rule

Screen 4

 To monitor all instances, in the Event pattern enter the following (Screen 4)

{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"]
}

If you are planing to monitor particular instances then you need to indicate their instance IDs using the following way:

{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"instance-id": ["i-0897b9d936bda3bda", "i-0c76e28b0ec1a099c"]
}
}

Please note, this pattern can also be generated by choosing “Pre-defined pattern by service” radio button and choosing appropriate services. But for simplicity the above pattern can be entered.

In the Select targets section, choose previously created SNS topic.


Screen 5

In the Input transformer (Screen 5) paste the following strings accordingly:

Input Path

{"instance-id":"$.detail.instance-id", "state":"$.detail.state", "time":"$.time", "region":"$.region", "account":"$.account"}

Input Template

"At <time>, the status of your EC2 instance <instance-id> on account <account> in the AWS Region <region> has changed to <state>."

Click Create.

3. Test the configuration

Start/stop target instanes and check your email:

The home is not clean. This home cannot be used since there was a failed OPatch execution in this home. Use a different home to proceed

Problem:

While installing Oracle software using applyRU:

$ /u01/app/oracle/product/19.3.0/dbhome_1/runInstaller -silent -applyRU /u01/swtmp/32895426/32904851/ -responseFile /opt/rsp/db_19.3_EE_swonlyinstall.rsp -ignorePrereqFailure -waitforcompletion

Got an error:

ERROR: The home is not clean. This home cannot be used since there was a failed OPatch execution in this home. Use a different home to proceed.

Solution:

If a previous attempt to install Oracle software using applyRU or applyOneOffs option failed, the home becomes unusable. Before retrying the process, you need to clean up the home first:

$ rm -rf /u01/app/oracle/product/19.3.0/dbhome_1/*

$ unzip -o /u01/swtmp/LINUX.X64_193000_db_home.zip -d /u01/app/oracle/product/19.3.0/dbhome_1

As a best practice I always have a new Opatch, so need to unzip it again:

$ unzip -o /u01/swtmp/p6880880_190000_Linux-x86-64.zip -d /u01/app/oracle/product/19.3.0/dbhome_1

Rerun the command:

$ /u01/app/oracle/product/19.3.0/dbhome_1/runInstaller -silent -applyRU /u01/swtmp/32895426/32904851/ -responseFile /opt/rsp/db_19.3_EE_swonlyinstall.rsp -ignorePrereqFailure -waitforcompletion
Preparing the home to patch...
Applying the patch /u01/swtmp/32895426/32904851/...

After GI upgrade flashgrid-cluster shows wrong diskgroup status

Problem:

I have upgraded Oracle GI from 12c to 19c. ASM is up and diskgroups are mounted, but flashgrid-cluster still shows that diskgroups are not mounted on any node:

[root@rac1 ~]# flashgrid-cluster
FlashGrid 21.8.67.66809 #707efdc10a212421dbd3737ef8e693ae14c15964
License: Active, Marketplace
Licensee: Flashgrid Inc.
Support plan: 24x7
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
FlashGrid running: OK
Clocks check: OK
Configuration check: OK
Network check: OK

Querying nodes: rac1, rac2, racq ...

Cluster Name: mariGIMR
Cluster status: Warning
-------------------------------------------------------------
Node  Status   ASM_Node  Storage_Node  Quorum_Node  Failgroup
-------------------------------------------------------------
rac1  Warning  Yes       Yes           No           RAC1
rac2  Warning  Yes       Yes           No           RAC2
racq  Warning  No        No            Yes          RACQ
-------------------------------------------------------------
-----------------------------------------------------------------------------
GroupName  Status  Mounted  Type  TotalMiB  FreeMiB  OfflineDisks  LostDisks  Resync  ReadLocal  Vote
-----------------------------------------------------------------------------
GRID     Warning  No  N/A   N/A   N/A    N/A    N/A  N/A     N/A        N/A
-----------------------------------------------------------------------------

Solution:

After upgrade, you should restart flashgrid_asm service on all nodes, it does not require downtime:

[root@rac1 ~]# systemctl restart flashgrid_asm
[root@rac2 ~]# systemctl restart flashgrid_asm

Check the status again:

...
-----------------------------------------------------------------------------
GroupName  Status  Mounted   Type    TotalMiB  FreeMiB  OfflineDisks  LostDisks  Resync  ReadLocal  Vote
-----------------------------------------------------------------------------
GRID    Good    AllNodes  NORMAL  204800    154336   0   0  No Enabled    3/3
-----------------------------------------------------------------------------

Can fs.aio-max-nr be changed on a live system?

Short answer:

Yes, it can be changed without impacting the business operations.
Oracle source: Doc ID 2269728.1
RHEL source: https://access.redhat.com/solutions/437043

========================Additional information========================

Additional explanation, why you may need to change that value.

Problem:

Alert log shows:

ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Additional information: 128
Additional information: 139817340277512

Reason:

The problem is caused by a lower than recommended value for aio-max-nr. The current value is 1048576, but recommended is 3145728.

Solution:

Set fs.aio-max-nr to 3145728 in /etc/sysctl.conf.

# grep fs.aio-max-nr /etc/sysctl.conf
fs.aio-max-nr = 3145728

Make a new value effective:

# sysctl -p  /etc/sysctl.conf

Verify the new value:

# sysctl -a|grep fs.aio-max-nr
fs.aio-max-nr = 3145728

THE SPECIFIED LOGFILE BLOCKSIZE (512) IS LESS THAN THE MEDIA SECTOR SIZE ON DISK (4096)

Problem:

Alert log contains:

WARNING! THE SPECIFIED LOGFILE BLOCKSIZE (512) IS LESS THAN THE MEDIA SECTOR SIZE ON DISK (4096). LOGFILE WRITES ARE NOT SECTOR ALIGNED. THIS MAY LEAD TO DEGRADED PERFORMANCE DUE TO READ-MODIFY-WRITE OPERATION

Reason/Reproduce:

The redo log groups have the default block size 512. So as the error mentions, the size of the redo log group member is less than the sector size attribute for the diskgroup.

This is reproducable when disk physical block size is 4096. In AWS environment, for example, r5b instances has this size for attached disks (this may change in future, need to be checked while reading this post).

1. Make sure attached disks has physical size 4096

[root@rac1 ~]# find /sys/block/*/queue -name physical_block_size | while read f ; do echo "$f $(cat $f)" ; done

/sys/block/nvme0n1/queue/physical_block_size 4096
/sys/block/nvme1n1/queue/physical_block_size 4096
/sys/block/nvme2n1/queue/physical_block_size 4096
/sys/block/nvme3n1/queue/physical_block_size 4096
/sys/block/nvme4n1/queue/physical_block_size 4096
/sys/block/nvme5n1/queue/physical_block_size 4096

2. Create diskgroup with the sector size 4096

CREATE DISKGROUP "DG4K" NORMAL REDUNDANCY                                                
 FAILGROUP "RAC1" DISK '/dev/flashgrid/rac1.xvdbd' NAME "RAC1$XVDBD" SIZE 10240M 
 FAILGROUP "RAC2" DISK '/dev/flashgrid/rac2.xvdbd' NAME "RAC2$XVDBD" SIZE 10240M 
 QUORUM FAILGROUP "RACQ" DISK '/dev/flashgrid/racq.xvdbz' NAME "RACQ$XVDBZ"      
 ATTRIBUTE 'au_size' = '4M', 'compatible.asm' = '19.0.0.0', 'compatible.rdbms' = '19.0.0.0', 'sector_size'='4096';

Make sure sector size is 4096 for this diskgroup:

SQL> select name,sector_size from v$asm_diskgroup;

NAME			       SECTOR_SIZE
------------------------------ -----------
DATA				       512
GRID				       512
DG4K				      4096
FRA				       512

Check log group block size:

SQL> select group#, blocksize from v$log;

    GROUP#  BLOCKSIZE
---------- ----------
	 1	  512
	 2	  512
	 3	  512
	 4	  512

Add a new logfile member on DG4K:

SQL> alter database add logfile member '+DG4K' to group 4;

Database altered.

Check alert log:

2021-12-09T15:32:04.533204+00:00
WARNING! THE SPECIFIED LOGFILE BLOCKSIZE (512) IS LESS THAN THE MEDIA SECTOR SIZE ON DISK (4096). LOGFILE WRITES ARE NOT SECTOR ALIGNED. THIS MAY LEAD TO DEGRADED PERFORMANCE DUE TO READ-MODIFY-WRITE OPERATIONS.
2021-12-09T15:32:21.080534+00:00
Completed: alter database add logfile member '+DG4K' to group 4

Solution:

You should create a new redo log group (not member, but group) with block size=4K:

SQL> alter database add logfile group 5 ('+DG4K') size 512M blocksize 4096;