OPatchauto fails: CLSRSC-180: An error occurred while executing the command ‘/bin/rpm -qf /sbin/init’

Problem:

During applying ACFS patch on top of GI home, I received the following error:

Command failure output:
...
2024/03/08 19:31:03 CLSRSC-180: An error occurred while executing the command '/bin/rpm -qf /sbin/init'

After fixing the cause of failure Run opatchauto resume

]
OPATCHAUTO-68061: The orchestration engine failed.
OPATCHAUTO-68061: The orchestration engine failed with return code 1
OPATCHAUTO-68061: Check the log for more details.
OPatchAuto failed.

OPatchauto session completed at Fri Mar 8 19:31:04 2024
Time taken to complete the session 3 minutes, 38 seconds

opatchauto failed with error code 42

Troubleshooting:

I attempted to manually execute the command that failed, and it returned a helpful error message:

[root@rac1 tmp]# /bin/rpm -qf /sbin/init
error: rpmdb: BDB0113 Thread/process 5003/139974823143296 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 - (-30973)
error: cannot open Packages database in /var/lib/rpm
...

Solution:

I have a solution for this type of error in another post. Let’s solve it again:

[root@rac1 tmp]# rpm  --rebuilddb

Rerun the failing command to make sure it was resolved:

[root@rac1 tmp]# /bin/rpm -qf /sbin/init
systemd-239-78.0.3.el8.x86_64

If you encountered an error during patching, you can resume opatchauto at this point:

[root@rac1 tmp]# /u01/app/19.3.0/grid/OPatch/opatchauto resume

--------------------------------Summary--------------------------------

Patching is completed successfully. Please find the summary as follows:

Host:rac1
CRS Home:/u01/app/19.3.0/grid
Version:19.0.0.0.0
Summary:

==Following patches were SUCCESSFULLY applied:

Patch: /tmp/36114443/36114443
Log: /u01/app/19.3.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2024-03-08_19-29-48PM_1.log

Good luck, as always!

CRS-2549: Resource ‘ora.asmgroup’ cannot be placed on ‘rac1’ as it is not a valid candidate as per the placement policy

Problem:

After failed JDK patching on the 1st node, we tried troubleshooting and saw that ASM was not able to start:

# su - grid
$ sqlplus / as sysasm
SQL> startup nomount;
ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance
ORA-39511: Start of CRS resource for instance '223' failed with error:[CRS-2549: Resource 'ora.asmgroup' cannot be placed on 'rac1' as it is not a valid candidate as per the placement policy
CRS-0223: Resource 'ora.asm' has placement error.
clsr_start_resource:260 status:223
clsrapi_start_asm:start_asmdbs status:223

Reason:

Prepatch modified RESOURCE_USE_ENABLED=0 for rac1 node:

[grid@rac1 ~]$ crsctl stat server -f

NAME=rac1
MEMORY_SIZE=63465
CPU_COUNT=8
CPU_CLOCK_RATE=2499
CPU_HYPERTHREADING=1
CPU_EQUIVALENCY=1000
DEPLOYMENT=other
CONFIGURED_CSS_ROLE=hub
RESOURCE_USE_ENABLED=0
SERVER_LABEL=
PHYSICAL_HOSTNAME=
CSS_CRITICAL=no
CSS_CRITICAL_TOTAL=0
RESOURCE_TOTAL=0
SITE_NAME=stsfilive
STATE=ONLINE
ACTIVE_POOLS=Free
STATE_DETAILS=
ACTIVE_CSS_ROLE=hub

NAME=rac2
MEMORY_SIZE=63465
CPU_COUNT=8
CPU_CLOCK_RATE=2499
CPU_HYPERTHREADING=1
CPU_EQUIVALENCY=1000
DEPLOYMENT=other
CONFIGURED_CSS_ROLE=hub
RESOURCE_USE_ENABLED=1
….

Solution:

Connect to the failing node and run:

[root@rac1 ~]# crsctl set resource use 1

Start ASM.

Part 2: ora.storage fails to start, ORA-01017

Problem:

One of our customers changed ASM password file by mistake and regarding other actions, we are not sure. After node restart, they encountered ora.storage startup issue on the second node.

CRS-2672: Attempting to start 'ora.storage' on 'orcl02'
ORA-01017: invalid username/password; logon denied
CRS-5055: unable to connect to an ASM instance because no ASM instance is running in the cluster
CRS-2883: Resource 'ora.storage' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
CRS-4000: Command Start failed, or completed with errors.

I have followed my blog post to recover ASM passwordfile and add CRSUSER__ASM_001. The CRS started successfully on the first node but it still was not able to start on the second.

Reason:

When we checked password for CRSUSER__ASM_001 on both nodes, we got different results:

[grid@orcl01 ~]$ crsctl get credmaint -path ASM/Self/0b5330fe4bdf6f6ebffb09beab078d6e -credtype userpass -id 0 -attr passwd -local 
zSZDts1PQx8v7gRrdmH1EjIpSBsAt
[grid@orcl02 ~]$ crsctl get credmaint -path ASM/Self/0b5330fe4bdf6f6ebffb09beab078d6e -credtype userpass -id 0 -attr passwd -local 
rHgulYGfY17Uxbb9Tbd9VF3yr2Kvr

Which is not normal and they must be the same. This was the reason CRS was not able to start on the second node, because ASM passwordfile for CRSUSER__ASM_001 had value zSZDts1PQx8v7gRrdmH1EjIpSBsAt

Solution:

Verify and fix the credentials:

If you are not able to set up root ssh passwordless connectivity, you can run the following command as grid. Note in that case you will get “credfix: could not delete crs credentials for jxrucJl3”, this is because the command was not run as root and old credentials were not deleted. But new credentials are successfully created.

[grid@orcl01 ~]$ asmcmd --nocp credverify
credverify: More than one credential in password file, please run 'credfix' to fix the credentials.
​
[grid@orcl01 ~]$ asmcmd --nocp credfix
credfix: Credentials for JXRUCJL3 not in password file, trying next credential.
op=addcrscreds wrap=/tmp/creds0.xml
credfix: Creating new credentials, no valid credentials in OCR.
credfix: New user CRSUSER__ASM_004 created.
op=credimport wrap=/tmp/creds0.xml olr=true force=true
credfix: OLR for orcl01 has been fixed if credentials were created incorrectly.
credfix: Starting SSH session on node orcl02.
credfix: OLR for orcl02 has been fixed if credentials were created incorrectly. Exiting SSH session.
op=delcrscreds crs_user=jxrucJl3
ASMCMD-8202: internal error:
credfix: could not delete crs credentials for jxrucJl3

It is recommended to setup passwordless ssh connectivity for root user and then run credfix as root to have clean configuration without old entries:

[root@rac1 ~]# asmcmd --nocp credfix
..

ora.evmd and ora.mdnsd fails to start when http_proxy is set to https://

Problem:

After setting http_proxy to https string (export http_proxy=https://test) and then stopping and starting CRS got the following error:

CRS-2883: Resource 'ora.evmd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/bin/CommonSetup.pm" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
....
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/lib/libnl19.a" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
CRS-4000: Command Start failed, or completed with errors.

Even after unsetting http_proxy and trying to stop CRS got the following:

[root@rac1 ~]# crsctl start crs -wait
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[root@rac1 ~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'
CRS-2679: Attempting to clean 'ora.mdnsd' on 'rac1'
CRS-2679: Attempting to clean 'ora.gpnpd' on 'rac1'
CRS-2679: Attempting to clean 'ora.evmd' on 'rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2680: Clean of 'ora.evmd' on 'rac1' failed
CRS-2680: Clean of 'ora.gpnpd' on 'rac1' failed
CRS-2680: Clean of 'ora.mdnsd' on 'rac1' failed
CRS-2799: Failed to shut down resource 'ora.evmd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.gpnpd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.mdnsd' on 'rac1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors

So https entry in http_proxy variable caused my CRS even not being able to stop.

Solution:

The solution is simple, find processes that were started during previous attempt and kill them (be careful, not to kill anything that is not started from GI home):

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10228 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin
root     20305     1  2 05:16 ?        00:00:33 /u01/app/19.3.0/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.US7ASCII
root     20631     1  0 05:16 ?        00:00:05 /u01/app/19.3.0/grid/bin/orarootagent.bin

[root@rac1 ~]# kill -9 20305 20631

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10296 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin

Make sure http_proxy is not set or instead of https there is http as a value:

[root@rac1 ~]# unset http_proxy

[root@rac1 ~]# echo $http_proxy

Or

[root@rac1 ~]# export http_proxy=http://test

Try to start CRS now:

[root@rac1 ~]# crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'rac1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2676: Start of 'ora.evmd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'rac1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.crf' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'rac1'
CRS-2676: Start of 'ora.storage' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rac1'
CRS-2676: Start of 'ora.crsd' on 'rac1' succeeded
CRS-6017: Processing resource auto-start for servers: rac1
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac2'
CRS-2672: Attempting to start 'ora.chad' on 'rac1'
CRS-2672: Attempting to start 'ora.ons' on 'rac1'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac2'
CRS-2677: Stop of 'ora.scan1.vip' on 'rac2' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac1'
CRS-2676: Start of 'ora.chad' on 'rac1' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac1'
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded
CRS-2676: Start of 'ora.ons' on 'rac1' succeeded
CRS-6016: Resource auto-start has completed for server rac1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.

DPI-1030: unable to get or set error structure for thread local storage

Problem:

flashgrid-cluster command was showing that diskgroups were not mounted, while diskgroup were successfully mounted on all nodes:

Reason:

GI was upgraded and Flashgrid was not able to reconnect ASM.

Solution:

Restart flashgrid_asm service, please note that it does not cause any downtime and is safe to run during business hours:

 # systemctl restart flashgrid_asm.service

INS-45511: Installer has detected that an Oracle Grid Infrastructure home is marked incorrectly as configured

Problem:

After deconfiguring Oracle Restart stack using:

[root@rac1 ~]# /u01/app/19.3.0/grid/root.sh -deconfig
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/app/19.3.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/19.3.0/grid/crs/install/crsconfig_params
The log of current session can be found at:
  /u01/app/grid/crsdata/rac1/crsconfig/hadeconfig.log
2020/07/04 10:49:21 CLSRSC-332: CRS resources for listeners are still configured
2020/07/04 10:49:49 CLSRSC-337: Successfully deconfigured Oracle Restart stack

Tried to configure GI as clusterware stack and got the following error:

INS-45511: Installer has detected that an Oracle Grid Infrastructure home is marked incorrectly  as configured

Solution:

Remove CRS="true" accross GI home entry in /u01/app/oraInventory/ContentsXML/inventory.xml

Original:

<HOME NAME="OraGI19Home1" LOC="/u01/app/19.3.0/grid" TYPE="O" IDX="1" CRS="true"/>

After modification:

<HOME NAME="OraGI19Home1" LOC="/u01/app/19.3.0/grid" TYPE="O" IDX="1"/>

Retry configuration.

Common usage examples of cluvfy utility

1. Discovering all available network interfaces and verifying the connectivity between the nodes in the cluster through those network interfaces:

$ cluvfy comp nodecon -n all -verbose

2. Check the reachability of specified nodes from a source node:

$ cluvfy comp nodereach -n all -verbose

3. Check the integrity of Oracle Cluster Registry (OCR) on all the specified nodes:

$ cluvfy comp ocr -n all -verbose

4. Verifying the integrity of the Oracle high availability services daemon on all nodes in the cluster:

$ cluvfy comp ohasd -n all -verbose

5. Comparing nodes:

$ cluvfy comp peer -n all -verbose

6. Verifying the software configuration on all nodes in the cluster for the Oracle Clusterware home directory:

$ cluvfy comp software -n all -verbose

7. Check the integrity of the cluster on all the nodes in the node list:

$ cluvfy comp clu -n all -verbose

Can’t call method “uid” on an undefined value at …DBUtilServices.pm line 28.

Problem:

opatchauto on GI fails with the following error:

# /u01/app/18.0.0/grid/OPatch/opatchauto apply /0/grid/29301682  -oh /u01/app/18.0.0/grid
 Can't call method "uid" on an undefined value at /u01/app/18.0.0/grid/OPatch/auto/database/bin/module/DBUtilServices.pm line 28.

Reason:

  1. GI is not setup yet. You may have unzipped GI installation file, but have not run gridSetup.sh
  2. $GI_HOME/oraInst.loc is missing.

Solution:

  1. Setup GI by running gridSetup.sh
  2. Copy the oraInst.loc from the other node, if you don’t have another node then please see the file content bellow:
# cat /u01/app/18.0.0/grid/oraInst.loc
inst_group=oinstall
inventory_loc=/u01/app/oraInventory

PRVG-11069 : IP address “169.254.0.2” of network interface “idrac” on the node “primrac1” would conflict with HAIP usage

Problem:

Oracle 18c GI configuration precheck was failing with the following error:

Summary of node specific errors 

primrac2  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac2" would conflict with HAIP usage.  
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces. 

primrac1  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac1" would conflict with HAIP usage. 
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces.  

On each node additional network interface – named idrac was started with the ip address 169.254.0.2. I tried to set static ip address in /etc/sysconfig/network-scripts/ifcfg-idrac , also tried to bring the interface down – but after some time interface was starting up automatically and getting the same ip address.

Cluster nodes were DELL servers with Dell Remote Access Controller(iDRAC) Service Module installed. For more information about this module installation/deinstallation… can be found here https://topics-cdn.dell.com/pdf/idrac-service-module-v32_users-guide_en-us.pdf

Servers were configured by system administrator and was not clear why this module was there, we are not using iDRAC module and the only option that we had was to remove/uninstall that module. (configuring module should also be possible to avoid such situation, but we keep our servers as clean as possible without having unsed services)

Solution:

Uninstalled iDRAC module (also expained in the above pdf):

# rpm -e dcism 

After uninstalling it idrac interface did not started anymore, so we could continue GI configuration.

PRVF-6402 : Core file name pattern is not same on all the nodes

Problem:

Oracle 18c GI configuration prerequisite checks failed with the following error:

PRVF-6402 : Core file name pattern is not same on all the nodes. Found core filename pattern "core" on nodes "primrac1". Found core filename pattern "core.%p" on nodes "primrac2".  
- Cause:  The core file name pattern is not same on all the nodes.  
- Action:  Ensure that the mechanism for core file naming works consistently on all the nodes. Typically for Linux, the elements to look into are the contents of two files /proc/sys/kernel/core_pattern or /proc/sys/kernel/core_uses_pid. Refer OS vendor documentation for platforms AIX, HP-UX, and Solaris.

Comparing parameter values on both nodes:

[root@primrac1 ~]# cat /proc/sys/kernel/core_uses_pid
0
[root@primrac2 ~]# cat /proc/sys/kernel/core_uses_pid
1 

[root@primrac1 ~]# sysctl -a|grep core_uses_pid
kernel.core_uses_pid = 0

[root@primrac2 ~]# sysctl -a|grep core_uses_pid
kernel.core_uses_pid = 1

Strange fact was that this parameter was not defined explicitly in sysctl.conf file, but still had different default values:

[root@primrac1 ~]# cat /etc/sysctl.conf |grep core_uses_pid
[root@primrac2 ~]# cat /etc/sysctl.conf |grep core_uses_pid 

Solution:

I’ve set parameter to 1 explicitly in sysctl.conf on both nodes:

[root@primrac1 ~]# cat /etc/sysctl.conf |grep core_uses_pid
kernel.core_uses_pid=1 

[root@primrac2 ~]# cat /etc/sysctl.conf |grep core_uses_pid
kernel.core_uses_pid=1

[root@primrac1 ~]# sysctl -p 
[root@primrac2 ~]# sysctl -p

[root@primrac1 ~]# sysctl -a|grep core_uses_pid 
kernel.core_uses_pid = 1

[root@primrac2 ~]# sysctl -a|grep core_uses_pid 
kernel.core_uses_pid = 1

Pressed Check Again button and GI configuration succeeded.