ora.evmd and ora.mdnsd fails to start when http_proxy is set to https://

Problem:

After setting http_proxy to https string (export http_proxy=https://test) and then stopping and starting CRS got the following error:

CRS-2883: Resource 'ora.evmd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/bin/CommonSetup.pm" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
....
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/lib/libnl19.a" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
CRS-4000: Command Start failed, or completed with errors.

Even after unsetting http_proxy and trying to stop CRS got the following:

[root@rac1 ~]# crsctl start crs -wait
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[root@rac1 ~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'
CRS-2679: Attempting to clean 'ora.mdnsd' on 'rac1'
CRS-2679: Attempting to clean 'ora.gpnpd' on 'rac1'
CRS-2679: Attempting to clean 'ora.evmd' on 'rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2680: Clean of 'ora.evmd' on 'rac1' failed
CRS-2680: Clean of 'ora.gpnpd' on 'rac1' failed
CRS-2680: Clean of 'ora.mdnsd' on 'rac1' failed
CRS-2799: Failed to shut down resource 'ora.evmd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.gpnpd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.mdnsd' on 'rac1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors

So https entry in http_proxy variable caused my CRS even not being able to stop.

Solution:

The solution is simple, find processes that were started during previous attempt and kill them (be careful, not to kill anything that is not started from GI home):

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10228 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin
root     20305     1  2 05:16 ?        00:00:33 /u01/app/19.3.0/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.US7ASCII
root     20631     1  0 05:16 ?        00:00:05 /u01/app/19.3.0/grid/bin/orarootagent.bin

[root@rac1 ~]# kill -9 20305 20631

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10296 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin

Make sure http_proxy is not set or instead of https there is http as a value:

[root@rac1 ~]# unset http_proxy

[root@rac1 ~]# echo $http_proxy

Or

[root@rac1 ~]# export http_proxy=http://test

Try to start CRS now:

[root@rac1 ~]# crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'rac1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2676: Start of 'ora.evmd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'rac1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.crf' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'rac1'
CRS-2676: Start of 'ora.storage' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rac1'
CRS-2676: Start of 'ora.crsd' on 'rac1' succeeded
CRS-6017: Processing resource auto-start for servers: rac1
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac2'
CRS-2672: Attempting to start 'ora.chad' on 'rac1'
CRS-2672: Attempting to start 'ora.ons' on 'rac1'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac2'
CRS-2677: Stop of 'ora.scan1.vip' on 'rac2' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac1'
CRS-2676: Start of 'ora.chad' on 'rac1' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac1'
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded
CRS-2676: Start of 'ora.ons' on 'rac1' succeeded
CRS-6016: Resource auto-start has completed for server rac1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.

DPI-1030: unable to get or set error structure for thread local storage

Problem:

flashgrid-cluster command was showing that diskgroups were not mounted, while diskgroup were successfully mounted on all nodes:

Reason:

GI was upgraded and Flashgrid was not able to reconnect ASM.

Solution:

Restart flashgrid_asm service, please note that it does not cause any downtime and is safe to run during business hours:

 # systemctl restart flashgrid_asm.service

INS-45511: Installer has detected that an Oracle Grid Infrastructure home is marked incorrectly as configured

Problem:

After deconfiguring Oracle Restart stack using:

[root@rac1 ~]# /u01/app/19.3.0/grid/root.sh -deconfig
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/app/19.3.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/19.3.0/grid/crs/install/crsconfig_params
The log of current session can be found at:
  /u01/app/grid/crsdata/rac1/crsconfig/hadeconfig.log
2020/07/04 10:49:21 CLSRSC-332: CRS resources for listeners are still configured
2020/07/04 10:49:49 CLSRSC-337: Successfully deconfigured Oracle Restart stack

Tried to configure GI as clusterware stack and got the following error:

INS-45511: Installer has detected that an Oracle Grid Infrastructure home is marked incorrectly  as configured

Solution:

Remove CRS="true" accross GI home entry in /u01/app/oraInventory/ContentsXML/inventory.xml

Original:

<HOME NAME="OraGI19Home1" LOC="/u01/app/19.3.0/grid" TYPE="O" IDX="1" CRS="true"/>

After modification:

<HOME NAME="OraGI19Home1" LOC="/u01/app/19.3.0/grid" TYPE="O" IDX="1"/>

Retry configuration.

Common usage examples of cluvfy utility

1. Discovering all available network interfaces and verifying the connectivity between the nodes in the cluster through those network interfaces:

$ cluvfy comp nodecon -n all -verbose

2. Check the reachability of specified nodes from a source node:

$ cluvfy comp nodereach -n all -verbose

3. Check the integrity of Oracle Cluster Registry (OCR) on all the specified nodes:

$ cluvfy comp ocr -n all -verbose

4. Verifying the integrity of the Oracle high availability services daemon on all nodes in the cluster:

$ cluvfy comp ohasd -n all -verbose

5. Comparing nodes:

$ cluvfy comp peer -n all -verbose

6. Verifying the software configuration on all nodes in the cluster for the Oracle Clusterware home directory:

$ cluvfy comp software -n all -verbose

7. Check the integrity of the cluster on all the nodes in the node list:

$ cluvfy comp clu -n all -verbose

Can’t call method “uid” on an undefined value at …DBUtilServices.pm line 28.

Problem:

opatchauto on GI fails with the following error:

# /u01/app/18.0.0/grid/OPatch/opatchauto apply /0/grid/29301682  -oh /u01/app/18.0.0/grid
 Can't call method "uid" on an undefined value at /u01/app/18.0.0/grid/OPatch/auto/database/bin/module/DBUtilServices.pm line 28.

Reason:

  1. GI is not setup yet. You may have unzipped GI installation file, but have not run gridSetup.sh
  2. $GI_HOME/oraInst.loc is missing.

Solution:

  1. Setup GI by running gridSetup.sh
  2. Copy the oraInst.loc from the other node, if you don’t have another node then please see the file content bellow:
# cat /u01/app/18.0.0/grid/oraInst.loc
inst_group=oinstall
inventory_loc=/u01/app/oraInventory

PRVG-11069 : IP address “169.254.0.2” of network interface “idrac” on the node “primrac1” would conflict with HAIP usage

Problem:

Oracle 18c GI configuration precheck was failing with the following error:

Summary of node specific errors 

primrac2  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac2" would conflict with HAIP usage.  
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces. 

primrac1  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac1" would conflict with HAIP usage. 
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces.  

On each node additional network interface – named idrac was started with the ip address 169.254.0.2. I tried to set static ip address in /etc/sysconfig/network-scripts/ifcfg-idrac , also tried to bring the interface down – but after some time interface was starting up automatically and getting the same ip address.

Cluster nodes were DELL servers with Dell Remote Access Controller(iDRAC) Service Module installed. For more information about this module installation/deinstallation… can be found here https://topics-cdn.dell.com/pdf/idrac-service-module-v32_users-guide_en-us.pdf

Servers were configured by system administrator and was not clear why this module was there, we are not using iDRAC module and the only option that we had was to remove/uninstall that module. (configuring module should also be possible to avoid such situation, but we keep our servers as clean as possible without having unsed services)

Solution:

Uninstalled iDRAC module (also expained in the above pdf):

# rpm -e dcism 

After uninstalling it idrac interface did not started anymore, so we could continue GI configuration.

PRVF-6402 : Core file name pattern is not same on all the nodes

Problem:

Oracle 18c GI configuration prerequisite checks failed with the following error:

PRVF-6402 : Core file name pattern is not same on all the nodes. Found core filename pattern "core" on nodes "primrac1". Found core filename pattern "core.%p" on nodes "primrac2".  
- Cause:  The core file name pattern is not same on all the nodes.  
- Action:  Ensure that the mechanism for core file naming works consistently on all the nodes. Typically for Linux, the elements to look into are the contents of two files /proc/sys/kernel/core_pattern or /proc/sys/kernel/core_uses_pid. Refer OS vendor documentation for platforms AIX, HP-UX, and Solaris.

Comparing parameter values on both nodes:

[root@primrac1 ~]# cat /proc/sys/kernel/core_uses_pid
0
[root@primrac2 ~]# cat /proc/sys/kernel/core_uses_pid
1 

[root@primrac1 ~]# sysctl -a|grep core_uses_pid
kernel.core_uses_pid = 0

[root@primrac2 ~]# sysctl -a|grep core_uses_pid
kernel.core_uses_pid = 1

Strange fact was that this parameter was not defined explicitly in sysctl.conf file, but still had different default values:

[root@primrac1 ~]# cat /etc/sysctl.conf |grep core_uses_pid
[root@primrac2 ~]# cat /etc/sysctl.conf |grep core_uses_pid 

Solution:

I’ve set parameter to 1 explicitly in sysctl.conf on both nodes:

[root@primrac1 ~]# cat /etc/sysctl.conf |grep core_uses_pid
kernel.core_uses_pid=1 

[root@primrac2 ~]# cat /etc/sysctl.conf |grep core_uses_pid
kernel.core_uses_pid=1

[root@primrac1 ~]# sysctl -p 
[root@primrac2 ~]# sysctl -p

[root@primrac1 ~]# sysctl -a|grep core_uses_pid 
kernel.core_uses_pid = 1

[root@primrac2 ~]# sysctl -a|grep core_uses_pid 
kernel.core_uses_pid = 1

Pressed Check Again button and GI configuration succeeded.