‘udev’ rules continuously being reloaded resulted in ASM nvme disks going offline

Environment:

Linux Server release 7.2
kernel-3.10.0-514.26.2.el7.x86_64

Problem:

When Oracle processes are opening the device for writing and then closing it, this synthesizes a change event. And udev rules having  ACTION=="add|change" gets reloaded. This behavior causes ASM nvme disks to go offline:

Thu Jul 09 16:33:16 2020
WARNING: Disk 18 (rac1$disk1) in group 2 mode 0x7f is now being offlined

Fri Jul 10 10:04:34 2020
WARNING: Disk 19 (rac1$disk5) in group 2 mode 0x7f is now being offlined

Fri Jul 10 13:45:45 2020
WARNING: Disk 15 (rac1$disk8) in group 2 mode 0x7f is now being offlined

Solution:

To suppress the false positive change events disable the inotify watch for devices used for Oracle ASM using following steps:

  1.  Create /etc/udev/rules.d/96-nvme-nowatch.rules file with just one line in it:
ACTION=="add|change", KERNEL=="nvme*", OPTIONS:="nowatch"

2. After creating the file please run the following to activate the rule:

# udevadm control --reload-rules
# udevadm trigger --type=devices --action=change

The above command will reload the complete udev configuration and will trigger all the udev rules. On a busy production system this could disrupt ongoing operations, applications running on the server. Please use the above command during a scheduled maintenance window only.

Source: https://access.redhat.com/solutions/1465913 + our experience with customers.

Retrieving the Public Key for Key Pair on Linux or Mac OS

ssh-keygen command can be used on Linux or Mac OS to retrieve the public key from the private SSH key:

$ ssh-keygen -y -f MyKeyPair.pem 
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQChauLwkBK/vIiFY/t7uY6lzxESqZkZNvCAA3L42OH2fWzKptqGF+N32zjmLLSPFpYjoEHoHpi5e7yypTmiljtHcKUTwJTs3xclQrApCQvR+LneOi/5P5WaYl61G76osJesXiunLTa+RVr3LDR96LjPcql7JDnuh1RFhDqZ87nDfcGmXGV8iG7w3bk3R/2LuzzMYTgEVdv91S1OF1roH1baPXSV8MaYbOKhMUqV61+eP6/F5ZhT5Gk0BKX1KnQ3/gbgMqjMWRMZzYUeVjUbC52lYwrrBTQX5tHphAJtOTNJ/CpyuEuZ7ED+XYhX9Q1DNOZ47K51xbg5lsnyOBYSUqHz

-y – This option will read a private OpenSSH format file and print its public key.
-f – Specifies the filename of the key file.

Boot in single user mode and rescue your RHEL7

Problem:

One of our customer incorrectly changed fstab file and rebooted the OS. As a result, VM was not able to start. Fortunately, cloud where this VM was located supported serial console.

Solution:

We booted in single user mode through serial console and reverted the changes back. To boot in single user mode and update necessary file, do as follows:

Connect to the serial console and while OS is booting in a grub menu press e to edit the selected kernel:

Find line that starts with linux16 ( if you don’t see it press arrow down ), go to the end of this line and type rd.break.

Press ctrl+x.

Wait for a while and system will enter into single user mode:

During this time /sysroot is mounted in read only mode, you need to remount it in read write:

switch_root:/# mount -o remount,rw /sysroot
switch_root:/# chroot /sysroot

You can revert any changes back by updating any file, in our case we updated fstab:

sh-4.2# vim /etc/fstab

You are a real hero, because you rescued your system!

PRVG-11069 : IP address “169.254.0.2” of network interface “idrac” on the node “primrac1” would conflict with HAIP usage

Problem:

Oracle 18c GI configuration precheck was failing with the following error:

Summary of node specific errors 

primrac2  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac2" would conflict with HAIP usage.  
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces. 

primrac1  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac1" would conflict with HAIP usage. 
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces.  

On each node additional network interface – named idrac was started with the ip address 169.254.0.2. I tried to set static ip address in /etc/sysconfig/network-scripts/ifcfg-idrac , also tried to bring the interface down – but after some time interface was starting up automatically and getting the same ip address.

Cluster nodes were DELL servers with Dell Remote Access Controller(iDRAC) Service Module installed. For more information about this module installation/deinstallation… can be found here https://topics-cdn.dell.com/pdf/idrac-service-module-v32_users-guide_en-us.pdf

Servers were configured by system administrator and was not clear why this module was there, we are not using iDRAC module and the only option that we had was to remove/uninstall that module. (configuring module should also be possible to avoid such situation, but we keep our servers as clean as possible without having unsed services)

Solution:

Uninstalled iDRAC module (also expained in the above pdf):

# rpm -e dcism 

After uninstalling it idrac interface did not started anymore, so we could continue GI configuration.

sshd: /etc/ssh/sshd_config: Permission denied

Problem:

sshd and chronyd services on the database server were in a failed state and not able to start because of the permission problem on their configuration files. Permissions on these files were correct and services should have been able to start, so there was something else… let’s dig into the details.

# systemctl status sshd
 â sshd.service - OpenSSH server daemon
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
    Active: activating (auto-restart) (Result: exit-code) since Tue 2019-07-09 12:21:49 UTC; 32s ago
      Docs: man:sshd(8)
            man:sshd_config(5)
   Process: 124026 ExecStart=/usr/sbin/sshd -D $OPTIONS (code=exited, status=1/FAILURE)
Main PID: 124026 (code=exited, status=1/FAILURE)
Jul 09 12:21:49 node03 systemd[1]: Failed to start OpenSSH server daemon.
Jul 09 12:21:49 node03 systemd[1]: Unit sshd.service entered failed state.
Jul 09 12:21:49 node03 systemd[1]: sshd.service failed

`journalctl -xe` shows:

-- Unit sshd.service has begun starting up.
Jul 09 12:26:03 node03 sshd[129121]: /etc/ssh/sshd_config: Permission denied
Jul 09 12:26:03 node03 systemd[1]: sshd.service: main process exited, code=exited, status=1/FAILURE
Jul 09 12:26:03 node03 systemd[1]: Failed to start OpenSSH server daemon.
-- Subject: Unit sshd.service has failed

The same problem was happening with chronyd service. It was claiming about /etc/chrony.conf file. Incorrect time on database servers can cause node evictions.

Reason:

If permissions on these files are correct, we can think about SELinux, let’s check:

# getenforce 
Enforcing

Solution:

Disable SELinux and reboot the server:

# vim /etc/selinux/config
SELINUX=disabled

# reboot

Summary:

I consider SELinux as a non-desirable service on the database servers. But I appreciate opinion of my colleages/friends and I want to share it with you.

SELinux can be enabled with the correct config in RHEL 4,5,6 – “Starting with Oracle Database 11g Release 2 (11.2), the Security Enhanced Linux (SELinux) feature is supported for Oracle Linux 4, Oracle Linux 5, Oracle Linux 6, Red Hat Enterprise Linux 4, Red Hat Enterprise Linux 5, and Red Hat Enterprise Linux 6.
https://docs.oracle.com/cd/E11882_01/install.112/e47689/pre_install.htm#LADBI1092

SELinux is a good security tool and usually I only disable it as a last resort or if the software doesn’t support it.

Postfix: connect to gmail-smtp-in.l.google.com [2607:f8b0:400c:c0b::1a]:25: Network is unreachable

Problem:

I am not able to receive email alerts from database server. Because message transfer agent is trying to connect to the Google SMTP via IPv6, which fails.

# tail /var/log/maillog

Jun 12 15:35:10 rac1 postfix/smtp[19725]:connect to 
gmail-smtp-in.l.google.com [2607:f8b0:400c:c0b::1a]:25: 
Network is unreachable

Solution:

Configure Postfix not to use IPv6 by editing /etc/postfix/main.cf with the following:

[root@rac1 ~]# cat /etc/postfix/main.cf | grep inet_protocols
inet_protocols = ipv4

Restart Postfix and check the status:

[root@rac1 ~]# systemctl restart postfix

[root@rac1 ~]# systemctl status  postfix
 ● postfix.service - Postfix Mail Transport Agent
    Loaded: loaded (/usr/lib/systemd/system/postfix.service; enabled; vendor preset: disabled)
    Active: active (running) since Thu 2019-06-13 10:20:48 UTC; 52s ago
   Process: 17431 ExecStop=/usr/sbin/postfix stop (code=exited, status=0/SUCCESS)
   Process: 17449 ExecStart=/usr/sbin/postfix start (code=exited, status=0/SUCCESS)
   Process: 17445 ExecStartPre=/usr/libexec/postfix/chroot-update (code=exited, status=0/SUCCESS)
   Process: 17442 ExecStartPre=/usr/libexec/postfix/aliasesdb (code=exited, status=0/SUCCESS)
  Main PID: 17520 (master)
    Memory: 3.0M
    CGroup: /system.slice/postfix.service
            ├─17520 /usr/libexec/postfix/master -w
            ├─17521 pickup -l -t unix -u
            └─17522 qmgr -l -t unix -u
 Jun 13 10:20:48 rac1.example.com systemd[1]: Starting Postfix Mail Transport Agent…
 Jun 13 10:20:48 rac1.example.com postfix/postfix-script[17518]: starting the Postfix mail system
 Jun 13 10:20:48 rac1.example.com postfix/master[17520]: daemon started -- version 2.10.1, configuration /etc/postfix
 Jun 13 10:20:48 rac1.example.com systemd[1]: Started Postfix Mail Transport Agent

pam_systemd(sshd:session): Failed to create session: Failed to activate service ‘org.freedesktop.login1’: timed out

Problem:

1. Slow ssh connections
2. System seems slow when trying to su to another user

/var/log/secure contains the following errors:

pam_systemd(sshd:session): Failed to create session: Failed to activate
service 'org.freedesktop.login1': timed out

Solution:

1. Restart systemd-logind service

# systemctl restart systemd-logind

2. Restart server

# reboot 

Note that the mentioned solutions are considered as temporary solutions (Frankly, I’ve never seen this error after restart. The problem happened with our two customers, who changed sshd_config file and did “something” after that, so the problem was caused by humman error in my all cases), for more information about this problem please see article at redhat site
https://access.redhat.com/discussions/3536621 .

How to display certain line from a text file in Linux?

Sometimes script fails and error mentiones the line number, where there is a mistake. One option is to open a file and go to the mentioned line.

I am going to show you how to use SED to print only the certain line from the script file:

# sed -n ’80p’ /u01/app/18.3.0/grid/crs/script/mount-dbfs.sh

“My 80th line”

Where 80 is the line number and p “print[s] the current pattern space”

Check if a port on a remote system is reachable (without telnet)

Problem:

I wanted to check if our customer had an issue with network access. I asked him to run telnet command, but because of their security reasons, telnet rpm was not installed at all.

Solution:

One of the solution would be to run nc 🙂 but of course nc was not installed. 🙂

The following command helped me in that situation:

cat < /dev/tcp/192.168.3.4/3260

The port is open if there is no output, but if you receive the following, then the port is closed:

-bash: connect: Connection refused
-bash: /dev/tcp/192.168.3.4/3261: Connection refused

 

Install blobfuse on RHEL7 fails

Problem: 

Not able to install blobfuse on RHEL7.

Error:

Error: Package: blobfuse-1.0.2-1.x86_64 (packages-microsoft-com-prod)
Requires: libgnutls.so.26(GNUTLS_1_4)(64bit)
Error: Package: blobfuse-1.0.2-1.x86_64 (packages-microsoft-com-prod)
Requires: libgnutls.so.26(GNUTLS_2_10)(64bit)
Error: Package: blobfuse-1.0.2-1.x86_64 (packages-microsoft-com-prod)
Requires: libgnutls.so.26()(64bit)
You could try using –skip-broken to work around the problem
You could try running: rpm -Va –nofiles –nodigest

Solution:

sudo yum remove packages-microsoft-prod-1.0-1.el7.noarch

sudo yum clean all

sudo rm -rf /var/cache/yum

sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm

sudo yum install blobfuse fuse -y