Add new virtual machine in VBox and install Oracle Linux

Intro:

This blog post belongs to my student at Business and Technology University Ivane Metreveli, thank you Ivane for participating in this project.

  1. First of all, you need to download Oracle Linux iso file from edelivery.oracle.com or from oracle.com. After that, run VirtualBox, click New button and create new virtual machine:

2. Set Name of the Virtual Machine and select operation system as follows, click Next

3. Select appropriate RAM amount, 3GB RAM is recommended for normal processing, click on Next button and jump to next step

4. Now, Select Create a virtual hard disk now option and click Create button

5. Select VDI(virtualbox Disk image)

6. Select Dynamically allocated if you don’t want take hard disk space immediately

7. Select file size (disk size for VB) and the location, click Create button to finish virtual machine creation process

8. Virtual machine is already is created. Before we open/start VM, we load iso file in the machne, click Settings and follow me

9. Navigate to Storage and click CD icon,  on the right side of the window, under attributes, click CD icon and add virtual machine’s .iso file.

10. After that, you can click start button

11. Select .iso files or click folder icon and open folder where .iso file is located, select it and click start

12. Next step is OS installation process, here you select Install Oracle linux 7.6 and click enter to start installation process:

13. Select system language and click continue

14. Select installation destiantion

15. Select the disk where you want to install system. You can select virtual disk, that you have created in the previous step or add a new one. Select disk and click Done button;

16. Now all parameter is ready. Click Begin Installation and wait for finishing the process

17. Set password and click Done

18. Installation is in progress, need to wait more

19. Installation proess is finished, click Roboot button and move to the next step:

20. Installation is finised now, you can start working with Oracle Linux:

rpm -qa gets thread died in Berkeley DB library

Problem:

After checking if flashgrid-clan package was installed, got this error:

error: rpmdb: BDB0113 Thread/process 2884/140438918064192 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 - (-30973)
error: cannot open Packages database in /var/lib/rpm
error: rpmdb: BDB0113 Thread/process 2884/140438918064192 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages database in /var/lib/rpm
package flashgrid-clan is not installed

Reason:

If you see rpmdb errors during package management (rpm, yum), it means that the RPM database is corrupted.

Solution:

# mkdir /var/lib/rpm/backup
# cp -a /var/lib/rpm/__db* /var/lib/rpm/backup/
# rm -f /var/lib/rpm/__db.[0-9][0-9]*
# rpm --quiet -qa
# rpm --rebuilddb
# yum clean all

Find 5 biggest files in Linux

I have used this command many times, but the interval between each usage is so big that I almost always forget the syntax.

So here it is:

# du -a / | sort -n -r | head -n 5

51190272	/
37705424	/root
33040524	/root/apache-tomcat-7.0.53
32802516	/root/apache-tomcat-7.0.53/logs
32802440	/root/apache-tomcat-7.0.53/logs/catalina.out

Create shortcuts for frequently accessed servers

Life is too short, that’s why it’s mandatory to use shortcuts… Instead of typing frequently used ssh client options such as port, user, hostname, identity-file and so on, you can save that information in sshd config file and then access it with defined alias.

  • System wide config file location is /etc/ssh/ssh_config
  • User specific config file location is ~/.ssh/config same as $HOME/.ssh/config

Instead of connecting to the server everytime using the following command:

# ssh root@95.80.12.10 -i ~/.ssh/my_id_rsa

Save the following entries in ~/.ssh/config file:

# vim ~/.ssh/config
Host my_db
HostName 95.80.12.10
IdentityFile ~/.ssh/my_id_rsa
User root

And connect to the server using this simple way:

# ssh my_db

For other options check https://linuxize.com/post/using-the-ssh-config-file/

How to identify OS is Oracle Linux or RHEL?

There are several ways to identify that, I will suggest one of them using rpm -qf, that finds out what package a file belongs to:

Oracle Linux:

#  rpm -qf /etc/redhat-release
oraclelinux-release-7.8-1.0.7.el7.x86_64

RHEL:

# rpm -qf /etc/redhat-release
redhat-release-server-7.8-2.el7.x86_64

Boot in single user mode and rescue your RHEL7

Problem:

One of our customer incorrectly changed fstab file and rebooted the OS. As a result, VM was not able to start. Fortunately, cloud where this VM was located supported serial console.

Solution:

We booted in single user mode through serial console and reverted the changes back. To boot in single user mode and update necessary file, do as follows:

Connect to the serial console and while OS is booting in a grub menu press e to edit the selected kernel:

Find line that starts with linux16 ( if you don’t see it press arrow down ), go to the end of this line and type rd.break.

Press ctrl+x.

Wait for a while and system will enter into single user mode:

During this time /sysroot is mounted in read only mode, you need to remount it in read write:

switch_root:/# mount -o remount,rw /sysroot
switch_root:/# chroot /sysroot

You can revert any changes back by updating any file, in our case we updated fstab:

sh-4.2# vim /etc/fstab

You are a real hero, because you rescued your system!

PRVG-11069 : IP address “169.254.0.2” of network interface “idrac” on the node “primrac1” would conflict with HAIP usage

Problem:

Oracle 18c GI configuration precheck was failing with the following error:

Summary of node specific errors 

primrac2  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac2" would conflict with HAIP usage.  
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces. 

primrac1  - PRVG-11069 : IP address "169.254.0.2" of network interface "idrac" on the node "primrac1" would conflict with HAIP usage. 
- Cause:  One or more network interfaces have IP addresses in the range (169.254..), the range used by HAIP which can create routing conflicts.  
- Action:  Make sure there are no IP addresses in the range (169.254..) on any network interfaces.  

On each node additional network interface – named idrac was started with the ip address 169.254.0.2. I tried to set static ip address in /etc/sysconfig/network-scripts/ifcfg-idrac , also tried to bring the interface down – but after some time interface was starting up automatically and getting the same ip address.

Cluster nodes were DELL servers with Dell Remote Access Controller(iDRAC) Service Module installed. For more information about this module installation/deinstallation… can be found here https://topics-cdn.dell.com/pdf/idrac-service-module-v32_users-guide_en-us.pdf

Servers were configured by system administrator and was not clear why this module was there, we are not using iDRAC module and the only option that we had was to remove/uninstall that module. (configuring module should also be possible to avoid such situation, but we keep our servers as clean as possible without having unsed services)

Solution:

Uninstalled iDRAC module (also expained in the above pdf):

# rpm -e dcism 

After uninstalling it idrac interface did not started anymore, so we could continue GI configuration.

UDEV rules for configuring ASM disks

Problem:

During my previous installations I used the following udev rule on multipath devices:

KERNEL=="dm-[0-9]*", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -d /dev/$parent", RESULT=="360050768028200a9a40000000000001c", NAME="oracleasm/asm-disk1", OWNER="oracle", GROUP="asmadmin", MODE="0660"

So to identify the exact disk I used PROGRAM option. The above script looks through `/dev/dm-*` devices and if any of them satisfy the condition, for example:

# scsci_id -gud /dev/dm-3
360050768028200a9a40000000000001c 

then device name will be changed to /dev/oracleasm/asm-disk1, owner:group to grid:asmadmin and permission to 0660

But on my new servers same udev rule was not working anymore. (Of course, it needs more investigation, but our time is really valuable and never enough and if we know another solution that works and is acceptable- let’s just use it)

Solution:

I used udevadm command to identify other properties of these devices and wrote new udev rule (to see all properties, just remove grep):

# udevadm info --query=property --name /dev/mapper/asm1 | grep DM_UUID
DM_UUID=mpath-360050768028200a9a40000000000001c

New udev rule looks like this:

# cat /etc/udev/rules.d/99-oracle-asmdevices.rules
ENV{DM_UUID}=="mpath-360050768028200a9a40000000000001c",  SUBSYSTEM=="block", NAME="oracleasm/asm-disk1", OWNER="grid", GROUP="asmadmin", MODE="0660"

Trigger udev rules:

# udevadm trigger

Verify that name, owner, group and permissions are changed:

# ll /dev/oracleasm/
total 0
brw-rw---- 1 grid asmadmin 253, 3 Jul 17 17:33 asm-disk1

sshd: /etc/ssh/sshd_config: Permission denied

Problem:

sshd and chronyd services on the database server were in a failed state and not able to start because of the permission problem on their configuration files. Permissions on these files were correct and services should have been able to start, so there was something else… let’s dig into the details.

# systemctl status sshd
 â sshd.service - OpenSSH server daemon
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
    Active: activating (auto-restart) (Result: exit-code) since Tue 2019-07-09 12:21:49 UTC; 32s ago
      Docs: man:sshd(8)
            man:sshd_config(5)
   Process: 124026 ExecStart=/usr/sbin/sshd -D $OPTIONS (code=exited, status=1/FAILURE)
Main PID: 124026 (code=exited, status=1/FAILURE)
Jul 09 12:21:49 node03 systemd[1]: Failed to start OpenSSH server daemon.
Jul 09 12:21:49 node03 systemd[1]: Unit sshd.service entered failed state.
Jul 09 12:21:49 node03 systemd[1]: sshd.service failed

`journalctl -xe` shows:

-- Unit sshd.service has begun starting up.
Jul 09 12:26:03 node03 sshd[129121]: /etc/ssh/sshd_config: Permission denied
Jul 09 12:26:03 node03 systemd[1]: sshd.service: main process exited, code=exited, status=1/FAILURE
Jul 09 12:26:03 node03 systemd[1]: Failed to start OpenSSH server daemon.
-- Subject: Unit sshd.service has failed

The same problem was happening with chronyd service. It was claiming about /etc/chrony.conf file. Incorrect time on database servers can cause node evictions.

Reason:

If permissions on these files are correct, we can think about SELinux, let’s check:

# getenforce 
Enforcing

Solution:

Disable SELinux and reboot the server:

# vim /etc/selinux/config
SELINUX=disabled

# reboot

Summary:

I consider SELinux as a non-desirable service on the database servers. But I appreciate opinion of my colleages/friends and I want to share it with you.

SELinux can be enabled with the correct config in RHEL 4,5,6 – “Starting with Oracle Database 11g Release 2 (11.2), the Security Enhanced Linux (SELinux) feature is supported for Oracle Linux 4, Oracle Linux 5, Oracle Linux 6, Red Hat Enterprise Linux 4, Red Hat Enterprise Linux 5, and Red Hat Enterprise Linux 6.
https://docs.oracle.com/cd/E11882_01/install.112/e47689/pre_install.htm#LADBI1092

SELinux is a good security tool and usually I only disable it as a last resort or if the software doesn’t support it.

“kernel: serial8250: too much work for irq4” potential problem caused by Azure OMS Agent

Problem:

There are a lot of warnings “kernel: serial8250: too much work for irq4 ” in /var/log/messages and are likely your system experiences stability problems. And can lead to Oracle cluster node evictions.

Cause:

The problem was related to Azure OAM Agent pushing very large messages to serial console. The problem was introduced by the latest update of the Azure OMS agent.

Temporary Solution:

Temporarily remove OMS Linux Agent Extension until Microsoft resolves this bug:

1. On Azure portal click the link of the affected VM.
2. Click the “Extensions” section.
3. Click the OMS Linux Agent in the list.
4. Click the “Uninstall” button at the top

When you make sure that OMS agent bug is fixed (should be verified with Microsoft support), then you can reinstall the pluggin.