sshd: /etc/ssh/sshd_config: Permission denied

Problem:

sshd and chronyd services on the database server were in a failed state and not able to start because of the permission problem on their configuration files. Permissions on these files were correct and services should have been able to start, so there was something else… let’s dig into the details.

# systemctl status sshd
 â sshd.service - OpenSSH server daemon
    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
    Active: activating (auto-restart) (Result: exit-code) since Tue 2019-07-09 12:21:49 UTC; 32s ago
      Docs: man:sshd(8)
            man:sshd_config(5)
   Process: 124026 ExecStart=/usr/sbin/sshd -D $OPTIONS (code=exited, status=1/FAILURE)
Main PID: 124026 (code=exited, status=1/FAILURE)
Jul 09 12:21:49 node03 systemd[1]: Failed to start OpenSSH server daemon.
Jul 09 12:21:49 node03 systemd[1]: Unit sshd.service entered failed state.
Jul 09 12:21:49 node03 systemd[1]: sshd.service failed

`journalctl -xe` shows:

-- Unit sshd.service has begun starting up.
Jul 09 12:26:03 node03 sshd[129121]: /etc/ssh/sshd_config: Permission denied
Jul 09 12:26:03 node03 systemd[1]: sshd.service: main process exited, code=exited, status=1/FAILURE
Jul 09 12:26:03 node03 systemd[1]: Failed to start OpenSSH server daemon.
-- Subject: Unit sshd.service has failed

The same problem was happening with chronyd service. It was claiming about /etc/chrony.conf file. Incorrect time on database servers can cause node evictions.

Reason:

If permissions on these files are correct, we can think about SELinux, let’s check:

# getenforce 
Enforcing

Solution:

Disable SELinux and reboot the server:

# vim /etc/selinux/config
SELINUX=disabled

# reboot

Summary:

I consider SELinux as a non-desirable service on the database servers. But I appreciate opinion of my colleages/friends and I want to share it with you.

SELinux can be enabled with the correct config in RHEL 4,5,6 – “Starting with Oracle Database 11g Release 2 (11.2), the Security Enhanced Linux (SELinux) feature is supported for Oracle Linux 4, Oracle Linux 5, Oracle Linux 6, Red Hat Enterprise Linux 4, Red Hat Enterprise Linux 5, and Red Hat Enterprise Linux 6.
https://docs.oracle.com/cd/E11882_01/install.112/e47689/pre_install.htm#LADBI1092

SELinux is a good security tool and usually I only disable it as a last resort or if the software doesn’t support it.

PRCS-1007 : Server pool RPPCOMS already exists

Problem:

While adding database using srvctl add database command got PRCS-1007 and PRCR-1086 errors:

$ srvctl add database -db RPPCOMS -oraclehome /u01/app/oracle/product/12.1.0/dbhome_1 -spfile +DATA/RPPCOMS/spfileRPPCOMS.ora -pwfile +DATA/RPPCOMS/PASSWORD/pwdrppcoms.256.1005727427 -dbname RPPCOMS  -diskgroup FRA,DATA

PRCS-1007 : Server pool RPPCOMS already exists
PRCR-1086 : server pool ora.RPPCOMS is already registered

Solution:

Remove mentioned server pool via root user:

# crsctl delete serverpool ora.RPPCOMS

And retry adding database service.

19cGI & 12cRDBMS opatchauto: Re-link fails on target “procob”

Problem:

Environment: 19c GI | 12c RDBMS | RHEL 7.6

While applying Apr 2019 RU on top of the RDBMS home, opatchauto failed:

# /u01/app/oracle/product/12.2.0/dbhome_1/OPatch/opatchauto apply /u01/swtmp/29314339/ -oh /u01/app/oracle/product/12.2.0/dbhome_1 

==Following patches FAILED in apply:
Patch: /u01/swtmp/29314339
Log: /u01/app/oracle/product/12.2.0/dbhome_1/cfgtoollogs/opatchauto/core/opatch/opatch2019-06-24_18-51-22PM_1.log
Reason: Failed during Patching: oracle.opatch.opatchsdk.OPatchException: Re-link fails on target "procob".
Re-link fails on target "proc". 

We’ve tried using opatch instead of opatchauto and it succeeded:

$ /u01/app/oracle/product/12.2.0/dbhome_1/OPatch/opatch apply

Patch 29314339 successfully applied.
OPatch succeeded.

The problem is definitelly related to opatchauto . We’ve opened SR to Oracle and I will update this post as soon as I get the solution from them. Before that, I want to share three workarounds for this problem with you.

Workarounds:

1. The first workaround, as you have already guessed is to use opatch instead of opatchauto.

$ /u01/app/oracle/product/12.2.0/dbhome_1/OPatch/opatch apply

2. The second workaround is to edit actions.xml file under PSU (e.g /u01/swtmp/29314339/etc/config/actions.xml) and remove the following entries (see Doc ID 2056670.1):

<oracle.precomp.lang opt_req="O" version="12.2.0.1.0">
<make change_dir="%ORACLE_HOME%/precomp/lib" make_file="ins_precomp.mk" make_target="procob"/>
</oracle.precomp.lang>
  
<oracle.precomp.common opt_req="O" version="12.2.0.1.0">
<make change_dir="%ORACLE_HOME%/precomp/lib" make_file="ins_precomp.mk" make_target="proc"/>
</oracle.precomp.common> 

Retry opatchauto.

3. The third workaround is to backup libons.so file in GI home and then copy from RDBMS home (You should return it back, after opatchauto succeeds).

# mv  /u01/app/19.3.0/grid/lib/libons.so  /u01/app/19.3.0/grid/lib/libons.so_backup 
# cp  /u01/app/oracle/product/12.2.0/dbhome_1/lib/libons.so  /u01/app/19.3.0/grid/lib/libons.so  

The reason I’ve used this workaround is that, I’ve found opatchauto was using libons.so file from GI home instead of RDBMS:

[WARNING]OUI-67200:Make failed to invoke "
/usr/bin/make -f ins_precomp.mk proc ORACLE_HOME=/u01/app/oracle/product/12.2.0/dbhome_1"….
'/u01/app/19.3.0/grid/lib/libons.so: undefined reference to `memcpy@GLIBC_2.14'                         

If we compare libons.so files between GI and RDBMS homes, we will find that they are not the same. The problem is that opatchauto uses that file from wrong location. As long as it is not easy to find a way to force opatchauto to use libons.so file from the correct location (working on this with Oracle Support), this workaournd can also be considered.

Please keep in mind, you must return GI libons.so back after opatchauto succeeds. We don’t know what happens if we have 12c libons.so file under 19c GI.

Detach diskgroup from 12c GI and attach to 19c GI

Task:

We have two separate Real Application Clusters, one 12c and another 19c. We decided to migrate data from 12c to 19c by simply detaching all ASM disks from the source and attaching to the destination.

Steps:

1. Connect to the 12c GI via grid user and dismount FRA diskgroup on all nodes:

[grid@rac1 ~]$ sqlplus  / as sysasm
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
SQL> alter diskgroup FRA dismount;
Diskgroup altered. 
[grid@rac2 ~]$ sqlplus  / as sysasm
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
SQL> alter diskgroup FRA dismount;
Diskgroup altered.

You can also use srvctl to stop the diskgroup with one command.

2. Detach disks belonging to the specific diskgroup from 12c cluster and attach to 19c cluster.

3. After ASM disks are visible on 19c cluster, connect as sysasm via grid user and mount the diskgroup:

# Check that there is no FRA resource registered with CRS:

[root@rac1 ~]# crsctl status res -t |grep FRA

# Mount the diskgroup on all nodes

[grid@rac1 ~]$ sqlplus / as sysasm
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
SQL> alter diskgroup FRA mount;
Diskgroup altered.
[grid@rac2 ~]$ sqlplus / as sysasm
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
SQL> alter diskgroup FRA mount;
Diskgroup altered.

# FRA diskgroup resource will automatically be registered with CRS:

[root@rac1 ~]# crsctl status res -t |grep FRA
ora.FRA.dg(ora.asmgroup)

And data will be there…

Rollback RU patches from 12c GI home using opatchauto

Junior DBAs will find these steps useful 🙂

Environment details:

Two-node Real Application Cluster.
Database version: 12.2.0.1
Applied RU: 16-04-2019

1. Check existing patches

[grid@rac1 ~]$  /u01/app/12.2.0/grid/OPatch/opatch lspatches
29314424;OCW APR 2019 RELEASE UPDATE 12.2.0.1.190416 (29314424)
29314339;Database Apr 2019 Release Update : 12.2.0.1.190416 (29314339)
29301676;ACFS APR 2019 RELEASE UPDATE 12.2.0.1.190416 (29301676)
28566910;TOMCAT RELEASE UPDATE 12.2.0.1.0(ID:180802.1448.S) (28566910)
26839277;DBWLM RELEASE UPDATE 12.2.0.1.0(ID:170913) (26839277)
OPatch succeeded.

Note that all these patches are part of RU 16-04-2019.

2. Stop all database instances on that node:

# srvctl stop instance -db orclA -i orclA1

3. Download Release Update 16-04-2019 (p29301687_122010_Linux-x86-64.zip), unzip and go to the unzipped patch location:

To rollback all these patches it is easier to have unzipped Release Update 16-04-2019 patch (all existing patches are part of it) on the server.

If you cannot download zipped RU then you need to indicate all patch ids in the list during opatchauto rollback -id 29314424,29314339,29301676,28566910,26839277

As long as I have unzipped RU on rac1, I will do by the following way:

[root@rac1 ~]# cd /u01/app/sw/29301687

[root@rac1 29301687]# ll
 total 132
 drwxr-x--- 4 grid oinstall     48 Mar 25 01:09 26839277
 drwxr-x--- 4 grid oinstall     48 Mar 25 01:08 28566910
 drwxr-x--- 5 grid oinstall     62 Mar 25 01:03 29301676
 drwxr-x--- 4 grid oinstall     67 Mar 25 01:08 29314339
 drwxr-x--- 5 grid oinstall     62 Mar 25 01:06 29314424
 drwxr-x--- 2 grid oinstall   4096 Mar 25 01:03 automation
 -rw-rw-r-- 1 grid oinstall   5828 Mar 25 01:29 bundle.xml
 -rw-r--r-- 1 grid oinstall 120219 Apr 10 18:07 README.html
 -rw-r----- 1 grid oinstall      0 Mar 25 01:03 README.txt

4. Rollback patches using opatchauto:

[root@rac1 29301687]# /u01/app/12.2.0/grid/OPatch/opatchauto rollback -oh /u01/app/12.2.0/grid
 ….
 ==Following patches were SUCCESSFULLY rolled back:
 Patch: /u01/app/sw/29301687/29314424
 Log: /u01/app/12.2.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2019-05-29_12-56-19PM_1.log
 Patch: /u01/app/sw/29301687/29301676
 Log: /u01/app/12.2.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2019-05-29_12-56-19PM_1.log
 Patch: /u01/app/sw/29301687/26839277
 Log: /u01/app/12.2.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2019-05-29_12-56-19PM_1.log
 Patch: /u01/app/sw/29301687/28566910
 Log: /u01/app/12.2.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2019-05-29_12-56-19PM_1.log
 Patch: /u01/app/sw/29301687/29314339
 Log: /u01/app/12.2.0/grid/cfgtoollogs/opatchauto/core/opatch/opatch2019-05-29_12-56-19PM_1.log

5. Start database instance on the first node and shutdown on the second:

# srvctl start instance -db orclA -i orclA1
# srvctl stop instance -db orclA -i orclA2

6. Connect to the second node and repeat the same steps:

[root@rac2 ~]# cd /u01/app/sw/29301687

[root@rac2 29301687]# /u01/app/12.2.0/grid/OPatch/opatchauto rollback -oh /u01/app/12.2.0/grid

7. Start database instance on rac2

# srvctl start instance -db orclA -i orclA2

8. Check inventory

$  /u01/app/12.2.0/grid/OPatch/opatch lspatches

There are no Interim patches installed in this Oracle Home "/u01/app/12.2.0/grid".
 OPatch succeeded.

Start CRS even getting “ORA-15040: diskgroup is incomplete” on voting file/OCR diskgroup

Problem:

CRS was down on both nodes, during startup cluster encountered the following error when it was trying to mount diskgroup containing voting files and OCR:

WARNING: Disk Group VOTE containing configured OCR is not mounted
WARNING: Disk Group VOTE containing voting files is not mounted
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "0" is missing from group number "1" 

The diskgroup, where OCR and voting files were located was not able to mount because one disk was missing. As a result CRS is down:

# crsctl status res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

We know that NORMAL redundancy diskgroup can tolerate one mirror problem at a time.

Solution:

1. Start HAS and check status of the local resoureces

# crsctl start has

# crsctl status res -t -init

---------------------------------------------------------------------------
Name          Target      State        Server      State details       
---------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------- 
ora.asm
       1       ONLINE      ONLINE       rac2         STABLE
 ora.cluster_interconnect.haip
       1        ONLINE     ONLINE       rac2         STABLE
 ora.crf
       1        OFFLINE    OFFLINE                   STABLE
 ora.crsd
       1        ONLINE      OFFLINE                  STABLE
 ora.cssd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.cssdmonitor
       1        ONLINE      ONLINE       rac2        STABLE
 ora.ctssd
       1        ONLINE      ONLINE       rac2        OBSERVER,STABLE
 ora.diskmon
       1        OFFLINE      OFFLINE                 STABLE
 ora.drivers.acfs
       1        ONLINE      ONLINE       rac2        STABLE
 ora.evmd
       1        ONLINE      INTERMEDIATE rac2        STABLE
 ora.gipcd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.gpnpd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.mdnsd
       1        ONLINE      ONLINE       rac2        STABLE
 ora.storage
       1        ONLINE      OFFLINE      rac2        STABLE 

2. Connect to the ASM instance and mount diskgroup using force option.

ASM instance will be in nomount state, because diskgroup having voting files and OCR cannot be mounted.

Force option is mandatory, otherwise you will get the same ORA-15040 error.

# su - grid

$ sqlplus / as sysasm

SQL*Plus: Release 12.2.0.1.0 Production on Tue May 28 16:14:14 2019
Copyright (c) 1982, 2016, Oracle.  All rights reserved.

Connected to:
 Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production

SQL> alter diskgroup VOTE mount force;
Diskgroup altered.

This operation sometimes takes ~6min to complete because of the following notification in alert_ASM?.log

"WARNING: Background operations delayed until 05/28/19 16:19:47 because ASM was not stopped cleanly and there could be disconnected client(s)"

The error message is self explanatory.

3. The diskgroup online operation on the 2nd step should trigger clusterware autostart, if not start it using the following command:

# crsctl start cluster

4. Check CRS status:

# crsctl status res -t 

---------------------------------------------------------------------------
Name           Target  State        Server       State details       
--------------------------------------------------------------------------- 
Local Resources

ora.ASMNET1LSNR_ASM.lsnr
                ONLINE  ONLINE       rac2        STABLE
ora.DATA.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.FRA.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.LISTENER.lsnr
                ONLINE  ONLINE       rac2        STABLE
ora.MGMT.dg
                ONLINE  OFFLINE      rac2        STABLE
ora.VOTE.dg
                ONLINE  ONLINE       rac2        STABLE
ora.chad
                ONLINE  OFFLINE      rac2        STABLE
ora.net1.network
                ONLINE  ONLINE       rac2        STABLE
ora.ons
                ONLINE  ONLINE       rac2        STABLE
ora.proxy_advm
                OFFLINE OFFLINE      rac2        STABLE
---------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------- 
ora.LISTENER_SCAN1.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.LISTENER_SCAN2.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.LISTENER_SCAN3.lsnr
       1        ONLINE  ONLINE       rac2        STABLE
 ora.MGMTLSNR
       1        OFFLINE OFFLINE                  STABLE
 ora.asm
       1        ONLINE  OFFLINE                  STABLE
       2        ONLINE  ONLINE       rac2        Started,STABLE
 ora.cvu
       1        ONLINE  ONLINE       rac2        STABLE
 ora.mgmtdb
       1        OFFLINE OFFLINE                  STABLE
 ora.qosmserver
       1        ONLINE  ONLINE       rac2        STABLE 
 ora.rac1.vip
       1        ONLINE  INTERMEDIATE rac2        FAILED OVER,STABLE
 ora.rac2.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan1.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan2.vip
       1        ONLINE  ONLINE       rac2        STABLE
 ora.scan3.vip
       1        ONLINE  ONLINE       rac2        STABLE

Recommendation:

Change corrupted disks as soon as possible and make it online.

How to delete MGMTDB?

Introduction

MGMTDB is a repository database that saves Cluster Health Monitor (CHM) data. In Oracle 11g this information was stored in Berkley database ( .bdb files) but starting from Oracle database 12c it is configured as an Oracle single instance database.

In Oracle 12.1.0.1 – GIMR is optional. Whereas in Oracle 12.1.0.2 – it’s mandatory  and it’s not supported to be turned off with the exception of Exadata.
I’ve searched a lot to find out why it is not supported to be turned off, but I still do not have that answer. I only know that TFA collects some information from MGMTDB and if we turn it off, it means TFA will not be able to retrieve that information. In 19c GIMR is optional again.

The reason why I want to turn it off, is that there are several bugs related to MGMTDB. We have noticed that several customers had performance related issues because of MGMTDB. The repository database was able to use almost 100% of CPU resources. In addition to this, one customer noticed that MGMTDB increased up to 60GB and exhausted GRID diskgroup where OCR and voting files were located (this size is not normal for 3-node cluster).

More information about Grid Infrastructure Management Repository (GIMR) can be found at 1568402.1

Steps

Please consider that for 12.1.0.2 deleting it is not supported by Oracle, which is not clear why, but it is better to ask Oracle support before doing this.

Instead of deleting a repository, it is better to apply all bug fixes that is related to it. And try to use it’s intelligece to proactively tune your database. But if you still want to delete it, or at least know how to delete it – then let’s do that, it is not harmful.

1. Stop and disable CRF

# crsctl stop res ora.crf -init
CRS-2673: Attempting to stop 'ora.crf' on 'rac1'
CRS-2677: Stop of 'ora.crf' on 'rac1' succeeded 

# crsctl modify res ora.crf -attr ENABLED=0 -init

# crsctl status res ora.crf -init
NAME=ora.crf
TYPE=ora.crf.type
TARGET=OFFLINE
STATE=OFFLINE 

2. Delete MGMTDB using dbca

Database must be up and running to do this step.

# su - grid

$ dbca -silent -deleteDatabase -sourceDB -MGMTDB
Connecting to database
4% complete
9% complete
14% complete
19% complete
23% complete
28% complete
47% complete
Updating network configuration files
52% complete
Deleting instance and datafiles
76% complete
100% complete    

3. Make sure that MGMTDB was deleted

# srvctl status mgmtdb
PRCD-1120 : The resource for database _mgmtdb could not be found.
PRCR-1001 : Resource ora.mgmtdb does not exist