[WARNING] [INS-08102] Unexpected error occurred while transitioning from state ‘DBIdentification’

Problem:

While using dbca to create a database I get the following error:

[oracle@rac1 ~]$ dbca -silent -createDatabase \
>   -responseFile NO_VALUE \
>   -templateName New_Database.dbt \
>   -dbOptions "OMS:false,JSERVER:false,SPATIAL:false,IMEDIA:false,ORACLE_TEXT:false,SAMPLE_SCHEMA:false,CWMLITE:false,APEX:false,DV:false" \
>   -gdbname racdb \
>   -characterSet AL32UTF8 \
>   -sysPassword Oracle123 \
>   -systemPassword Oracle123 \
>   -databaseType MULTIPURPOSE \
>   -automaticMemoryManagement false \
>   -totalMemory 6144 \
>   -redoLogFileSize 512 \
>   -nodelist rac1,rac2 \
>   -storageType ASM \
>   -diskGroupName DATA01 \
>   -recoveryGroupName DATA01 \
>   -recoveryAreaSize 20000 \
>   -initParams "log_buffer=128M,processes=3000" \
>   -asmsnmpPassword Oracle123
[WARNING] [INS-08102] Unexpected error occurred while transitioning from state 'DBIdentification'
   CAUSE: No additional information available.
   ACTION: Contact Oracle Support Services or refer to the software manual.

On DBCA GUI it looks like this:

Solution:

On the Oracle site or on the Internet, there is no useful information that can help so far. The only place you should search for the reason is in dbca logs generated under /u01/app/oracle/cfgtoollogs/dbca

[root@rac1 dbca]# pwd
/u01/app/oracle/cfgtoollogs/dbca

[root@rac1 dbca]# ll
total 1344
drwxr-xr-x 2 root   root        4096 Jul 14 18:48 old
drwxr-x--- 2 oracle oinstall       6 Apr 28 16:25 orcl
-rw-r----- 1 oracle oinstall 1370046 Jul 14 18:51 trace.log_2023-07-14_06-51-03PM
-rw-r----- 1 oracle oinstall       0 Jul 14 18:51 trace.log_2023-07-14_06-51-03PM.lck

The latest trace file contains the following error messages:

[Thread-355] [ 2023-07-14 18:51:35.375 UTC ] [StreamReader.run:66]  OUTPUT> DIA-49802: missing read, write, or execute permission on specified ADR home directory [/u01/app/grid/diag/crs/rac1/crs/log]
[Thread-355] [ 2023-07-14 18:51:35.375 UTC ] [StreamReader.run:66]  OUTPUT>DIA-49801: actual permissions [rwxr-xr-x], expected minimum permissions [rwxrwx---] for effective user [oracle]
[Thread-355] [ 2023-07-14 18:51:35.375 UTC ] [StreamReader.run:66]  OUTPUT>DIA-48188: user missing read, write, or exec permission on specified directory
....
[Thread-363] [ 2023-07-14 19:01:53.417 UTC ] [StreamReader.run:66]  OUTPUT> DIA-48141: error creating directory during ADR initialization [/u01/app/grid/diag/crs/rac1/crs/trace]
[Thread-363] [ 2023-07-14 19:01:53.417 UTC ] [StreamReader.run:66]  OUTPUT>DIA-48189: OS command to create directory failed
[Thread-363] [ 2023-07-14 19:01:53.417 UTC ] [StreamReader.run:66]  OUTPUT>Linux-x86_64 Error: 1: Operation not permitted

Let’s check permissions on folders under /u01/app/grid/diag/crs/rac1/crs

[root@rac1 dbca]# ll /u01/app/grid/diag/crs/rac1/crs/

drwxr-xr-x 2 grid oinstall   21 Jul 14 16:09 alert
drwxr-xr-x 2 grid oinstall    6 Jul 12 21:33 cdump
drwxr-xr-x 2 grid oinstall    6 Jul 12 21:33 incident
drwxr-xr-x 2 grid oinstall    6 Jul 12 21:33 incpkg
drwxr-xr-x 2 grid oinstall 4096 Jul 12 21:33 lck
drwxrwxr-x 5 grid oinstall   48 Jul 12 21:33 log
drwxr-xr-x 2 grid oinstall 4096 Jul 12 21:33 metadata
drwxr-xr-x 2 grid oinstall    6 Jul 12 21:33 metadata_dgif
drwxr-xr-x 2 grid oinstall    6 Jul 12 21:33 metadata_pv
drwxr-xr-x 2 grid oinstall    6 Jul 12 21:33 stage
drwxr-xr-x 2 grid oinstall    6 Jul 12 21:33 sweep
drwxr-xr-x 2 grid oinstall 8192 Jul 14 18:43 trace

When I have the same case, I always check the same folder on a healthy server, let’s verify permissions on a healthy node:

[root@rac2 oraInventory]# ll /u01/app/grid/diag/crs/rac2/crs

drwxrwxr-x 2 grid oinstall    21 Apr 26 17:53 alert
drwxrwxr-x 5 grid oinstall    57 Jul 13 21:43 cdump
drwxrwxr-x 7 grid oinstall    89 Jul 13 21:43 incident
drwxrwxr-x 2 grid oinstall     6 Apr 26 17:53 incpkg
drwxrwxr-x 2 grid oinstall  4096 Jul 13 21:43 lck
drwxrwxr-x 5 grid oinstall    65 Jun  1 00:36 log
drwxrwxr-x 2 grid oinstall  4096 Apr 26 17:53 metadata
drwxrwxr-x 2 grid oinstall     6 Apr 26 17:53 metadata_dgif
drwxrwxr-x 2 grid oinstall     6 Apr 26 17:53 metadata_pv
drwxrwxr-x 2 grid oinstall   119 Jul 13 21:43 stage
drwxrwxr-x 2 grid oinstall   124 Jul 13 21:43 sweep
drwxrwxr-x 2 grid oinstall 32768 Jul 14 18:43 trace

Now it’s time to correct permissions on the problematic node:

[root@rac1 dbca]# chmod -R 775 /u01/app/grid/diag/crs/rac1/crs/*

[root@rac1 dbca]# ll /u01/app/grid/diag/crs/rac1/crs/

drwxrwxr-x 2 grid oinstall   21 Jul 14 16:09 alert
drwxrwxr-x 2 grid oinstall    6 Jul 12 21:33 cdump
drwxrwxr-x 2 grid oinstall    6 Jul 12 21:33 incident
drwxrwxr-x 2 grid oinstall    6 Jul 12 21:33 incpkg
drwxrwxr-x 2 grid oinstall 4096 Jul 12 21:33 lck
drwxrwxr-x 5 grid oinstall   48 Jul 12 21:33 log
drwxrwxr-x 2 grid oinstall 4096 Jul 12 21:33 metadata
drwxrwxr-x 2 grid oinstall    6 Jul 12 21:33 metadata_dgif
drwxrwxr-x 2 grid oinstall    6 Jul 12 21:33 metadata_pv
drwxrwxr-x 2 grid oinstall    6 Jul 12 21:33 stage
drwxrwxr-x 2 grid oinstall    6 Jul 12 21:33 sweep
drwxrwxr-x 2 grid oinstall 8192 Jul 14 18:43 trace

It is now possible for DBCA to proceed without any issues.
Good Luck!

RMAN spread a backup job between many RAC instances in parallel to increase throughput

There are two options to allocate RMAN channels on different RAC instances to increase the throughput.

I will start with the option, that assures all RAC instances get one channel. Regarding the other option, it does load balance but in a random way, so with a small number of channels, you may see that all of them are allocated in one instance. So you decide which option is better for you.

The test is done on a 2-node cluster.

  1. Configure parallelism and two channels in RMAN. Indicate a connect string, one per instance:
$ RMAN target /
RMAN> CONFIGURE DEVICE TYPE disk PARALLELISM 2;
RMAN> CONFIGURE CHANNEL 1 DEVICE TYPE DISK CONNECT 'sys/Oracle123@ORCL1 as sysdba';
RMAN> CONFIGURE CHANNEL 2 DEVICE TYPE DISK CONNECT 'sys/Oracle123@ORCL2 as sysdba';

2. Define ORCL1 and ORCL2 aliases on each database node under $ORACLE_HOME/network/admin/tnsnames.ora:

ORCL1=
(DESCRIPTION=
           (ADDRESS= (PROTOCOL=tcp) (HOST=rac1.example.com) (PORT=1522))
           (CONNECT_DATA =
              (SERVER = DEDICATED)
              (SERVICE_NAME = orcl)
           )
      )
 
ORCL2=
(DESCRIPTION=
           (ADDRESS= (PROTOCOL=tcp) (HOST=rac2.example.com) (PORT=1522))
           (CONNECT_DATA =
              (SERVER = DEDICATED)
              (SERVICE_NAME = orcl)
           )
      )

Please note, in my case 1522 is a local listener port.

3. Run backup:

RMAN> backup database;
 
Starting backup at 22-MAR-22
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=507 instance=orcl1 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=753 instance=orcl2 device type=DISK
channel ORA_DISK_2: SID=753 instance=orcl2 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00001 name=+DATA/ORCL/DATAFILE/system.257.1098460673
input datafile file number=00004 name=+DATA/ORCL/DATAFILE/undotbs1.259.1098460743
channel ORA_DISK_1: starting piece 1 at 22-MAR-22
channel ORA_DISK_2: starting full datafile backup set
channel ORA_DISK_2: specifying datafile(s) in backup set
input datafile file number=00003 name=+DATA/ORCL/DATAFILE/sysaux.258.1098460717
input datafile file number=00005 name=+DATA/ORCL/DATAFILE/undotbs2.265.1098461311
input datafile file number=00007 name=+DATA/ORCL/DATAFILE/users.260.1098460743

As you see, two files were backed up by the 1st channel (1st instance) and the other three files by the 2nd channel (2nd instance).

Now let’s explain another possible variant:

  1. Configure one TNS string with load balance parameter:
ORCL_BALANCE=
     (DESCRIPTION=
           (TRANSPORT_CONNECT_TIMEOUT=3) (RETRY_COUNT=6)(LOAD_BALANCE=on)
           (ADDRESS= (PROTOCOL=tcp) (HOST=rac1.example.com) (PORT=1522))
           (ADDRESS= (PROTOCOL=tcp) (HOST=rac2.example.com) (PORT=1522))
           (CONNECT_DATA =
              (SERVER = DEDICATED)
              (SERVICE_NAME = orcl)
           )
      )

Both node addresses are defined and Oracle will pick each address randomly.

2. Configure parallelism and one channel with ORCL_BALANCE string:

Please note, I did this test case on the same server where I’ve already defined CHANNEL 1 and CHANNEL 2, so I had to clear them:

RMAN> CONFIGURE CHANNEL DEVICE TYPE DISK clear;
RMAN> CONFIGURE CHANNEL 1 DEVICE TYPE DISK clear;
RMAN> CONFIGURE CHANNEL 2 DEVICE TYPE DISK clear;

Define channel and parallelism:

RMAN> CONFIGURE DEVICE TYPE disk PARALLELISM 2;
RMAN> CONFIGURE CHANNEL 1 DEVICE TYPE DISK CONNECT 'sys/Oracle123@ORCL_BALANCE as sysdba';

RMAN> backup database;
 
Starting backup at 22-MAR-22
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=752 instance=orcl1 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=138 instance=orcl1 device type=DISK

Both channels were allocated on orcl1. Try one more time, or better configure parallelism 3, CHANNEL parameter is already defined and it is permanent until changed:

RMAN>  CONFIGURE DEVICE TYPE disk PARALLELISM 3;
 
RMAN> backup database;
 
Starting backup at 22-MAR-22
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=752 instance=orcl1 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=138 instance=orcl1 device type=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: SID=753 instance=orcl2 device type=DISK

As you see it really did a random choice. But two channels were allocated on the 1st node and the last one on the 2nd node.

I think here random algorithm is not a good option, but you better know which variant is appropriate in your case.

ORA-39701: database must be mounted EXCLUSIVE for UPGRADE or DOWNGRADE

Problem:

In cluster environment, I was not able to start database in upgrade mode:

SQL> startup upgrade

ORACLE instance started.
Total System Global Area 1996486272 bytes
Fixed Size		    8898176 bytes
Variable Size		  704643072 bytes
Database Buffers	 1275068416 bytes
Redo Buffers		    7876608 bytes
Database mounted.
ORA-00603: ORACLE server session terminated by fatal error
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-39701: database must be mounted EXCLUSIVE for UPGRADE or DOWNGRADE

The following also did not work:

SQL> startup mount exclusive

ORACLE instance started.
Total System Global Area 1996486272 bytes
Fixed Size		    8898176 bytes
Variable Size		  704643072 bytes
Database Buffers	 1275068416 bytes
Redo Buffers		    7876608 bytes
Database mounted.

SQL> alter database open upgrade;
alter database open upgrade
*
ERROR at line 1:
ORA-00603: ORACLE server session terminated by fatal error
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-39701: database must be mounted EXCLUSIVE for UPGRADE or DOWNGRADE

Solution:

Change parameter cluster_database to FALSE, and startup in upgrade mode:

$ sqlplus / as sysdba

SQL> startup nomount;

SQL> alter system set cluster_database=FALSE scope=spfile sid='*';

SQL> shutdown immediate;

SQL> startup upgrade

ORACLE instance started.
Total System Global Area 1996486272 bytes
Fixed Size		    8898176 bytes
Variable Size		  704643072 bytes
Database Buffers	 1275068416 bytes
Redo Buffers		    7876608 bytes
Database mounted.
Database opened.

After finishing your work, don’t forget to return cluster_database parameter to TRUE and restart your database:

SQL> alter system set cluster_database=TRUE scope=spfile sid='*';

TCPS configuration: ora.LISTENER.lsnr is in INTERMEDIATE state

Problem

After configuring TCPS endpoint in listener configuration using the following way:

In srvctl:

[grid@rac1 ~]$ srvctl modify listener -p "TCP:1522/TCPS:1524"

In listener.ora file:

LISTENER =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = TCP)(HOST = rac1.example.com)(PORT = 1522))
      (ADDRESS = (PROTOCOL = IPC)(KEY = LISTENER))
      (ADDRESS = (PROTOCOL = TCPS)(HOST =rac1.example.com)(PORT = 1524))
    ))

So from the above, we see that I’ve configured as static endpoint registration as well as dynamic.

Restarted listener and saw that ora.LISTENER.lsnr resource left in the INTERMEDIATE state:

# srvctl stop listener;srvctl start listener

# crsctl status res -t|head
...
ora.LISTENER.lsnr 
ONLINE INTERMEDIATE rac1 Not All Endpoints Registered,STABLE
ONLINE INTERMEDIATE rac2 Not All Endpoints Registered,STABLE

Troubelshooting

After checking listener.log file, found the following messages:

Dynamic address is already listened on (DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=rac1.example.com)(PORT=1524)))
Dynamic address is already listened on (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rac1.example.com)(PORT=1522)))
Dynamic address is already listened on (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=rac1.example.com)(PORT=1522)))

So, due to the static registration, the dynamic failed and Agent thought it could not register endpoints, which left resource in the INTERMEDIATE state (which should not be normal I think) 

I tried to disable dynamic registration by setting ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER to OFF in listener.ora file, but after modifying and restarting listener using srvctl Oracle Agent changed the value back to ON. I could not find a way to disable it so far.

The second option was to remove static entries from listener.ora file and leave only dynamic registration, but after that, the important endpoint on TCPS/1524 was not registered:

(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=192.168.1.11)(PORT=1524)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.1)(PORT=1522)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.11)(PORT=1522)))

The endpoint that I want is 192.168.1.1:1524 and not 192.168.1.11:1524

Workaround

Configured dynamic registration in srvctl for TCP/1522 port only:

[grid@rac1 ~]$ srvctl modify listener -p "TCP:1522"

Configured static registration in listener.ora file for TCPS/1524:

LISTENER =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = IPC)(KEY = LISTENER))
      (ADDRESS = (PROTOCOL = TCPS)(HOST =rac1.example.com)(PORT = 1524))
    )
  )

After restarting listener, the status became ONLINE:

ora.LISTENER.lsnr
              ONLINE  ONLINE   rac1   STABLE
              ONLINE  ONLINE   rac2   STABLE

Another workaround that my colleage found is the following:

Set ENDPOINTS to null in crsctl (in srvctl it’s not possible):

# crsctl modify resource ora.LISTENER.lsnr -attr "ENDPOINTS=" –unsupported

And register all endpoints statically in listener.ora. Using that way, no dynamic registration is triggered, only static endpoints will be visible.

Please comment bellow, if you have a better way. Thank you!

Identify whether database is RAC or SI using crsctl

Environment:

GI – configured as a cluster
orclsingle – configured as a Single Instance
orclrac – configured as RAC
orclone – configured as RACOneNode

Q1: How to identify whether database is RAC or SI?

A1: crsctl stat res shows parameter CLUSTER_DATABASE. In case of RAC or RACOneNode the value is TRUE otherwise FALSE

# crsctl stat res ora.orclrac.db -p|grep CLUSTER_DATABASE
CLUSTER_DATABASE=true

# crsctl stat res ora.orclone.db -p|grep CLUSTER_DATABASE
CLUSTER_DATABASE=true

# crsctl stat res ora.orclsingle.db -p|grep CLUSTER_DATABASE
CLUSTER_DATABASE=false

Please note that the same can be identified from sqlplus by selecting CLUSTER_DATABASE initialization parameter.

Q2: How to identify whether my database is RAC or RACOneNode?

A2: srvctl config database shows Type, that can have the following values SINGLE, RAC, or RACOneNode.

$ srvctl config database -db orclrac|grep Type
Type: RAC

$ srvctl config database -db orclone|grep Type
Type: RACOneNode

$ srvctl config database -db orclsingle|grep Type
Type: SINGLE

INS-45511: Installer has detected that an Oracle Grid Infrastructure home is marked incorrectly as configured

Problem:

After deconfiguring Oracle Restart stack using:

[root@rac1 ~]# /u01/app/19.3.0/grid/root.sh -deconfig
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/app/19.3.0/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/19.3.0/grid/crs/install/crsconfig_params
The log of current session can be found at:
  /u01/app/grid/crsdata/rac1/crsconfig/hadeconfig.log
2020/07/04 10:49:21 CLSRSC-332: CRS resources for listeners are still configured
2020/07/04 10:49:49 CLSRSC-337: Successfully deconfigured Oracle Restart stack

Tried to configure GI as clusterware stack and got the following error:

INS-45511: Installer has detected that an Oracle Grid Infrastructure home is marked incorrectly  as configured

Solution:

Remove CRS="true" accross GI home entry in /u01/app/oraInventory/ContentsXML/inventory.xml

Original:

<HOME NAME="OraGI19Home1" LOC="/u01/app/19.3.0/grid" TYPE="O" IDX="1" CRS="true"/>

After modification:

<HOME NAME="OraGI19Home1" LOC="/u01/app/19.3.0/grid" TYPE="O" IDX="1"/>

Retry configuration.

CRS-6706: Oracle Clusterware Release Patch Level (‘xxxx’) Does Not Match Software Patch Level

Problem:

We have restored OCR from backup, started CRS, and rolled back patch using opatchauto. After restarting CRS we’ve got:

CRS-6706: Oracle Clusterware Release Patch Level ('xxxx') Does Not Match Software Patch Level ('yyyy')

Solution:

1. Unlock GI home from root user:

# cd $GI_HOME
# $GI_HOME/crs/install/rootcrs.sh -unlock

2. Complete the patching setup from root user (local means it will correct OLR, not OCR):

# $ORACLE_HOME/bin/clscfg -localpatch

3. As the root user lock the GI home:

# $ORACLE_HOME/crs/install/rootcrs.sh -lock

4. Start CRS:

# crsctl start crs -wait


Restore OCR and correct Software Patch Level

Software patch level crsctl query crs softwarepatch and active patch level crsctl query crs activersion -f are introduced to ensure GI home has identical patches on all nodes.

It’s highly recommended to manually backup OCR using # ocrconfig -manualbackup before and after applying a patch or making configurtion changes (adding resources into CRS).

When you apply a patch to GI home, the software patch level will be updated on each node. On the last node, the active patch level will be updated for the cluster.

The patch level is stored in OCR, so if we restore old OCR backup, we need to update the patch level inside it using Oracle provided commands.

  1. Try to find patchlevel in the OCR backup that matches to activeversion:
[root@rac1 ~]# ocrdump -stdout -keyname SYSTEM.version.activeversion.patchlevel -backupfile +GRID:/marirac/OCRBACKUP/backup_20200425_171539.ocr.266.1038676539
...
[SYSTEM.version.activeversion.patchlevel]
UB4 (10) : 2701864972
...

If backup file is located on filesystem, then need to replace +GRID:/marirac/OCRBACKUP/backup_20200425_171539.ocr.266.1038676539 with the full path of the file.

Keep copies of OCR backups on filesystem also using asmcmd cp command.

2. Restore OCR backup

a) Stop crs services on all nodes
b) On one of the node, start cluster in exclusive and restore OCR backup:

# crsctl stop crs -f 
# crsctl start crs -excl -nocrs
# ocrconfig -restore /tmp/OCR/OCR_BACKUP_FILE
# crsctl stop crs -f

3. Correct software and active patch levels

a) Keep CRS down on all nodes
b) Start CRS on the 1st node, run clscfg, and check softwarepatch

# crsctl start crs -wait
# clscfg -patch 
# crsctl query crs softwarepatch

c) On the last node (do not run on other node, only from last node), issue the following command:

# crsctl stop rollingpatch

d) Verify that active patch level was updated

# crsctl query crs activeversion -f

Query the cluster active patchlevel in the OCR backup

To identify OCR backup where the cluster active patch level matches, you need to run the following:

Identify cluster active patch level:

[root@rac1 ~]# crsctl query crs activeversion -f
Oracle Clusterware active version on the cluster is [19.0.0.0.0]. 
The cluster upgrade state is [NORMAL]. 
The cluster active patch level is [2701864972].

If backup is located on diskgroup:

[root@rac1 ~]# ocrdump -stdout -keyname SYSTEM.version.activeversion.patchlevel -backupfile +GRID:/marirac/OCRBACKUP/backup_20200425_171539.ocr.266.1038676539
04/26/2020 12:28:20
+GRID:/marirac/OCRBACKUP/backup_20200425_171539.ocr.266.1038676539
/u01/app/19.3.0/grid/bin/ocrdump.bin -stdout -keyname SYSTEM.version.activeversion.patchlevel -backupfile +GRID:/marirac/OCRBACKUP/backup_20200425_171539.ocr.266.1038676539
[SYSTEM.version.activeversion.patchlevel]
UB4 (10) : 2701864972
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

If backup is located on filesystem:

[root@rac1 ~]# ocrdump -stdout -keyname SYSTEM.version.activeversion.patchlevel -backupfile /tmp/backup_07_13_47_25
04/26/2020 12:26:57
/tmp/backup_07_13_47_25
/u01/app/19.3.0/grid/bin/ocrdump.bin -stdout -keyname SYSTEM.version.activeversion.patchlevel -backupfile /tmp/backup_07_13_47_25
[SYSTEM.version.activeversion.patchlevel]
UB4 (10) : 2701864972
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

ORA-15137: The ASM cluster is in rolling patch state, CRS does not start

Problem:

The customer was not able to start cluster because ASM disks were offline and could not become online because the cluster was in ROLLING patch state.

SQL> ALTER DISKGROUP "GRID" ONLINE REGULAR DISK "RAC2$XVDN"
2020-04-23T17:28:10.851065+08:00ORA-15032: not all alterations performed
ORA-15137: The ASM cluster is in rolling patch state.

Troubleshooting:

  1. Check cluster activeversion
[root@rac1 ~]# crsctl query crs activeversion -f
Oracle Clusterware active version on the cluster is [12.2.0.1.0].
The cluster upgrade state is [ROLLING PATCH]. The cluster active patch level is [2250904419].

2. Check softwarepatch of each node:

[root@rac1 ~]# crsctl query crs softwarepatch
Oracle Clusterware patch level on node rac1 is [2269302628]

[root@rac2 ~]# crsctl query crs softwarepatch
Oracle Clusterware patch level on node rac2 is [2269302628]

[root@rac3 ~]# crsctl query crs softwarepatch
Oracle Clusterware patch level on node rac3 is [2269302628]

[root@rac4 ~]# crsctl query crs softwarepatch
Oracle Clusterware patch level on node rac4 is [3074559134]

As we see software version on rac4 is different then others. In addition to this activeversion [2250904419] is older, this is normal in ROLLING mode, when ROLLING mode is finished then this value in OCR is also updated.

3. From the output of the 2nd step we see that additional troubleshooting on installed patches are necessary, compare patches on each node:

Inventory showed the same output on all 4 nodes:

$ $GRID_HOME/OPatch/opatch lspatches
30593149;Database Jan 2020 Release Update : 12.2.0.1.200114 (30593149)
30591794;TOMCAT RELEASE UPDATE 12.2.0.1.0(ID:RELEASE) (30591794)
30586063;ACFS JAN 2020 RELEASE UPDATE 12.2.0.1.200114 (30586063)
30585969;OCW JAN 2020 RELEASE UPDATE 12.2.0.1.200114 (30585969)
26839277;DBWLM RELEASE UPDATE 12.2.0.1.0(ID:170913) (26839277)

But after checking kfod op=patches we’ve found that 4 additional patches existed on rac4 node only, the difference was not visible in inventory because they were inactive, left from old RU (they were not rolled back during applying superset patches, but they were made inactive).

The output from rac1, rac2 and rac3:

$ $GRID_HOME/bin/kfod op=patches
List of Patches
26710464
26737232
26839277
...
<extracted>

The output from rac4:

$ $GRID_HOME/bin/kfod op=patches
---------------
List of Patches
===============
26710464
26737232
26839277
...
<extracted>
28566910<<<<<Bug 28566910:TOMCAT RELEASE UPDATE 12.2.0.1.0
29757449<<<<<Bug 29757449:DATABASE JUL 2019 RELEASE UPDATE 12.2.0.1.190716
29770040<<<< Bug 29770040:OCW JUL 2019 RELEASE UPDATE 12.2.0.1.190716
29770090<<<<< Bug 29770090:ACFS JUL 2019 RELEASE UPDATE 12.2.0.1.190716

To rollback inactive patches you need to use patchgen, using opatch rollback -id it is not possible.

As root:

# $GRID_HOME/crs/install/rootcrs.sh -prepatch

As grid:

$ $GRID_HOME/bin/patchgen commit -rb 28566910
$ $GRID_HOME/bin/patchgen commit -rb 29757449
$ $GRID_HOME/bin/patchgen commit -rb 29770040
$ $GRID_HOME/bin/patchgen commit -rb 29770090

Validate patch level again on rac4, should be 2269302628:

$ $GRID_HOME/bin/kfod op=PATCHLVL
-------------------
Current Patch level
===================
2269302628

As root:

# $GRID_HOME/crs/install/rootcrs.sh -postpatch

4. If CRS was up at least on one node, then we would be able to stop rollingpatch state using crsctl stop rollingpatch (although, if CRS is up, postpatch should stop rolling also). But now our CRS is down, ASM is not able to access OCR file (this was caused by some other problem, which I consider as a bug, but will not mention here otherwise blog will become too long). We need to do extra steps to start CRS.

The only solution that I used here was to restore OCR from old backup. I needed OCR which was not in ROLLING mode. I needed NORMAL OCR backup, which was located on filesystem. I knew after restore my OCR activeversion would be lower than the current software version on nodes, but this is solvable (we will see this step also)

a) Stop crs services on all nodes
b) On one of the node, start cluster in exclusive and restore OCR backup:

# crsctl stop crs -f 
# crsctl start crs -excl -nocrs
# ocrconfig -restore /tmp/OCR/OCR_BACKUP_FILE
# crsctl stop crs -f

c) If you check softwarepatch here, it will not be 2269302628, but it will be something old (because OCR is old), so we need to correct it on each node:

# crsctl start crs -wait
# clscfg -patch
# crsctl query crs softwarepatch

Sofwarepatch will be 2269302628 on all nodes.

d) Note even correcting softwarepatch crsctl query crs activeversion -f still shows something old, which is normal. To correct it, run the following only on last node:

# crsctl stop rollingpatch
# crsctl query crs activeversion -f

e) If you had custom services or configuration in CRS after OCR backup was taken then you need to add them manually. For example, if you had service added srvctl add service, need to readd. That’s is why it’s is highly recommended to backup OCR manually before and after patching or any custom configuration changes using # ocrconfig -manualbackup