ORA-15137: The ASM cluster is in rolling patch state, CRS does not start
April 26, 2020 7 Comments
Problem:
The customer was not able to start cluster because ASM disks were offline and could not become online because the cluster was in ROLLING patch state.
SQL> ALTER DISKGROUP "GRID" ONLINE REGULAR DISK "RAC2$XVDN" 2020-04-23T17:28:10.851065+08:00ORA-15032: not all alterations performed ORA-15137: The ASM cluster is in rolling patch state.
Troubleshooting:
- Check cluster activeversion
[root@rac1 ~]# crsctl query crs activeversion -f Oracle Clusterware active version on the cluster is [12.2.0.1.0]. The cluster upgrade state is [ROLLING PATCH]. The cluster active patch level is [2250904419].
2. Check softwarepatch of each node:
[root@rac1 ~]# crsctl query crs softwarepatch Oracle Clusterware patch level on node rac1 is [2269302628] [root@rac2 ~]# crsctl query crs softwarepatch Oracle Clusterware patch level on node rac2 is [2269302628] [root@rac3 ~]# crsctl query crs softwarepatch Oracle Clusterware patch level on node rac3 is [2269302628] [root@rac4 ~]# crsctl query crs softwarepatch Oracle Clusterware patch level on node rac4 is [3074559134]
As we see software version on rac4 is different then others. In addition to this activeversion [2250904419] is older, this is normal in ROLLING mode, when ROLLING mode is finished then this value in OCR is also updated.
3. From the output of the 2nd step we see that additional troubleshooting on installed patches are necessary, compare patches on each node:
Inventory showed the same output on all 4 nodes:
$ $GRID_HOME/OPatch/opatch lspatches 30593149;Database Jan 2020 Release Update : 12.2.0.1.200114 (30593149) 30591794;TOMCAT RELEASE UPDATE 12.2.0.1.0(ID:RELEASE) (30591794) 30586063;ACFS JAN 2020 RELEASE UPDATE 12.2.0.1.200114 (30586063) 30585969;OCW JAN 2020 RELEASE UPDATE 12.2.0.1.200114 (30585969) 26839277;DBWLM RELEASE UPDATE 12.2.0.1.0(ID:170913) (26839277)
But after checking kfod op=patches
we’ve found that 4 additional patches existed on rac4 node only, the difference was not visible in inventory because they were inactive, left from old RU (they were not rolled back during applying superset patches, but they were made inactive).
The output from rac1, rac2 and rac3:
$ $GRID_HOME/bin/kfod op=patches List of Patches 26710464 26737232 26839277 ... <extracted>
The output from rac4:
$ $GRID_HOME/bin/kfod op=patches --------------- List of Patches =============== 26710464 26737232 26839277 ... <extracted> 28566910<<<<<Bug 28566910:TOMCAT RELEASE UPDATE 12.2.0.1.0 29757449<<<<<Bug 29757449:DATABASE JUL 2019 RELEASE UPDATE 12.2.0.1.190716 29770040<<<< Bug 29770040:OCW JUL 2019 RELEASE UPDATE 12.2.0.1.190716 29770090<<<<< Bug 29770090:ACFS JUL 2019 RELEASE UPDATE 12.2.0.1.190716
To rollback inactive patches you need to use patchgen
, using opatch rollback -id
it is not possible.
As root:
# $GRID_HOME/crs/install/rootcrs.sh -prepatch
As grid:
$ $GRID_HOME/bin/patchgen commit -rb 28566910 $ $GRID_HOME/bin/patchgen commit -rb 29757449 $ $GRID_HOME/bin/patchgen commit -rb 29770040 $ $GRID_HOME/bin/patchgen commit -rb 29770090
Validate patch level again on rac4, should be 2269302628:
$ $GRID_HOME/bin/kfod op=PATCHLVL ------------------- Current Patch level =================== 2269302628
As root:
# $GRID_HOME/crs/install/rootcrs.sh -postpatch
4. If CRS was up at least on one node, then we would be able to stop rollingpatch state using crsctl stop rollingpatch
(although, if CRS is up, postpatch should stop rolling also). But now our CRS is down, ASM is not able to access OCR file (this was caused by some other problem, which I consider as a bug, but will not mention here otherwise blog will become too long). We need to do extra steps to start CRS.
The only solution that I used here was to restore OCR from old backup. I needed OCR which was not in ROLLING mode. I needed NORMAL OCR backup, which was located on filesystem. I knew after restore my OCR activeversion would be lower than the current software version on nodes, but this is solvable (we will see this step also)
a) Stop crs services on all nodes
b) On one of the node, start cluster in exclusive and restore OCR backup:
# crsctl stop crs -f # crsctl start crs -excl -nocrs # ocrconfig -restore /tmp/OCR/OCR_BACKUP_FILE # crsctl stop crs -f
c) If you check softwarepatch here, it will not be 2269302628, but it will be something old (because OCR is old), so we need to correct it on each node:
# crsctl start crs -wait # clscfg -patch # crsctl query crs softwarepatch
Sofwarepatch will be 2269302628 on all nodes.
d) Note even correcting softwarepatch crsctl query crs activeversion -f
still shows something old, which is normal. To correct it, run the following only on last node:
# crsctl stop rollingpatch # crsctl query crs activeversion -f
e) If you had custom services or configuration in CRS after OCR backup was taken then you need to add them manually. For example, if you had service added srvctl add service
, need to readd. That’s is why it’s is highly recommended to backup OCR manually before and after patching or any custom configuration changes using # ocrconfig -manualbackup
in this case CRS is up in all other three nodes correct , can not we use directly stop rolllingpatch command from nopde 4 instead of doing all crs stop , exclusive start, stop -f , clscfg patch ?
I want to mention that step 4 is not always necessary. If CRS is startable, then you will stop on portpatch, which should also stop rolling.
So step4 is only necessary when there are additional problems (other than different patches) that blocks CRS startup and this reason is OCR rolling mode.
Do you mean on our env? No it was not starting up on any node, here I mention something in the brackets
4. If CRS was up at least on one node, then we would be able to stop rollingpatch state using crsctl stop rollingpatch. But now our CRS is down, ASM is not able to access OCR file (this was caused by some other problem, which I consider as a bug, but will not mention here otherwise blog will become too long). We need to do extra steps to start CRS.
So after rolling back extra patches, ASM was terminating. alert_ASM2.log was showing:
2020-04-25T22:12:54.196741+08:00
ALTER SYSTEM START ROLLING PATCH
…
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_gen0_18405.trc:
ORA-15138: cluster rolling patch incomplete
USER (ospid: 18405): terminating the instance due to error 15138
We’ve noticed that ASM was able to start (watching from crsctl status res -t -init), it was up for several seconds and when ocrsd daemon was trying to start -> ASM was terminating. From ASM logs we see that ASM somehow read OCR content, identified that it was in ROLLING, was trying to set this mode and on this step it was failing and terminating. When ASM terminates, CRS is not able to finish startup.
So the situation was much more complicated in their case, I consider it as a bug of ASM, because cluster normally should start in ROLLING mode.
Hi Maria
did u face this issue after when you apply Jan 2020 RU?
Hi,
It may not depend on RU and it was just a coincidence. The most likely this is caused by unknown ASM bug (As I know, Oracle is working on that)
Yes it was Jan 2020 RU, but please also see my previous comment
OUTSTANDING post. I put in a Sev 1 ticket with Oracle support (with platinum support no less) and after waiting over an hour for them to figure it out and reply to the SR I found this and resolved it. Thank you! Down time is bad! 😉