UDEV rules for configuring ASM disks

Problem:

During my previous installations I used the following udev rule on multipath devices:

KERNEL=="dm-[0-9]*", BUS=="scsi", PROGRAM=="/sbin/scsi_id -g -u -d /dev/$parent", RESULT=="360050768028200a9a40000000000001c", NAME="oracleasm/asm-disk1", OWNER="oracle", GROUP="asmadmin", MODE="0660"

So to identify the exact disk I used PROGRAM option. The above script looks through `/dev/dm-*` devices and if any of them satisfy the condition, for example:

# scsci_id -gud /dev/dm-3
360050768028200a9a40000000000001c 

then device name will be changed to /dev/oracleasm/asm-disk1, owner:group to grid:asmadmin and permission to 0660

But on my new servers same udev rule was not working anymore. (Of course, it needs more investigation, but our time is really valuable and never enough and if we know another solution that works and is acceptable- let’s just use it)

Solution:

I used udevadm command to identify other properties of these devices and wrote new udev rule (to see other properties, just remove grep):

# udevadm info --query=property --name /dev/mapper/asm1 | grep DM_UUID
DM_UUID=mpath-360050768028200a9a40000000000001c

New udev rule looks like this:

# cat /etc/udev/rules.d/99-oracle-asmdevices.rules
ENV{DM_UUID}=="mpath-360050768028200a9a40000000000001c",  SUBSYSTEM=="block", NAME="oracleasm/asm-disk1", OWNER="grid", GROUP="asmadmin", MODE="0660"

Trigger udev rules:

# udevadm trigger

Verify that name, owner, group and permissions are changed:

# ll /dev/oracleasm/
total 0
brw-rw---- 1 grid asmadmin 253, 3 Jul 17 17:33 asm-disk1

Detach diskgroup from 12c GI and attach to 19c GI

Task:

We have two separate Real Application Clusters, one 12c and another 19c. We decided to migrate data from 12c to 19c by simply detaching all ASM disks from the source and attaching to the destination.

Steps:

1. Connect to the 12c GI via grid user and dismount FRA diskgroup on all nodes:

[grid@rac1 ~]$ sqlplus  / as sysasm
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
SQL> alter diskgroup FRA dismount;
Diskgroup altered. 
[grid@rac2 ~]$ sqlplus  / as sysasm
Connected to:
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
SQL> alter diskgroup FRA dismount;
Diskgroup altered.

You can also use srvctl to stop the diskgroup with one command.

2. Detach disks belonging to the specific diskgroup from 12c cluster and attach to 19c cluster.

3. After ASM disks are visible on 19c cluster, connect as sysasm via grid user and mount the diskgroup:

# Check that there is no FRA resource registered with CRS:

[root@rac1 ~]# crsctl status res -t |grep FRA

# Mount the diskgroup on all nodes

[grid@rac1 ~]$ sqlplus / as sysasm
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
SQL> alter diskgroup FRA mount;
Diskgroup altered.
[grid@rac2 ~]$ sqlplus / as sysasm
Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0
SQL> alter diskgroup FRA mount;
Diskgroup altered.

# FRA diskgroup resource will automatically be registered with CRS:

[root@rac1 ~]# crsctl status res -t |grep FRA
ora.FRA.dg(ora.asmgroup)

And data will be there…

What is a Flex ASM and how to check if it is enabled?

In versions prior to 12c, the ASM instance needed to be run on each of the nodes of the cluster. In case ASM was not able to start, the database instance located on the same node was not able to come up also. There were a hard dependency between database and ASM instances.

With Oracle Flex ASM, databases are able to connect remote ASM using network connection(ASM network). In case of ASM instance fails, the database instance will reconnect to another ASM instance on another node. This feature is called Oracle Flex ASM.

Check if you are using such a great feature using the following command:

[grid@rac1 ~]$ asmcmd
ASMCMD> showclustermode
ASM cluster : Flex mode enabled

 

ORA-17635: failure in obtaining physical sector size for ‘+DATA’

Action:

I was trying to create a spfile on asm diskgroup from standby database.

SYS @ shcat > create spfile='+DATA' from pfile='/tmp/initshcat_stby.ora';
create spfile='+DATA' from pfile='/tmp/initshcat_stby.ora'
*
ERROR at line 1:
ORA-17635: failure in obtaining physical sector size for '+DATA'
ORA-12547: TNS:lost contact
ORA-12547: TNS:lost contact

Troubleshooting:

I’ve checked the sector size and it was ok:

SQL> select name,sector_size from v$asm_diskgroup;

NAME       SECTOR_SIZE
---------- ------------
DATA       512

I’ve checked if oracle account was able to see ASM diskgroups in DBCA and it was not. The diskgroup list was empty.

Causes:

There are several causes:

1) File permissions in <Grid_home>/bin/oracle executable not set properly.

2) Oracle user is not a part of asmdba group

Solution: 

1)  Change permissions:

[root@stbycat ~]# chmod 6751 /u01/app/18.3.0/grid/bin/oracle

2)  Add oracle to asmdba group

[root@stbycat ~]# usermod -g oinstall -G oper,dba,asmdba oracle

In my case it was 1st.

My permissions:

$ ll /u01/app/18.3.0/grid/bin/oracle
-rwxr-x--x 1 grid oinstall 413844056 Nov 4 09:14 /u01/app/18.3.0/grid/bin/oracle

Must be:

$ ll /u01/app/18.3.0/grid/bin/oracle
-rwsr-s--x 1 grid oinstall 413844056 Nov 4 08:45 /u01/app/18.3.0/grid/bin/oracle

 

kfed repair: clscfpinit: Failed clsdinitx [-1] ecode [64]

Action: 

Trying to repair ASM disk header using kfed from root user.

[root@rac1 ~]# kfed repair /dev/flashgrid/racq.lun1 aus=4M

clscfpinit: Failed clsdinitx [-1] ecode [64]
2018-11-03 02:47:34.338 [2576228864] gipclibInitializeClsd: clscfpinit failed with -1

Cause:

The user must be grid instead of root.

Solution:

[root@rac1 ~]# su - grid

[grid@rac1 ~]$ kfed repair /dev/flashgrid/racq.lun1 aus=4M

Add filegroup fails with ORA-15067: command or option incompatible with diskgroup redundancy

Problem:

I was trying to add filegroup to the FRA diskgroup:

SQL> alter diskgroup FRA add filegroup high_filegroup database orcl set ‘datafile.redundancy’ = ‘HIGH’;

Error:

ORA-15067: command or option incompatible with diskgroup redundancy

Troubleshooting:

Checking diskgroup type:

SQL> select name,type,compatibility,database_compatibility from v$asm_diskgroup where name=’FRA’;

NAME      TYPE   COMPATIBILITY    DATABASE_COMPATIBILITY
————- —— ————————– ————————————————————
FRA        NORMAL 18.0.0.0.0    12.2.0.1.0

Solution:

Change diskgroup type to FLEX:

SQL> alter diskgroup FRA convert redundancy to flex;
Diskgroup altered.

Check that type was changed:

SQL> select name,type,compatibility,database_compatibility from v$asm_diskgroup where name=’FRA’;

NAME      TYPE   COMPATIBILITY    DATABASE_COMPATIBILITY
————- —— ————————– ————————————————————
FRA        FLEX   18.0.0.0.0    12.2.0.1.0

Adding filegroup succeeds:

SQL> alter diskgroup FRA add filegroup high_filegroup database orcl set ‘datafile.redundancy’ = ‘HIGH’;
Diskgroup altered.

Rebuild RAC clusterware without deleting data

As I have mentioned in my previous posts, I was applying interim patch on database which had post installation script (# <GI_HOME>/crs/install/rootcrs.pl -postpatch) .
The post script failed with permission denied error on ohasd file and left clusterware in a messy situation.

I have opened SR on metalink and one of their support after a huge amount of time of talking and troubleshooting together says:

“We do not know what happened or what steps you have taken to reach this situation. You should open an SR with us before you deconfigure the node.
Please, do bare metal restore as it is recommended by previous engineer.
Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment ( Doc ID 1084360.1 )”

This Bare Metal Restore is like wiping everything and after that I have had to configure RAC, DATAGUARD and everything from scratch. <<–Don’t like such solutions, this is like “if your windows works slowly then reinstall it”.. for windows this might be really true 🙂 nothing than reinstall helps 😀 but on Linux/Oracle you must troubleshoot first.
So I created another SR with another error(Errors at this time were lot) and for the second time I was lucky.
I was working 24/7 with support, the engineers were shifting. Three different engineers worked at different times on this SR.
I want to mention one “Venkata Pradeep Kumar” Oracle support engineer , he is so clever he helped me a lot and we rescued the system !:)

I want to share the steps with you , it should interesting.

Problem:

After applying patch post script on first node (which failed), clusterware on first node was not starting. At this time second node was fine.
I have deconfigured clusterware (write this step in solution section) on first node and it started but with some problems about oc4j service.

2016/09/27 06:56:15 CLSRSC-1003: Failed to start resource OC4J
2016/09/27 06:56:16 CLSRSC-287: FirstNode configuration failed

I have deconfigured clusterware on second node also and tried to run root.sh, but it said that root.sh could not be run because it was not successful on first node. 😦

So, until root.sh script is not completely successful on first node you should not deconfigure it on second. But if you did it do not panic if you have OCR backup.

Solution:

# Deconfigure crs on problematic node , note you may help the different solution , by just configuring one node. In my situation all nodes became problematic.
# Also please be careful, below steps assumes that you have separate group for OCR. Datafiles must be on different group. Or diskgroup will be wiped.

# From root on both nodes node1 , node2

/u01/app/12.1.0.2/grid/crs/install/rootcrs.sh -deconfig -force

# run root.sh on node1 , it may not be completely successful

/u01/app/12.1.0.2/grid/root.sh

# We need to find a good OCR backup , for me it is week.ocr which was taken automatically in 2016/09/15 09:12:28.
# Patch was applied at 10:00AM in 2016/09/25. So we need week.ocr it is before patching.

[root@lbdm01-dr-adm grid]# ocrconfig -showbackup

lbdm02-dr-adm 2016/09/27 02:35:23 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup00.ocr 3351897854
lbdm02-dr-adm 2016/09/26 15:44:53 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup01.ocr 3351897854
lbdm02-dr-adm 2016/09/26 11:44:52 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup02.ocr 3351897854
lbdm02-dr-adm 2016/09/27 02:35:23 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/day.ocr 3351897854
lbdm01-dr-adm 2016/09/15 09:12:28 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/week.ocr 854493477
lbdm02-dr-adm 2016/09/25 15:29:18 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup_20160925_152918.ocr 3351897854
lbdm02-dr-adm 2016/09/25 10:34:56 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup_20160925_103456.ocr 2725022894
lbdm01-dr-adm 2015/07/29 19:46:28 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup_20150729_194628.ocr 854493477
lbdm01-dr-adm 2015/07/29 19:46:27 /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup_20150729_194627.ocr 854493477

# Ensure that no process left
# node 1

crsctl stop crs -f
ps -ef|grep “/u01/app”

# if here is anything kill them!

#Start clusterware in exclusive mode with no ocr on node 1

crsctl start crs -excl -nocrs

#Restore OCR on node 1

ocrconfig -restore /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/week.ocr
ocrcheck

# Stop crs on node 1

crsctl stop crs -f
crsctl start crs

# Check the status

crsctl status res -t

# It should be OK

# Do the same steps on node 2 from root, but it may fail

/u01/app/12.1.0.2/grid/root.sh

# Failed

ORA-15160: rolling migration internal fatal error in module SKGXP,valNorm:not-native
. For details refer to “(:CLSN00107:)” in “/u01/app/oracle/diag/crs/lbdm02-dr-adm/crs/trace/ohasd_oraagent_oracle.trc”.
CRS-2883: Resource ‘ora.asm’ failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-4000: Command Start failed, or completed with errors.
2016/09/28 09:11:00 CLSRSC-117: Failed to start Oracle Clusterware stack

# deconfig on both nodes
# node1 , node2

 /u01/app/12.1.0.2/grid/crs/install/rootcrs.sh -deconfig -force

#and run agin root.sh
# node 1

/u01/app/12.1.0.2/grid/root.sh

# It was completelly successful.

# On second there is still problem

# Read the following document ORA-15160: rolling migration internal fatal error in module SKGXP,valNorm:not-native (NOTE 1682591.1)

# Here problem was on protocols that was used by asm and rdbms.
# rdbms is using rds protocol and asm is using udp, see Oracle Clusterware and RAC Support for RDS Over Infiniband (NOTE 751343.1)
# problem was in libraries and we should relink them with right protocols
# As the ORACLE_HOME/GI_HOME owner, stop all resources (database, listener, ASM etc) that’s running from the home. When stopping database, use NORMAL or IMMEDIATE option.

# From problemtic node , where asm or database is not starting.

crsctl stop crs
ps -ef|grep d.bin
ps -ef|grep “/u01/app”

# Kill if any process left

# If relinking Grid Infrastructure (GI) home, as root, unlock GI home: <GI_HOME>/crs/install/rootcrs.pl -unlock

/u01/app/12.1.0.2/grid/crs/install/rootcrs.sh -unlock

# As the ORACLE_HOME/GI_HOME owner, go to ORACLE_HOME/GI_HOME and cd to rdbms/lib
# As the ORACLE_HOME/GI_HOME owner, issue “make -f ins_rdbms.mk <protocol write here> ioracle”
#For rdbms

[root@lbdm02-dr-adm lib]# su – oracle
[oracle@lbdm02-dr-adm ~]$ cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk ipc_rds ioracle

#For asm

. oraenv
+ASM2
[oracle@lbdm02-dr-adm ~]$ cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk ipc_g ioracle

# From root

/u01/app/12.1.0.2/grid/crs/install/rootcrs.sh -patch

# The last step should have configure clusterware also. And everything should be fine. And you can sleep now. 🙂