Moving GRID disk group files to another disk group

To migrate all content from +GRID diskgroup to another newly created one, we need to know what is the list of necessary files that are located on it:

  • ASM password file
  • ASM Spfile
  • OCR
  • Voting files
  • OCR backups (if configured on the same diskgroup)

Let’s migrate all of them one by one:

Migrate ASM password file

1. Locate the Oracle ASM password file:

[grid@rac1 ~]$ asmcmd pwget --asm
+GRID/orapwASM

2. Migrate the password file:

[grid@rac1 ~]$ asmcmd pwmove --asm -f +GRID/orapwASM +GRID2/orapwASM
moving +GRID/orapwASM -> +GRID2/orapwASM

3. Verify that the file has a new path:

[grid@rac1 ~]$ asmcmd pwget --asm
+GRID2/orapwASM

Migrate ASM Spfile

1. Locate the Oracle ASM SPFILE:

[grid@rac1 ~]$ asmcmd spget
+GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891

2. Migrate the spfile:

[grid@rac1 ~]$ asmcmd spmove +GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891 +GRID2/marirac/ASMPARAMETERFILE/spfileASM
ORA-15032: not all alterations performed
ORA-15028: ASM file '+GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891' not dropped; currently being accessed (DBD ERROR: OCIStmtExecute)

The error message can be ignored, the new location will be used after we restart CRS.

3. Verify:

[grid@rac1 ~]$ asmcmd spget
+GRID2/marirac/ASMPARAMETERFILE/spfileASM

Migrate OCR

1. Get the current OCR location:

[grid@rac1 ~]$ ocrcheck -config
Oracle Cluster Registry configuration is :
	 Device/File Name         :      +GRID

2. Move OCR:

[grid@rac1 ~]$  ocrconfig -add +GRID2
PROT-20: Insufficient permission to proceed. Require privileged user

[grid@rac1 ~]$ exit
logout

[root@rac1 ~]# ocrconfig -add +GRID2
[root@rac1 ~]# ocrconfig -delete +GRID

3. Verify:

[root@rac1 ~]# ocrcheck -config
Oracle Cluster Registry configuration is :
	 Device/File Name         :     +GRID2

Migrate voting files

1. Get the current location:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   544b7b2dc9f14f8dbf8f5c560a32a95f (/dev/flashgrid/rac2.xvdba) [GRID]
2. ONLINE   c4035c7009be4f26bffd663651e4d520 (/dev/flashgrid/rac1.xvdba) [GRID]
3. ONLINE   5737c31731574fa8bf2acc107fbbd364 (/dev/flashgrid/racq.xvdba) [GRID]
Located 3 voting disk(s).

2. Move:

[root@rac1 ~]# crsctl replace votedisk +GRID2
Successful addition of voting disk 26221fd4d7334fa8bfc98be1908ee3ef.
Successful addition of voting disk 093f9c21b9864f87bfc4853547f05a16.
Successful addition of voting disk 9c2a9fd2fc334f7ebfb44c04bdb0cf57.
Successful deletion of voting disk 544b7b2dc9f14f8dbf8f5c560a32a95f.
Successful deletion of voting disk c4035c7009be4f26bffd663651e4d520.
Successful deletion of voting disk 5737c31731574fa8bf2acc107fbbd364.
Successfully replaced voting disk group with +GRID2.
CRS-4266: Voting file(s) successfully replaced

3. Verify:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   26221fd4d7334fa8bfc98be1908ee3ef (/dev/flashgrid/rac1.xvdbc) [GRID2]
2. ONLINE   093f9c21b9864f87bfc4853547f05a16 (/dev/flashgrid/rac2.xvdbc) [GRID2]
3. ONLINE   9c2a9fd2fc334f7ebfb44c04bdb0cf57 (/dev/flashgrid/racq.xvdbz) [GRID2]
Located 3 voting disk(s).

Moving OCR backup

Please note that having OCR backup on the same location where OCR is located is not a good practice, you should have another disgroup for that. So let’s assume, we have separate DG for that.

1. Check the current location:

[root@rac1 ~]# ocrconfig -showbackup

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/backup00.ocr.276.1089911215     1443639413

rac2     2021/11/25 16:52:08     +GRID:/marirac/OCRBACKUP/backup01.ocr.275.1089564721     1443639413

rac2     2021/11/21 14:13:23     +GRID:/marirac/OCRBACKUP/backup02.ocr.277.1089209597     1443639413

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/day.ocr.272.1089911223     1443639413

rac1     2021/11/15 15:05:26     +GRID:/marirac/OCRBACKUP/week.ocr.273.1088694327     1443639413
PROT-25: Manual backups for the Oracle Cluster Registry are not available

2. Reconfigure:

[root@rac1 ~]# ocrconfig -backuploc +FRA

There are automatic OCR backups that are taken in the past 4 hours, 8 hours, 12 hours, and in the last day and week. Until this time passes, we can run manual backup for safety:

[root@rac1 ~]# ocrconfig -manualbackup

rac2     2021/11/30 12:20:15     +FRA:/marirac/OCRBACKUP/backup_20211130_122015.ocr.257.1089980415     1443639413

3. Verify:

[root@rac1 ~]# ocrconfig -showbackup

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/backup00.ocr.276.1089911215     1443639413

rac2     2021/11/25 16:52:08     +GRID:/marirac/OCRBACKUP/backup01.ocr.275.1089564721     1443639413

rac2     2021/11/21 14:13:23     +GRID:/marirac/OCRBACKUP/backup02.ocr.277.1089209597     1443639413

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/day.ocr.272.1089911223     1443639413

rac1     2021/11/15 15:05:26     +GRID:/marirac/OCRBACKUP/week.ocr.273.1088694327     1443639413

rac2     2021/11/30 12:20:15     +FRA:/marirac/OCRBACKUP/backup_20211130_122015.ocr.257.1089980415     1443639413

Display ASM disk attributes while ASM is not running, using KFOD

$GRID_HOME/bin/kfod has many usages (kfod -help), one of them is to print disk attributes without connecting to an ASM instance. Even more, you can display these attributes while ASM is not running. Imagine how useful can it be for you, when troubleshooting ASM startup issues.

Let’s display: disk size, header, path, diskgroup name, owner user, owner group, physical sector size, logical sector size.

[root@rac1~]# kfod op=disks status=true disks=all dscvgroup=true diskattr=all

Let’s see if ASM is running during that time:

[root@rac1~]# ps -ef|grep smon

root 3716 1     4 12:36 ?      00:00:01 /u01/app/19.3.0/grid/bin/osysmond.bin
root 5178 5083  0 12:37 pts/0  00:00:00 grep --color=auto smon

There is no asm_smon_+ASM1, which means ASM is down.

Print the content of multiple differently named files in Linux

If the number of files you are working on is big, then you need automation as soon as possible.
This post describes find -o option, which helps you work on differently named files when their number is big.

For example, if you want to output the content of files physical_block_size and logical_block_size located under /sys/block/*/queue, run the following:

# find /sys/block/*/queue -name physical_block_size -o -name logical_block_size | while read f ; do echo "$f $(cat $f)" ; done

..
/sys/block/dm-0/queue/physical_block_size 4096
/sys/block/dm-0/queue/logical_block_size 512
/sys/block/dm-1/queue/physical_block_size 512
...

Where -o means OR.

Useful when working on ASM disks.

ora.evmd and ora.mdnsd fails to start when http_proxy is set to https://

Problem:

After setting http_proxy to https string (export http_proxy=https://test) and then stopping and starting CRS got the following error:

CRS-2883: Resource 'ora.evmd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/bin/CommonSetup.pm" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
....
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/lib/libnl19.a" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
CRS-4000: Command Start failed, or completed with errors.

Even after unsetting http_proxy and trying to stop CRS got the following:

[root@rac1 ~]# crsctl start crs -wait
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[root@rac1 ~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'
CRS-2679: Attempting to clean 'ora.mdnsd' on 'rac1'
CRS-2679: Attempting to clean 'ora.gpnpd' on 'rac1'
CRS-2679: Attempting to clean 'ora.evmd' on 'rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2680: Clean of 'ora.evmd' on 'rac1' failed
CRS-2680: Clean of 'ora.gpnpd' on 'rac1' failed
CRS-2680: Clean of 'ora.mdnsd' on 'rac1' failed
CRS-2799: Failed to shut down resource 'ora.evmd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.gpnpd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.mdnsd' on 'rac1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors

So https entry in http_proxy variable caused my CRS even not being able to stop.

Solution:

The solution is simple, find processes that were started during previous attempt and kill them (be careful, not to kill anything that is not started from GI home):

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10228 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin
root     20305     1  2 05:16 ?        00:00:33 /u01/app/19.3.0/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.US7ASCII
root     20631     1  0 05:16 ?        00:00:05 /u01/app/19.3.0/grid/bin/orarootagent.bin

[root@rac1 ~]# kill -9 20305 20631

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10296 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin

Make sure http_proxy is not set or instead of https there is http as a value:

[root@rac1 ~]# unset http_proxy

[root@rac1 ~]# echo $http_proxy

Or

[root@rac1 ~]# export http_proxy=http://test

Try to start CRS now:

[root@rac1 ~]# crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'rac1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2676: Start of 'ora.evmd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'rac1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.crf' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'rac1'
CRS-2676: Start of 'ora.storage' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rac1'
CRS-2676: Start of 'ora.crsd' on 'rac1' succeeded
CRS-6017: Processing resource auto-start for servers: rac1
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac2'
CRS-2672: Attempting to start 'ora.chad' on 'rac1'
CRS-2672: Attempting to start 'ora.ons' on 'rac1'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac2'
CRS-2677: Stop of 'ora.scan1.vip' on 'rac2' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac1'
CRS-2676: Start of 'ora.chad' on 'rac1' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac1'
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded
CRS-2676: Start of 'ora.ons' on 'rac1' succeeded
CRS-6016: Resource auto-start has completed for server rac1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.

ORA-15041 during rebalance OR add disk

Problem:

One of our customers had a disk offline for more than disk_repair_time, which caused Oracle to drop 1TB disk. The problem started after that, the drop command caused rebalance operation and because of less than 1TB free space on the diskgroup, the rebalance failed with ORA-15041. Mentioned rebalance caused some of the disks to become 100% full, so free MB on some disks were 0.

Adding disks did not help, because when we were checking free space on the existing disks we were getting the following output:

# su - grid
$ sqlplus / as sysasm
SQL> select disk_number "Disk #", free_mb 
     from v$asm_disk 
     where group_number = 1 
     order by 2

    Disk #    FREE_MB
---------- ----------
        13       0
         0       0
         4       0
         3       4
        11       132900
        ...

As mentioned our rebalance was failing:

Solution

It was AWS environment and in cloud we could easily increase disk size, so we increased all disks in the diskgorup by 200GB:

Resizing steps: https://dba010.com/2019/08/23/resize-asm-disks-in-aws-fg-enabled-cluster/

Triggered Rebalance:

# su - grid
$ sqlplus / as sysasm
SQL> ALTER DISKGROUP DATA REBALANCE POWER 13; 

And after several hours rebalance finished successfully.

Please note that initially we increased space on disks by 1GB and rebalance failed again, then we increased by 200GB and the operation was successful. So you may need to increase disk size several times.

Useful note from Oracle Doc ID 473271.1

asmcmd does not return error text for root user

Problem:

While running asmcmd commands via root user, error messages are not displayed. But if we run the same command via grid – it returns messages. In the following example, ORACLE_HOME is set to RDBMS home (instead of GI) and we are trying to list the file which does not exist:

[root@rac1 ~]# echo $ORACLE_HOME
/u01/app/oracle/product/19.3.0/dbhome_1

[root@rac1 ~]# echo $PATH
/u01/app/19.3.0/grid/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin

[root@rac1 ~]# asmcmd ls +demodg/orcl/tempfile/TEMP.263.10617958
<<<<<<Nothing is displayed here

When a file does not exist, ls command should display ASMCMD-8002: entry 'TEMP.263.10617958' does not exist in directory '+demodg/orcl/tempfile/' error. But in our example, it does not return anything.

In the following example, if we try to copy a file, we see a message about copying a file but actually, the file is not copied because grid user does not have permission under /u01:

[root@rac1 ~]# asmcmd cp +demodg/orcl/tempfile/TEMP.263.1061795851 /u01
copying +demodg/orcl/tempfile/TEMP.263.1061795851 -> /u01/TEMP.263.1061795851

If we run the same command via grid, we get understandable error message:

[grid@rac1 ~]$  asmcmd cp +demodg/orcl/tempfile/TEMP.263.1061795851 /u01
 ASMCMD-9463: operation failed due to lack of write permissions

Reason:

Environment variables are not set correctly. ORACLE_HOME should be pointing to GI home.

Solution:

[root@rac1 ~]# export ORACLE_HOME=/u01/app/19.3.0/grid

[root@rac1 ~]# asmcmd ls +demodg/orcl/tempfile/TEMP.263.10617958
ASMCMD-8002: entry 'TEMP.263.10617958' does not exist in directory '+demodg/orcl/tempfile/'

PRCD-1229 : An attempt to access configuration of database orcl was rejected because its version 12.2.0.1.0 differs from the program version 19.0.0.0.0

Problem:

After a manual database upgrade from 12c to 19c, I was not able to start database instance using srvctl.

[oracle@rac1 ~]$ srvctl start instance -db orcl -n rac1

PRCD-1027 : Failed to retrieve database orcl
PRCD-1229 : An attempt to access configuration of database orcl was rejected because its version 12.2.0.1.0 differs from the program version 19.0.0.0.0. Instead run the program from /u01/app/oracle/product/12.2.0/dbhome_1.

Solution:

Use srvctl upgrade from new home:

$ /u01/app/oracle/product/19.3.0/dbhome_1/bin/srvctl upgrade database -d orcl -oraclehome /u01/app/oracle/product/19.3.0/dbhome_1

Try to start again:

$ srvctl start instance -db orcl -n rac1

ORA-15477: cannot communicate with the volume driver (DBD ERROR: OCIStmtExecute)

Problem:

I had GI Standalone installation, which I’ve deconfigured and configured one node RAC which was successful. Then I’ve tried to create ACFS volume which failed with ORA-15477:

[root@host1 dbs]# asmcmd volcreate -G OGG -s 10G ACFSGG
ORA-15032: not all alterations performed
ORA-15477: cannot communicate with the volume driver (DBD ERROR: OCIStmtExecute)

Reason:

It seems the ACFS/ADMV modules are not loaded:

[root@host1 dbs]# lsmod | grep oracle
oracleacfs           5921415  0
oracleadvm           1236257  0
oracleoks             750688  2 oracleacfs,oracleadvm

Solution:

First of all, I will share two possible solutions, that helped others but not me and one possible solution (3rd) that helped me:

  1. Start module manualy and make sure it’s enabled:
# acfsload start
# acfsload enable

Check if modules is loaded using lsmod | grep oracle and retry volume creation.

2. Reinstall acfs/admv modules manually:

[root@host1 dbs]# acfsroot install
ACFS-9300: ADVM/ACFS distribution files found.
ACFS-9314: Removing previous ADVM/ACFS installation.
depmod: ERROR: fstatat(6, uds.ko): No such file or directory
depmod: ERROR: fstatat(6, kvdo.ko): No such file or directory
ACFS-9315: Previous ADVM/ACFS components successfully removed.
ACFS-9294: updating file /etc/sysconfig/oracledrivers.conf
ACFS-9307: Installing requested ADVM/ACFS software.
ACFS-9294: updating file /etc/sysconfig/oracledrivers.conf
ACFS-9308: Loading installed ADVM/ACFS drivers.
ACFS-9321: Creating udev for ADVM/ACFS.
ACFS-9323: Creating module dependencies - this may take some time.
depmod: ERROR: fstatat(6, uds.ko): No such file or directory
depmod: ERROR: fstatat(6, kvdo.ko): No such file or directory
ACFS-9390: The command 'echo '/lib/modules/3.10.0-862.el7.x86_64/extra/usm/oracleadvm.ko
/lib/modules/3.10.0-862.el7.x86_64/extra/usm/oracleoks.ko
/lib/modules/3.10.0-862.el7.x86_64/extra/usm/oracleacfs.ko
' | /sbin/weak-modules --no-initramfs --add-modules 3.10.0-1127.18.2.el7.x86_64 2>&1 |' returned unexpected output that may be important for system configuration:
depmod: ERROR: fstatat(6, kvdo.ko): No such file or directory

depmod: ERROR: fstatat(6, uds.ko): No such file or directory

depmod: ERROR: fstatat(6, uds.ko): No such file or directory

depmod: ERROR: fstatat(6, kvdo.ko): No such file or directory

ACFS-9154: Loading 'oracleoks.ko' driver.
ACFS-9154: Loading 'oracleadvm.ko' driver.
ACFS-9154: Loading 'oracleacfs.ko' driver.
ACFS-9327: Verifying ADVM/ACFS devices.
ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.
ACFS-9156: Detecting control device '/dev/ofsctl'.
ACFS-9309: ADVM/ACFS installation correctness verified.

Retry volume creation.

If none of the above helps, do the 3rd solution (which is not available on the internet, it was my decision):

3. Rebuild initramfs

[root@host1 ~]# cp -p /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
[root@host1 ~]# dracut -f
[root@host1 ~]# reboot

After restart, you should be able to create volume.

ACFS-05913: unable to contact the standby node stbyrac1

Problem:

I was trying to setup ACFS replication, where one of the steps is to validate keys using acfsutil, which failed with ACFS-05913 error:

[root@rac1 .ssh]# acfsutil repl info -c -u oggrepl stbyrac1 stbyrac2 /GG
acfsutil repl info: ACFS-05913: unable to contact the standby node stbyrac1
acfsutil repl info: ACFS-05913: unable to contact the standby node stbyrac2

Cause: 

An attempt to use the ping utility to contact a standby node failed.

Solution:

Enable ICMP traffic between these nodes and retry validation:

[root@rac1 .ssh]# acfsutil repl info -c -u oggrepl stbyrac1 stbyrac2 /GG
A valid 'ssh' connection was detected for standby node stbyrac1 as user oggrepl.
A valid 'ssh' connection was detected for standby node stbyrac2 as user oggrepl.

srvctl start filesystem hangs

The title of this post is general, there can be a lot of reasons why srvctl start filesystem hangs. The aim of this blog post is to share one of the reasons only.

Problem:

I’ve created ACFS volume and added it to srvctl:

$ asmcmd volcreate -G OGG -s 10G ACFSGG
# srvctl add filesystem -device /dev/asm/acfsgg-11 -path /GG_HOME -volume acfsgg -diskgroup OGG -user oracle -fstype ACFS

then tried to start the filesystem using:

# srvctl start filesystem -device /dev/asm/acfsgg-11

Which hanged.

Troubleshooting:

I’ve checked logs under trace folder under GI base, but could not find any clue. Even worse, stopping filesystem was also hanging.

But let’s stop here, the file that should have been checked was really there, but I missed it and checked wrong files. The file name that shows the necessary error is mount_<process id>.trc and is definitely located under trace folder. So instead of manually mounting filesystem to see the error, you can just open that mount_<process id>.trc and you will see the reason there.

Then I tried manual mounting of the filesystem, without srvctl:

[root@stbyrac1 trace]# /bin/mount -t acfs  /dev/asm/acfsgg-11 /GG_HOME
mount.acfs: ACFS-03037: not an ACFS file system

saw the error, which explained what was happening. My volume was not formatted with acfs filesystem. Somehow I missed that step on the standby cluster, so just a human error, but srvctl at least should have said that instead of hanging and placing info in trace file.

Solution:

Format ACFS volume:

[root@stbyrac1 trace]# mkfs -t acfs /dev/asm/acfsgg-11
mkfs.acfs: version                   = 19.0.0.0.0
mkfs.acfs: on-disk version           = 46.0
mkfs.acfs: volume                    = /dev/asm/acfsgg-11
mkfs.acfs: volume size               = 10737418240  (  10.00 GB )
mkfs.acfs: Format complete.

Because the start and stop operations are hanged, you need to mount filesystem on all database nodes manually:

[root@stbyrac1 ~]# /bin/mount -t acfs  /dev/asm/acfsgg-11 /GG_HOME
[root@stbyrac1 ~]# /bin/mount -t acfs  /dev/asm/acfsgg-11 /GG_HOME

Now try to stop and start filesystem, to make sure srvctl is able to do it’s job without any manual interaction:

[root@stbyrac1 ~]# srvctl stop filesystem -device /dev/asm/acfsgg-11
[root@stbyrac1 ~]# srvctl start filesystem -device /dev/asm/acfsgg-11