Moving GRID disk group files to another disk group

To migrate all content from +GRID diskgroup to another newly created one, we need to know what is the list of necessary files that are located on it:

  • ASM password file
  • ASM Spfile
  • OCR
  • Voting files
  • OCR backups (if configured on the same diskgroup)

Let’s migrate all of them one by one:

Migrate ASM password file

1. Locate the Oracle ASM password file:

[grid@rac1 ~]$ asmcmd pwget --asm
+GRID/orapwASM

2. Migrate the password file:

[grid@rac1 ~]$ asmcmd pwmove --asm -f +GRID/orapwASM +GRID2/orapwASM
moving +GRID/orapwASM -> +GRID2/orapwASM

3. Verify that the file has a new path:

[grid@rac1 ~]$ asmcmd pwget --asm
+GRID2/orapwASM

Migrate ASM Spfile

1. Locate the Oracle ASM SPFILE:

[grid@rac1 ~]$ asmcmd spget
+GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891

2. Migrate the spfile:

[grid@rac1 ~]$ asmcmd spmove +GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891 +GRID2/marirac/ASMPARAMETERFILE/spfileASM
ORA-15032: not all alterations performed
ORA-15028: ASM file '+GRID/marirac/ASMPARAMETERFILE/registry.253.1088678891' not dropped; currently being accessed (DBD ERROR: OCIStmtExecute)

The error message can be ignored, the new location will be used after we restart CRS.

3. Verify:

[grid@rac1 ~]$ asmcmd spget
+GRID2/marirac/ASMPARAMETERFILE/spfileASM

Migrate OCR

1. Get the current OCR location:

[grid@rac1 ~]$ ocrcheck -config
Oracle Cluster Registry configuration is :
	 Device/File Name         :      +GRID

2. Move OCR:

[grid@rac1 ~]$  ocrconfig -add +GRID2
PROT-20: Insufficient permission to proceed. Require privileged user

[grid@rac1 ~]$ exit
logout

[root@rac1 ~]# ocrconfig -add +GRID2
[root@rac1 ~]# ocrconfig -delete +GRID

3. Verify:

[root@rac1 ~]# ocrcheck -config
Oracle Cluster Registry configuration is :
	 Device/File Name         :     +GRID2

Migrate voting files

1. Get the current location:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   544b7b2dc9f14f8dbf8f5c560a32a95f (/dev/flashgrid/rac2.xvdba) [GRID]
2. ONLINE   c4035c7009be4f26bffd663651e4d520 (/dev/flashgrid/rac1.xvdba) [GRID]
3. ONLINE   5737c31731574fa8bf2acc107fbbd364 (/dev/flashgrid/racq.xvdba) [GRID]
Located 3 voting disk(s).

2. Move:

[root@rac1 ~]# crsctl replace votedisk +GRID2
Successful addition of voting disk 26221fd4d7334fa8bfc98be1908ee3ef.
Successful addition of voting disk 093f9c21b9864f87bfc4853547f05a16.
Successful addition of voting disk 9c2a9fd2fc334f7ebfb44c04bdb0cf57.
Successful deletion of voting disk 544b7b2dc9f14f8dbf8f5c560a32a95f.
Successful deletion of voting disk c4035c7009be4f26bffd663651e4d520.
Successful deletion of voting disk 5737c31731574fa8bf2acc107fbbd364.
Successfully replaced voting disk group with +GRID2.
CRS-4266: Voting file(s) successfully replaced

3. Verify:

[root@rac1 ~]# crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
1. ONLINE   26221fd4d7334fa8bfc98be1908ee3ef (/dev/flashgrid/rac1.xvdbc) [GRID2]
2. ONLINE   093f9c21b9864f87bfc4853547f05a16 (/dev/flashgrid/rac2.xvdbc) [GRID2]
3. ONLINE   9c2a9fd2fc334f7ebfb44c04bdb0cf57 (/dev/flashgrid/racq.xvdbz) [GRID2]
Located 3 voting disk(s).

Moving OCR backup

Please note that having OCR backup on the same location where OCR is located is not a good practice, you should have another disgroup for that. So let’s assume, we have separate DG for that.

1. Check the current location:

[root@rac1 ~]# ocrconfig -showbackup

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/backup00.ocr.276.1089911215     1443639413

rac2     2021/11/25 16:52:08     +GRID:/marirac/OCRBACKUP/backup01.ocr.275.1089564721     1443639413

rac2     2021/11/21 14:13:23     +GRID:/marirac/OCRBACKUP/backup02.ocr.277.1089209597     1443639413

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/day.ocr.272.1089911223     1443639413

rac1     2021/11/15 15:05:26     +GRID:/marirac/OCRBACKUP/week.ocr.273.1088694327     1443639413
PROT-25: Manual backups for the Oracle Cluster Registry are not available

2. Reconfigure:

[root@rac1 ~]# ocrconfig -backuploc +FRA

There are automatic OCR backups that are taken in the past 4 hours, 8 hours, 12 hours, and in the last day and week. Until this time passes, we can run manual backup for safety:

[root@rac1 ~]# ocrconfig -manualbackup

rac2     2021/11/30 12:20:15     +FRA:/marirac/OCRBACKUP/backup_20211130_122015.ocr.257.1089980415     1443639413

3. Verify:

[root@rac1 ~]# ocrconfig -showbackup

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/backup00.ocr.276.1089911215     1443639413

rac2     2021/11/25 16:52:08     +GRID:/marirac/OCRBACKUP/backup01.ocr.275.1089564721     1443639413

rac2     2021/11/21 14:13:23     +GRID:/marirac/OCRBACKUP/backup02.ocr.277.1089209597     1443639413

rac2     2021/11/29 17:07:02     +GRID:/marirac/OCRBACKUP/day.ocr.272.1089911223     1443639413

rac1     2021/11/15 15:05:26     +GRID:/marirac/OCRBACKUP/week.ocr.273.1088694327     1443639413

rac2     2021/11/30 12:20:15     +FRA:/marirac/OCRBACKUP/backup_20211130_122015.ocr.257.1089980415     1443639413

One of the solutions for ORA-27300: OS system dependent operation:fork failed with status: 11

Problem:

Databases were crashed and alert logs were showing errors:

Fri Nov 12 13:23:39 2021
Process startup failed, error stack:
Errors in file /app/oracle/diag/rdbms/orcl/orcl/trace/orcl_psp0_25852.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn5

We’ve implemented a procedure to take process list (by ps -ef), so output during that time was the following:

oracle    592    1  0 13:57 ?        00:00:00 [oracle] <defunct>
oracle    593    1  0 13:55 ?        00:00:00 [oracle] <defunct>
oracle    615    1  0 13:57 ?        00:00:00 [oracle] <defunct>
oracle    618    1  0 13:57 ?        00:00:00 [oracle] <defunct>
...

Not only Oracle, but sshd and some other processes were also experiencing the same:

oracle  22335 22331  0 13:52 ?        00:00:00 [ps] <defunct>
oracle  22336 22331  0 13:52 ?        00:00:00 [grep] <defunct>
oracle  22338 22331  0 13:52 ?        00:00:00 [grep] <defunct>
oracle  14389    1  0 13:24 ?        00:00:00 [sshd] <defunct>
oracle  15852    1  0 13:23 ?        00:00:00 [sshd] <defunct>

Reason:

A large amount of Zombie processes, causing applications to fail.

Solution:

You may find a lot of recommendations about increasing kernel.pid_max, similarly ORA-27300: OS System Dependent Operation:fork Failed With Status: 11 (Doc ID 1546393.1). Of course, you can make this parameter unlimited, but this will not solve the problem, it will just postpone it.

The reason for high number of defunct processes is described here, https://access.redhat.com/solutions/2438581.

The parent process for our defunct processes was systemd (pid=1) and the version of it was systemd-219-19.el7.x86_64.

The solution is to update systemd to the latest version.

Display ASM disk attributes while ASM is not running, using KFOD

$GRID_HOME/bin/kfod has many usages (kfod -help), one of them is to print disk attributes without connecting to an ASM instance. Even more, you can display these attributes while ASM is not running. Imagine how useful can it be for you, when troubleshooting ASM startup issues.

Let’s display: disk size, header, path, diskgroup name, owner user, owner group, physical sector size, logical sector size.

[root@rac1~]# kfod op=disks status=true disks=all dscvgroup=true diskattr=all

Let’s see if ASM is running during that time:

[root@rac1~]# ps -ef|grep smon

root 3716 1     4 12:36 ?      00:00:01 /u01/app/19.3.0/grid/bin/osysmond.bin
root 5178 5083  0 12:37 pts/0  00:00:00 grep --color=auto smon

There is no asm_smon_+ASM1, which means ASM is down.

Print the content of multiple differently named files in Linux

If the number of files you are working on is big, then you need automation as soon as possible.
This post describes find -o option, which helps you work on differently named files when their number is big.

For example, if you want to output the content of files physical_block_size and logical_block_size located under /sys/block/*/queue, run the following:

# find /sys/block/*/queue -name physical_block_size -o -name logical_block_size | while read f ; do echo "$f $(cat $f)" ; done

..
/sys/block/dm-0/queue/physical_block_size 4096
/sys/block/dm-0/queue/logical_block_size 512
/sys/block/dm-1/queue/physical_block_size 512
...

Where -o means OR.

Useful when working on ASM disks.

ORA-65086: cannot open/close the pluggable database

Problem:

Tried to open the pluggable database and got the error:

SQL> alter pluggable database PDB11 open;
alter pluggable database PDB11 open
*
ERROR at line 1:
ORA-65086: cannot open/close the pluggable database

Reason:

I have run unplug database before so database status was UNPLUGGED.

SQL> col PDB_NAME for a15
SQL> set linesize 800
SQL> select pdb_name,status from dba_pdbs;

PDB_NAME	STATUS
--------------- ----------
PDB11		UNPLUGGED
PDB$SEED	NORMAL
PDB12		NORMAL
PDB13		NORMAL

Solution:

To plug the database back to the container with the same name, first you need to drop it:

SQL> drop pluggable database PDB11 keep datafiles;
SQL> create pluggable database PDB11 using '/u01/app/oracle/myxml/PDB11.xml' nocopy tempfile reuse;

Check the status again:

SQL> col PDB_NAME for a15
SQL> set linesize 800
SQL> select pdb_name,status from dba_pdbs;

PDB_NAME	STATUS
--------------- ----------
PDB$SEED	NORMAL
PDB12		NORMAL
PDB13		NORMAL
PDB11		NEW

Open the database and make sure status became NORMAL:

SQL> alter pluggable database PDB11 open;

SQL> select pdb_name,status from dba_pdbs;

--------------- ----------
PDB$SEED	NORMAL
PDB12		NORMAL
PDB13		NORMAL
PDB11		NORMAL

Draw graph for Linux sar output using ksar

I’ve recently heard about this tool , as it is said we are learning things until the death (:

Our company is saving sar output in a text file periodicly and after performance or other issues we need to analyze it’s output to find out which resource was busy and when.. analyzing text file is time-consuming and can also cause eye tension.

Output in sar:

00:00:01        CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle
00:10:01        all      3.14      0.00      2.43      1.64      0.00      0.00      0.60      0.00      0.00     92.20
00:10:01          0      3.64      0.00      2.33      4.10      0.00      0.00      1.10      0.00      0.00     88.83

00:00:01      scall/s badcall/s  packet/s     udp/s     tcp/s     hit/s    miss/s   sread/s  swrite/s saccess/s sgetatt/s
00:10:01         0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

...
00:00:01       totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw
00:10:01         5682       656      1783         0         6       502
00:20:01         5651       668      1748         0         0       804

CPU, Network, Disk I/O, etc. activities are logged.

Same text file analyzed by ksar tool and graphycally displayed is the following:

Full list of items that can be seen graphycally are the following:

Now I will show all necessary information that is necessary to use this tool:

1. Download a pre-built jar from GitHub releases page.

2. Run jar on your computer:

   java -jar ksar-5.2.4-b396_gf0680721-SNAPSHOT-all.jar

3. Click Data -> Load from a file…and choose output of sar in a text file

Full information about this tool: https://github.com/vlsi/ksar

Create database using DBCA in VNC

In this tutorial, we will configure VNC and create a database using DBCA.

Source: https://support.flashgrid.io/hc/en-us…

Gmail blocks emails from Postfix client on Linux

Problem:

I want to send email notification to my Gmail account from Linux server using Postfix client. Mails are not received and /var/log/maillog is full of the following error messages:

Aug 18 17:24:29 rac1 postfix/smtp[17580]: connect to gmail-smtp-in.l.google.com[74.125.69.27]:25: Connection timed out
Aug 18 17:24:29 rac1 postfix/smtp[17580]: connect to gmail-smtp-in.l.google.com[2607:f8b0:4001:c0d::1a]:25: Network is unreachable
Aug 18 17:24:59 rac1 postfix/smtp[17580]: connect to alt1.gmail-smtp-in.l.google.com[173.194.77.27]:25: Connection timed out
Aug 18 17:25:29 rac1 postfix/smtp[17580]: connect to alt2.gmail-smtp-in.l.google.com[173.194.219.27]:25: Connection timed out

Solution:

Configure Postfix and Gmail account accordingly.

1. Confirm that the myhostname parameter is configured with your server’s FQDN:

# grep ^myhostname /etc/postfix/main.cf
myhostname = rac1.example.com

2. Generate an App Password for Postfix:

Click on App passwords -> Select app dropdown -> choose Other (custom name) -> Enter “Postfix” -> click GENERATE.

Postfix app password is generated in yellow box, copy and save it (generated_password_goes_here will be changed by this value).

3. Fill SMTP Host, username, and password in /etc/postfix/sasl_passwd

# cat /etc/postfix/sasl_passwd
smtp.gmail.com your_username@gmail.com:generated_password_goes_here

4. Create the hash db file

# postmap /etc/postfix/sasl_passwd

5. Configure the Postfix Relay Server:

# grep ^relayhost /etc/postfix/main.cf
relayhost = [smtp.gmail.com]:587

6.  To enable authentication, add the following parameters in /etc/postfix/main.cf

smtp_sasl_auth_enable = yes
smtpd_tls_auth_only = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_security_options = noanonymous
smtp_tls_security_level = encrypt

7. Reload Postfix service:

# systemctl reload postfix

8. For sending test email, I use Flashgrid tool:

[root@rac1 ~]# flashgrid-node test-alerts
FlashGrid 21.2.24.58935 #bb6005e9d66650d1996184c38d2fb8a2a78420a8
License: Active, Marketplace
Licensee: Flashgrid Inc.
Support plan: 24x7
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test alerts were sent

The alert is received now:

Flashgrid CHM and Basic Troubleshooting Part 3

In this tutorial, we are gracefully stopping one node in 2-node cluster using flashgrid-node stop command.

The command is useful when you are doing maintenance tasks on one node, e.g upgrading OS, upgrading kernel.

ora.evmd and ora.mdnsd fails to start when http_proxy is set to https://

Problem:

After setting http_proxy to https string (export http_proxy=https://test) and then stopping and starting CRS got the following error:

CRS-2883: Resource 'ora.evmd' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/bin/CommonSetup.pm" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
....
PRVG-2031 : Owner of file "/u01/app/19.3.0/grid/lib/libnl19.a" did not match the expected value on node "rac1". [Expected = "root(0)" ; Found = "grid(3002)"]
CRS-4000: Command Start failed, or completed with errors.

Even after unsetting http_proxy and trying to stop CRS got the following:

[root@rac1 ~]# crsctl start crs -wait
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[root@rac1 ~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'
CRS-2679: Attempting to clean 'ora.mdnsd' on 'rac1'
CRS-2679: Attempting to clean 'ora.gpnpd' on 'rac1'
CRS-2679: Attempting to clean 'ora.evmd' on 'rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2680: Clean of 'ora.evmd' on 'rac1' failed
CRS-2680: Clean of 'ora.gpnpd' on 'rac1' failed
CRS-2680: Clean of 'ora.mdnsd' on 'rac1' failed
CRS-2799: Failed to shut down resource 'ora.evmd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.gpnpd' on 'rac1'
CRS-2799: Failed to shut down resource 'ora.mdnsd' on 'rac1'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors

So https entry in http_proxy variable caused my CRS even not being able to stop.

Solution:

The solution is simple, find processes that were started during previous attempt and kill them (be careful, not to kill anything that is not started from GI home):

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10228 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin
root     20305     1  2 05:16 ?        00:00:33 /u01/app/19.3.0/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.US7ASCII
root     20631     1  0 05:16 ?        00:00:05 /u01/app/19.3.0/grid/bin/orarootagent.bin

[root@rac1 ~]# kill -9 20305 20631

[root@rac1 ~]# ps -ef|grep d.bin
root      1817     1  0 05:12 ?        00:00:01 /opt/flashgrid/bin/flashgrid_aio_srv
root      1821     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_target_srv
root      1824     1  0 05:12 ?        00:00:13 /opt/flashgrid/bin/flashgrid_initiator_srv
grid      1832     1  0 05:12 ?        00:00:04 /opt/flashgrid/bin/flashgrid_asm_srv
root      1845     1  0 05:12 ?        00:00:06 /opt/flashgrid/bin/flashgrid_cluster_srv
root      1879     1  0 05:12 ?        00:00:02 /opt/flashgrid/bin/flashgrid_iamback
root      1881     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_diskwatch
root      1884     1  0 05:12 ?        00:00:00 /opt/flashgrid/bin/flashgrid_reconstruct
root     10296 13775  0 05:43 pts/0    00:00:00 grep --color=auto d.bin

Make sure http_proxy is not set or instead of https there is http as a value:

[root@rac1 ~]# unset http_proxy

[root@rac1 ~]# echo $http_proxy

Or

[root@rac1 ~]# export http_proxy=http://test

Try to start CRS now:

[root@rac1 ~]# crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.evmd' on 'rac1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1'
CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded
CRS-2676: Start of 'ora.evmd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1'
CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'rac1'
CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'rac1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1'
CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'rac1'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac1'
CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded
CRS-2676: Start of 'ora.crf' on 'rac1' succeeded
CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1'
CRS-2672: Attempting to start 'ora.ctssd' on 'rac1'
CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'rac1'
CRS-2676: Start of 'ora.asm' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'rac1'
CRS-2676: Start of 'ora.storage' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'rac1'
CRS-2676: Start of 'ora.crsd' on 'rac1' succeeded
CRS-6017: Processing resource auto-start for servers: rac1
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac2'
CRS-2672: Attempting to start 'ora.chad' on 'rac1'
CRS-2672: Attempting to start 'ora.ons' on 'rac1'
CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac2'
CRS-2677: Stop of 'ora.scan1.vip' on 'rac2' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac1'
CRS-2676: Start of 'ora.chad' on 'rac1' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac1'
CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded
CRS-2676: Start of 'ora.ons' on 'rac1' succeeded
CRS-6016: Resource auto-start has completed for server rac1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.