Exadata Performance Diagnostics with AWR

” The contents of this paper apply to all Exadata deployments – whether onpremises, Public Cloud, or Cloud at Customer “

https://www.oracle.com/technetwork/database/exadata/exadata-awr-5100655.pdf

Exadata: Rebuild RAC clusterware without deleting data Version 2

This post is differrent from my previous post Rebuild RAC clusterware without deleting data . Because two days ago I was upgrading grid infrastructure from 12.1 to 12.2 that was successfull on first node , but failed on second node. I will not describe why this happend, but the whole process was something complicated instead of being simple. We have installed several patches before the installation(gridSetup has this option to indicate patches before installation)… Seems the 12.2 software has many bugs even during upgrade process.(But I agree  with other DBA-s that 12.2 database is very stable itself).

So what happend now is that during first node upgrade OCR files was changed. I tried deconfigure from 12.2 home and it was also failed. So now I am with my clusterware that has corrupted OCR and voting disks(it belongs 12.2 version). In my presious post I was starting clusterware in exclusive mode with nocrs and restoring OCR from backup, but now because of voting disks are different version  it does not starting in even exclusive mode.

So I have followed the steps that recreate diskgroup , where OCR and voting disks are saved. Because it is Exadata Cell Storage disks , it was more complicated than with ordinary disks, where you can cleanup header using “dd”. Instead of dd you use cellcli.

So let’s start:

Connect to each cell server(I have three of them) and drop grid disks that belong to DBFS(it contains OCR and Voting disks). Be careful dropping griddisk causes data to be erased. So DBFS must contain only OCR and Vdisks not !DATA!

#Find the name, celldisk and size of the grid disk:

CellCLI> list griddisk where name like ‘DBFS_.*’ attributes name, cellDisk, size
DBFS_CD_02_lbcel01_dr_adm CD_02_lbcel01_dr_adm 33.796875G
DBFS_CD_03_lbcel01_dr_adm CD_03_lbcel01_dr_adm 33.796875G
DBFS_CD_04_lbcel01_dr_adm CD_04_lbcel01_dr_adm 33.796875G
DBFS_CD_05_lbcel01_dr_adm CD_05_lbcel01_dr_adm 33.796875G
DBFS_CD_06_lbcel01_dr_adm CD_06_lbcel01_dr_adm 33.796875G
DBFS_CD_07_lbcel01_dr_adm CD_07_lbcel01_dr_adm 33.796875G
DBFS_CD_08_lbcel01_dr_adm CD_08_lbcel01_dr_adm 33.796875G
DBFS_CD_09_lbcel01_dr_adm CD_09_lbcel01_dr_adm 33.796875G
DBFS_CD_10_lbcel01_dr_adm CD_10_lbcel01_dr_adm 33.796875G
DBFS_CD_11_lbcel01_dr_adm CD_11_lbcel01_dr_adm 33.796875G

 

#Drop

CellCLI> drop griddisk DBFS_CD_02_lbcel01_dr_adm
drop griddisk DBFS_CD_03_lbcel01_dr_adm
drop griddisk DBFS_CD_04_lbcel01_dr_adm
drop griddisk DBFS_CD_05_lbcel01_dr_adm
drop griddisk DBFS_CD_06_lbcel01_dr_adm
drop griddisk DBFS_CD_07_lbcel01_dr_adm
drop griddisk DBFS_CD_08_lbcel01_dr_adm
drop griddisk DBFS_CD_09_lbcel01_dr_adm
drop griddisk DBFS_CD_10_lbcel01_dr_adm
drop griddisk DBFS_CD_11_lbcel01_dr_adm

#Create

cellcli> create griddisk DBFS_CD_02_lbcel01_dr_adm celldisk=CD_02_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_03_lbcel01_dr_adm celldisk=CD_03_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_04_lbcel01_dr_adm celldisk=CD_04_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_05_lbcel01_dr_adm celldisk=CD_05_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_06_lbcel01_dr_adm celldisk=CD_06_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_07_lbcel01_dr_adm celldisk=CD_07_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_08_lbcel01_dr_adm celldisk=CD_08_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_09_lbcel01_dr_adm celldisk=CD_09_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_10_lbcel01_dr_adm celldisk=CD_10_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_11_lbcel01_dr_adm celldisk=CD_11_lbcel01_dr_adm, size=33.796875G

Do the same steps on other cells.

2.  Deconfigure root.sh on each node

# Run deconfig

/u01/app/12.1.0.2/grid/crs/install/rootcrs.sh -deconfig -force

#rename gpnp profile

mv /u01/app/12.1.0.2/grid/gpnp/profiles/peer/profile.xml /tmp/profile_backup.xml

3. Run root.sh on first node

/u01/app/12.1.0.2/grid/root.sh

It will fail because will not find the disk group DBFS for mounting and of course OCR inside.  But now asm is started in nomount mode and we are able to recreate diskgroup

4. Create DBFS diskgroup

sqlplus / as sysasm

SQL> create diskgroup DBFS
failgroup LBCEL01_DR_ADM disk ‘o/*/DBFS_CD_02_lbcel01_dr_adm’,’o/*/DBFS_CD_03_lbcel01_dr_adm’,’o/*/DBFS_CD_04_lbcel01_dr_adm’,’o/*/DBFS_CD_05_lbcel01_dr_adm’,’o/*/DBFS_CD_06_lbcel01_dr_adm’,’o/*/DBFS_CD_07_lbcel01_dr_adm’,’o/*/DBFS_CD_08_lbcel01_dr_adm’,’o/*/DBFS_CD_09_lbcel01_dr_adm’,’o/*/DBFS_CD_10_lbcel01_dr_adm’,’o/*/DBFS_CD_11_lbcel01_dr_adm’
failgroup LBCEL02_DR_ADM disk ‘o/*/DBFS_CD_02_lbcel02_dr_adm’,’o/*/DBFS_CD_03_lbcel02_dr_adm’,’o/*/DBFS_CD_04_lbcel02_dr_adm’,’o/*/DBFS_CD_05_lbcel02_dr_adm’,’o/*/DBFS_CD_06_lbcel02_dr_adm’,’o/*/DBFS_CD_07_lbcel02_dr_adm’,’o/*/DBFS_CD_08_lbcel02_dr_adm’,’o/*/DBFS_CD_09_lbcel02_dr_adm’,’o/*/DBFS_CD_10_lbcel02_dr_adm’,’o/*/DBFS_CD_11_lbcel02_dr_adm’
failgroup lbcel03_dr_adm disk ‘o/*/DBFS_CD_02_lbcel03_dr_adm’,’o/*/DBFS_CD_03_lbcel03_dr_adm’,’o/*/DBFS_CD_04_lbcel03_dr_adm’,’o/*/DBFS_CD_05_lbcel03_dr_adm’,’o/*/DBFS_CD_06_lbcel03_dr_adm’,’o/*/DBFS_CD_07_lbcel03_dr_adm’,’o/*/DBFS_CD_08_lbcel03_dr_adm’,’o/*/DBFS_CD_09_lbcel03_dr_adm’,’o/*/DBFS_CD_10_lbcel03_dr_adm’,’o/*/DBFS_CD_11_lbcel03_dr_adm’
ATTRIBUTE
‘compatible.asm’=’12.1.0.2.0’,
‘compatible.rdbms’=’11.2.0.2.0’,
‘au_size’=’4194304’,
‘cell.smart_scan_capable’=’TRUE’;

5. Do the following steps:

* Deconfigure root.sh again from first node
* remove gpnp profile
* run root.sh again on first node

At this time root.sh should be successful.

6. Restore OCR

/u01/app/12.1.0.2/grid/cdata/<clustername> directory contans OCR backups by default

crsctl stop crs -f
crsctl start crs -excl -nocrs
ocrconfig -restore /u01/app/12.1.0.2/grid/cdata/lbank-clus-dr/backup00.ocr
crsctl stop crs -f
crsctl start crs

7. Run root.sh on the second node

/u01/app/12.1.0.2/grid/root.sh

Restart Exadata storage cell service without affecting ASM

Brief history:

One week ago on our DR Exadata cell service hanged, which caused all databases located on Exadata to become inaccessible.

CellCLI> LIST ALERTHISTORY
9 2017-10-13T11:56:05+04:00 critical “RS-7445 [Serv CELLSRV hang detected] [It will be restarted] [] [] [] [] [] [] [] [] [] []”

In cell’s alert history there was written that the service would be restarted itself , but it did not and I restarted it by the following way:

CellCLI> ALTER CELL RESTART SERVICES CELLSRV

The databases started to work correctly.

Today, the same problem happend on the HQ side which of course caused to stop everything for a while until I’ve restarted the service.

But identifying which cell was problematic was a little bit difficult, because there was no error in alerthistory.

BUT when I entered the following command on the third cell node – it hanged, other cells were OK.

CellCLI> LIST ACTIVEREQUEST

So I restarted the same service on that node and problem was resolved.

CellCLI> ALTER CELL RESTART SERVICES CELLSRV

Of course, this is not a solution and cell service must not hang! , but this is the simple workaround when you have stopped PRODUCTION database.

I have created SR and waiting answer from them , if there is any usefull news will update this post.

===================================================================================================

Writing down the correct steps of restarting Cell Services without affecting ASM:

1.  Run the following command to check if there are offline disks on other cells that are mirrored with disks on this cell:

CellCLI > LIST GRIDDISK ATTRIBUTES name WHERE asmdeactivationoutcome != ‘Yes’

Warning : If any grid disks are listed in the returned output, then it is not safe to stop or re-start the CELLSRV process because proper Oracle ASM disk group redundancy will not be intact and will cause Oracle ASM to dismount the affected disk group, causing the databases to shut down abruptly.

If no grid disks are listed in the returned output, you can safely restart cellsrv or all services in step #2 below.

2.  Re-start the cell services using either of the following commands:

CellCLI> ALTER CELL RESTART SERVICES CELLSRV

CellCLI> ALTER CELL RESTART SERVICES ALL

BUT what is good news cell has self-defence on reduced redundancy, if you try to restart it when redundancy check is not satisfied you get:

CellCLI> ALTER CELL RESTART SERVICES ALL;

Stopping the RS, CELLSRV, and MS services…
The SHUTDOWN of ALL services was not successful.
CELL-01548: Unable to shut down CELLSRV because disk group DATA, RECO may be forced to dismount due to reduced redundancy.
Getting the state of CELLSRV services… running
Getting the state of MS services… running
Getting the state of RS services… running

 

Creating a Snapshot-Based Backup of Oracle Linux Database Server(Exadata)

Story: We have Exadata x5-2 servers and they must be PCI compliant. Our PCI scanner found a lot of vulnerabilities on these machines, most of them were rpm upgrades.

So we are in a hell (devil) . Exadata is not a toy and upgrading it’s system is not like playing :):)

We must backup the system , and take the image before doing such things.

Because of Exadata has LVM-s installed we can use LVM based backups.

So let’s start:

Creating a Snapshot-Based Backup of Oracle Linux Database Server with Uncustomized Partitions

This means that you have not changed anything after the initial installation.

a) create mount point where you save backups. It is better if you use NFS. But if you don’t have NFS you may temporarily use local storage and then copy backup to another disk to be more safe.

Exadata local disk is 1.6TB in size and there is about 1.4TB free , so we can create another logical volume for backup. Let’s create logical volume 400GB  in size.
Here, I can tell you the difference between DBA and Sysadmin 🙂 DBA needs more and more space and always takes more than required for being calm for next 5 years. And sysadmin is giving to DBA less space than required and everytime you ask him/her storage his/her face becomes like thinking_80_anim_gif

# lvcreate -L 400GB -n backup VGExaDb
# mkdir /backup
# mkfs.ext4 /dev/VGExaDb/backup
# echo “/dev/VGExaDb/backup     /backup                 ext4    defaults        0 0” >> /etc/fstab
# mount /backup

b) Take a snapshot-based backup of the / (root), /u01. Name them root_snap and u01_snap. Mount them to the newly created directories called exadata_os_backup and exadata_u01_backup.

# mkdir /backup/exadata_os_backup
# mkdir /backup/exadata_u01_backup
# lvcreate -L1G -s -c 32K -n root_snap /dev/VGExaDb/LVDbSys1
Logical volume “root_snap” created.
# e2label /dev/VGExaDb/root_snap DBSYS_SNAP
# mount /dev/VGExaDb/root_snap /backup/exadata_os_backup -t ext4
# lvcreate -L5G -s -c 32K -n u01_snap /dev/VGExaDb/LVDbOra1
Logical volume “u01_snap” created.
# e2label /dev/VGExaDb/u01_snap DBORA_SNAP
# mount /dev/VGExaDb/u01_snap  /backup/exadata_u01_backup -t ext4

c)  Create the backup file using the following command:

# cd  /backup
# tar -pjcvf /backup/os_backup.tar.bz2 * –exclude os_backup.tar.bz2 > /backup/os_backup.stdout 2> /backup/os_backup.stderr

Check the /tmp/backup_tar.stderr file for any significant errors. Errors about failing to tar open sockets, and other similar errors, can be ignored.

d) Unmount the snapshots and remove the snapshots for the root and /01 directories using the following commands:

umount /backup/exadata_u01_backup
umount /backup/exadata_os_backup
lvremove /dev/VGExaDb/u01_snap
lvremove /dev/VGExaDb/root_snap

More information:
https://docs.oracle.com/cd/E50790_01/doc/doc.121/e51951/db_server.htm#DBMMN21380
There is also discussed “Creating a Snapshot-Based Backup of Oracle Linux Database Server with Customized Partitions”.