Managing a RAC environment indeed is much more complex and very challenging than managing a non-RAC environment. Besides, it requires high level of skill sets and tremendous amount of efforts to ensure the stability of a RAC environment. The more complex the environment, the tougher it gets to manage and control. This article is going to touch on some of the best practices that help keep your RAC environment healthy and calm.
Being co-authored of an Oracle RAC 11g and 12c books, designed, implemented & managing huge/complex cluster environments here in the Middle East, I have often been asked by several Oracle DBAs, friends and followers to help them on the following:
This article will explore some of the best practices to stabilize a RAC environment and ensure best practices are being applied on the cluster environment. Additionally, it will also discuss the strength and weakness of having a complex RAC environment and how to make a decision between them.
One of the cluster health check best practices involve proactively verifying the health of the cluster stack, ASM and database instances configured on a cluster. ORAchk – a RAC configuration auditing tool, popularly known as the RACcheck tool, is a light weight tool which proactively scans for the know issues across various layers in the cluster and also can used for Upgrade Readiness Assessment when planning an upgrade. It can run to verify/audit only the high availability configuration or the entire cluster stack, including databases. The ORAchk includes the following best practice checks:
ORAchk utility is available on most of the popular Unix flavored Operating Systems: Linux, Oracle Solaris, AIX & HPUX. The following procedure explains how to configure and invoke the tool:
If you have already familiar with the RACcheck report, the HTML report generated by the ORAchk is no difference. It provides a very comprehensive and a detailed report about various layers of a RAC system. You can also configure the report to get notified through an EMAIL whenever an error is detected.
Oracle quarterly releases a PSU which typically include latest CPUs, also other bug fixes. It is critical to have latest PSU applied on the cluster environment to guarantee the system stability. Having the latest patch deployed on the system is always prevents from many issues.
Upholding the statistical history of OS resource utilization will help troubleshooting Clusterware related issues, for example, node eviction, cluster stack unhealthy issues etc. The OS Watcher and CHM tools collect the OS resource statistical data in real time every second. The statistical information can be later used to measure the system health state, such as, CPU load, memory utilization and, also can be extensively used to diagnose cluster eviction and other Clusterware related issues. The CHM by default configured with Oracle 220.127.116.11 for Linux and Solaris OS. OS water tool needs download from the My Support Oracle site, which requires login credentials. If CHM is not configured on RAC environment, ensure you have at least OS watcher tool configured and active all the time.
It is recommended to configure and use Linux HUGEPAGES on a database server with >=12GB memory for great kernel performance improvement. Using HUGEPAGES is the general rule of thumb for any Oracle databases with over 8GB SGA size. Additionally, it will also help avoiding node eviction issues.
If you have Windows OS, it is recommended to increase the windows noninteractive desktop heap (REG_EXPAND_SZ) size to overcome application connectivity issues and avoid RAC instability. The recommended noninteractive desktop heap size is 1M.
Good network architecture is a critical component in an optimal RAC configuration. Especially the interconnect configuration play a pivotal role in cluster stability. The Clusterware interconnect used for heartbeat communication, node state change, message exchange between the nodes etc. If interconnect tends to suffer from various problems, you could expect frequent node evictions etc. To optimize interconnect stability; there can be many workaround, as following:
Yet another tough decision that lays in front an RAC DBA whilst working in a large scaled organization is to decide between a complex RAC environment and small number of RAC environments. Though it is quite difficult to use any thumb of rule of finalize the setup, but, when the strengths and weakness of these two setups are carefully measured, one should be able to make a right choice.
High or large scaled cluster setups are always tough and challenging to control and manage due to the complexity involved in the cluster nature. When you have a large scaled cluster, let’s say, over 10 nodes cluster with 50 databases running across the cluster, Cluster management, ASM management and database management is going to be tough. In this setup, any maintenance task will be time consuming, collecting diagnostic information will be a tough ask though you have tools to gather the diagnostic information, managing/controlling ASM disks, diskgroup will be a tough ask.
In my own perspective, if there is a possibility to go with a small scaled RAC with many RAC environments, I would strongly recommend this approach, in contrast to having a large scaled cluster environment.