Note: This paper is based on collaborative research effort between Quest Software and Dell Computers, with all research having been done by the four authors whose bios are listed at the end of this paper. This presentation is being offered on behalf of both companies and all the authors involved. Also an extra special thanks to Dell for allocating a million dollars worth of equipment to make such testing possible.Smile

When businesses start small and have low capital budget to invest, configuration of computer systems are minimal, consisting of a single database server with a storage array to support the database implementation. However, as the business grows, expansion follows and consequently the current configuration (single node – storage system) is not able to handle the increased load due to increased business. It is typical at this point to either upgrade the current hardware by adding more CPU’s and other required resources or to add additional servers to work as a cohesive unit /cluster. Increasing resources is similar to placing a temporary bandage on the system as it solves the current situation of increased business and demand, but only shifts the issue to be dealt with at a future date as the workload will invariably increase once again. Instead of adding more memory or CPU to the existing configuration, which could be referred to as vertical scaling, additional server(s) should be added to provide load balancing, workload distribution and availability. This functionality is achieved by scaling the boxes horizontally, or in a linear fashion, and by configuring these systems to work as a single cohesive unit or a cluster.


Figure 1.0. Vertical and Horizontal / Linear Scalability Representation

Figure 1.0 illustrates the advantages of horizontal scalability over vertical scalability. The growth potential on a vertically scalable system is limited and, as explained earlier, reaches a point where the addition of resources does not provide proportionally improved results. From a systems perspective, a hardware clustering provides this. ( Note: A cluster is a group of independent hardware systems or nodes that are interconnected to provide a single computer source). Linear (horizontal) scalability brought about by clustered hardware solutions also provides distribution of user workload Vertical Scalability Horizontal Scalability among many servers (nodes). Clusters offer both horizontal (linear) and vertical scalability, the cluster model provides investment protection. Horizontal scalability is the ability to add more nodes to the cluster to provide more functionality. These nodes may be relatively small and/or inexpensive commodity hardware, offering economical upgradeability options that might offer enhancements to a single large system. Vertical scalability is the ability to upgrade individual nodes to higher specifications. Oracle real application cluster (RAC) version 10g with its unique architecture is the only database solution available today that provides true clustering, in a shared database environment and provides horizontal scalability.

REAL APPLICATION CLUSTER

RAC is a configuration of two or more nodes comprising of two or more instances clustered together using Oracle Clusterware 2 using its unique cache fusion technology it is able to share resources and balance workload providing optimal scalability for today’s high end computing environments. A RAC configuration consists of:

  • Many instances of Oracle running on many nodes
  • Many instances sharing a single physical database
  • All instances having common data and control files
  • Each instance contains individual log files and undo segments
  • All instances can simultaneously execute transactions against the single physical database
  • Cache synchronization between users requests across various instances using the cluster interconnect 

COMPONENTS OF RAC CLUSTER


Figure 2.0. Cluster components in a 10g RAC cluster

ORACLE CLUSTERWARE

Oracle Clusterware is three tiered, comprising of the cluster synchronization services (CSS), event manager (EVM) and the cluster ready services (CRS) that provides a unified integrated solution bringing maximum scalability to the RAC environment.

INTERCONNECT

Interconnect is a dedicated private network between the various nodes in the cluster. RAC architecture uses the cluster interconnect for instance to instance block transfers by providing cache coherency. Ideally interconnects are Gigabit Ethernet adapters configured to transfer packets of maximum size supported by the operating system. Depending on the operating system the suggested protocols may vary, on Linux clusters the recommended protocol is UDP.

VIP

Traditionally users and applications have connected to the RAC cluster and database using a public network interface. The network protocol used for this connection has typically been TCP/IP. In a RAC environment, when a node or instance fails, the application unware of such failure attempts to make a connection, at which time the time taken for TCP/IP to acknowledge such a failure could be over 10 minutes causing unresponsive behaviors to the users.

VIP is virtual connection over the public interface. When an application or user makes a connection using the VIP, during a node failure, the Oracle Clusterware based on an event received from the EVM will transfer the VIP address to another surviving instance. Now when a new connection is attempted by the application there are two possible situations based on the Oracle 10g database features implemented.

  1. If the application uses fast application notification calls, Oracle notification services (ONS) will inform the ONS running on the client machines and the application using Oracle provided API can receive such notifications and make connections to one of the other available instances in the cluster. Such proactive notification mechanisms will avoid connections to the failed node.
  2. If the application attempts to connect using the VIP address of the failed node, an immediate failure is notified to the through an acknowledgement when the connection is refused because of a mismatch in the hardware address.

SHARED STORAGE

Another important component of RAC cluster is its shared storage that all participating instances in the cluster will access. The shared storage contains the datafiles, control files, redo logs and undo files. Oracle Database 10g supports three different methods for storing files on shared storage.

  1. Raw devices

    A raw device partition is a contiguous region of a disk accessed by a UNIX or Linux character-device interface. This interface provides raw access to the underlying device, arranging for direct I/O between a process and the logical disk. Therefore, the issuance of a write command by a process to the I/O system directly moves the data to the device.
  2. Oracle clustered file system (OCFS)

    OCFS is a clustered file system developed by Oracle Corporation to provide a solution for easy data file management at the same time providing performance characteristics similar to RAW devices. Oracle initial release OCFS 1.0 only supports database files to be stored on devices formatted using OCFS. The latest release OCFS 2.0 is a more generic file system support both Oracle and non-Oracle files. OCFS supports both Linux and Windows operating systems.
  3. Automatic Storage Management (ASM)

    ASM is new storage management solution introduced in Oracle Database 10 g, it bridges the gap in today’s storage management solution and I/O tuning burden present among the administrators. ASM integrates the file system, volume manager and using the OMF architecture distributes I/O load across all available resources optimizing performance and throughput using the SAME methodology.

TESTING FOR SCALABILITY

Primary functions that RAC systems provide apart from improved performance are availability and scalability. Availability because when one of the nodes or the instances in the cluster where to fail, the reminder of the instances would continue to provide access to the physical database. Scalability because when the user workload or work pattern increases users can access the database from any of the available instances that have higher amount of resources available. Scalability also provides the option to add additional nodes when the user base increases.

When organizations move to a RAC environment, it would be in their best interests to perform independent performance tests to determine the capacity of the cluster configured. Such tests will help determine at what stage in the life cycle of the organization and its application, the cluster will require additional instances to be added to accommodate higher workload.

The basis of these tests is to provide a similar baseline to users regarding the scalability levels of the DELL servers using ASM.

TEST ENVIRONMENT

TOOLS

BENCHMARK FACTORY

Benchmark Factory (BMF) for Databases provides a simple yet robust GUI (figure 3.0) for creating, managing and scheduling industry standard database benchmarks and real-world workload simulation so as to determine accurate production database hardware and software configurations for optimal effectiveness, efficiency and scalability. Using BMF for Databases, DBA’s can easily address the two of the most challenging tasks they face: what hardware architecture and platform should they deploy, and what performance related SLA’s (Service Level Agreements) can they agree to.


Figure 3.0: Benchmark Factory for Databases (BMF)

While BMF for Databases offers numerous industry standard benchmarks, the tests performed for this article were similar to the TPC-C (shown above):

  • The TPC-C like benchmark measures on-line transaction processing (OLTP) workloads. It combines readonly and update intensive transactions simulating the activities found in complex OLTP enterprise environments.

The benchmark tests were setup to simulate loads from 100 to 5000 concurrent users in increments of 100 against a 10 GB database (which BMF for Databases creates). The idea was to ascertain two critical data points: how many concurrent users can each RAC node sustain, and does the RAC cluster scale both predictably and linearly as additional nodes and users are added.

SPOTLIGHT ON RAC

Spotlight on RAC (SoRAC) is a new, innovative database monitoring and diagnostic tool for RAC. It extends the proven architecture and intuitive GUI of Quest Software’s Spotlight on Oracle to RAC environments. Spotlight on RAC is designed to provide a comprehensive yet comprehendible overview of numerous RAC internals visualized by a world-class dash-board like display that makes clustered database monitoring and diagnostics a snap. With its simple street-light like color coded scheme (where green is good and red is bad), plus its point-and-click to drill-down into details design – DBA’s can easily monitor their clusters in order to detect, diagnose and correct all potential problems or hotspots. SoRAC (figure 4.0) even offers alarms with automatic prioritization and weighted escalation rankings to help less experienced RAC DBA’s focus their attention on the most critical or problematic issues. SoRAC is shown below monitoring over 2300 users on a 10 node cluster – with all being well.


Figure 4.0: Spotlight on RAC (SoRAC)

Note that SoRAC requires only a Windows client install to monitor all the nodes on the cluster. It requires no server side agents and no data repository. It’s truly a light footprint, simple to install, launch and start to use tool. The hardware configuration illustrated in figure 5.0 reflects the primary configuration and hardware layout used for this benchmark operation and the tables below provide the details:

Hardware and Software Setup:

  

Database Configuration

Database Version Oracle database 10g R1 (10.1.04) Enterprise Edition
ASM Diskgroups:

SYSTEMDG: 50 GB
DATADG: 50GB
INDEXDG: 50GB
REDO01DG: 20 GB
REDO02 DG: 20GB
All disk groups were created using external redundancy option of ASM
Tablespaces:

Quest_data in DATADG disk group size 40GB using OMF feature

Quest_index in INDEXDG disk group size 10GB using OMF feature
All Other database tablespaces were created in SYSTEMDG disk group.

Redo log files were created in REDO01DG REDO02DG disk groups

 


Figure 5.0 10 Node physical hardware layout.

TESTING METHODOLOGY

Methodology is of critical importance for any reliable benchmarking exercise – especially those of a complex and repetitive nature. A methodology allows for comparison of current activity with previous activities, while recording any changes to the baseline criteria. RAC testing is no different – a methodology to identify performance candidates, tune parameters or settings, run the tests, and then record the results is critical. And because of its highly complex multi-node architecture, RAC benchmarking should follow an iterative testing process that proceeds as follows:

  1. For a single node and instance
    1. Establish a fundamental baseline
      1. Install the operating system and Oracle database (keeping all normal installation defaults)
      2. Create and populate the test database schema
      3. Shutdown and startup the database
      4. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a baseline for default operating system and database settings
    2. Optimize the basic operating system
      1. Manually optimize typical operating system settings
      2. Shutdown and startup the database
      3. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a new baseline for basic operating system improvements
      4. Repeat prior three steps until a performance balance results
    3. Optimize the basic non-RAC database
      1. Manually optimize typical database “spfile” parameters
      2. Shutdown and startup the database
      3. Run a simple benchmark (e.g. TPC-C for 200 users) to establish a new baseline for basic Oracle database improvements iv. Repeat prior three steps until a performance balance results
    4. Ascertain the reasonable per-node load
      1. Manually optimize scalability database “spfile” parameters
      2. Shutdown and startup the database
      3. Run an increasing user load benchmark (e.g. TPC-C for 100 to 800 users increment by 100) to find the “sweet spot” of how many concurrent users a node can reasonably support
      4. Monitor the benchmark run via the vmstat command, looking for the point where excessive paging and swapping begins – and where the CPU idle time consistently approaches zero
      5. Record the “sweet spot” number of concurrent users – this represents an upper limit vi. Reduce the “sweet spot” number of concurrent users by some reasonable percentage to account for RAC architecture and inter/intra-node overheads (e.g. reduce by say 10%)
    5. Establish the baseline RAC benchmark
      1. Shutdown and startup the database
      2. Create an increasing user load benchmark based upon the node count and the “sweet spot” (e.g. TPC-C for 100 to node count * sweet spot users increment by 100) iii. Run the baseline RAC benchmark 2.

  2. For 2nd through Nth nodes and instances
    1. Duplicate the environment
      1. Install the operating system
      2. Duplicate all of the base node’s operating system setting
    2. Add the node to the cluster
      1. Perform node registration task
      2. Propagate the Oracle software to the new node
      3. Update the database “spfile” parameters for the new node
      4. Alter the database to add node specific items (e.g. redo logs
    3. Run the baseline RAC benchmark
      1. Update the baseline benchmark criteria to include user load scenarios from the prior run’s maximum up to the new maximum based upon node count * “sweet spot” of concurrent users using the baseline benchmark’s constant for increment by
      2. Shutdown and startup the database – adding the new instance
      3. Run the baseline RAC benchmark 4.Plot the transactions per second graph showing this run versus all the prior baseline benchmark runs – the results should show a predictable and reliable scalability factor

As with any complex testing endeavor, the initial benchmarking setup and sub-optimization procedure is very time consuming. In fact reviewing the steps above, nearly two thirds of the overall effort is expended in getting the single node and instance correctly setup, plus the baseline benchmark properly defined. Of course once that initial work is completed, then the remaining steps of adding another node and retesting progresses rather quickly. Plus if the DBA simply duplicates all the nodes and instances like the first once it’s done, then the additional node benchmarking can be run with little or no DBA interaction (i.e. eliminates steps 2-a and 2-b). This also provides the greatest flexibility to test any scenario and in any order that one might prefer (e.g. test 10 nodes down to 1 node). So a little up front work can go a long way.

TESTING

In our benchmarking test case, the first three steps (establish a fundamental baseline, optimize the basic operating system and optimize the non-RAC database) were very straight forward and quite uneventful. We simply installed Redhat Advanced Server 4.0 update 1 and all the device drivers necessary for our hardware, installed Oracle 10g Release 1, and patched Oracle to version 10.1.0.4. We of course then modified the Linux kernel parameters to best support Oracle by adding the following entries to /etc/sysctl.conf:

      • kernel.shmmax = 2147483648
      • kernel.sem = 250 32000 100 128
      • fs.file-max = 65536
      • fs.aio-max-nr = 1048576
      • net.ipv4.ip_local_port_range = 1024 65000
      • net.core.rmem_default = 262144
      • net.core.rmem_max = 262144
      • net.core.wmem_default = 262144
      • net.core.wmem_max = 262144

We then made sure that asynchronous IO was compiled in and being used by performing the following steps:

      1. cd to $ORACLE_HOME/rdbms/lib
        1. make -f ins_rdbms.mk async_on
        2. make -f ins_rdbms.mk ioracle
      2. Set necessary “spfile” parameter settings
        1. disk_asynch_io = true (default value is true)
        2. filesystemio_options = setall (for both async and direct io)

Note that in Oracle 10g Release 2 asynchronous IO is now compiled in by default. Next we then created our RAC database and initial instance using Oracle’s Database Configuration Assistant (DBCA), being very careful to choose parameter settings that made sense for our proposed maximum scalability (i.e. 10 nodes). Finally, we made the final manual “spfile” adjustments shown below:

      • cluster_database=true
      • cluster_database_instances=10
      • db_block_size=8192
      • processes=16000
      • sga_max_size=1500m
      • sga_target=1500m
      • pga_aggregate_target=700m
      • db_writer_processes=2
      • open_cursors=300
      • optimizer_index_caching=80
      • optimizer_index_cost_adj=40

The key idea was to eek out as much SGA memory usage as possible within the 32-bit operating system limit (about 1.7 GB). Since our servers had only 4 GB of RAM each, we figured that allocating half to Oracle was sufficient – with the remaining memory to be shared by the operating system and the thousands of dedicated Oracle server processes that the TPC-C like benchmark would be creating as its user load.

Now it was time to ascertain the reasonable per node load that our servers could accommodate. This is arguably the most critical aspect of the entire benchmark testing process – and especially for RAC environments with more than just a few nodes. We initially ran a TPC-C on the single node without monitoring the benchmark run via the vmstat command. So simply looking at the transactions per second graph in BMF yielded a deceiving belief that we could set the “sweet spot” at 700 users per node. The issue was that even though the graph continued in a positive direction up to 700 users, that in reality the operating system was being overstressed and exhibited minimal thrashing characteristics at about 600 users. Moreover we did not temper that value by reducing for RAC overhead. The end result was that our first attempt at running a series of benchmarks for 700 users per node did not scale either reliably or predictably beyond four servers. Our belief is that by taking each box to a near thrashing threshold by our overzealous per node user load selection, the nodes did not have sufficient resources available to communicate in a timely enough fashion for inter/intra-node messaging – and thus Oracle began to think that nodes were either dead or non-respondent.. Furthermore when relying upon Oracle’s client and server side load balancing feature, which allocates connections based upon node responding, the user load per node became skewed and then exceeded our per node “sweet spot” value. For example when we tested 7000 users for 10 nodes, since some nodes appeared dead to Oracle – the load balancer simply directed all the sessions across whatever node were responding. So we ended up with nodes trying to handle far more than 700 users – and thus the thrashing was even worse.

Note: This will not be of any concern in Oracle database 10g Release 2. With the runtime connection load balancing feature and FAN technology the client will be proactively notified regarding the resource availability on each node and the client can place connections on instances that have more resources. Load balancing can be performed either by connections or by response time.

So with a valuable lesson learned by our first attempt, we made two major improvements. First, we re-ascertained what our true “sweet spot” was by monitoring the single node 100 to 800 user load test – watching very carefully for the onset of excessive paging, swapping or consistent CPU idle time near zero percent. That number was 600 users, not 700 as we had tried before. We then adjusted that number down to 500 users – by simply figuring that the RAC architecture would require 15% overhead. This is not a recommendation per se; we simply wanted to pick a number that would yield a purely positive scalability experience for the next set of benchmarking test runs. If we had more time, we could have selected a less conservative “sweet spot” and kept repeating our tests until a definitive reduction percentage could be factually derived. Again, we simply erred on the side of caution and chose a “sweet spot” value that we expected to work well and yet that did not overcompensate. And second, we decided to rely upon Benchmark Factory’s load balancing feature for clusters – which simply allocates or directs 1/n th of the jobs to each node. That way we could be absolutely sure that we never had more than our “sweet spot” of users running on any given node.

With the correct per node user load now correctly identified and guaranteed load balancing, it was now a very simple (although time consuming) exercise to run the TPC-C like benchmarks listed below:

      • 1 Node: 100 to 500 users, increment by 100
      • 2 Node 100 to 1000 users, increment by 100
      • 4 Node 100 to 2000 users, , increment by 100
      • 6 Node 100 to 3000 users, , increment by 100
      • 8 Node 100 to 4000 users, , increment by 100
      • 10 Node 100 to 5000 users, , increment by 100

Benchmark Factory’s default TPC-C like test iteration requires about 4 minutes for a given user load. So for the single node with five user load scenarios, the overall benchmark test run requires 20 minutes. During the entire testing process the load was monitored to identify any hiccups using SoRAC. As illustrated in figure 6.0 when we reached our four node tests we did identify that CPU’s on node racdb1 and racdb3 reached 84% and 76% respectively. Analyzing the root cause of the problem it was related to temporary overload of users on these servers, and the ASM response time.


Figure 6.0 Four node tests - high CPU usage on nodes racdb1 and racdb3

We increased the following parameters on the ASM instance ran our four node tests again and all was well beyond this:

Parameter Default Value New Value
SHARED_POOL 32M 67M
LARGE_POOL 12M 67M

This was the only parameter change we had to make to the ASM instance and beyond this everything work just smooth. Figure 7.0 gives another look at the cluster level latency charts from SoRAC during our eight node run. This indicated that the interconnect latency was well within expectations and in par with any industry network latency numbers.


Figure 7.0 Cluster Latency charts eight node tests

And once you factor in certain inherent latencies and delays between test runs, it actually takes about 30 minutes. So as you add nodes and thus more user load scenarios, the overall benchmarking test runs take longer and longer. In fact the 10 node test takes well over four hours. Other than some basic monitoring to make sure that all is well and the tests are working, there’s really not very much to do while these tests run – so bring a good book to read. The final results are shown below in Figure 8.0.


Figure 8.0. Results for 1 to 10 RAC nodes

Monitoring the storage subsystem using the SoRAC (figure 9.0) indicated that ASM was performing excellently well at this user load. 10 instances with over 5000 users indicated an excellent service time from ASM, actually the I/O’s per second was pretty high and noticeably good toping to over 2500 I/O’s per second.


Figure 9.0. ASM performance with 10 RAC nodes

The results are quite interesting. As the graph clearly shows, Oracle’s RAC and ASM are very predictable and reliable in terms of its scalability. Each successive node seems to continue the near linear line almost without issue. Now there are 3 or 4 noticeable troughs in the graph for the 8 and 10 node test runs that seem out of place. Note that we had one database instance that was throwing numerous ORA-00600 [4194] errors related to its UNDO tablespace. And that one node took significantly longer to startup and shutdown than all the other nodes combined. A search of Oracle’s metalink web site located references to a known problem that would require a database restore or rebuild. Since we were tight on time, we decided to ignore those couple of valleys in the graph, because it’s pretty obvious from the overall results we obtained that smoothing over those few inconsistent points would yield a near perfect graph – showing that RAC is truly reliable and predictable in terms of scalability. And using the 6 node graph results to project forward, Figure10.0 shows a reasonable expectation in terms of realizable scalability – where 17 nodes should equal nearly 500 TPS and support about 10,000 concurrent users.


Figure 10.0. Projected RAC Scalability

CONCLUSION

Apart from the minor hiccups at the initial round where we tried to determine the optimal user load on a node for the given hardware and processor configuration, beyond this scalability of the RAC cluster was outstanding. Addition of every node to the cluster showed steady - close to linear scalability. Close to linear scalability because of the small overhead that the cluster interconnect would consume during block transfer between instances.

The interconnect also performed very well, in this particular case NIC paring/bonding feature of Linux was implemented to provide load balancing across the redundant interconnects which also helped provide availability should any one interconnect fails.

The DELL|EMC storage subsystem that consisted of six ASM diskgroups for the various data files types performed with high throughput also indicating high scalability numbers. EMC PowerPath provided IO load balancing and redundancy utilizing dual Fibre Channel host bus adapters on each server.

It’s the unique architecture of RAC that makes this possible, because irrespective of the number of instances in the cluster, the maximum number of hops that will be performed to before the requestor gets the block requested will not exceed three under any circumstances. This unique architecture of RAC removes any limitations in clustering technology (available from other database vendors) giving maximum scalability. This was demonstrated through the tests above.

These tests would also conclude that Oracle’s direction to bring a true technology grid environment where RAC and ASM are only the stepping stones is encouraging. Oracle® 10g Real Application Clusters (RAC) software running on standards-based Dell™ PowerEdge™ servers and Dell/EMC storage can provide a flexible, reliable platform for a database grid. In particular, Oracle 10g RAC databases on Dell hardware can easily be scaled out to provide the redundancy or additional capacity that the grid environment requires.

ABOUT THE AUTHORS

Anthony Fernandez: Fernandez is a senior analyst with the Dell Database and Applications Team of Enterprise Solutions Engineering, Dell Product Group. His focus is on Database optimization and performance. Anthony has a bachelor’s degree in Computer Science from Florida International University.

Bert Scalzo: Bert Scalzo is a product architect for Quest Software and a member of the TOAD development team. He designed many of the features in the TOAD DBA module. Mr. Scalzo has worked as an Oracle DBA with versions 4 through 10g. He has worked for both Oracle Education and Consulting. Mr. Scalzo holds several Oracle Masters, a BS, MS and PhD in Computer Science, an MBA and several insurance industry designations. His key areas of DBA interest are Linux and data warehousing (he designed 7-Eleven Corporation's multi-terabyte, star-schema data warehouse). Mr. Scalzo has also written articles for Oracle’s Technology Network (OTN), Oracle Magazine, Oracle Informant, PC Week, Linux Journal and www.linux.com. He also has written three books: "Oracle DBA Guide to Data Warehousing and Star Schemas", "TOAD Handbook" and "TOAD Pocket Reference". Mr. Scalzo can be reached at bert.scalzo@quest.com or bert.scalzo@comcast.net.

Murali Vallath: Murali Vallath has over 17 years of IT experience designing and developing databases with over 13 years on Oracle products. Vallath has successfully completed over 50 successful small, medium and terabyte sized RAC implementations (Oracle 9i & Oracle 10g) for reputed corporate firms.

Vallath is the author of the book titled ‘Oracle Real Application Clusters’ and an upcoming book titled ‘Oracle 10g RAC, Grid, Services & Clustering’. Vallath is a regular speaker at industry conferences and user groups, including the Oracle Open World, UKOUG and IOUG on RAC and Oracle RDBMS performance tuning topics.

Vallath is the president of the RAC SIG ( www.oracleracsig.org ) and the Charlotte Oracle Users Group (www.cltoug.org).

Zafar Mahmood: is a senior consultant with the Dell Database and Applications Team of Enterprise Solutions Engineering, Dell Product Group. He has been involved in database performance optimization, database systems, and database clustering solutions for more than eight years. Zafar has a B.S. and M.S. in Electrical Engineering with specialization in Computer Communications from the City University of New York.