Hello, you are not logged in.  Login or sign up
EXPERTS >> Guy Harrison's Improving Oracle Performance >> Resolving Oracle Contention >> Mar 2008 - Resolving Latch Contention
Toad on Twitter Follow Toad Search Toad World Search
Resolving Latch Contention
 Print  

What are latches?

Latches are serialization mechanisms that protect areas of Oracle’s shared memory (the SGA).  In simple terms latches prevent two processes from simultaneously updating - and possibly corrupting - the same area of the SGA.

Oracle sessions need to update or read from the SGA for almost all database operations.  For instance:
  • When a session reads a block from disk, it must modify a free block in the buffer cache and adjust the buffer cache LRU (Least Recently Used) chain.
  • When a session reads a block from the SGA, it will modify the LRU chain.
  • When a new SQL statement is parsed, it will be added to the library cache within the SGA.
  • As modifications are made to blocks, entries are placed in the redo buffer.
  • The database writer periodically writes buffers from the cache to disk (and must update their status from “dirty” to “clean”).
  • The redo log writer writes entries from the redo buffer to the redo logs.
Latches prevent any of these operations from colliding and possibly corrupting the SGA.   

How latches work

Because the duration of operations against memory is very small (typically in the order of nanoseconds) and the frequency of latch requests very high, the latching mechanism needs to be very light-weight.   On most systems, a single machine instruction called “test and set” is used to see if the latch is taken (by looking at a specific memory address) and if not, acquire it (by changing the value in the memory address).

If the latch is already in use, Oracle can assume that it will not be in use for long, so rather than go into a passive wait (e.g., relinquish the CPU and go to sleep) Oracle will retry the operation a number of times before giving up.  This algorithm is called acquiring a spin lock and the number of “spins” before sleeping is controlled by the Oracle initialization parameter “_spin_count”.

The first time the session fails to acquire the latch by spinning it will attempt to awaken after a millisecond or so.  Subsequent waits will increase in duration and in extreme circumstances may reach 100s of milliseconds.   In a system suffering from intense contention for latches, these waits will have a severe impact on response time and throughput.
Figure 1 uses Spotlight’s wait histogram display to show how latch waits typically last only a millisecond or two.  

 Latch histogram
Figure 1 Latch wait durations shown in Spotlights wait histogram display

Causes of latch contention

The latches that most frequently affect performance are those protecting the buffer cache, areas of the shared pool and the redo buffer.
  • Library cache and shared pool latches:  These latches protect the library cache in which sharable SQL is stored.  In a well defined application there should be little or no contention for these latches, but in an application that uses literals instead of bind variables (for instance “WHERE surname=’HARRISON’” rather that “WHERE surname=:surname”, library cache contention is common.  
  • Redo copy/redo allocation latches:  These latches protect the redo log buffer, which buffers entries made to the redo log.   These latches were a significant problem in earlier versions of Oracle, but are rarely encountered today.  
  • Cache buffers chain latches:  These latches are held when sessions read or write to buffers in the buffer cache.   There are typically a very large number of these latches each of which protects only a handful of blocks.  Contention on these latches is typically caused by concurrent access to a very “hot” block and the most common type of such a hot block is an index root or branch block (since any index based query must access the root block).  

Detecting latch contention

Oracle’s wait interface makes it relatively easy to detect latch contention and – from 10g onwards – to accurately identify the specific latch involved.   In 10 and 11g, each latch has it’s own wait category if waits on the specific latch become significant then we can deduce a latch contention problem.    In Spotlight general latch alarms result in the server processes changing color and a general latch alarm firing (Figure 2).

 latch alarm
Figure 2 The Spotlight latch contention alarm

As often as not, latch contention will be associated with other symptoms that can help diagnose the root cause of the problem.  For instance,  in Figure 3 we see that a high shared pool miss rate and shared pool lock alarm are also current.  These alarms are typical of a system in which a high rate of dynamic (non sharable) SQL is causing shared pool or library cache latch contention.

 latchAlarmOtherCauses
Figure 3 Latch alarms are usually associated with other symptoms

In the “bad old days”,  we often had to use ratio based techniques to determine which latch was causing a problem, since Oracle collated all latch waits into a single category.  Since we didn’t know what latch was responsible for the greatest amount of waits, we would typically examine “miss” and “sleep” rates.   A “miss” occurs when a session cannot immediately obtain a latch.  A sleep occurs when the session cannot obtain the latch even when retrying to the value of the parameter “_spin_count”.  These values can be found in the V$LATCH table or  in the Spotlight Latches drill down (Figure 4).  You may still need to refer to this data when dealing with latch contention in Oracle 9i and earlier.  In 10g and 11g, you can use the wait statistics (Figure 1) to determine the latch which is resulting in the most impact.

 Latch DrillDown
Figure 4 The Spotlight latch Drilldown

It’s a valid assumption that the latch with the most sleeps is contributing to the most latch free waits.  However,  the “latch miss rate” – the metric most commonly used to identify latch contention in the past – is not an accurate measure of latch contention.

Tuning the application to avoid latch contention

There are some things we can do within our application design that can reduce contention for latches.

Using bind variables

As noted earlier, failure to use bind variables within an application is the major cause of library cache and/or shared pool latch contention.   All Oracle applications should make use of bind variables whenever possible.

However, all is not lost if you are unable to modify your application code.  You can also try the “CURSOR_SHARING” parameter to cause Oracle to modify SQL on the fly to use bind variables.   A setting of FORCE causes all literals to be converted to bind variables.  A setting of SIMILAR causes statements to be rewritten only if it would not cause the statements execution plan to vary (which can happen if there are histogram statistics defined on a column referenced in the WHERE clause).

CURSOR_SHARING is one of the few silver bullet parameters that can instantly improve performance.  Figure 5 shows how performance changed on my latch constrained system when I changed CURSOR_SHARING to FORCE (using Spotlights Parameters page).   On changing the parameter (at 4:10pm) the execution rate more than doubled, latch contention was eliminated and my SQL Area miss rate halved.  

 Effect of the cursor_sharing parameter
Figure 5 The cursor_sharing parameter can be a silver bullet for library cache/shared pool latch contention

Dealing with cache buffer chains contention

Cache buffers chains latch contention is one of the most intractable types of latch contention.  There are a couple of things you can do at the application level to reduce the severity of this type of contention.

I always recommend that the application workload be optimized before dealing with contention issues (see for instance SystematicOracletuning.pdf).  But in the case of cache buffer chains contention it is particularly important.  Cache buffer chains contention occurs most often because of very high logical read rates on a relatively small number of database blocks.  A common cause of this phenomenon is SQL which repeatedly and unnecessarily  reads the same blocks over and over again.  So first, identify the SQL that is associated with the most cache buffer chains latch activity, and see if that SQL should be tuned.   Spotlight's waiting events Screen (Figure 6) can be used to find the SQL concerned and - of course - SQL optimizer will provide you with options for tuning the SQL.

Finding SQL associated with latch contention
Figure 6: Finding the SQL associated with a latch free wait

If you are satisfied that the SQL is optimized but you still have a cache buffer chains latch contention problem, try and identify the blocks that are “hot".   Metalink note 163424.1 “How to Identify a Hot Block Within The Database” describes how to do this.

Having identified the identity of the hot block, you may find that it is an index root or branch block.  If this is the case, there are two application design changes that may help.
  1. Consider partitioning the table and using local indexes.  This might allow you to spread the heat amongst multiple indexes (you will probably want to use a hash partition to ensure an even spread of load amongst the partitions).
  2. Consider converting the table to a hash cluster keyed on the columns of the index.  This allows the index to be bypassed completely and may also result in some other performance improvements.   However, hash clusters are suitable only for tables of relatively static size, and determining an optimal setting for the SIZE and HASHKEYS storage parameters are essential.
If the block is a table block that just happens to be very heavily accessed, then perhaps partitioning can still help by spreading the load across multiple partitions.  However, if it actually a single row that is hot, then you may need to review your application design.  Alternatively, you can try adjusting the _spin_count parameter (as discussed below).

Is latch contention inevitable?

While conducting performance tuning consultancies or visiting customer sites over the years I have noticed that the most highly optimized databases running on the most high end hardware seem to be the ones that suffer most significantly from latch contention.

It would appear that as we remove all other constraints on databases performance, contention for latches becomes the ultimate limiting factor on database throughput.   

Imagine we have a perfectly tuned application:  we have allocated sufficient memory to the SGA and have a sufficiently low latency IO sub-system that waits for IO are negligible.  CPU is abundant and exceeds the demands of the application.  When we reach this highly desirable state the database will be doing almost nothing but performing shared memory accesses and hence latches – which prevent simultaneous access to the same shared memory areas - will become the limiting factor.

So it may be that some degree of latch contention – possibly on the cache buffers chains latch – has to be accepted in very high volume systems running on extremely powerful hardware.

Investigating spin_count

Back when I started working with Oracle (Oracle version 5 if you must know ), the spin count parameter  (or latch_spin_count) was a documented parameter and many DBAs attempted to adjust it to resolve latch contention.  However, ever since Oracle8i the parameter is been “undocumented”: it does not appear in v$parameter and is not documented in the Oracle reference manual - it's now the hidden parameter "_spin_count".  Why did Oracle do this?

The official Oracle Corporate line is that the value of spin_count is correct for almost all systems and that adjusting it can cause degraded performance.  For instance Metalink  Note:30832.1  says: “If a system is not tight on CPU resource SPIN_COUNT can be left at higher values but anything above 2000 is unlikely to be of any benefit.”.   However,  I think the real reason is that Oracle was unable to provide good guidance on the correct value for _spin_count and therefore decided that - since adjusting it was just as likely to cause harm as to improve performance - the best option was to address latch contention through other means.

What I've found is that that higher values of _spin_count can relieve latch contention in many circumstances and I think Oracle depreciated the parameter incorrectly. But, I do think it's critical that any adjustment to _spin_count be performed with some valid measurement structure in place so that you can determine if the change of parameter has had a beneficial effect.

Oracle set the default value of spin_count to 2000 in Oracle7.  Over the subsequent 10 or so years,  CPU clock speed has increased by more than a factor of 10.  This means that Oracle systems are spending a decreasing amount of time trying to obtain the latch before dropping into a sleep.  So it is arguable that the default value of spin_count should have been increased with each release of Oracle.

I conducted some experiments into the effect of adjusting spin count on the performance of a system suffering from heavy latch contention. For my research, I created a simulation in which severe latch contention was induced on an 11g database.   I then adjusted _spin_count programmatically across a wide range of values and recorded the impact on database throughput, latch waits and CPU utilization.  

Figure 7 summarizes the relation ship between database throughput (as measured by the number of SQL statement executions per second), the amount of time spent in latch waits and the CPU utilization of the system (as measured by the CPU run queue).

The data indicates that as _spin_count increased, waits for latches reduced as CPU utilization increased.  As CPU utilization saturated (an average run queue per processor of one or more) improvements in throughput and reduction in latch free time reduced.

Note that the optimal value for spin_count in this simulation was somewhere in the vicinity of 10,000 – 5 times the default value provided by Oracle.  Throughput had increased by about 80% at this value.

 Spin Count chart
Figure 7 Relationship between spin count, CPU, latch waits and throughput

Clearly, manipulating the value of _spin_count can result in very significant reductions in latch free waits and improve the throughput of latch constrained applications.  As an undocumented parameter, many DBAs will be reluctant to manipulate _spin_count.  However, if faced with intractable latch contention  manipulating spin_count may be the best  available option for improving database throughput.

_spin_count should only be adjusted when there are available CPU resources on the system.  Specifically, if the average CPU Queue length is approaching or greater than 1, then increasing _spin_count is unlikely to be effective.

Note that as of 9iR2, you can use the undocumented _latch_class_* parameters to change the spin count for individual latches, which might be desirable in certain unusual circumstances.

Adjusting your spin count with Spotlight

Spotlight on Oracle incorporates an automated tuning module that will attempt to establish the optimal value for spin count on your system.   Spotlight will experiment with various values of spin count and determine which value results in the best throughput on your system.  Built in constraints – which you can configure – prevent Spotlight from continuing if any performance degradation is encountered.   

Spotlight on Oracle latch tuning
Figure 8: Finding the optimal value for _spin_count with Spotlight

Conclusion

Latches protect areas of Oracle shared memory from concurrent access in roughly the same way that locks protect data in tables.   When a session wants a latch it will repeatedly attempt to obtain the latch until reaching the value of "_spin_count" after which it will sleep and a "latch free" wait will occur.  Excessive latch sleeps can create restrictions on throughput and response time.

The two most frequently encountered forms of latch contention in modern Oracle (10g/11g) are:
  1. Library cache/shared pool latch contention.  This is usually caused when an application issues high volumes of SQL which are non-sharable due to an absence of bind variables.  The CURSOR_SHARING parameter can often be used to alleviate this form of contention.
  2. cache buffer chains contention. This is usually associated with very high logical read rates and "hot" blocks within the database (sometimes index blocks).   After tuning SQL to reduce logical IO and eliminate repetitive reads of the same information, partitioning is often a possible solution. 
If latch contention is causing serious problems, and the system has some free CPU capacity, adjusting the value of the undocumented parameter _spin_count may be effective in reducing contention.  As always, modifying undocumented parameters should be approached with great caution.

Spotlight has a variety of alarms, advice and diagnostic screens for identifying and diagnosing latch contention.   This includes a capability of establishing the optimum _spin_count value for a particular workload.   
Copyright 2010 by Quest Software  | Terms Of Use | Privacy Statement | Contact Us