Monitoring Wait Statistics in Oracle database. Part 1

Oracle Community

Monitoring Wait Statistics in Oracle database. Part 1

By Juan Carlos Olamendy Turruellas

 

In this series of articles, I’ll talk about important wait statistics for watching in order to avoid system degradation, that is, getting on slow response time.

 

Let’s start by saying that the response time (total time to finish the work) of a request sent to a server process depends on two factors:

  • Service time: the time that the server process takes for executing the SQL query/statement
  • Wait time: the time that the server process spends waiting for available shared resources (to be released by other server processes) such as latches, locks, data buffers, and so on

We can mathematically express the former concept using the formula: response_time = service_time + wait_time.

We’ve talked in previous articles about techniques/designs/configurations in order to improve and monitor the service time; so in this article, we’ll focus on the wait time and related events.

It’s remarkable to say the wait time and related events are only the symptoms of the problems, so they show what’s slowing down the performance, but they don’t tell why. Thus, it’s responsibility of the DBA to investigate the real cause by doing a causal analysis.

 

There are dynamic views that show all wait events related to an instance. These wait event views show similar information but from different point of view of the instance and they’re very important for resolving performance issues. There are more than 950 wait events since Oracle database 11g, although the most important ones are related to resource contention such as latch, lock, buffer and I/O contention. Wait events are grouped together by wait classes as shown below:

  • Administratives: waits caused by administrative commands. For example, index rebuilding
  • Application: Waits caused by the program code
  • Cluster: Waits caused by RAC
  • Commit: Waits caused by commits
  • Concurrency: Waits caused by waiting for used shared resources. For example, latches
  • Idle: Waits caused when a session is not active.
  • Network: Waits caused by the network messaging
  • Scheduler: Waits caused by the resource manager
  • System I/O: Waits caused by background I/O processes. For example, the DB writer or log writer
  • User I/O: Waits caused by user I/O. For example, sequential reads to a database file
  • Other: Miscellaneous waits

 

The first metric to analyze is the instance performance, that’s, to calculate the ratio of total time dedicated to service time (working) vs wait time. This information can be extracted from the V$SYSMETRIC view as shown below in the listing 01.

 

SELECT METRIC_NAME, VALUE

FROM V$SYSMETRIC

WHERE METRIC_NAME IN ('Database CPU Time Ratio', 'Database Wait Time Ratio') AND INTSIZE_CSEC =(SELECT max(INTSIZE_CSEC) FROM V$SYSMETRIC);

Listing 01

 

From the output of the former SQL query, we can validate whether or not the wait time is higher than the working time (CPU time). From this point, we go in further details to discover the real causes related to the wait events. Wait classes provide a view to discover where the database instance is performing poorly as shown below in the listing 02 using the V$SYSTEM_WAIT_CLASS view.

 

SELECT WAIT_CLASS, TOTAL_WAITS, round(100*(TOTAL_WAITS/SUM_WAITS),2) PCT_TOTAL_WAITS,

ROUND((TIME_WAITED/100),2) TIME_WAITED, round(100*(TIME_WAITED/SUM_TIME_WAITED),2) PCT_TIME_WAITED

FROM (SELECT WAIT_CLASS, TOTAL_WAITS, TIME_WAITED FROM V$SYSTEM_WAIT_CLASS WHERE WAIT_CLASS != 'Idle'),

(SELECT sum(TOTAL_WAITS) SUM_WAITS, sum(TIME_WAITED) SUM_TIME_WAITED FROM V$SYSTEM_WAIT_CLASS WHERE WAIT_CLASS != 'Idle')

ORDER BY PCT_TIME_WAITED DESC

Listing 02

 

In our case, that Other, Configuration, User I/O and System I/O are responsible for most of the wait time by watching the PCT_TIME_WAITED column in the report. We can see that although User I/O and System I/O are responsible for the most wait events by watching at the PCT_TOTAL_WAITS column, they don’t represent the most wait time by watching the PCT_TIME_WAITED column. We can conclude that I/O operations are not the bottleneck in the system in this moment as shown in the listing 03.

 


 

Listing 03

 

The V$SYSTEM_EVENT view shows the total time waited for all the events in the instance since it started up. We can use this view to discover what are the top wait events. We can calculate the top n wait events by dividing the event’s wait time by the total wait time for all events. For getting the 10 wait events, we need to execute the SQL query as shown in the listing 04.

 

SELECT EVENT, TIME_WAITED, round(100*(TIME_WAITED/SUM_TIME_WAITED),2) PCT_TIME_WAITED,

TOTAL_WAITS, round(100*(TOTAL_WAITS/SUM_TOTAL_WAITS),2) PCT_TOTAL_WAITS

FROM (SELECT EVENT, TIME_WAITED, TOTAL_WAITS FROM V$SYSTEM_EVENT),

(SELECT sum(TIME_WAITED) SUM_TIME_WAITED, sum(TOTAL_WAITS) SUM_TOTAL_WAITS FROM V$SYSTEM_EVENT)

WHERE ROWNUM<=10

ORDER BY PCT_TIME_WAITED DESC

Listing 04

 

We can see that there is no significant I/O wait events. In this case, the most significant wait event is “rdbms ipc message” which means that background processes are waiting for IPC messages (indicating a wait for more work) and according to the vendor documentation it’s not necessary to worry because it’s like a sort of idle event. The report is shown in the listing 05.

 


 

Listing 05

 

We can filter the events like “db file%” as shown below in the listing 06 in order to validate if there are (or not) heavy I/O operations over the database files.

 

SELECT EVENT, TIME_WAITED, round(100*(TIME_WAITED/SUM_TIME_WAITED),2) PCT_TIME_WAITED,

TOTAL_WAITS, round(100*(TOTAL_WAITS/SUM_TOTAL_WAITS),2) PCT_TOTAL_WAITS

FROM (SELECT EVENT, TIME_WAITED, TOTAL_WAITS FROM V$SYSTEM_EVENT WHERE EVENT like 'db file%'),

(SELECT sum(TIME_WAITED) SUM_TIME_WAITED, sum(TOTAL_WAITS) SUM_TOTAL_WAITS FROM V$SYSTEM_EVENT)

ORDER BY PCT_TIME_WAITED DESC

Listing 06

 

We can see that although there are some waits associated to I/O operations over database files, they don’t impact on the percent of waited time that is almost close to 0 as shown in the report in the listing 07.

 

Let me tell that in Oracle we see two types of data block access:

  • db file sequential read”. A single block is read in contiguous way (i.e. reading on indexes, reading the data file headers)
  • db file scatter read”. Multiple blocks are read at the same time (i.e. a full-table scan, a sorting operation)

 

We need to monitor these metrics under a heavy I/O workload because they may constitute the bottleneck of your system. If we find out that there is a lot of wait time related to these metrics (it’s not the case for our example, fortunately), then we need to follow the strategies:

  1. Increase the system RAM and databases buffer cache
  2. Tune the SQL queries to reduce I/O access
  3. Invest on fast storage medium (for example, on SSD devices)

 

We can see in the listing 07 that we’re doing a lot of “db file sequential read” (a 12.22% of waits), although not impacting fortunately on the waited time for now.

 


 

Listing 07

 

We can check the current wait events for a given session using the SQL query as shown below in the listing 08.

 

SELECT event, state, wait_time, seconds_in_wait

FROM V$SESSION

WHERE sid=<sid_parameter>

Listing 08

 

We can also use the V$SESSION_WAIT_HISTORY view to display information about the last ten wait events filtering by a given session as shown below in the listing 09.

 

SELECT seq#, event, wait_time

FROM V$SESSION_WAIT_HISTORY

WHERE sid=<sid_parameter>

ORDER BY seq#

Listing 09

 

 

Conclusion

 

In this first article, I’ve started talking about monitoring wait statistics in Oracle databases. Now you can apply these knowledge and techniques to your own Oracle database instances in order to monitor the performance.

 

 

8468 1 /
Follow / 12 Jan 2016 at 9:55pm

Excellent article, thanks.