Health Does Not Equal Performance

Most technicians and managers equate health with performance; specifically, a DB2 subsystem is healthy if some performance measure or metric remains high. There is a concurrent emphasis on CPU and DASD capacity planning, with many tools available for tracking and predicting the need for future hardware upgrades.

A typical myth is that high-performing systems are healthy, and that making systems healthier involves getting more bang for your buck. However, this view ignores several other important categories of health. It concentrates on efficiency, while neglecting other areas.

Organizations that simply equate health with performance spend lots of time on performance tuning and reacting to real or perceived performance issues. Organizations like this tend to be dealing with symptoms rather than problems, and their typical reaction is to attempt to increase performance, not health. Typical issues in this environment are:

  • Lots of access path tuning
     
  • Reactive, rather than proactive
     
  • CPU-constraints
     
  • Poor change control procedures
     
  • Poor (or no) disaster recovery plans

For example, would you consider the following healthy? A DB2 subsystem where:

  1. No Catalog/Directory backups exist
     
  2. There is only one copy of the Active Log (single logging)
     
  3. You are running DB2 Version 4
     
  4. Maintenance is current to May 2002
     
  5. The CPU is running at 100% capacity all day, every day

In fact, there are several classes of health that are combined into a single overall measure. Further, the DB2 subsystem does not exist in a vacuum; to function, it requires support from IT professionals (DBAs, systems programmers, etc.). Even more, as a DBA you have several processes you maintain as a part of your support infrastructure, like regular backups, reorgs, consistency checks, and so forth.

A Working Definition of Health

When we think about software health we think about efficiency, code path length, speed, lack of bugs, and so forth. In a broader sense we also consider how easy new features are to use, whether new functions perform well, and how generally fault tolerant the system is.

Overall health, therefore, is a combination of the following:

  • Efficiency - raw performance
     
  • Flexibility - ability to handle and exploit features and functions
     
  • Fault Tolerance - ability to tolerate failures

Further, there are several components to a DB2 subsystem. They include:

  • Software
     
  • Process
     
  • People

So, an overall health measure for a DB2 subsystem consists of a combination of these three classes of health across three categories of items. This is easily visualized as a table:

Software Processes People
Efficiency
Flexibility
Fault Tolerance

The myth that health equals performance is an emphasis on software efficiency. While this is important, there are eight other areas on the graph that deserve our attention. Software health is generally well-understood; so, briefly consider the other two areas.

Process Health

Few consider process health. What does it mean for processes to be efficient, flexible, or fault tolerant? Here are examples of typical processes related to DB2 software:

  • Automated backups (Image Copy, Incremental Copy, volume backups)
     
  • Automated Reorgs
     
  • Regular collection and reporting of performance and trace data
     
  • Regular quality review meetings
     
  • Documentation for contingencies (e.g., what to do if DB2 crashes)

People Health

This category includes the people that support a DB2 subsystem: DBAs, systems programmers, and other support personnel. Typical metrics include:

  • Levels of skill and education
     
  • Time management
     
  • Communications skills
     
  • Ability to set priorities
     
  • Analytical ability
     
  • Problem-solving ability

While it isn't typical to consider whether someone is "fault tolerant", you can still analyze this area in terms of overall system health. For example, there should be more than one person competent in each essential skills area (e.g., database recovery, DB2 system restart, problem diagnosis and repair).

Typical Measurement Methods

There are many techniques for measuring each health category. A representative sample occurs in the following chart.

Software Processes People
Efficiency Tools (e.g., RMF) QA Metrics Performance Appraisal
Flexibility Spare Capacity QA Process Review Skills Inventory
Fault Tolerance Maintenance Strategy Best Practices Practice, Practice, Practice!

Typical Health Strategies

The most effective method of developing a strategy for maintaining overall DB2 subsystem health is to extend the definitions given above for each health class: efficiency, flexibility, and fault tolerance. Then create an extended definition for each new area, concentrating on examples of good health.

Here is the strategy matrix for an example company:

Software Processes People
Ability to Recover from a Disaster
Availability of Extra Capacity
Ability of System to be:
Proactive
Predictive
Self-Healing
Stability
Efficiency
Maturity (CMM level)

The case study company used this template for several comparative analyses, including:

  • Health Measure
     
  • Measurement Tools
     
  • Current Documentation
     
  • Budget Allocation

Here is an extended outline of some of these health classes.

  1. Ability to Recover from a Disaster
     
    1. Coordination of logs and image copies
       
    2. Existence of a D/R plan
       
    3. Physical Health
       
    4. Backups
  2. Availability of Extra Capacity

    1. Measurement of percent busy
       
    2. CPU-bound versus I/O bound
       
    3. MB of traffic through DDF
       
    4. Numbers of SQL statements (Query versus Update)
       
    5. Logging activity
       
    6. Memory activity (DBM1 address space)
       
    7. Determination of current bottlenecks
  3. Proactive, Predictive, Self-Healing

    1. Real-time in-memory statistics
       
    2. Bufferpool thresholds
       
    3. Sequential detection
       
    4. Automatic page recovery
       
    5. Dual Logging

    For more information on autonomic functions, see the IBM DB2 for z/OS Version 8 site at: www-306.ibm.com/software/data/db2/os390/db2zosv8.html


  4. Stability
    1. Standardized maintenance process (version, currency)

    2. Regular maintenance upgrades

    3. Software maintenance strategy

    4. People

      1. Business skills Time management, meetings, communications skills
         
      2. Problem solving methodology
         
      3. Education, training, certification
    5. Processes

      1. Documentation: update, upgrade, centralize, review
         
      2. Process measurement
         
      3. Continuous improvement

    Additional information and recommendations on software management may be found in the IBM Red Book, Parallel Sysplex – Software Management for Availability, SG24-5451.

  5. Maturity (CMM Level)

    The Capability Maturity Model ™ (CMM) was originally created for the software development process; however, it is relevant to any set of processes used by human beings. We won't cover it here; more information on CMM may be found at: www.sei.cmu.edu/cmm/cmm.html

DB2 Subsystem Health – Specifics

To measure overall DB2 subsystem health you can break it down into classes of health (efficiency, stability, etc.) and examine the three components of the system: software, processes, and people.

Here are some sample categories. They were derived by beginning with the heath strategy matrix from the Case Study above. Yours will be different depending upon your particular priorities. For example, this company has not implemented data sharing; consequently, there were no areas involving data sharing, global buffer pool, or coupling facility health.

  • Subsystem Configuration
     
  • Catalog and Directory
     
  • Access Paths
     
  • Data
     
  • Process Objects

For each category, it is possible to determine specific items for measurement and tuning, which results in the following expanded outline:

  1. Subsystem Configuration
    1. ZParm settings
       
    2. Logs and Archives
       
    3. Work Files
       
    4. Data Sharing
       
    5. Virtual Pool sizing
       
    6. Memory pool sizing (EDM, RID, Sort)
       
    7. Processes for regular reporting
       
    8. Processes for automated changes
  2. Catalog and Directory

    1. Reorg requirements and frequency
       
    2. Standard consistency check (DSNTESQ)
       
    3. User-defined indexes
  3. Access Paths

    1. Regular saving and reporting of IFCID 0022 and 0316 records
       
    2. Regular capture of Explain information
       
    3. Regular (and automated) analysis
  4. Data: Volumetric and Configurational

    1. Volumetrics on large objects
       
    2. Highly volatile pagesets
       
    3. Message traffic
       
    4. Bulk loads and unloads
       
    5. Image Copies
       
    6. Other utilities (RunStats)
       
    7. Pageset extents
       
    8. Data compression
       
    9. Index and column statistics
  5. Process Objects

    1. Stored Procedures with non-standard run options
       
    2. User-Defined Function activity
       
    3. Triggers
       
    4. Any process automation

A note on process objects: These objects are often implemented with little thought given to error checking or reporting. And, since object invocation can be nested (e.g. a Stored Procedure can invoke a user-defined function (UDF), which then performs SQL that causes triggers to fire), these objects need to be designed to respond / react to errors or warning from objects they invoke.

Do you have any stored procedures that issue SQL statements? Do these statements ever generate negative SQLCodes? How would you know? Do all applications (including other stored procedures) react properly to these SQLCodes? How do you know?

It is important to implement a process for gathering and interpreting health data. A standard process should include the following items:

  • Set up a snapshot process - Develop SQL queries, REXX procs, utilities, third-party tools, etc., to regularly gather and store measurements
     
  • Develop a time-dependent storage mechanism - Store the measurements in a form where you can associate measurement with time and date gathered
     
  • Begin regular data gathering - Implements automated snapshots
     
  • Develop analysis reports - These include time-series analysis such as snapshot-to-snapshot comparisons and time-dependent graphs and charts
     
  • Design an automated review process - Consider ways to use the analysis reports as inputs to jobs that will address issues

For example, you set up a method of snapshotting pageset sizes and extents on DASD. You store the results in a DB2 table including a timestamp with each measurement. Next, as a team you analyze the results and determine that there is an issue with pagesets that go into multiple extents. Finally, you write a REXX procedure that reads the snapshot data, extracts the dataset names of pagesets in more than (say) 25 extents, and creates jobs to Reorg the pagesets.

What to Look For

After a period of time, review the analyses. You are looking for things on three levels:

  • What is the status right now? (Snapshot)
     
  • How are things changing? (Trend Analysis)
     
  • What processes need to be put in place (Continuous Improvement)

In the example above, it isn't enough just to know which pagesets have many extents. You also want to know how fast they are growing (trends), and if there is anything you need to do to improve the process.

Health Rating

Although the Capability Maturity Model (CMM) was originally created for the software development process, can also be used as a framework for organizing a quality improvement program for documentation and processes.

Based on the information above, there is now enough information to define measurements and scoring for the various health areas. Each IT department should review itself based on the Capability Maturity Model with an emphasis on classifying measure of health. This discussion will concentrate on DB2 infrastructure support.

You also need to have a way of rating what you do. Here is the scale used by one company:

CMM Level Documentation Processes SCORE
Does not exist No process in place 0
Initial Exists Process in place 1
Repeatable Centrally Available Documented 2
Defined Understood by all Consistently Applied 3
Managed Regularly Reviewed and Updated Regularly Reviewed and Updated 4
Optimized Regular Quality Improvement Regular Quality Improvement 5

This company chose to concentrate on Documentation and Processes. (The third area, People, was assigned as a separate project.) Our case study company then evaluated each area of its IT subsystems in terms of this chart. They logged the score for each area (Documentation and Processes). When they finished, they reviewed their findings for accuracy and completeness. Next they began to set priorities for improving documentation and processes.

Each IT enterprise should follow this procedure: evaluate current health measures and systems, devise health measures relevant to IT, rate each area, and prioritize areas that require improvement either based on a low score or on your enterprise or departmental goals. Any area receiving a score of zero for either documentation or processes should get a high priority.

Documentation and Process Upgrades

The process of updating documentation should follow the levels listed in the chart above. First, ensure that documentation exists for each area. Then ensure it is centrally available to all who might use it, and make certain it can be read and understood by all. Next implement a process where the department regularly reviews and updates the documentation. Last, institute a quality improvement process.

Process updates are more difficult, primarily because of their complexity. Processes such as regular tablespace reorgs or statistics gathering may not be implemented consistently across the enterprise, or even across a single DB2 subsystem.

Strategies and tactics for improving documentation and processes are part of the CMM.

Indicators and Events for Automation

Part of any review of IT processes and procedures is looking at automation. Before describing automation that is part of the DBMS we need to review that which is either constructed by DBAs or support staff or is implemented as part of a third-party software tool.

Most shops define the automation they desire with a simple If-Then logic. For example, "If any pageset is in over 30 extents, execute a Reorg utility for the pageset." These needs are then implemented typically as batch jobs that are executed if the condition is true. The conditions are defined in terms of indicators.

There are two types of indicators: state-based, and threshold-based. State-based indicators usually have two values such as On/Off or True/False. Examples of these would be whether or not DB2 is up, whether or not an Active Log is being archived, or whether or not an index is defined as the clustering index.

Threshold-based indicators are used when a measurement varies over a range of values. Usually they indicate one of three conditions for the current value:

  • It is within an acceptable range (normal)
     
  • It has increased (or decreased) to a level where there is some concern (warning)
     
  • It has increased (or decreased) to a level where immediate action must be taken (danger)

An example of a threshold-based indicators is the number of extents of a pageset. For most enterprises this number ranges from 1 to 255 depending upon the initial allocation of the pageset and whether it can extend to multiple volumes. Along with the measurement itself (say, 30 extents) you must also define thresholds that designate the boundaries between the normal/warning areas and the warning/danger areas.

With the indicators now defined the event definitions become clear. An event occurs when either:

  • A state-based indicator changes state
     
  • A threshold-based indicator value crosses a threshold

In summary: support personnel:

  • Define the conditions and processes
     
  • Define the indicators and events
     
  • Implement data gathering processes to regularly calculate indicator values
     
  • Implement processes to signal events based on indicators
     
  • Implement additional processes that execute based on the events

Automation Efforts

As your automation effort gets going you go through certain levels of sophistication, corresponding to the levels of the CMM. Here, the levels might represent the following:

  1. Produce and gather multiple sources of event, threshold, and statistical data
     
  2. Consolidate, summarize, and report data through management tools
     
  3. Implement monitoring software that correlates and recommends actions
     
  4. Implement action scripts for monitoring software to automatically take action
     
  5. Integrate components for dynamic management

This area of study is called autonomic computing. For more DB2-specific information on this topic, see An Autonomic Computing Roadmap and Understand Autonomic Maturity Levels.

You now have all of the tools required to determine DB2 health.

  • Identified classes of health:
    • Ability to recover from a disaster
       
    • Availability of extra capacity
       
    • Proactive, predictive, self-healing
       
    • Stability
       
    • Maturity
  • Identified categories of DB2 health:

    • Subsystem configuration
       
    • Catalog and directory
       
    • Access Paths
       
    • Data: volumetric and configurational
       
    • Process objects
  • Developed methods for rating the health of our documentation and processes

  • Optionally developed methods for rating the health of our people

  • For each health category defined indicators, events, and processes

  • Optionally implemented these processes as automation

This, in a nutshell, is the way that you go about implementing a DB2 health strategy.

Implementing Health Category Measurement and Automation

This case study company identified several categories of DB2 system health measures. One in particular was Subsystem Configuration. This was broken down into more granular areas as follows:

Subsystem Configuration

  • MVS Environment; DBMS operational state
     
  • DBMS WLM assignments
     
  • Data Sharing and Parallel Sysplex exploitation
     
  • DBMS maintenance
     
  • IRLM configuration
     
  • Disaster Recovery readiness
     
  • ZParm settings
     
  • Logs, Archives, and Log utilization
     
  • Work Files
     
  • Data Sharing
     
  • Virtual Pool sizing, thresholds and tuning
     
  • Global Buffer Pool tuning
     
  • Memory pool sizing (EDM, RID, Sort) and usage
     
  • Processes for regular reporting
     
  • Processes for automated changes

For each area they defined several indicators. For example, the first area is MVS Environment; DBMS operational state. The indicators chosen were:

  • (S) DB2 IRLM address space active
     
  • (S) DB2 MSTR address space active
     
  • (S) DB2 DBM1 address space active
     
  • (S) DB2 DDF address space active
     
  • (T) DB2 subsystem startup: recovery complete
     
  • (T) Production DB2 WLM resource "on"

They implemented measurements for these indicators through a series of REXX procedures and console commands. State changes kicked off console messages, e-mails, and other notifications.

In addition to this monitoring and reporting activity, additional logic combined indicator states and date/time information and stored records in a data warehouse for historical reporting and additional monitoring. As the database grew they were able to spot trends and further refine their measurements and indicator/event definitions.

For another example, look at Virtual Pool sizing, thresholds and tuning. This is a much more granular area, concentrating as it does on the DB2 DBM1 address space and its allocation of memory to the virtual pools.

A set of indicators was defined for each Virtual Pool. Since each pool was used for a certain class of pagesets (ex: workfiles, small tables, indexes) the indicator threshold values differed across the pools as well as some of the indicators themselves. Here is a selection of the indicators they defined:

  • (T) Active pages as a percent of pool size
     
  • (T) Deferred Write Threshold exceeded
     
  • (T) Additional thresholds exceeded
     
  • (T) Virtual Pool to Hiperpool movement
     
  • (T) Page writes to DASD per unit time

In these cases, the case study company used the DB2-related SMF records as a data source and analyzed the results with a third-party software tool. As before, they also summarized and timestamped the data and stored it in the data warehouse.

Summary

The health of your DB2 subsystems consists of a combination of factors. The system itself is a combination of software, processes, documentation, and people. Good health can be defined as a blend of recoverability, capacity, self-healing, stability, maturity, and more.

Smart companies will embark on a course that includes the Capability Maturity Model as a method of organizing and guiding your health strategy. It also has the side effect of helping you implement quality improvements in the form of Best Practices. As you begin to implement these practices, attempt to be proactive. Implement automation when you can, and gather and store performance data for later trend analysis.

Finally, realize that this is not a one-time effort but an ongoing process. The point of a health check is to implement a process of continuous quality improvement. While it's nice to have a healthy DB2, it's even better to keep it that way.