Most technicians and managers equate health with performance; specifically, a DB2 subsystem is healthy if some performance measure or metric remains high. There is a concurrent emphasis on CPU and DASD capacity planning, with many tools available for tracking and predicting the need for future hardware upgrades.
A typical myth is that high-performing systems are healthy, and that making systems healthier involves getting more bang for your buck. However, this view ignores several other important categories of health. It concentrates on efficiency, while neglecting other areas.
Organizations that simply equate health with performance spend lots of time on performance tuning and reacting to real or perceived performance issues. Organizations like this tend to be dealing with symptoms rather than problems, and their typical reaction is to attempt to increase performance, not health. Typical issues in this environment are:
For example, would you consider the following healthy? A DB2 subsystem where:
In fact, there are several classes of health that are combined into a single overall measure. Further, the DB2 subsystem does not exist in a vacuum; to function, it requires support from IT professionals (DBAs, systems programmers, etc.). Even more, as a DBA you have several processes you maintain as a part of your support infrastructure, like regular backups, reorgs, consistency checks, and so forth.
When we think about software health we think about efficiency, code path length, speed, lack of bugs, and so forth. In a broader sense we also consider how easy new features are to use, whether new functions perform well, and how generally fault tolerant the system is.
Overall health, therefore, is a combination of the following:
Further, there are several components to a DB2 subsystem. They include:
So, an overall health measure for a DB2 subsystem consists of a combination of these three classes of health across three categories of items. This is easily visualized as a table:
The myth that health equals performance is an emphasis on software efficiency. While this is important, there are eight other areas on the graph that deserve our attention. Software health is generally well-understood; so, briefly consider the other two areas.
Few consider process health. What does it mean for processes to be efficient, flexible, or fault tolerant? Here are examples of typical processes related to DB2 software:
This category includes the people that support a DB2 subsystem: DBAs, systems programmers, and other support personnel. Typical metrics include:
While it isn't typical to consider whether someone is "fault tolerant", you can still analyze this area in terms of overall system health. For example, there should be more than one person competent in each essential skills area (e.g., database recovery, DB2 system restart, problem diagnosis and repair).
There are many techniques for measuring each health category. A representative sample occurs in the following chart.
The most effective method of developing a strategy for maintaining overall DB2 subsystem health is to extend the definitions given above for each health class: efficiency, flexibility, and fault tolerance. Then create an extended definition for each new area, concentrating on examples of good health.
Here is the strategy matrix for an example company:
The case study company used this template for several comparative analyses, including:
Here is an extended outline of some of these health classes.
Availability of Extra Capacity
Proactive, Predictive, Self-Healing
For more information on autonomic functions, see the IBM DB2 for z/OS Version 8 site at: www-306.ibm.com/software/data/db2/os390/db2zosv8.html
Standardized maintenance process (version, currency)
Regular maintenance upgrades
Software maintenance strategy
Additional information and recommendations on software management may be found in the IBM Red Book, Parallel Sysplex – Software Management for Availability, SG24-5451.
Maturity (CMM Level)
The Capability Maturity Model ™ (CMM) was originally created for the software development process; however, it is relevant to any set of processes used by human beings. We won't cover it here; more information on CMM may be found at: www.sei.cmu.edu/cmm/cmm.html
To measure overall DB2 subsystem health you can break it down into classes of health (efficiency, stability, etc.) and examine the three components of the system: software, processes, and people.
Here are some sample categories. They were derived by beginning with the heath strategy matrix from the Case Study above. Yours will be different depending upon your particular priorities. For example, this company has not implemented data sharing; consequently, there were no areas involving data sharing, global buffer pool, or coupling facility health.
For each category, it is possible to determine specific items for measurement and tuning, which results in the following expanded outline:
Catalog and Directory
Data: Volumetric and Configurational
A note on process objects: These objects are often implemented with little thought given to error checking or reporting. And, since object invocation can be nested (e.g. a Stored Procedure can invoke a user-defined function (UDF), which then performs SQL that causes triggers to fire), these objects need to be designed to respond / react to errors or warning from objects they invoke.
Do you have any stored procedures that issue SQL statements? Do these statements ever generate negative SQLCodes? How would you know? Do all applications (including other stored procedures) react properly to these SQLCodes? How do you know?
It is important to implement a process for gathering and interpreting health data. A standard process should include the following items:
For example, you set up a method of snapshotting pageset sizes and extents on DASD. You store the results in a DB2 table including a timestamp with each measurement. Next, as a team you analyze the results and determine that there is an issue with pagesets that go into multiple extents. Finally, you write a REXX procedure that reads the snapshot data, extracts the dataset names of pagesets in more than (say) 25 extents, and creates jobs to Reorg the pagesets.
After a period of time, review the analyses. You are looking for things on three levels:
In the example above, it isn't enough just to know which pagesets have many extents. You also want to know how fast they are growing (trends), and if there is anything you need to do to improve the process.
Although the Capability Maturity Model (CMM) was originally created for the software development process, can also be used as a framework for organizing a quality improvement program for documentation and processes.
Based on the information above, there is now enough information to define measurements and scoring for the various health areas. Each IT department should review itself based on the Capability Maturity Model with an emphasis on classifying measure of health. This discussion will concentrate on DB2 infrastructure support.
You also need to have a way of rating what you do. Here is the scale used by one company:
This company chose to concentrate on Documentation and Processes. (The third area, People, was assigned as a separate project.) Our case study company then evaluated each area of its IT subsystems in terms of this chart. They logged the score for each area (Documentation and Processes). When they finished, they reviewed their findings for accuracy and completeness. Next they began to set priorities for improving documentation and processes.
Each IT enterprise should follow this procedure: evaluate current health measures and systems, devise health measures relevant to IT, rate each area, and prioritize areas that require improvement either based on a low score or on your enterprise or departmental goals. Any area receiving a score of zero for either documentation or processes should get a high priority.
The process of updating documentation should follow the levels listed in the chart above. First, ensure that documentation exists for each area. Then ensure it is centrally available to all who might use it, and make certain it can be read and understood by all. Next implement a process where the department regularly reviews and updates the documentation. Last, institute a quality improvement process.
Process updates are more difficult, primarily because of their complexity. Processes such as regular tablespace reorgs or statistics gathering may not be implemented consistently across the enterprise, or even across a single DB2 subsystem.
Strategies and tactics for improving documentation and processes are part of the CMM.
Part of any review of IT processes and procedures is looking at automation. Before describing automation that is part of the DBMS we need to review that which is either constructed by DBAs or support staff or is implemented as part of a third-party software tool.
Most shops define the automation they desire with a simple If-Then logic. For example, "If any pageset is in over 30 extents, execute a Reorg utility for the pageset." These needs are then implemented typically as batch jobs that are executed if the condition is true. The conditions are defined in terms of indicators.
There are two types of indicators: state-based, and threshold-based. State-based indicators usually have two values such as On/Off or True/False. Examples of these would be whether or not DB2 is up, whether or not an Active Log is being archived, or whether or not an index is defined as the clustering index.
Threshold-based indicators are used when a measurement varies over a range of values. Usually they indicate one of three conditions for the current value:
An example of a threshold-based indicators is the number of extents of a pageset. For most enterprises this number ranges from 1 to 255 depending upon the initial allocation of the pageset and whether it can extend to multiple volumes. Along with the measurement itself (say, 30 extents) you must also define thresholds that designate the boundaries between the normal/warning areas and the warning/danger areas.
With the indicators now defined the event definitions become clear. An event occurs when either:
In summary: support personnel:
As your automation effort gets going you go through certain levels of sophistication, corresponding to the levels of the CMM. Here, the levels might represent the following:
This area of study is called autonomic computing. For more DB2-specific information on this topic, see An Autonomic Computing Roadmap and Understand Autonomic Maturity Levels.
You now have all of the tools required to determine DB2 health.
Identified categories of DB2 health:
Developed methods for rating the health of our documentation and processes
Optionally developed methods for rating the health of our people
For each health category defined indicators, events, and processes
Optionally implemented these processes as automation
This, in a nutshell, is the way that you go about implementing a DB2 health strategy.
This case study company identified several categories of DB2 system health measures. One in particular was Subsystem Configuration. This was broken down into more granular areas as follows:
For each area they defined several indicators. For example, the first area is MVS Environment; DBMS operational state. The indicators chosen were:
They implemented measurements for these indicators through a series of REXX procedures and console commands. State changes kicked off console messages, e-mails, and other notifications.
In addition to this monitoring and reporting activity, additional logic combined indicator states and date/time information and stored records in a data warehouse for historical reporting and additional monitoring. As the database grew they were able to spot trends and further refine their measurements and indicator/event definitions.
For another example, look at Virtual Pool sizing, thresholds and tuning. This is a much more granular area, concentrating as it does on the DB2 DBM1 address space and its allocation of memory to the virtual pools.
A set of indicators was defined for each Virtual Pool. Since each pool was used for a certain class of pagesets (ex: workfiles, small tables, indexes) the indicator threshold values differed across the pools as well as some of the indicators themselves. Here is a selection of the indicators they defined:
In these cases, the case study company used the DB2-related SMF records as a data source and analyzed the results with a third-party software tool. As before, they also summarized and timestamped the data and stored it in the data warehouse.
The health of your DB2 subsystems consists of a combination of factors. The system itself is a combination of software, processes, documentation, and people. Good health can be defined as a blend of recoverability, capacity, self-healing, stability, maturity, and more.
Smart companies will embark on a course that includes the Capability Maturity Model as a method of organizing and guiding your health strategy. It also has the side effect of helping you implement quality improvements in the form of Best Practices. As you begin to implement these practices, attempt to be proactive. Implement automation when you can, and gather and store performance data for later trend analysis.
Finally, realize that this is not a one-time effort but an ongoing process. The point of a health check is to implement a process of continuous quality improvement. While it's nice to have a healthy DB2, it's even better to keep it that way.