Designing an application for high availability

After your application goes online, the burden of keeping it online is often carried by the operations staff. The operations staff must do what they can to prevent the application from going down and, if it does go down, get the application back online quickly. The role of the application designer and developer is to prevent application outages when possible, and to provide instrumentation to assist operators with recovery operations and with their analysis of application failures.

Your role as application designer and developer is to do the following:

In other words, you should provide the same kinds of functions in the business application that Hewlett Packard Enterprise provides for control of many subsystems.

If you use standard interfaces for reporting application events and for receiving commands, you can write applications that automatically perform the majority of operations tasks. This approach reduces the burden on the operations staff and provides a quicker response to state changes in the application, which increases application availability.

Reducing application downtime by providing instrumentation

The key to reducing application downtime though instrumentation is understanding what happens when a problem occurs and what needs to happen before the problem is fixed and the application is back online.

If a problem occurs that takes the application offline, the following tasks must finish before the application comes back online:

  1. Detect the failure.

    Some mechanism must be in place to detect and report the failure to a human or automated operator.

  2. Analyze the failure.

    The failure must be analyzed to determine exactly what went wrong and the circumstances under which the failure occurred.

  3. Resolve the failure.

    The failure must be fixed. For example, a switch must be brought back online, disk space must be allocated, or some other action must be taken.

  4. Recover from the failure.

    The program or operator must take action to resume normal application operations after the failure is resolved. These actions could involve something as automated as receiving notification that a resource is back on line and continuing program execution or as manual as requiring the operator to restart the application. Other approaches might involve determining that the failure cannot be resolved and taking actions to work around the failure.

Because any of these tasks can represent a significant amount of down time for an application, reducing application down time involves the following:

  • Avoiding the occurrence of the problem.

  • If the problem cannot be avoided, using automated procedures (instrumentation) wherever possible to detect, analyze, and correct the problem in the shortest possible time.

Instrumenting for failure prevention

Instrumentation for failure prevention includes:

  • Providing a command interface to monitor and control critical objects within an application. Such an interface allows the human or automated operator to query the status of application objects and perform preventive action to ensure the continued availability of the application.

  • Generating and capturing events that indicate that a resource has crossed a critical threshold; for example, a disk is 95 percent full. Operators can use that information to take action to prevent subsequent outage.

  • Taking and reporting performance measurements. Human or automated operators can take preventive measures if significant performance degradation should occur and become critical.

Instrumenting for failure detection

Instrumentation for failure detection includes generating and capturing events indicating that a critical object has gone offline; for example, a connection to an external database is down or an application software error has been detected. Immediate reactive response keeps downtime to a minimum.

Instrumenting for failure analysis

Although instrumentation cannot provide all the tools and procedures necessary for analyzing every fault, you can help to make fault analysis possible by:

  • Considering failure data capture strategies and functions in the design of your application. For example, consider what failure data might be needed an automated management application and how your application might provide that data.

  • Including appropriate diagnostic information in alerts and log entries.

  • Providing an interface or using controller services to enable operators to retrieve configuration information, status of objects, and internal usage statistics.

Instrumenting for failure resolution and recovery

Instrumentation can help resolve the failure and recover the application through a command interface that can alter the status of objects by starting, stopping, suspending, or activating parts of the application.