Design philosophy for error handling

Highly available applications should check for all possible error conditions and inform the user or the system operator as appropriate. Of course, not all error returns indicate a problem with the application software. Many errors indicate usage errors, such as a user entering the name of a nonexisting file or entering some out-of-range value. In these circumstances, the application is functioning according to its specification.

For error conditions that might indicate a problem, your application must provide recovery wherever possible and provide adequate diagnostic information when recovery is not possible. Ideally, your application should provide the operator with control over the amount and kind of detail collected for problem diagnosis.

Checking for errors

No simple formula exists for determining the correct response that an application must make to an error. The correct response varies greatly depending on the severity of the error, the type of application, the type of user, and so on.

Applications that require high levels of availability, however, should try to avoid simply terminating on occurrence of an error, unless some form of corruption is indicated. Your application and supporting software should first try every means possible to keep the application running. If the application must go offline, then it should do so as gracefully as possible, such as by leaving any databases in known and consistent states and creating files that can be used by operators and support representatives for analysis.

Applications should check for all possible error return values. Errors can occur in several situations, including the following:

  • Communications input/output

  • Database input/output

  • Opening and closing resources

  • Using controller services

  • Communicating between instances of applications on teamed controllers


[NOTE: ]

NOTE: Hewlett Packard Enterprise occasionally adds to the list of errors that a given product might generate. Check the accompanying documentation when you install new revisions of Hewlett Packard Enterprise software and ensure that your application handles any new error codes in an appropriate way.


Attempting to recover from an error

Many errors indicate some temporary loss of service, which can be corrected simply by retrying the operation. The period and number of retries again depends on the error. If the error persists or recurs, then it might be appropriate to perform a different recovery action, such as trying to communicate with a different resource or terminating the application.

Generally, recovery should be attempted if the loss of any resource occurs. Your program should repeatedly retry associated operations until the resource reappears, and advise the user as appropriate for the application.

For some errors, it might be appropriate to retreat to a known safe point, such as a transaction boundary. Having done so, your application might be able to continue processing, or at least leave the database in a known, consistent state before terminating.

For other errors, your application might be designed to continue with delayed or partial functionality. For example, if your client process receives an error indicating that it can no longer access an external database, it can do as much processing as it can locally and send the request to a queue file for later processing.

Terminating the application

For a highly available application, process termination should be considered as a last resort. However, if an unexpected error occurs, other software faults are returned from error detectors, potential data corruption errors occur, or recoveries repeatedly fail, the correct action for your application is usually to terminate.

On termination, your application should create a file that can be used for further analysis and should send appropriate notifications to users and controller services.