Controller fault tolerance

The threshold for controller fault tolerance is 2n+1, where n is the number of failed controllers allowed in an active team. HP VAN SDN Controller teaming supports a team of three controllers. In a team of three controllers, n = 1; one controller in a team of three can fail without suspending team operation. If one such controller does go down, a stateful failover occurs, in which the remaining two members resume together from the point of failure, and the team continues to operate. As long as any two of the teamed controllers in the network are active, the network remains in a managed state. If a second controller in the team fails, then n exceeds the maximum allowed, and the remaining controller transitions to a SUSPEND state. When a controller is suspended because it is in the minority, it will return a 503 error to any REST API calls. When at least two controllers in the team become active, a new team manager is elected and the team operation resumes.


NOTE: In teamed controller operation, maintaining the integrity of the controller state information requires that a minimum of two controllers in a team of three must be active at all times.

A controller might transition to Suspended state because of healthy reasons. When we lose quorum, some of the services (link, node, device, etc.) will not be operational during a network partition on the side that has a single controller to maintain high consistency. The following summarizes the controller states:

  • Active: The controller is healthy

  • Suspended: The controller is unhealthy

  • Unreachable: If the connection between two controllers is broken then they see each other as unreachable.

Considerations:

  • A system never sees itself as unreachable. Unreachable is a state for the remote controllers.

  • Health does not affect cluster quorum. If in a team of three controllers two are unhealthy, as long as there is a link between controllers the third one will be active.