How to Set Up Failover Monitoring

You can enable a failover mechanism for the calculation flow processor by setting the ServerRole: "Calculation Flow Failover Processor" for the node(s) which should pick up processing if the master node (i.e. a node with "Calculation Flow Processor") fails.

Detailed scenario:

  1. The master node becomes unavailable. 
  2. The other nodes notice it (through seeing an expired distributed lock during their heartbeat call) and increase a "failover counter". (This counter can be used for monitoring.)
    (The processing happens on the 2nd heartbeat. It does not start immediately to avoid collisions with master restarts/slower starts.)
  3. On the next heartbeat, the first node picks the work and starts to process it.
  4. Once the master is available again, the failover node notices it and increases the "recovery counter", and lets the master do its work again. 

When there are two nodes and:

  • Both are masters (i.e. have the "Calculation Flow Processor" role), then just a single one is allowed to process. 
  • Both are failover nodes and none is master: it works the same as described above, it just takes longer to decide which node will be processing (the failover node starts on 3rd heartbeat). 

The counters mentioned above belong to the metrics counter and they are exported to monitoring. Based on them you can create your own alerts. "Tasks" in the metric class:

  • BackgroundCalculationFlowTaskWorker.failover – Increased anytime a master vanishes and some other node gets ready for taking over.
  • BackgroundCalculationFlowTaskWorker.failoverProcessing – Increased anytime a failover node does processing.
  • BackgroundCalculationFlowTaskWorker.recovery – Increased anytime the master gets back and is expected that within the next heartbeat will take over the processing.