Skip to main content

Failover & Replication

This page details concepts and considerations for managing generic failures, such as hardware failures or power outages, among cluster members, when one or more cluster nodes are unable to perform their intended functions.

Goals

Cluster handles member loss by:

  • Preserving data as long as a stream's lost members < replication factor.
  • Enuring the Cluster is accessible for read/write after member loss.

Cluster also aims to support late joins.

Success Criteria

Successful failure processing involves the following criteria:

  • Streams with a replication factor (R) less than or equal to the number of cluster members (M) have all blocks on at least R servers (R ≤ M).
  • Streams with a replication factor greater than the number of cluster members (R > M) store all blocks on all available M servers.
  • Operations can continue as before, except when replication criteria for a specific stream (R > M) cannot be met.
  • Long-running operations (e.g., replication, purge, schema change) active during a failure must either fully complete or be rolled back if completion is not possible.

Furthermore, every effort must be made to minimize disruption to connected clients whenever possible.

Recovery Priorities

The recovery priorities from highest to lowest are as follows:

  1. Restore the ability to write new data.
    Writing is prioritized over reading due to limited writer buffering and data loss risk.
  2. Restore the ability to read live data.
  3. Restore the ability to read historical data.

Implementation

Startup States

When a cluster node starts up, it transitions though the following states:

  1. Offline: Node is down.
  2. Initialization: Timebase server starts up and performs internal consistency checks. It doesn't accept connections from clients or communicate with the cluster in this state.
  3. Syncing: Timebase is up, internal checks complete, it has joined the cluster, and syncs local stream state with cluster metadata. It can accept client and server connections, but the leader doesn't allocate new blocks or failover tasks.
  4. Live: Node is fully operational.

Note that any node could be marked for "maintenance", with its timestamp recorded in Raft. Nodes marked for maintenance are not treated as failed when offline, as long as the current time is before the maintenance timestamp.

Task Queue

The cluster leader has a prioritized task queue that holds different types of tasks. Each task retains information on what was scheduled for replication at a specific time.

Task types include:

  • ACTIVE_BLOCK_REPLICATION
  • FINISHED_BLOCK_REPLICATION
  • PARTITION_REPLICATION
  • STREAM_REPLICATION

The cluster leader checks tasks in the queue and assigns them to specific cluster members. Tasks cannot just be assigned to any arbitrary cluster member.

For instance, node Y cannot be assigned a replication task for block X if it is already in the replication list for that block.

As a result, there's no need to log this queue in Raft because it must be reconstructed upon a leader switch based on the current cluster membership status.

Queue Processing

To ensure balanced processing, each node is assigned a maximum of K tasks at a time, such as K=2, to prevent overburdening. It's crucial to verify the task's validity before execution, as the queue may contain obsolete tasks that are no longer relevant.

Queue processing is triggered in the following cases:

TriggerResponse
Cluster member switches from LIVE to OFFLINE without a maintenance status.Add replication tasks for the offline member's streams and assign tasks to live cluster members.
OFFLINE cluster member's maintenance status expires.Add replication tasks for the offline member's streams and assign tasks to live cluster members.
Cluster member finishes processing a replication task and is ready to start the next task.Try to assign a new replication task for this cluster member.