Manual Recovery
Overview
This document covers disaster scenarios when the Cluster metadata state gets corrupted and information about streams, partitions, or blocks does not match the content in the local Timebase storage.
Possible causes of metadata corruption include:
- A major outage involving more than half of the Cluster servers
- A logical error in the code base
- An error in the Cluster maintenance procedures
- Damage to metadata files on more than half of the Cluster members
In such cases, built-in, automatic Cluster recovery mechanisms may not work. The following section provides instructions for manual metadata recovery.
Manual Recovery Procedure
To recover a Cluster state from local data:
- Stop the Cluster.
- Run the
tb-read-meatadata
tool, which reads stream metadata and exports it to a file for each Cluster member. - Put all metadata files generated during the previous step into a single location that has network access to the Cluster, for example, one of the Cluster servers.
- Start up the Cluster but do not allow clients to write new data to it yet.
To achieve this, you may need to isolate the Cluster servers from clients. - Run the
tb-merge-metadata
tool, which merges the metadata files and loads the results to the Cluster. - Allow data write access to the Cluster.
Manual Recovery Example
Assuming a cluster of three nodes is stopped and has data in streams, the cluster file looks like this:
/tbcluster
../server-1
../timebase
../server-2
../timebase
../server-3
../timebase
To restore Cluster on a local machine:
Ensure all servers are stopped.
Take a metadata snapshot for each node by running the following code.
If you have a local setup, you do not need to move files around because they are already on same machine.# tb-read-metadata.sh -dbDir /tbcluster/server-1/timebase -outDir /tmp/node-snapshot -memberKey member1
# tb-read-metadata.sh -dbDir /tbcluster/server-2/timebase -outDir /tmp/node-snapshot -memberKey member2
# tb-read-metadata.sh -dbDir /tbcluster/server-3/timebase -outDir /tmp/node-snapshot -memberKey member3Start up the Cluster but do not allow clients to connect to and write data into it yet.
Merge and load the restored snapshot to the cluster by running the following code:
note
The argument
-chunkSize
is in bytes and is optional. The default value is 1 MB.# tb-merge-metadata.sh -in /tmp/node-snapshot -dbUrl dxctick://localhost:8011 -chunkSize 1024