Skip to main content

Horizontal Scaling

Horizontal Scaling

TimeBase Cluster can be horizontally scaled by adding more servers to the cluster. This allows for increased data storage capacity and improved performance by distributing the workload across multiple nodes.

Capacity Planning

When planning the capacity of a TimeBase Cluster, consider the following factors:

  • Data Volume: Estimate the amount of data that will be stored in the cluster.
  • Replication Factor: Determine the replication factor for data blocks, which affects the number of copies stored across servers.
    • If your cluster is supposed to handle single server failure, then the replication factor should be at least 2 (main and one replica). This means that data will occupy twice the space on cluster disks.
  • Additional space for recovery: Your cluster should have enough free space to accommodate the data that have to be re-replicated in case of server failures. For instance, if you have 5 servers and want to tolerate 2 stopped servers, then any 3 server of your cluster should be able to store all data of the cluster.

Generally, the following formula can be used to estimate the number of servers needed in a TimeBase Cluster:

N = ⌈ DataVolume / CapacityPerServer * (K+1) + K ⌉

Where:

  • N is the number of servers in the cluster.
  • DataVolume is the total amount of data to be stored in the cluster (without duplication).
  • K (K=ReplicationFactor - 1) is the number of servers that can be stopped at the same time (due to failure or maintenance).
Example

Suppose that you have 10 TB of data to store in the cluster, and each server has a capacity of 4 TB. You want to tolerate 1 server failure, so the replication factor is 2.

10 TB / 4 TB * (2) + 1 = 6, so you need at least 6 servers in the cluster.

If you want to tolerate 2 server failures, then the replication factor is 3, and the formula becomes:

10 TB / 4 TB * (3) + 2 = 9.5, so you need at least 10 servers in the cluster.

Throughput Scaling

TimeBase Cluster scaling increases overall throughput and read performance by distributing the workload across multiple servers in the cluster.

That especially useful when the amount of data that is being written into TimeBase is larger than the maximum throughput of a single server.

caution

Horizontal scaling does not improve the performance of a single loader or cursor, unless existing setup is overloaded.

Throughput Scaling Test

To illustrate the throughput scaling capabilities of TimeBase Cluster, we conducted a scalability test with the following parameters:

  • Cluster Size: 3, 6, 9, and 12 servers, AWS i3en.6xlarge instances, 2.50GHz x 24 logic cores each, NVMe SSD.
  • Client machine: single AWS m5n.8xlarge instance, 2.50GHz x 32 logic cores.
  • Messages: deltix.timebase.api.messages.BarMessage messages, ~30 bytes each in TimeBase wire binary format.

Benchmark logic:

  • Create P producer/consumer pairs, where each producer writes messages into a single partition of a stream, and one consumer reads messages from that partition. So each pair of producer and consumer works with a single partition of a stream and is independent of other pairs.
  • 60 seconds of warmup
  • 120 seconds of measurement

Value P is selected as follows:

  • For 3 servers: select value that allows the maximum throughput.
  • For the reset of the servers: increase P proportionally to the number of servers in the cluster, so that each server gets the same number of producer/consumer pairs on each cluster size.

The following table shows scaling of overall ingest throughput and read throughput with the number of servers in the cluster.

ServersProducer/Consumer pairsTotal Throughput, Millions of Messages per SecondThroughput per Consumer, Thousands of Messages per Second (Avg)
382.3 M msg/s288 k msg/s
6164.4 M msg/s275 k msg/s
9246.3 M msg/s263 k msg/s
12328.3 M msg/s259 k msg/s

As you can see, the overall throughput scales near linearly with the number of servers in the cluster, while the throughput per consumer remains approximately constant.

info

Please note that in this experiment, the producer and consumer pairs were running on a single client machine with high but limited resources, potentially decreasing the throughput per consumer as the number of pairs increases.