Skip to main content

Architecture

Partitions & Data Blocks

TimeBase (TB) streams consist of one or more partitions, which are essentially logic "substreams." Only a single data loader (writer) can concurrently write data to a specific partition in cluster.

Cluster splits data from a single partition into a sequence of blocks, referred to as "Data Blocks", according to message timestamp.

In the following diagram, the time axis goes from left to right.

Splitting a partition into blocks allows Cluster to store partitions that are bigger than the disk storage of an individual server.

Storage

When writing data, a server may initiate creating a new data block when the current one reaches a soft limit (about 100 Mb, configurable). Multiple blocks can be stored on one server as long as there's sufficient space.

Once the new location is ready, TimeBase loader writes incoming data to the new data block, resulting in a partition's time range being divided into intervals with messages stored in respective blocks.

Data blocks are indivisible, requiring a server to store the entire block. Each block should be entirely stored on a single physical machine's file system. When relocating or replicating data, the complete block must be moved or replicated.

The cluster can relocate data blocks during failover-related procedures. When there's neither an active loader for a partition nor ongoing failover procedures, all block copies across servers contain identical content. By default, block distribution seeks to maintain balanced server workloads.

Time Intervals

Data blocks store data for specific time intervals, featuring both startTimestamp and endTimestamp attributes, although endTimestamp may be empty for new blocks. TimeBase cluster exclusively supports "append mode" for loaders, preventing messages with decreasing timestamps within the same partition or "space." Consequently, each block represents an ordered sequence of messages with monotonically non-decreasing timestamps written into a single "space."

Time intervals within different blocks can greatly vary in size, as blocks are created based on data size on disk, rather than time elapsed or message count. It's essential to note that time ranges for all blocks in a single partition never overlap. Moreover, time ranges for different partitions are not related to each other and usually have a unique set of borders between blocks.

Finding a message with a given timestamp is easy, as only one block for a partition corresponds to the specific time range.

Replication Chain

The content of each block is replicated to multiple TimeBase servers, forming what is known as the "block location list" or "replication chain." Servers within the block location list always replicate data from the server placed before them in the replication sequence. In other words, if you have servers labeled A, B, C, and D, data is replicated from A to B, from B to C, and from C to D.

The data loader (writer) exclusively writes data to the very first server in the block location list, while consumers (readers) solely retrieve data from the very last server in this list. This setup guarantees that messages traverse all the servers in the replication chain, ensuring that the desired replication factor is achieved before reaching any reader.

Replication Factor

TimeBase Cluster stores multiple copies of each data block on distinct servers to ensure data recovery in case of server failure. The replication factor, defined at stream creation (defaulting to 2), determines how many data block copies are kept.

Copies (replicas) of a single data block are always stored on different cluster nodes. Cluster metadata maintains the list of locations (nodes) where the data for each block is stored.

These block copies are always stored on different cluster nodes. The cluster's metadata maintains a list of block locations where the data for each block is stored. The list of a block's locations (servers) is ordered and has special meaning:

  • The client loader always writes to the first location in the list.
  • Client cursors read from the last location in the list.
  • Data replication progresses from the first block location to the second, from the second location to the third, and so on.

Data Flow Example

When a loader starts to write to a partition, the initial block structure may look like this (with the replication factor = 2):

After the first block gets full, the loader starts to write to the next block:

Physical Structure on Server

The diagram illustrates the following information:

  • The cluster has 3 servers.
  • The replication factor for the stream is 2.
  • Arrows represent replication direction.
  • Blocks are distributed between servers in a semi-random way.
  • On the storage layer, there is no notion of "partitions". From the storage layer perspective, each local stream contains a random set of blocks.
  • On the storage layer, there is no notion about the relationship between blocks in a stream. From the storage layer perspective, all blocks are independent but share the same message schema.

Component Interaction Structure

Some things to note about the diagram:

  • The Raft Cluster Leader is not a separate entity. It's one of the Raft Cluster Members that was appointed to be a Cluster Leader.
  • Under the hood, Cluster Client creates multiple instances of a "regular" TimeBase client and has connections to multiple TimeBase servers. The client picks a random known server to initialize connection and then switches to the server that holds the needed data blocks.
  • Raft Cluster Members do not directly interact with the Cluster Database (DB). Raft agents are embedded into the TB server application but are mostly isolated from the rest of TB. However, there is an exception: data regarding server state (disk usage and so on) is directly passed from TB to the Raft Cluster Member.

Global Data Flow

The diagram illustrates an example of data flow in a cluster where there are multiple Loaders and Cursors.

  • Cursor 1 only consumes data from Loader 1.
  • Cursor 2 consumes data from all three Loaders.
  • Cursor 3 consumes data from Loader 1 and Loader 3.
  • Cursor 4 only consumes data from Loader 3.

Cursors always read from the last block replica. If multiple cursors need the same data, they read the data rom the same server.

Glossary

TermDefinition
ServerAn instance of the TB application embedding the cluster component and data storage. Must provide implementation of the Cluster Member interface.
Cluster APIThe API used by the cluster member to query or change the state of the cluster.
Cluster LeaderThe cluster member currently appointed as the cluster leader.
Cluster MemberTimeBase server instance that is joined to a cluster.
Data BlockA container for a subset of messages within a specific time range for specific "space" in specific stream.
Block Location ListAn ordered list of cluster members responsible for replicating a specific block.
Block Replication ChainAn ordered list of cluster members responsible for replicating a specific block.
Head MemberThe member that is first in the location list for a specific block.
Tail MemberThe member that is last in the location list for a specific block.