Time Slice Files
Messages are always stored into and read from TimeBase in the order of their primary
Within a single stream, TimeBase distributes messages by the
timestamp into a number of files.
Each file holds messages for a given range of timestamps.
These files are called Time Slice Files, or TSFs. The number of TSFs in a single stream could be very large.
TSFs are written in compacted, and often compressed, form.
The internal structure of a TSF allows random read access to its sectors, called Data Blocks.
However, data is never updated inside a TSF without re-writing the entire file.
TSFs are created to be of a given configurable size, on the order of 10MB.
A single TSF holds messages for all entities, which fall into the given time range. Therefore, the width of a TSF’s time range is widely variable.
In extreme cases, an entire TSF may contain several messages, all with the same timestamp or a TSF may contain a wide internal time gap.
TimeBase ensures that all TSFs are of a reasonable size - not too small, not too large.
The reason a TSF should not be too small is because there is non-trivial overhead of handling and maintaining an individual file.
Experiments on modern hardware show that when a TSF size is around 10MB (or more), TimeBase can store data at speeds close to 50% of sustained throughput of the underlying storage.
If file size were set to about 1MB (bad), the speed would fall dramatically.
There are two reasons why a TSF should not be too large.
- The less important reason is that if you want to edit even one record, you have to re-write an entire TSF. The larger the file, the more work needs to be done.
- The more important consideration in sizing a TSF is that it limits the amount of memory consumed by readers and writers.
A reader that is reading data for all existing entities, without filtering, will need to retrieve entire TSFs from disk, one at a time.
- If the maximum size of a TSF is 10MB, then we can be certain that, for example, 1GB of RAM is sufficient to support 100 readers.
- If we set the maximum TSF size to 100 MB, then only 10 simultaneous readers will fit in 1GB of RAM.
When data is written to TimeBase, the writer buffers (at least) the most recent time slice in memory.
Messages are added to this dynamic data structure. As the cumulative size of the time slice exceeds the threshold, the time slice is scheduled for Commit (write to disk), and the Writer switches over to the next time slice.
Note that the very last time slice being formed, is never committed to disk, until it reaches the size threshold, or the writer is closed.
Otherwise, the commit process would have to acquire a lock on the last time slice and delay the process of writing data to it.
As a result of this architecture, should TimeBase experience abnormal termination, such as system crash or accidental loss of power, the contents of the uncommitted time slice will be lost.
That may be another reason to limit maximum time slice size.
Most recently committed time slices are kept in memory for as long as a pre-configured Data Cache Size allows.
When Data Cache gets full, the least recently used time slice is evicted from it (assuming, of course, that it has been committed).
Time Slice File Structure
A Time Slice File has the following internal structure:
It starts with an Index Block, followed by a number of Data Blocks.
Each Data Block contains messages for one given entity, all, of course, falling within the time range of the TSF.
Inside a Data Block, messages are ordered by time.
The Index Block tells where the Data Blocks are within the TSF. Therefore, for example, to read data for a single entity, the reader only needs to first read the small Index Block, and then it can directly access the Data Block for the entity of interest.
There is little overhead in terms of the amount of extra data read from disk (besides the useful data).
Typically, Data Blocks are lightly compressed on write (and then, obviously, uncompressed on read).
The reduction in the amount of disk space (sometimes 4X) comes at the expense of relatively little additional CPU consumption.
When data needs to be transmitted over the network, such as when using Hadoop, light compression actually speeds up the reading.
Time Slice File System
Time slice files are organized into a balanced tree of folders in the file system.
Every stream is maintained in a separate root folder.
Under the root, the may be zero or more levels of intermediate folders.
At the last level of the tree are Time Slice Files.
One intermediate level of folders is shown on the below diagram:
TimeBase maintains balance in the tree by ensuring that every folder has between N/2 and N files, where N is a configurable value called Maximum Folder Size.
When a folder fills up (has N files or child folders), it is split into two folders, each with N/2 children. This process increases the number of child nodes in the folder’s parent, and may cause the parent to split, etc.
Along with the basic tree structure, the folders remember a small amount of additional information about their children. in particular, every folder remembers for each entity the first and last TSF that has data for it.
Based on this index, TimeBase is able to very quickly find the first and last TSFs for each entity, as well as jump from one TSF to the next that contains data for the given entity, or a set thereof.
Inside a stream, messages are stored in a number of so-called M-Files.
While many TimeBase users do not need to know about M-Files, this concept is critical for the administration of TimeBase.
Each M-File is represented by two physical files on disk, called
data file and
Message data is stored in the
data file, while the
index file allows the server to quickly find the location of any message inside the data file by the timestamp.
Both physical files grow as messages are appended to the M-File.
It is important to understand that messages in an M-File are placed sequentially in the order of their timestamps, and that different messages occupy different amount of space in the file.
A critical stream parameter is called Distribution Factor (DF).
DF determines how many M-Files a stream will contain, and, subsequently, how messages written to a stream will be distributed among those files.
The DF is specified when a stream is created.
DF = Max (dedicated M-Files)
DF is a positive number, which can also have a special out-of-band value of Max.
This is the default setting. When DF is set to Max, messages for each symbol are stored into a separate, dedicated M-File.
For example, all messages related to AAPL will be stored into one M-File, and all messages related to GOOG will be stored into another M-File.
Distributing messages into dedicated M-Files brings three advantages of varying importance:
- Most importantly, the data for each symbol can be manipulated independently. For example, you can re-load AAPL data without wiping out, or affecting in any way, the GOOG data. Because of this, Market Data Aggregator requires that the destination stream be set up with DF = Max.
- Also, you get very fast retrieval of small symbol subsets. For example, let’s say you have a stream with tick data for 1000 stocks, and you are backtesting an order execution strategy. In order to simulate the execution of orders, the strategy needs tick data, but only for those symbols for which it has outstanding orders. This is likely to be a very small subset of the original 1000-symbol universe. TimeBase server only needs to read a few files in order to supply the data to the strategy being backtested.
- Finally, since TimeBase knows that all messages in any given M-File are related to a specific symbol, it does not need to store the symbol in every message, resulting in space savings compared to alternative configurations.
DF = N (shared M-Files)
There is only one, but important, disadvantage of configuring a stream with dedicated M-Files: when selecting data simultaneously for more than a few hundred symbols, retrieval throughput begins to fall sharply.
For example, backtesting a typical equity alpha-generation or portfolio strategy on intraday data (such as one-minute bars) requires retrieving a fairly large amount of data for a large number of symbols. If you selected all data from a stream with one-minute bars for 10,000 equities stored in dedicated M-Files, TimeBase would have to join messages from 10,000 M-Files on time stamp value. TimeBase uses a highly optimized join algorithm, but at a certain point this algorithm begins to run out of Level II CPU cache.
The fall of retrieval performance is non-linear, and begins abruptly at the point when the join algorithm begins to miss Level II CPU cache. The exact number of symbols after which performance starts to fall depends on the CPU architecture. As a rule of thumb, this number is somewhere between high hundreds and low thousands of symbols. In order to overcome the join performance limitation in cases when mass-retrieval of data is important, the DF has to be set to a finite number.
When DF is set to a positive number between 1 and (usually) a few hundred, then messages are proportionally distributed among the specified number of M-Files.
Of course, when DF = 1, all messages in a stream are simply placed in a single M-File.
When DF is greater than 1, then every symbol is assigned to a specific M-File, but multiple symbols will share the same M-File.
Limiting the number of M-Files solves the CPU cache miss problem, but entails the following costs:
- Most importantly, the data for each symbol can no longer be manipulated independently. Market Data Aggregator cannot work with streams with a finite DF.
- Depending on the DF value, small symbol subsets may or may not take longer to retrieve. When DF is relatively high, the impact is small.
- TimeBase has to store the symbol in every message, resulting in slightly increased disk space consumption.
Often, the user must have both the ability to
mass-select data, as well as run
selective retrieval, for a large number of symbols.
For these purposes TimeBase provides several tools:
- If the data sets are fundamentally different, they can be stored into distinct streams, each with the properly configured DF. For instance, a hybrid alpha/execution strategy is frequently designed to perform the slower but broader alpha phase on, say, one-minute price bars, and then execute algo-orders on tick data. In this case, store one-minute bars in a stream with DF = 1, and ticks in a stream with DF = Max.
- If the same data set must be used in both modes, set DF to a relatively high value. Again, the exact number depends on the hardware, the type of queries you run, and the acceptable trade-off between mass and selective query performance. Start with a number in low hundreds, such as 200, and adjust based on experimental results.
- For ultimate performance, you can have two versions of the same stream, one with Max DF and one with DF = 1. It is possible to set up replication so that one stream is automatically produced from another, or this can be easily done manually using TimeBase Administrator’s Distribute and Consolidate commands.
TimeBase always stores messages within an M-File sequentially, in the order of their timestamps.
This ensures that when a client is reading messages (always in the order of time), the data file is read strictly sequentially.
There is no need to jump from one section of the data file to another.
On a defragmented file system, such design ensures maximum performance that can be possibly achieved without caching a substantial amount of data in computer memory (impractical with very large data sets, such as large collections of tick data).
Strict adherence to the above policy implies that it is impossible to insert a message into the middle of an M-File without re-writing the entire section of the file following the modification point:
In fact, TimeBase does strictly adhere to sequential ordering policy.
An attempt to insert data into the middle of any M-File causes immediate truncation of the affected M-File at the insertion point.
Note that only the M-File being inserted into is truncated; while the rest of the M-Files in the stream are unaffected. Therefore, what happens when data is inserted critically depends on the Distribution Factor.
A TimeBase instance occupies one or more folders on the hard disk.
When multiple folders are in use, TimeBase files can be distributed among them using a round robin approach to selecting storage locations.
This is possible because TimeBase makes no assumptions about which folders contain which files.
All files have a unique name within the instance, and could be placed in any of the multiple folders.
On startup, TimeBase makes a catalog of each of the folders and will find any of the required files if it has been moved around.
There are several reasons for using multiple folders:
- Spread data among multiple storage devices (not connected into a disk array) in order to increase the amount of data that can be stored.
- Spread data among multiple storage devices (not connected into a disk array) in order to increase the performance of data retrieval.
- Overcome the inefficiencies of specific file systems related to a large number of files being placed in a single folder. For example, it is undesirable to place more than a few thousand files into a single NTFS folder. NTFS performance begins to fall exponentially after folder size exceeds about 4000 files. In this case, using multiple folders even on the same physical device is beneficial.