TimeBase Storage 5.0
Time Slice Files
Messages are recorded and read from TimeBase in a chronological order according to their
timestamps. On a stream level, TimeBase distributes messages by the
timestamp into a number of compact and often compresses files, each storing messages for a specific time range. These files are called Time Slice Files, or TSFs. The number of TSFs in a single stream could be very large.
Each file size can be configured when creating it and TimeBase ensures that all TSFs are of a reasonable size - not too small, not too large. The reason a TSF should not be too small is because there is non-trivial overhead of handling and maintaining individual files. Experiments on modern hardware show, that when a TSF size is around 10MB (or more), TimeBase can store data at speeds close to 50% of a sustained throughput of the underlying storage. If TSF size is set to about 1MB (bad), the speed falls dramatically. Reasons why TSFs should not be too large:
- You have to re-write an entire TSF to edit even one record. The larger the file, the more work needs to be done.
- The more important consideration in sizing a TSF is that it limits the amount of memory consumed by readers and writers.
A reader needs to retrieve all TSFs from disk, one at a time (if no filter applies)
- If the maximum TSF size is 10MB, then we can be certain that, for example, 1GB of RAM is sufficient to support 100 readers.
- If the maximum TSF size is 100MB, then only 10 simultaneous readers will fit in 1GB of RAM.
The TSF's time range is widely variable, because a single TSF holds messages for all entities, which fall into the given time range. In extreme cases, a TSF may contain several messages, all with the same
timestamp or a TSF may contain a wide internal time gap.
When data is written to TimeBase, loaders buffer the most recent TSFs. Buffer is a dynamic data structure. As the cumulative size of the stored TSFs exceeds the defined threshold, TSFs are scheduled for Commit (write to disk), and the Loader switches over to the next TSF.
Note, that the last TSF is never committed to disk, until the threshold is reached or the writer is closed. Otherwise, the commit process would lock on the last TSF and delay the process of writing data to it. As a result of such architecture, should TimeBase experience abnormal termination, system crash or accidental power blackout, the content of all uncommitted TSFs would be lost.
Another reason for limiting the TSF maximum size is that recent TSFs are kept in Data Cache. When Data Cache gets full, the latest committed TSFs are evicted from it.
The Index Block tells where the Data Blocks are within the TSF. Each Data Block contains chronologically arranged messages for one given entity and a time range of the corresponding TSF. Data Blocks can be randomly accessed by readers (cursors). Therefore, for example, to read data for a specific entity, the Cursor needs to read the Index Block, and then it can access directly the required Data Block.
There is little overhead in terms of the amount of extra data read from the disk. Typically, Data Blocks are lightly compressed on write and decompressed on read. The disk space reduction (sometimes 4X) comes at the expense of a relatively small additional CPU consumption. When data needs to be transmitted over the network, for example when using Hadoop, light compression speeds up the reading.
TSFs are organized into a balanced tree of folders in the file system.
Every stream is maintained in a separate root folder. Under the root, the may be zero or more levels of intermediate folders. TSFs are at the bottom of the hierarchy.
TimeBase keeps balance in the tree by ensuring that every folder has between
N files, where
N is a configurable Maximum Folder Size. When a folder fills up (has N files or child folders), it is split into two folders, each having N/2 children. This process increases the number of child nodes in the parent folders, and may cause the parent to split.
Along with the basic tree structure, folders remember an additional information about their children. Every folder remembers its first and last TSF. Based on this index, TimeBase is able to very quickly find the first and last TSFs for each entity, as well as to navigate between TSFs.
TimeBase Storage 4.0
Messages are stored in M-Files inside streams.
M-Files are critical for TimeBase administration. Each M-File is represented by
data file and
index file stored on disk. Message data is stored in the
data file, while the
index file allows the server to quickly find messages inside the data file by their timestamps. Both files grow as messages are added to the M-File.
Note, that messages are placed chronologically in M-Files, and each of them occupies different amount of space.
Another important stream parameter is called Distribution Factor (DF). DF is specified when a new stream is created and determines the number of M-Files in a stream and how messages are distributed among these files.
DF = Max (dedicated M-Files)
DF is a positive number, which can also have a special out-of-band Max value. This is the default setting. When DF is set to Max, messages for each symbol are stored into a separate, dedicated M-File. For example, all messages related to AAPL will be written in one M-File, and all messages related to GOOG will be written in another M-File. Distributing messages between M-Files is very important for various reasons:
- Each symbol data can be treated independently. For example, you can re-load AAPL data without wiping out, or affecting in any way, the GOOG data. This is why Market Data Aggregator requires the destination stream to be set up with DF = Max.
- Very fast retrieval of small symbol subsets. For example, you have a stream with tick data for 1000 stocks, and you are backtesting an order execution strategy. In order to simulate the orders execution, the strategy needs tick data, but only for symbols it has outstanding orders for. This is likely to be a very small subset of the original 1000-symbol universe. TimeBase server needs to read only a few files in order to supply the data to the backtested strategy.
- Since TimeBase knows that all messages in any given M-File are related to a specific symbol, it is not necessary to store the symbol in every message, resulting in space savings compared to alternative configurations.
DF = N (shared M-Files)
There is only one, but important, disadvantage in configuring a stream with dedicated M-Files: when selecting data simultaneously for more than a few hundred symbols, retrieval throughput begins to fall sharply.
For example, backtesting a typical equity alpha-generation or portfolio strategy on intraday data (such as one-minute bars) requires retrieving a fairly large data volume for a large number of symbols. If you selected all data from a stream with one-minute bars for 10,000 equities stored in dedicated M-Files, TimeBase would have to join messages from 10,000 M-Files on time stamp value. TimeBase uses a highly optimized join algorithm, but at a certain point this algorithm begins to run out of Level II CPU cache.
The fall of retrieval performance is non-linear, and begins abruptly at the point when the join algorithm begins to miss Level II CPU cache. The exact number of symbols after which performance starts to fall depends on the CPU architecture. As a rule of thumb, this number is somewhere between high hundreds and low thousands of symbols. In order to overcome the join performance limitation in cases when mass-retrieval of data is important, the DF has to be set to a finite number.
When DF is set to a positive number between 1 and (usually) a few hundred, then messages are proportionally distributed among the specified number of M-Files. When DF = 1, all messages in a stream are simply placed in a single M-File. When DF is greater than 1, then every symbol is assigned to a specific M-File, but multiple symbols will share the same M-File. Limiting the number of M-Files solves the CPU cache miss problem, but entails the following costs:
- The data for each symbol can no longer be manipulated independently. Market Data Aggregator cannot work with streams with a finite DF.
- Depending on the DF value, small symbol subsets may or may not take longer to retrieve. When DF is relatively high, the impact is small.
- TimeBase has to store the symbol in every message, resulting in slightly increased disk space consumption.
Often, users require the ability to both
mass-select data, as well as to run
selective retrieval, for a large number of symbols. TimeBase provides several tools for these purposes:
- In case of different datasets, they can be stored into distinct streams, each with the properly configured DF. For example, a hybrid alpha/execution strategy is frequently designed to perform the slower but broader alpha phase on one-minute price bars and then execute algo-orders on tick data. In this case, store one-minute bars in a stream with DF = 1, and ticks in a stream with DF = Max.
- If the same dataset must be used in both modes, set DF to a relatively high value. The exact DF value depends on the hardware, the type of queries you run, and the acceptable trade-off between mass and selective query performance. Start with a number in low hundreds, such as 200, and adjust based on experimental results.
- For ultimate performance, you can have two versions of the same stream, one with DF = Max and one with DF = 1. It is possible to set up replication so that one stream is automatically produced from another.
TimeBase always stores messages in M-Files chronologically. This ensures that when a client is reading messages (always chronologically), the data file is also read strictly chronologically. There is no need to jump from one section of the data file to another.
Such design ensures maximum performance achieved on a defragmented file system without caching a substantial amount of data in the computer memory (not practical with very large datasets, such as large collections of tick data).
Strict adherence to the above policy excludes the possibility of a message insertion in the middle of an M-File without re-writing the entire section of the file, following the modification point:
In fact, TimeBase does strictly adhere to sequential ordering policy. An attempt to insert data in the middle of any M-File causes immediate truncation of the affected M-File at the insertion point.
Note, that only the M-File being inserted into is truncated; while the rest of the M-Files in the stream remain unaffected. Therefore, what happens when data is inserted critically depends on the Distribution Factor.
A TimeBase instance occupies one or more folders on the hard disk. When multiple folders are in use, TimeBase files can be distributed among them using a round robin approach to selecting storage locations. This is possible because TimeBase makes no assumptions about which folders contain which files. All files have unique names within the instance, and could be placed in any of the multiple folders. On startup, TimeBase makes a catalog of folders and can find any of the required files even if it has been moved around. There are several reasons for using multiple folders:
- To spread data among multiple storage devices (not connected into a disk array) to increase the amount of stored data.
- To spread data among multiple storage devices (not connected into a disk array) to increase the data retrieval performance.
- To overcome specific file systems shortcomings related to a large number of files being placed in a single folder. For example, it is not recommended placing more than a few thousand files into a single NTFS folder. NTFS performance begins to fall exponentially after folder size exceeds about 4000 files. In this case, using multiple folders even on the same physical device is beneficial.