TimeBase Read Throughput
This article examines the read performance of TimeBase, focusing on how overall server throughput scales as the number of concurrent consumers increases when reading a single historical data stream. The second part of this article explores how READ throughput scales for multiple streams.
Summary
Our measurements show that TimeBase can deliver historical data from a single stream to multiple concurrent consumers at a combined rate exceeding 7 million messages per second.
When reading multiple independent streams in parallel, combined throughput scaled up to 48.8 million messages per second with 16 consumers.
Per-consumer throughput decreased as more consumers were added, from about 4.6 million with 4 consumers to 3.0 million with 16 consumers.
This shows that while overall throughput grows with concurrency, efficiency per consumer declines.
Experiment 1 - Parallel READs of single stream
Environment
For this experiment results were obtained on a developer workstation with the following configuration:
- CPU: i7-13850HX, 2100 Mhz, 20 Cores
- RAM: 32G
- SSD: Intel 660p NVMe
- OS: Windows 11 PRO
- OpenJDK 17.0.12
- TimeBase 5.6.161 (default settings)
INFO: Initializing RAMDisk. Data cache size = 128MB.
INFO: Initializing Data Cache [PageSize = 65,536 bytes; Pages = 2044; MaxFileLength = 134,217,727 MB]
note
Disclaimer: This experiment was conducted on a developer laptop. Network throughput was explored in the next experiment described later in this document.
TimeBase and all consumers were running on the same machine. The TickDB Shell (TimeBase CLI) was used to conduct the test (see Appendix A for logs).
Specifically, the tptimea
command was executed to read a given stream with a specified number of concurrent consumers. Each consumer read the entire stream independently. Once all consumers completed, the tool reported the combined average data rate, which is shown in the results.
The following diagram illustrates the test setup:
Historical Data
A 2 GB sample of compressed Level 1 market data from Bloomberg (Trades and BBO) was used as the test dataset.
The average encoded message size was 44 bytes.
Results
The TimeBase client library was configured with the default number of transport threads (2).
Consumers | Data Rate (msg/s) |
---|---|
1 | 1,031,602 |
2 | 1,959,044 |
3 | 2,779,573 |
4 | 3,567,595 |
5 | 5,155,398 |
6 | 5,984,051 |
7 | 6,298,562 |
8 | 6,497,084 |
9 | 6,574,442 |
10 | 6,730,122 |
Analysis:
- Throughput scales almost linearly up to 5–6 consumers (≈6 M msg/s).
- Beyond that, gains flatten: 7–10 consumers only add ~0.4 M msg/s total.
- Efficiency per consumer drops from ~1.0 M msg/s (1–5 consumers) to ~0.67 M msg/s at 10 consumers.
note
Observation: Overall CPU utilization increased from ~25% to ~48% as the number of consumers grew across tests.
Parallel Reads of a Single Stream with Multiple Channels
In this test, the same dataset is consumed by 10 concurrent consumers, but the number of transport channels is increased beyond the default setting.
Consumers | Channels | Data Rate (msg/s) |
---|---|---|
10 | 2 | 6,730,122 |
10 | 4 | 6,927,077 |
10 | 6 | 7,095,391 |
In this experiment, throughput varied between runs. For each series, the median rate was used in the results table above.
Increasing number of channels above 8 was shown to be counter-productive.
Experiment 2 - AWS, Linux, Multiple streams, remote consumers
- CPU: Intel(R) Xeon(R) Platinum 8488C, 64 vCPUs (m7i.16xlarge)
- RAM: 256G
- Disk: GP3 EBS (3000 IOPS)
- Network: 25 Gigabit
- OS: Amazon Linux 2023
- OpenJDK 17.0.16
- TimeBase 5.6.161 (default settings)
- TimeBase.network.socket.receiveBufferSize=4194304
- TimeBase.ramCacheSize=10737418240 (10G)
- Storage format: 5.0
The following diagram illustrate this test setup:
All consumers running remotely, each consumer runs in own JVM process (and has separate instance of TimeBase client connection).
TimeBase Server startup parameters:
export DELTIX_HOME=/home/ec2-user/timebase-home
export TIMEBASE_SERIAL=***
java -Xms60G -Xmx60G -XX:+AlwaysPreTouch $ADD_OPENS -cp "/home/ec2-user/timebase-home/lib/*" deltix.qsrv.comm.cat.TBServerCmd -tb -home /home/ec2-user/QSHome -port 8011
TimeBase clients startup parameters:
java -DTimeBase.network.socket.bufferSize=4194304 $ADD_OPENS -cp "/home/ec2-user/jars-bench/*" deltix.qsrv.hf.tickdb.benchmark.Benchmark_HistoricThroughput -url dxtick://172.31.17.217:8011 -producers 4 -consumers 1 -channelType streams -messageCount 200000000 -stream thr_4 -skipGeneration
Historical Data
Our market data in this test is historical OHLC bars in several TimeBase streams (about 1Gb each). Each bar message is encoded into 36 bytes.
Results
Consumers | Total Throughput (msg/s) | Per-Consumer Throughput (msg/s) |
---|---|---|
4 | 18,544,919 | 4,636,229 |
8 | 34,215,267 | 4,276,908 |
16 | 48,876,041 | 3,054,752 |
As you can see with 16 consumers reading data we are approaching 50 million messages per second, during this time we are consuming 22 Gigabit per second (These instances are using 25 Gigabit network).
CPU Usage
These tests were run on large AWS instances with 64 vCPUs, primarily to ensure sufficient network bandwidth and to minimize the impact of “noisy neighbors.” From a CPU standpoint, the servers are very much over-provisioned for the workload under test.
To evaluate CPU scaling effects, we restricted TimeBase to a smaller subset of vCPUs and observed the impact on performance. The test setup was the same as before, with 4 consumers. We first measured throughput with the full CPU set available to TimeBase, then repeated the tests while limiting the server to 8, 4, and 2 vCPUs.
As an example, to restrict TimeBase to 8 vCPUs we prefixed the server command line with:
taskset -c 10-13,42-45 java -XX:ActiveProcessorCount=8 ...
On this instance, there are 32 physical cores with hyper-threading enabled; vCPUs 0 and 32 map to physical core 0. Hence CPU groups 10-13 and 42-45 are HT pairs.
CPU Allocation | Median Throughput (M msg/s) |
---|---|
All 64 vCPUs | 16.9 |
16 vCPUs | 19.1 |
8 vCPUs | 15.7 |
4 vCPUs | 12.3 |
2 vCPUs | 6.3 |
Analysis
The results show that throughput does not scale linearly with the number of vCPUs allocated to TimeBase. Even when restricted to 16 or 8 vCPUs, performance remains close to the levels observed with the full 64 vCPU server. Only at very low allocations (4 and especially 2 vCPUs) does throughput drop substantially.
This indicates that the 64 vCPU instance type is significantly over-provisioned in terms of raw compute. The main reason for choosing such large instances is network bandwidth and isolation from noisy neighbors, rather than CPU capacity. In practice, TimeBase requires only a fraction of the available cores to sustain tens of millions of messages per second.
Appendix: Sample of test logs when running tickdb shell
The following screenshots show how the TickDB Shell was used to execute these tests:
#/QuantServer/bin/tickdb.sh
==> set db dxtick://localhost:8011
==> open
==> set stream BLOOMBERG_TICKS
==> tptimea 1
54,986,443 messages in 53.302s; speed: 1,031,602 msg/s
==> tptimea 2
109,972,886 messages in 56.136s; speed: 1,959,044 msg/s
==> tptimea 3
164,959,329 messages in 59.347s; speed: 2,779,573 msg/s
==> tptimea 4
219,945,772 messages in 61.651s; speed: 3,567,595 msg/s
==> tptimea 5
274,932,215 messages in 53.329s; speed: 5,155,398 msg/s
==> tptimea 6
329,918,658 messages in 55.133s; speed: 5,984,051 msg/s
==> tptimea 7
384,905,101 messages in 61.110s; speed: 6,298,562 msg/s
==> tptimea 8
439,891,544 messages in 67.706s; speed: 6,497,084 msg/s
==> tptimea 9
494,877,987 messages in 75.273s; speed: 6,574,442 msg/s
==> tptimea 10
549,864,430 messages in 81.702s; speed: 6,730,122 msg/s
Variation of test results:
==> tptimea 5
274,932,215 messages in 52.905s; speed: 5,196,715 msg/s
==> tptimea 5
274,932,215 messages in 53.738s; speed: 5,116,160 msg/s
==> tptimea 5
274,932,215 messages in 53.329s; speed: 5,155,398 msg/s
CPU Limiting results
Baseline (all 64 vCPUs)
===================
Messages per second (total): 10691935
Messages per second (total): 13834487
Messages per second (total): 16897332
Messages per second (total): 19118745
Messages per second (total): 19728608
16 vCPUs / taskset -c 10-17,42-49 -XX:ActiveProcessorCount=16
===========================================================
Messages per second (total): 12326513
Messages per second (total): 13547267
Messages per second (total): 19073504
Messages per second (total): 19130289
Messages per second (total): 19212987
8 vCPUs / taskset -c 10-13,42-45 -XX:ActiveProcessorCount=8
===========================================================
Messages per second (total): 12520492
Messages per second (total): 11066078
Messages per second (total): 15768828
Messages per second (total): 15740207
Messages per second (total): 15798178
4 vCPUs / taskset -c 10-11,42-43 java -XX:ActiveProcessorCount=4
================================================================
Messages per second (total): 9379350
Messages per second (total): 8794736
Messages per second (total): 12498340
Messages per second (total): 12413300
Messages per second (total): 12313185
2 vCPUs / taskset -c 10,42 java -XX:ActiveProcessorCount=2
==========================================================
Messages per second (total): 4662106
Messages per second (total): 4629602
Messages per second (total): 6314626
Messages per second (total): 6325223
Messages per second (total): 6267848