Architecture

Timeplus Proton is a unified streaming and historical data processing engine built on top of ClickHouse. It extends ClickHouse’s columnar storage and vectorized query execution with native stream processing capabilities, delivering a single binary solution for real-time analytics.

Overview

Proton combines:

ClickHouse’s proven OLAP engine for historical data queries and storage
Native stream processing for real-time data ingestion and continuous queries
Distributed write-ahead log (WAL) implemented using Kafka for replication and fault tolerance
Single C++ binary with no JVM, no ZooKeeper dependencies

Core Components

Stream Storage Engine

Proton introduces a new storage engine called StorageStream that extends ClickHouse’s MergeTreeData:

// From src/Storages/Stream/StorageStream.h
class StorageStream : public MergeTreeData
{
    // Distributed data ingestion via Kafka WAL
    // Streaming query support
    // Simplified usability
};

Key capabilities:

Distributed ingestion: Data is written to a distributed WAL (Kafka) and replicated across shards
Streaming queries: Native support for continuous queries with WHERE, GROUP BY, and window functions
MergeTree integration: Uses ClickHouse’s MergeTree family for storage with automatic background merging

Columnar Processing

Like ClickHouse, Proton uses vectorized query execution:

Columnar storage

Data is stored by columns (IColumn interface) with contiguous memory layout for cache efficiency

Vectorized execution

Operations dispatch on arrays (vectors/chunks) rather than individual values for better CPU utilization

SIMD optimization

Single Instruction Multiple Data operations accelerate aggregations and filters

From ClickHouse architecture:

“Operations are dispatched on arrays, rather than on individual values. This helps lower the cost of actual data processing through vectorized query execution.”

Block Streams

Data flows through the system as Blocks - containers of column chunks:

Block = [(IColumn, IDataType, column_name), ...]

Processing pipeline:

IBlockInputStream reads blocks from sources (streams, tables, Kafka)
Transformations filter, aggregate, join blocks immutably
IBlockOutputStream writes results to destinations (views, external streams, storage)

Proton uses a “pull” approach: when you pull a block from a stream, it recursively pulls from nested streams, creating an execution pipeline.

Query Processing Stages

Streaming Query
Historical Query

-- Continuous aggregation over a stream
SELECT device, count(*), avg(temperature)
FROM sensor_data
GROUP BY device;

Execution flow:

Parser creates AST from SQL
InterpreterSelectQuery builds streaming execution pipeline
Stream shards read from WAL or external sources
Streaming aggregator maintains state
Results emit incrementally as data arrives

-- Query stored data in table mode
SELECT device, count(*), avg(temperature)
FROM table(sensor_data)
GROUP BY device;

Execution flow:

Parser creates AST from SQL
InterpreterSelectQuery builds batch execution pipeline
Read MergeTree parts from disk
Apply aggregations on complete data set
Return final results

Stream Sharding

Proton distributes streams across multiple shards for parallelism:

// Sharding configuration
UInt32 shards;  // Number of shards
ExpressionActionsPtr sharding_key_expr;  // Sharding expression
std::vector<StreamShardStorePtr> stream_shards;  // Shard stores

Sharding strategies:

Random: Round-robin distribution across shards
Expression-based: Hash on specified columns for co-location
Deterministic: Same key always routes to same shard

Sharding is transparent to queries - Proton automatically routes reads/writes to appropriate shards.

Storage Modes

Proton supports multiple storage backends:

Mode	Description	Use Case
`memory`	In-memory storage	High-speed streaming, temporary data
`default`	Disk-based MergeTree	Historical data, persistent streams
`append_only`	Append-only log	Event sourcing, audit logs
`changelog_kv`	CDC changelog	Database replication, change tracking

Query Execution Pipeline

Write Path

Ingest

Client sends data via INSERT or external stream connector

Shard routing

Sharding expression evaluates to determine target shard(s)

WAL append

Data appends to distributed WAL (Kafka) with async callbacks

Background consumption

Stream shards consume from WAL and write to MergeTree parts

Merge

Background threads merge parts to optimize storage and queries

Proton does not use MEMTABLE like LSM trees - data writes directly to filesystem as MergeTree parts. Use batch inserts (not single rows) for optimal performance.

Read Path

Streaming Mode

// From StorageStream::read()
void read(
    QueryPlan & query_plan,
    const Names & column_names,
    SelectQueryInfo & query_info,
    ContextPtr context,
    size_t max_block_size,
    size_t num_streams
);

Determine shards to read based on query predicates
Create parallel stream readers for each shard
Apply filters and projections
Stream results continuously

Historical Mode

Uses ClickHouse’s standard MergeTree read path:

Consult primary index to identify candidate parts
Read marks files for column offsets
Decompress and read column blocks
Apply filters and aggregations

Performance Characteristics

Benchmarks on Apple M2 Max:

90 million events/sec ingestion throughput
4ms end-to-end latency for streaming queries
1 million unique keys high-cardinality aggregation

Optimizations:

Zero-copy reads from Kafka with direct block conversion
SIMD-accelerated aggregations and filters
Sparse primary indexes for range scans
Column-level compression (LZ4, ZSTD)

Fault Tolerance

Proton inherits ClickHouse’s replication model with streaming enhancements:

Replication mechanism

Each stream shard maps to a Kafka partition (or native log)
Data replicates via Kafka replication factor
On failure, consumers resume from last committed offset
No distributed consensus required (no ZooKeeper)

Consistency model

At-least-once delivery by default
Exactly-once with idempotent keys
Eventual consistency for replicas

Integration with ClickHouse

Proton maintains full compatibility with ClickHouse:

-- Query ClickHouse external table from Proton
CREATE EXTERNAL TABLE ch_events
SETTINGS type='clickhouse',
         address='clickhouse.example.com:9000',
         database='analytics',
         table='events';

-- Join streaming data with historical ClickHouse data
SELECT s.user_id, s.event, c.user_name
FROM live_events s
JOIN ch_events c ON s.user_id = c.user_id;

Next Steps

Streams

Learn about stream types and operations

External Streams

Connect to Kafka, Pulsar, and other sources

Materialized Views

Build real-time data pipelines

Windows

Time-based aggregations and windowing

​Architecture

​Overview

​Core Components

​Stream Storage Engine

​Columnar Processing

​Block Streams

​Query Processing Stages

​Stream Sharding

​Storage Modes

​Query Execution Pipeline

​Write Path

​Read Path

​Streaming Mode

​Historical Mode

​Performance Characteristics

​Fault Tolerance

​Integration with ClickHouse

​Next Steps

Streams

External Streams

Materialized Views

Windows

Architecture

Overview

Core Components

Stream Storage Engine

Columnar Processing

Block Streams

Query Processing Stages

Stream Sharding

Storage Modes

Query Execution Pipeline

Write Path

Read Path

Streaming Mode

Historical Mode

Performance Characteristics

Fault Tolerance

Integration with ClickHouse

Next Steps