Use Case Overview
CDC with Proton enables:- Real-time Database Replication: Sync changes from MySQL/PostgreSQL to ClickHouse or other databases
- Event-driven Architectures: Trigger downstream processes based on data changes
- Audit Logging: Track all modifications to database records
- Data Warehouse Synchronization: Keep analytics databases up-to-date
- Microservice Data Consistency: Propagate changes across service boundaries
Architecture
A typical CDC pipeline looks like:- Debezium: Captures changes from database transaction logs
- Kafka/Redpanda: Streams CDC events reliably
- Timeplus Proton: Transforms and routes CDC data with SQL
- Destination: ClickHouse, another database, or application
Tutorial: MySQL to ClickHouse Replication
This example demonstrates capturing MySQL changes and replicating them to ClickHouse via Proton.Prerequisites
Start the CDC demo stack:docker compose up -d
Step 1: Create ClickHouse Table
In ClickHouse, create the destination table:Step 2: Configure Debezium Connector
Create a Debezium MySQL connector:inventory database.
Step 3: Create External Stream in Proton
Read CDC events from Kafka:payload.op: Operation type (c=create,u=update,d=delete,r=read/snapshot)payload.before: Record state before changepayload.after: Record state after changepayload.ts_ms: Change timestamp
Step 4: Create ClickHouse External Table
Connect Proton to ClickHouse:Step 5: Create Streaming ETL Pipeline
Set up a materialized view to continuously replicate data:- Reads CDC events from Kafka
- Extracts the new record from
payload.after - Writes to ClickHouse continuously
Step 6: Test the Pipeline
Insert data in MySQL:Advanced Pattern: Full CDC with Updates and Deletes
For complete change tracking including updates and deletes, use Proton’s changelog stream:Step 1: Create Changelog Stream
Step 2: Handle All CDC Operations
Create materialized views for each operation type: Read (initial snapshot):Step 3: Query Current State
Get the current snapshot (latest version of each row):CDC with Avro Format
Debezium can produce CDC events in Avro format with schema registry:Step 1: Configure Debezium with Avro
Step 2: Create Avro External Stream
Step 3: Query Avro CDC Stream
Use Case: Real-time Analytics on CDC Data
Perform analytics on database changes:Track Most Updated Tables
Detect High-frequency Updates
Audit Trail
PostgreSQL CDC
The same pattern works for PostgreSQL:Performance Tips
Use seek_to=‘earliest’ Carefully
Batch Writes to ClickHouse
Filter Early
Monitoring CDC Pipeline
Check Connector Status
Monitor Lag
Track Processing Rate
Next Steps
- Learn about Real-time ETL for data transformation
- Explore Feature Pipeline for ML use cases
- Build Telemetry Pipeline for observability