Use Case Overview
Real-time ETL pipelines are essential for:- Data Integration: Moving data between systems in real-time
- Data Masking: Protecting sensitive information like IP addresses or PII
- Format Conversion: Converting between different data formats (JSON, Avro, CSV)
- Data Enrichment: Adding calculated fields or joining with reference data
- Continuous Replication: Keeping downstream systems synchronized
Architecture
A typical real-time ETL pipeline with Proton consists of:- Source: External stream reading from Kafka/Redpanda
- Transform: SQL-based transformations (filter, map, aggregate)
- Sink: External stream or table writing to destination (ClickHouse, Kafka, etc.)
Tutorial: Read from Kafka, Transform, Write to ClickHouse
This example uses the Owl Shop ecommerce simulator to generate realistic frontend events and demonstrates masking sensitive data.Prerequisites
Start the demo stack with Docker Compose:Step 1: Create Source External Stream
First, create an external stream to read raw data from Kafka:Step 2: Prepare ClickHouse Destination
In your ClickHouse instance, create the target table:Step 3: Create ClickHouse External Table in Proton
Connect Proton to ClickHouse using an external table:Step 4: Build the ETL Pipeline with Materialized View
Create a materialized view that continuously:- Reads from the source stream
- Extracts JSON fields
- Masks the IP address using MD5 hash
- Writes to ClickHouse
Step 5: Query Results in ClickHouse
Verify data is arriving in ClickHouse:Alternative: Kafka-to-Kafka ETL
You can also write transformed data back to Kafka:Common Transformation Patterns
JSON Parsing and Extraction
Data Type Conversion
Filtering and Enrichment
Aggregation in ETL
Performance Considerations
Batching
Control batch size for external table writes:Parallel Processing
Enable parallel inserts for better throughput:Monitoring
Check materialized view progress:Error Handling
Skip Invalid Records
Dead Letter Queue
Real-World Example: AWS MSK Integration
For production deployments with AWS MSK (Managed Streaming for Apache Kafka):Next Steps
- Learn about Change Data Capture for database replication
- Explore Telemetry Pipeline for logs and metrics
- Build Feature Pipelines for ML use cases