Basic Aggregate Functions
count()
Counts the number of rows or non-NULL values. Syntax:- Returns
0when input is empty count(column)excludes NULL valuescount(*)andcount()include all rows
src/AggregateFunctions/AggregateFunctionCount.cpp
sum()
Calculates the sum of numeric values. Syntax:sum(x)- Standard sum with type promotionsum_with_overflow(x)- Sum with potential overflow, same result typesum_kahan(x)- Kahan compensated summation for better floating-point precision
- Returns NULL for empty input
- Type promotion: integer → Int64/UInt64, float → Float64
- Supports integers, floats, and decimals
src/AggregateFunctions/AggregateFunctionSum.cpp
avg()
Calculates the arithmetic mean of values. Syntax:- Returns NULL for empty input
- Supports int, uint, float, and decimal types
- Result type is always Float64 for integers, preserves precision for decimals
src/AggregateFunctions/AggregateFunctionAvg.cpp
min() / max()
Return the minimum or maximum value. Syntax:- Returns NULL for empty input
- Works with any comparable type (numbers, strings, dates)
- Maintains data type of input column
src/AggregateFunctions/AggregateFunctionMin.cpp and AggregateFunctionMax.cpp
Advanced Aggregate Functions
arg_min() / arg_max()
Returns the value of one column when another column reaches its minimum or maximum. Syntax:- Order-dependent operation
- Useful for “latest value” patterns in streaming
uniq()
Estimates the number of unique values using HyperLogLog algorithm. Syntax:- Approximate count with memory efficiency
- Accurate for small cardinalities, ~2% error for large sets
group_array()
Collects all values into an array. Syntax:quantile() / median()
Calculate quantiles (percentiles) of numeric data. Syntax:stddev() / variance()
Calculate standard deviation and variance. Syntax:Streaming Aggregation Semantics
Continuous Computation
In streaming queries, aggregations are computed continuously as new data arrives:- Results update incrementally with each new event
- State is maintained in memory for active groups
- No final “complete” state - always evolving
Window-Based Aggregation
Combine aggregations with window functions for bounded computation:- Bounded state: Each window has finite data
- Window emission: Results emitted when window completes
- Resource management: Completed windows can be freed from memory
State Management
Proton manages aggregation state efficiently:- Incremental updates: State updated with each event, not recalculated
- Memory optimization: Compact state representations
- Checkpointing: State can be persisted for fault tolerance (in clustering mode)
Combinator Functions
Aggregation functions can be modified with combinators for special behaviors:-
-Ifcombinator: Conditional aggregation -
-Statecombinator: Return intermediate state -
-Mergecombinator: Merge states
Aggregation with Multiple Levels
Group by multiple columns for hierarchical aggregation:Performance Considerations
Memory Usage
- Unbounded groups: Unlimited GROUP BY keys consume memory
- Use windows: Bound memory with time-based windows
- State cleanup: Completed windows release memory
Computational Efficiency
- Incremental computation: O(1) update per event for most aggregates
- Pre-aggregation: Consider pre-aggregating in materialized views
- Parallelization: Aggregations can be parallelized across partitions
Best Practices
- Use appropriate windows for streaming queries to bound state
- Choose efficient aggregates:
count()andsum()are more efficient thanmedian() - Limit GROUP BY cardinality when possible
- Consider approximate algorithms:
uniq()vscount(distinct)
Aggregation in Materialized Views
Materialized views can maintain pre-computed aggregations:- Pre-computed results for fast queries
- Reduced computation on raw data
- Automatic maintenance as data arrives
Examples: Common Patterns
Running Totals
Moving Averages
Count Distinct Users
Top K Items
Statistical Analysis
See Also
- Window Functions - Time-based windowing for aggregations
- Functions Overview - All function categories
- Time Functions - Working with timestamps in aggregations