ClickHouse query optimization.

Primary key (ORDER BY)

ORDER BY (date, user_id, event)

Queries filtering on prefix use primary index → fast.

WHERE date = today() AND user_id = 1   -- fast
WHERE event = 'x'                       -- slow (last in key)

Skip index

ALTER TABLE events ADD INDEX idx_event event TYPE bloom_filter(0.01) GRANULARITY 4;

For non-key filter columns.

Projections

ALTER TABLE events ADD PROJECTION top_users (
    SELECT user_id, count()
    GROUP BY user_id
);

CH auto-uses for matching queries.

Sampling

SELECT count() FROM events SAMPLE 0.1   -- 10% sample

Requires SAMPLE BY in table.

LIMIT BY

SELECT user_id, event, ts FROM events ORDER BY ts DESC LIMIT 5 BY user_id

Top 5 per user.

arrayJoin / explode

SELECT arrayJoin([1,2,3]) AS x

uniqCombined / uniqExact

uniqCombined(user_id)    -- HyperLogLog, fast, approximate
uniqExact(user_id)       -- exact, more memory

quantile

quantile(0.99)(latency)
quantileExact / quantileTDigest

any / argMax

SELECT user_id, argMax(name, version) AS latest_name FROM users GROUP BY user_id

anyLast / first_value

For pre-aggregated picks.

arrayMap / filter

SELECT arrayMap(x -> x * 2, [1,2,3])
SELECT arrayFilter(x -> x > 5, [1,5,10])

JOIN strategies

SELECT * FROM events ANY LEFT JOIN users USING user_id   -- get first match
SELECT * FROM events INNER JOIN users ON ...

ClickHouse JOINs less optimized than Postgres. Prefer denormalized or dictGet.

Dictionaries (precomputed joins)

CREATE DICTIONARY users_dict (
    user_id UInt32, name String, country String
) PRIMARY KEY user_id
SOURCE(CLICKHOUSE(table 'users'))
LAYOUT(HASHED())
LIFETIME(3600);

-- Use
SELECT dictGet('users_dict', 'name', user_id) FROM events;

In-memory lookup. Replaces small-table joins.

Materialized columns

ALTER TABLE events ADD COLUMN day Date MATERIALIZED toDate(ts);

Computed on insert. Indexable.

EXPLAIN

EXPLAIN syntax SELECT ...;
EXPLAIN AST SELECT ...;
EXPLAIN PIPELINE SELECT ...;
EXPLAIN ESTIMATE SELECT ...;

system.parts

SELECT name, rows, bytes_on_disk FROM system.parts WHERE table = 'events';

system.query_log

SELECT query, query_duration_ms FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC LIMIT 20;

Settings per query

SET max_threads = 8;
SET max_memory_usage = 10000000000;

Or SETTINGS max_threads=8 at query end.

Sample optimizations

-- Filter early
WHERE date = today()
   AND user_id IN (SELECT id FROM ...)

-- Use PREWHERE for big filters
SELECT * FROM events PREWHERE event = 'x' WHERE date = today()

-- Limit columns
SELECT user_id, event FROM events   -- not SELECT *

Common mistakes

  • SELECT * on huge table.
  • JOIN on big tables (use dict or denorm).
  • Wrong ORDER BY for query patterns.
  • No partition pruning.
  • Inserting one row at a time.

Read this next

If you want my CH query patterns, they’re at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .