ClickHouse distributed tables

A Distributed table is ClickHouse's sharding primitive — it's a virtual table that fans queries out to local tables on multiple shards and merges results back. You define cluster topology in config, create a local MergeTree table on each shard, then create a Distributed table that references the cluster + local table name. A SELECT against the Distributed table runs on every shard in parallel and the coordinator aggregates; an INSERT is routed to one shard based on a sharding key (hash, rand(), or explicit). Combined with per-shard replication, this is how petabyte-scale ClickHouse deployments work — data is sharded horizontally and each shard is internally HA. Two practical points: the sharding key shapes your query patterns (picking user_id keeps per-user queries on one shard; picking rand() spreads evenly but every query touches every shard); and cross-shard JOINs are expensive — GLOBAL JOIN ships the smaller side to all shards, which is usually acceptable for dimension tables but painful otherwise.