DuckDB has become the embedded analytics engine of 2026. It crossed 1M weekly PyPI downloads, replaced Pandas in many data engineering pipelines, and shows up inside SaaS products as the in-process query engine. This post is the working knowledge.

If you’re a backend engineer who wonders why people are excited about a “SQLite for analytics,” this post is for you.

What DuckDB is

DuckDB is an embedded analytical database. It runs in your process, stores data column-oriented, and executes SQL with vectorized query execution.

  • In-process. No server. Your Python / Node / Rust code links to DuckDB and queries directly.
  • Columnar storage. Optimized for “scan many rows of a few columns” — the analytical workload.
  • Vectorized execution. Operations on chunks of values, not row-by-row.
  • Reads many formats natively. Parquet, CSV, JSON, Arrow, Postgres tables, S3 buckets — all queryable as if they were tables.

Think “SQLite for analytics” with the qualifier that it’s wildly fast at what it’s good at.

When to reach for DuckDB

1. Replacing Pandas for medium-large data

import pandas as pd                                # OOM on 5 GB CSV
df = pd.read_csv("orders.csv")
df.groupby("country")["total"].sum()
import duckdb                                       # streams the file
duckdb.sql("""
    SELECT country, SUM(total) FROM 'orders.csv'
    GROUP BY country
""").to_df()

DuckDB streams the file. Memory stays bounded. The query runs faster than Pandas would, often 5–50×.

2. Querying Parquet directly

duckdb.sql("""
    SELECT user_id, COUNT(*) AS sessions
    FROM 's3://bucket/sessions/year=2026/month=4/*.parquet'
    WHERE country = 'IN'
    GROUP BY user_id
    ORDER BY sessions DESC
    LIMIT 100
""")

DuckDB pushes filters down to Parquet’s row groups. It only reads what it needs. For “ad-hoc analytics over a data lake,” this beats firing up a Spark cluster.

3. Joining Postgres + Parquet + CSV in one query

ATTACH 'postgres://localhost/orders' AS pg (TYPE postgres);

SELECT
  o.order_id,
  o.total,
  c.country
FROM pg.orders o
JOIN 's3://lake/customers/*.parquet' c ON c.id = o.customer_id;

A federated query without an ETL pipeline. The Postgres extension reads from your live OLTP database; the Parquet extension reads from your data lake.

4. Embedded inside SaaS for customer analytics

Each customer’s data → a per-tenant DuckDB file (or Parquet on S3). Queries run in-process inside your app server. No analytics warehouse to maintain. No per-tenant Snowflake bills.

5. Notebook / dev workflows

Replace pd.read_csv with duckdb.sql("SELECT * FROM 'file.csv'"). SQL ergonomics, faster execution, no DataFrame memory limits.

DuckDB next to Postgres

DuckDBPostgres
WorkloadOLAP (read-heavy aggregates)OLTP (concurrent writes)
StorageColumnarRow
ConcurrencyOne writer; many readers (in-process)Multi-writer at scale
Best atAnalytical queries on large dataTransactional integrity, ACID across many sessions
ReplacePandas, small Snowflake/BQ workloadsNone — use a real OLTP database

A typical 2026 stack: Postgres for the application data, DuckDB for analytics. Pipe Postgres → Parquet → DuckDB for read-heavy reports. Or use DuckDB’s Postgres extension to query Postgres tables directly when the data fits.

For Postgres specifics see PostgreSQL 18 features and PostgreSQL JSONB Advanced Patterns .

DuckDB next to ClickHouse / Snowflake

DuckDB is embedded; ClickHouse and Snowflake are servers.

  • For one-laptop or one-VM analytics: DuckDB.
  • For a multi-tenant warehouse with thousands of concurrent analyst queries: ClickHouse / Snowflake.
  • For ad-hoc Parquet reads even against terabytes: DuckDB will surprise you.

The wave that surprised me through 2024–2025: many teams replaced their Snowflake/Redshift monthly bills with DuckDB on a beefy box reading Parquet from S3. Saved 60–80%. Did not regret.

The DuckLake extension

DuckDB v1.5+ ships DuckLake — a table format for transactional, multi-writer analytics on S3-backed Parquet. Think Apache Iceberg / Delta Lake but lighter and DuckDB-native.

CREATE TABLE events (
  id BIGINT, ts TIMESTAMP, user_id BIGINT, action TEXT
)
TBLPROPERTIES ('format'='ducklake');

INSERT INTO events VALUES (...);
COPY events TO 's3://bucket/events/' (FORMAT 'ducklake');

This turns DuckDB into a real lakehouse engine. ACID writes, time travel, schema evolution — without a Spark cluster. For mid-sized data (TB-scale), it’s a credible alternative to Iceberg + Snowflake.

A practical pattern: DuckDB + Postgres + S3

The pattern I see at companies that ship analytics in 2026:

                Postgres (OLTP)
                     
                       CDC / nightly export
                     
                Parquet on S3 (immutable history)
                     
                       query
                     
                DuckDB in app server
                     
                     
                Customer-facing dashboard

Postgres holds the live data. A nightly (or CDC-based ) job exports to Parquet on S3, partitioned by date. DuckDB reads the Parquet for analytics queries. Fresh data goes through Postgres directly when freshness matters.

Result: Postgres stays a small OLTP database; analytics scales with S3 storage; query latency is consistent.

Performance notes

A few production observations:

  • Parquet beats CSV by orders of magnitude. If your data is in CSV today, converting to Parquet is the single biggest perf win.
  • Partition Parquet files by date / tenant. DuckDB reads only the partitions matching your filter.
  • Set memory_limit. Default is 80% of RAM; bound it on shared servers.
  • PRAGMA threads = N. DuckDB parallelizes queries; tune to your cores.
  • Connection per thread. DuckDB connections aren’t thread-safe; cheap to open one per worker.

Encryption and compliance

DuckDB v1.4+ supports AES-256 encryption at rest:

ATTACH 'mydata.duckdb' AS encrypted (ENCRYPTION_KEY 'secret-key-32-bytes-long-here');

Combined with column-level masks (PRAGMA enable_object_cache, view-based access control), DuckDB has reached compliance bar for healthcare, finance, and regulated industries.

DuckDB in agents and AI

A pattern getting popular: agents use DuckDB as scratch space.

  • LLM gets a CSV → DuckDB queries it → LLM gets answers.
  • Agent receives a Postgres dump → loads as DuckDB → analyzes.
  • Multi-step reasoning over data → DuckDB holds intermediate state.

The lightweight in-process model fits agents perfectly. Pair with MCP servers that expose a “query CSV” tool to your agent.

Common mistakes

1. Treating DuckDB as Postgres

DuckDB isn’t multi-writer. Don’t expect concurrent application writes from many processes to work. It’s OLAP.

2. Loading everything into memory

SELECT * FROM 's3://bucket/*.parquet' materialized into a DataFrame defeats DuckDB’s streaming. Keep work as SQL until you’ve reduced; only materialize the result.

3. Forgetting partition pruning

Querying WHERE date >= '2026-01-01' against unpartitioned Parquet scans the lot. Partition by date.

4. Mixing DuckDB and Postgres without thinking

The Postgres extension is amazing but a join across DuckDB + Postgres can pull GB across the wire. For large joins, prefer one side or pre-export.

5. No backup story

DuckDB is a file. Back it up like any file. Or — better — keep the source-of-truth in Postgres / S3 and treat DuckDB as a derivable cache.

Read this next

If you want a DuckDB + Parquet + Postgres CDC starter pipeline, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .