When should I use DuckDB over Pandas?

Use DuckDB whenever your data won't comfortably fit in Pandas memory (>1 GB), when you want SQL ergonomics, or when you need to query Parquet/CSV/Arrow without ETL. DuckDB will be 5–50× faster for analytical workloads on real-sized data.

Should I use DuckDB instead of Postgres?

DuckDB is OLAP — for analytical queries on large columnar data. Postgres is OLTP — for transactional workloads with concurrent writes. They're complementary. A common 2026 pattern: Postgres for the application; DuckDB for analytics and reports against Parquet exports of Postgres data.

Is DuckDB production-ready?

Yes. v1.0 shipped in 2024, v1.4 added AES-256 encryption, and v1.5+ added DuckLake for managed table formats. It's production-ready for embedded analytics, in-process OLAP, and as a query engine over Parquet/Arrow data lakes. It is not a multi-writer transactional database.

DuckDB in Production — Embedded Analytics Eating the Data World

DuckDB has become the embedded analytics engine of 2026. It crossed 1M weekly PyPI downloads, replaced Pandas in many data engineering pipelines, and shows up inside SaaS products as the in-process query engine. This post is the working knowledge.

If you’re a backend engineer who wonders why people are excited about a “SQLite for analytics,” this post is for you.

What DuckDB is

DuckDB is an embedded analytical database. It runs in your process, stores data column-oriented, and executes SQL with vectorized query execution.

In-process. No server. Your Python / Node / Rust code links to DuckDB and queries directly.
Columnar storage. Optimized for “scan many rows of a few columns” — the analytical workload.
Vectorized execution. Operations on chunks of values, not row-by-row.
Reads many formats natively. Parquet, CSV, JSON, Arrow, Postgres tables, S3 buckets — all queryable as if they were tables.

Think “SQLite for analytics” with the qualifier that it’s wildly fast at what it’s good at.

When to reach for DuckDB

1. Replacing Pandas for medium-large data

import pandas as pd                                # OOM on 5 GB CSV
df = pd.read_csv("orders.csv")
df.groupby("country")["total"].sum()

import duckdb                                       # streams the file
duckdb.sql("""
    SELECT country, SUM(total) FROM 'orders.csv'
    GROUP BY country
""").to_df()

DuckDB streams the file. Memory stays bounded. The query runs faster than Pandas would, often 5–50×.

2. Querying Parquet directly

duckdb.sql("""
    SELECT user_id, COUNT(*) AS sessions
    FROM 's3://bucket/sessions/year=2026/month=4/*.parquet'
    WHERE country = 'IN'
    GROUP BY user_id
    ORDER BY sessions DESC
    LIMIT 100
""")

DuckDB pushes filters down to Parquet’s row groups. It only reads what it needs. For “ad-hoc analytics over a data lake,” this beats firing up a Spark cluster.

3. Joining Postgres + Parquet + CSV in one query

ATTACH 'postgres://localhost/orders' AS pg (TYPE postgres);

SELECT
  o.order_id,
  o.total,
  c.country
FROM pg.orders o
JOIN 's3://lake/customers/*.parquet' c ON c.id = o.customer_id;

A federated query without an ETL pipeline. The Postgres extension reads from your live OLTP database; the Parquet extension reads from your data lake.

4. Embedded inside SaaS for customer analytics

Each customer’s data → a per-tenant DuckDB file (or Parquet on S3). Queries run in-process inside your app server. No analytics warehouse to maintain. No per-tenant Snowflake bills.

5. Notebook / dev workflows

Replace pd.read_csv with duckdb.sql("SELECT * FROM 'file.csv'"). SQL ergonomics, faster execution, no DataFrame memory limits.

DuckDB next to Postgres

	DuckDB	Postgres
Workload	OLAP (read-heavy aggregates)	OLTP (concurrent writes)
Storage	Columnar	Row
Concurrency	One writer; many readers (in-process)	Multi-writer at scale
Best at	Analytical queries on large data	Transactional integrity, ACID across many sessions
Replace	Pandas, small Snowflake/BQ workloads	None — use a real OLTP database

A typical 2026 stack: Postgres for the application data, DuckDB for analytics. Pipe Postgres → Parquet → DuckDB for read-heavy reports. Or use DuckDB’s Postgres extension to query Postgres tables directly when the data fits.

For Postgres specifics see PostgreSQL 18 features and PostgreSQL JSONB Advanced Patterns .

DuckDB next to ClickHouse / Snowflake

DuckDB is embedded; ClickHouse and Snowflake are servers.

For one-laptop or one-VM analytics: DuckDB.
For a multi-tenant warehouse with thousands of concurrent analyst queries: ClickHouse / Snowflake.
For ad-hoc Parquet reads even against terabytes: DuckDB will surprise you.

The wave that surprised me through 2024–2025: many teams replaced their Snowflake/Redshift monthly bills with DuckDB on a beefy box reading Parquet from S3. Saved 60–80%. Did not regret.

The DuckLake extension

DuckDB v1.5+ ships DuckLake — a table format for transactional, multi-writer analytics on S3-backed Parquet. Think Apache Iceberg / Delta Lake but lighter and DuckDB-native.

CREATE TABLE events (
  id BIGINT, ts TIMESTAMP, user_id BIGINT, action TEXT
)
TBLPROPERTIES ('format'='ducklake');

INSERT INTO events VALUES (...);
COPY events TO 's3://bucket/events/' (FORMAT 'ducklake');

This turns DuckDB into a real lakehouse engine. ACID writes, time travel, schema evolution — without a Spark cluster. For mid-sized data (TB-scale), it’s a credible alternative to Iceberg + Snowflake.

A practical pattern: DuckDB + Postgres + S3

The pattern I see at companies that ship analytics in 2026:

                Postgres (OLTP)
                     │
                     │  CDC / nightly export
                     ▼
                Parquet on S3 (immutable history)
                     │
                     │  query
                     ▼
                DuckDB in app server
                     │
                     ▼
                Customer-facing dashboard

Postgres holds the live data. A nightly (or CDC-based ) job exports to Parquet on S3, partitioned by date. DuckDB reads the Parquet for analytics queries. Fresh data goes through Postgres directly when freshness matters.

Result: Postgres stays a small OLTP database; analytics scales with S3 storage; query latency is consistent.

Performance notes

A few production observations:

Parquet beats CSV by orders of magnitude. If your data is in CSV today, converting to Parquet is the single biggest perf win.
Partition Parquet files by date / tenant. DuckDB reads only the partitions matching your filter.
Set memory_limit. Default is 80% of RAM; bound it on shared servers.
PRAGMA threads = N. DuckDB parallelizes queries; tune to your cores.
Connection per thread. DuckDB connections aren’t thread-safe; cheap to open one per worker.

Encryption and compliance

DuckDB v1.4+ supports AES-256 encryption at rest:

ATTACH 'mydata.duckdb' AS encrypted (ENCRYPTION_KEY 'secret-key-32-bytes-long-here');

Combined with column-level masks (PRAGMA enable_object_cache, view-based access control), DuckDB has reached compliance bar for healthcare, finance, and regulated industries.

DuckDB in agents and AI

A pattern getting popular: agents use DuckDB as scratch space.

LLM gets a CSV → DuckDB queries it → LLM gets answers.
Agent receives a Postgres dump → loads as DuckDB → analyzes.
Multi-step reasoning over data → DuckDB holds intermediate state.

The lightweight in-process model fits agents perfectly. Pair with MCP servers that expose a “query CSV” tool to your agent.

Common mistakes

1. Treating DuckDB as Postgres

DuckDB isn’t multi-writer. Don’t expect concurrent application writes from many processes to work. It’s OLAP.

2. Loading everything into memory

SELECT * FROM 's3://bucket/*.parquet' materialized into a DataFrame defeats DuckDB’s streaming. Keep work as SQL until you’ve reduced; only materialize the result.

3. Forgetting partition pruning

Querying WHERE date >= '2026-01-01' against unpartitioned Parquet scans the lot. Partition by date.

4. Mixing DuckDB and Postgres without thinking

The Postgres extension is amazing but a join across DuckDB + Postgres can pull GB across the wire. For large joins, prefer one side or pre-export.

5. No backup story

DuckDB is a file. Back it up like any file. Or — better — keep the source-of-truth in Postgres / S3 and treat DuckDB as a derivable cache.

What DuckDB is#

When to reach for DuckDB#

1. Replacing Pandas for medium-large data#

2. Querying Parquet directly#

3. Joining Postgres + Parquet + CSV in one query#

4. Embedded inside SaaS for customer analytics#

5. Notebook / dev workflows#

DuckDB next to Postgres#

DuckDB next to ClickHouse / Snowflake#

The DuckLake extension#

A practical pattern: DuckDB + Postgres + S3#

Performance notes#

Encryption and compliance#

DuckDB in agents and AI#

Common mistakes#

1. Treating DuckDB as Postgres#

2. Loading everything into memory#

3. Forgetting partition pruning#

4. Mixing DuckDB and Postgres without thinking#

5. No backup story#

Read this next#