Python data tooling improved dramatically. Polars matured; DuckDB took off; Ibis grew. By 2026 a Python data pipeline can be 10× faster than the 2020 version with a fraction of the memory. This post is the working set.
The new stack
- Polars: DataFrame library; Rust core; lazy evaluation.
- DuckDB: in-process analytical SQL DB; queries files directly.
- Ibis: portable expression API; same code → DuckDB / BigQuery / Snowflake / Postgres.
- Arrow: shared columnar format.
- Parquet: columnar storage on disk.
These compose. Each plays a role.
Polars basics
import polars as pl
df = pl.read_csv("data.csv")
result = (
df.filter(pl.col("price") > 100)
.group_by("category")
.agg(pl.col("price").mean())
.sort("price", descending=True)
)
Method chaining; no inplace. Beats pandas on speed and ergonomics.
Lazy
result = (
pl.scan_csv("huge.csv") # lazy; doesn't load
.filter(pl.col("price") > 100)
.group_by("category")
.agg(pl.col("price").mean())
.collect() # executes optimized plan
)
Query optimizer decides what to read; predicate pushdown to file level. Massive memory savings on big files.
Polars vs pandas
For a 10GB CSV with filter+groupby+agg:
- pandas: ~120s, peaks 8GB RAM.
- Polars (eager): ~25s, ~3GB.
- Polars (lazy): ~12s, ~1GB.
Speed and memory both. For ETL: Polars wins.
DuckDB
import duckdb
# Query a CSV / Parquet / Postgres directly
df = duckdb.sql("""
SELECT category, AVG(price) as avg
FROM 'data.csv'
WHERE price > 100
GROUP BY category
ORDER BY avg DESC
""").df()
Full SQL on files. No load step. No DB to manage.
# Query Postgres directly
duckdb.sql("INSTALL postgres; LOAD postgres;")
df = duckdb.sql("""
SELECT *
FROM postgres_scan('host=db user=app dbname=app', 'public', 'orders')
WHERE created_at > '2026-01-01'
""").df()
Queries straight from Postgres. For mixed-source analytics.
Polars + DuckDB
import polars as pl
import duckdb
df = pl.read_parquet("orders.parquet")
# Use DuckDB for SQL
result = duckdb.sql("""
SELECT category, AVG(price) FROM df GROUP BY category
""").pl() # back to Polars
# Continue Polars chain
result = result.filter(pl.col("avg") > 50).sort("avg")
Zero-copy via Arrow. Fluid composition.
Ibis
import ibis
# Connect once
con = ibis.duckdb.connect()
# or ibis.bigquery.connect(...), ibis.snowflake.connect(...), etc.
orders = con.table("orders")
result = (
orders.filter(orders.price > 100)
.group_by("category")
.aggregate(avg_price=orders.price.mean())
.order_by(ibis.desc("avg_price"))
)
# Same code; runs on whatever backend
df = result.execute()
Write once; run on any supported engine. Useful when you start local (DuckDB) and graduate to warehouse (BigQuery / Snowflake).
Real pipeline
import polars as pl
from datetime import datetime
def ingest(date: str):
raw = pl.scan_csv(f"raw/{date}/*.csv")
cleaned = (
raw
.with_columns(
pl.col("ts").str.to_datetime(),
pl.col("amount").cast(pl.Float64),
)
.filter(pl.col("amount") > 0)
)
cleaned.sink_parquet(f"clean/{date}.parquet", compression="snappy")
def aggregate(date: str):
df = pl.scan_parquet(f"clean/{date}.parquet")
summary = (
df.group_by(["category", pl.col("ts").dt.date()])
.agg(pl.col("amount").sum().alias("total"), pl.len().alias("count"))
.collect()
)
summary.write_parquet(f"summary/{date}.parquet")
Composes: ingest from raw, transform, write parquet, aggregate. All lazy / streaming where possible.
Streaming
For files larger than RAM:
df = pl.scan_csv("huge.csv")
df.group_by("k").agg(pl.col("v").sum()).sink_parquet("out.parquet")
sink_parquet streams output. Memory stays bounded.
When to keep pandas
- Sklearn / PyTorch interop: ML libs expect pandas/numpy.
- Plotly / Matplotlib: pandas is first-class.
- Existing notebooks: don’t rewrite working code.
- Small data: < 100MB; perf doesn’t matter; staff comfort matters.
Common pattern: ETL in Polars, hand off .to_pandas() at the model boundary.
Parquet
For all storage: Parquet, not CSV.
- 5-10× smaller (columnar compression).
- Faster reads.
- Schema preserved (types, nullability).
- Push-down filtering.
df.write_parquet("out.parquet", compression="zstd")
Always.
Distributed: when to graduate
Polars is in-memory. For multi-TB:
- DuckDB scales surprisingly far (single machine, all-core).
- dbt + warehouse (BigQuery / Snowflake / Redshift).
- Spark if you need distributed; Polars-shaped APIs via PySpark or Sail.
- Daft / Ray Data for distributed Polars-like.
Most data fits on one big machine in 2026 (256GB+ available). Graduate when it doesn’t.
Versioning data
For pipelines that evolve:
- DVC: Git for data.
- LakeFS: branchable data lakes.
- Iceberg / Delta: table formats with versioning.
Hash inputs / outputs; pin pipeline versions.
Common mistakes
1. CSV everywhere
CSV is great for handoff; bad for storage. Convert to Parquet at ingest.
2. Pandas because “you know it”
For >1GB: Polars / DuckDB are dramatically faster. Learn one.
3. Eager everything
Loading 50GB CSV before filtering. pl.scan_* + lazy + collect.
4. Custom SQL in DuckDB for what Polars does
Both work; pick the cleaner expression. Don’t mix unnecessarily.
5. No schema validation
Bad data flows through; later steps fail mysteriously. pydantic or schema-aware reading.
What I’d ship today
For new Python data pipelines:
- Polars for ETL.
- DuckDB for SQL where natural.
- Parquet for all stored data.
- Ibis if multi-backend (start DuckDB; graduate later).
- Pandas at ML boundary.
- Schema validation at ingest.
- DAG runner (Argo / Dagster / Prefect) for scheduling.
Read this next
If you want my Polars + DuckDB pipeline starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .