Python data tooling improved dramatically. Polars matured; DuckDB took off; Ibis grew. By 2026 a Python data pipeline can be 10× faster than the 2020 version with a fraction of the memory. This post is the working set.

The new stack

  • Polars: DataFrame library; Rust core; lazy evaluation.
  • DuckDB: in-process analytical SQL DB; queries files directly.
  • Ibis: portable expression API; same code → DuckDB / BigQuery / Snowflake / Postgres.
  • Arrow: shared columnar format.
  • Parquet: columnar storage on disk.

These compose. Each plays a role.

Polars basics

import polars as pl

df = pl.read_csv("data.csv")
result = (
    df.filter(pl.col("price") > 100)
    .group_by("category")
    .agg(pl.col("price").mean())
    .sort("price", descending=True)
)

Method chaining; no inplace. Beats pandas on speed and ergonomics.

Lazy

result = (
    pl.scan_csv("huge.csv")        # lazy; doesn't load
    .filter(pl.col("price") > 100)
    .group_by("category")
    .agg(pl.col("price").mean())
    .collect()                     # executes optimized plan
)

Query optimizer decides what to read; predicate pushdown to file level. Massive memory savings on big files.

Polars vs pandas

For a 10GB CSV with filter+groupby+agg:

  • pandas: ~120s, peaks 8GB RAM.
  • Polars (eager): ~25s, ~3GB.
  • Polars (lazy): ~12s, ~1GB.

Speed and memory both. For ETL: Polars wins.

DuckDB

import duckdb

# Query a CSV / Parquet / Postgres directly
df = duckdb.sql("""
SELECT category, AVG(price) as avg
FROM 'data.csv'
WHERE price > 100
GROUP BY category
ORDER BY avg DESC
""").df()

Full SQL on files. No load step. No DB to manage.

# Query Postgres directly
duckdb.sql("INSTALL postgres; LOAD postgres;")
df = duckdb.sql("""
SELECT *
FROM postgres_scan('host=db user=app dbname=app', 'public', 'orders')
WHERE created_at > '2026-01-01'
""").df()

Queries straight from Postgres. For mixed-source analytics.

Polars + DuckDB

import polars as pl
import duckdb

df = pl.read_parquet("orders.parquet")

# Use DuckDB for SQL
result = duckdb.sql("""
SELECT category, AVG(price) FROM df GROUP BY category
""").pl()  # back to Polars

# Continue Polars chain
result = result.filter(pl.col("avg") > 50).sort("avg")

Zero-copy via Arrow. Fluid composition.

Ibis

import ibis

# Connect once
con = ibis.duckdb.connect()
# or ibis.bigquery.connect(...), ibis.snowflake.connect(...), etc.

orders = con.table("orders")

result = (
    orders.filter(orders.price > 100)
    .group_by("category")
    .aggregate(avg_price=orders.price.mean())
    .order_by(ibis.desc("avg_price"))
)

# Same code; runs on whatever backend
df = result.execute()

Write once; run on any supported engine. Useful when you start local (DuckDB) and graduate to warehouse (BigQuery / Snowflake).

Real pipeline

import polars as pl
from datetime import datetime

def ingest(date: str):
    raw = pl.scan_csv(f"raw/{date}/*.csv")
    cleaned = (
        raw
        .with_columns(
            pl.col("ts").str.to_datetime(),
            pl.col("amount").cast(pl.Float64),
        )
        .filter(pl.col("amount") > 0)
    )
    cleaned.sink_parquet(f"clean/{date}.parquet", compression="snappy")

def aggregate(date: str):
    df = pl.scan_parquet(f"clean/{date}.parquet")
    summary = (
        df.group_by(["category", pl.col("ts").dt.date()])
        .agg(pl.col("amount").sum().alias("total"), pl.len().alias("count"))
        .collect()
    )
    summary.write_parquet(f"summary/{date}.parquet")

Composes: ingest from raw, transform, write parquet, aggregate. All lazy / streaming where possible.

Streaming

For files larger than RAM:

df = pl.scan_csv("huge.csv")
df.group_by("k").agg(pl.col("v").sum()).sink_parquet("out.parquet")

sink_parquet streams output. Memory stays bounded.

When to keep pandas

  • Sklearn / PyTorch interop: ML libs expect pandas/numpy.
  • Plotly / Matplotlib: pandas is first-class.
  • Existing notebooks: don’t rewrite working code.
  • Small data: < 100MB; perf doesn’t matter; staff comfort matters.

Common pattern: ETL in Polars, hand off .to_pandas() at the model boundary.

Parquet

For all storage: Parquet, not CSV.

  • 5-10× smaller (columnar compression).
  • Faster reads.
  • Schema preserved (types, nullability).
  • Push-down filtering.
df.write_parquet("out.parquet", compression="zstd")

Always.

Distributed: when to graduate

Polars is in-memory. For multi-TB:

  • DuckDB scales surprisingly far (single machine, all-core).
  • dbt + warehouse (BigQuery / Snowflake / Redshift).
  • Spark if you need distributed; Polars-shaped APIs via PySpark or Sail.
  • Daft / Ray Data for distributed Polars-like.

Most data fits on one big machine in 2026 (256GB+ available). Graduate when it doesn’t.

Versioning data

For pipelines that evolve:

  • DVC: Git for data.
  • LakeFS: branchable data lakes.
  • Iceberg / Delta: table formats with versioning.

Hash inputs / outputs; pin pipeline versions.

Common mistakes

1. CSV everywhere

CSV is great for handoff; bad for storage. Convert to Parquet at ingest.

2. Pandas because “you know it”

For >1GB: Polars / DuckDB are dramatically faster. Learn one.

3. Eager everything

Loading 50GB CSV before filtering. pl.scan_* + lazy + collect.

4. Custom SQL in DuckDB for what Polars does

Both work; pick the cleaner expression. Don’t mix unnecessarily.

5. No schema validation

Bad data flows through; later steps fail mysteriously. pydantic or schema-aware reading.

What I’d ship today

For new Python data pipelines:

  • Polars for ETL.
  • DuckDB for SQL where natural.
  • Parquet for all stored data.
  • Ibis if multi-backend (start DuckDB; graduate later).
  • Pandas at ML boundary.
  • Schema validation at ingest.
  • DAG runner (Argo / Dagster / Prefect) for scheduling.

Read this next

If you want my Polars + DuckDB pipeline starter, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .