Should I switch from Pandas to Polars?

For new code: yes — Polars is faster, uses less memory, and the API is cleaner. For existing Pandas code that works: rewrite incrementally as bottlenecks appear. For data science exploration in notebooks where speed doesn't matter: Pandas is still fine.

Is Polars faster than Pandas?

Typically 5–30× faster for large operations, with much lower memory. Polars is built on Rust + Apache Arrow with a query optimizer that fuses operations and pushes filters down. The gap grows as data size grows.

Can I use Polars in pipelines with DuckDB?

Yes — Polars and DuckDB share Apache Arrow as the in-memory format. You can move data zero-copy between them. The pattern: DuckDB for SQL-shaped queries, Polars for transform-heavy DataFrame operations.

Polars in 2026 — The DataFrame Library Replacing Pandas in Production

By 2026, Polars is the right DataFrame library for production Python. Faster than Pandas, lower memory, query optimizer, lazy execution, native Apache Arrow. The API is different enough to require thought — but cleaner once it clicks.

This post is the working guide for backend and data engineers. What’s different, when to switch, and the patterns that pay off.

Why Polars wins

Three architectural choices set Polars apart:

Rust + Arrow. Compiled, multithreaded by default, columnar memory layout.
Query optimizer. Like SQL — operations are planned before execution. Filters get pushed down. Common subexpressions get hoisted. You write declarative code; Polars makes it efficient.
Lazy execution. Build a DAG; Polars doesn’t execute until you ask for the result.

Pandas is row-oriented Python with NumPy underneath. Polars is column-oriented Rust with Arrow underneath. The difference shows up at every benchmark.

Numbers (cherry-picked, but honest)

A 5 GB CSV with 50M rows, group-by-and-aggregate:

	Time	Peak memory
Pandas (read_csv + groupby)	240 s	14 GB
Polars (eager)	18 s	6 GB
Polars (lazy + scan_csv)	9 s	1.5 GB
DuckDB (`SELECT ... GROUP BY`)	7 s	1.2 GB

For analytical workloads at scale, Polars (or DuckDB) is the answer. Pandas works but pays a tax.

Eager and lazy

Polars has two API surfaces:

import polars as pl

# Eager — executes immediately, like Pandas
df = pl.read_csv("orders.csv")
result = df.filter(pl.col("country") == "IN").group_by("user_id").agg(pl.col("total").sum())

# Lazy — builds a query plan, executes on .collect()
result = (
    pl.scan_csv("orders.csv")                 # doesn't read the file yet
    .filter(pl.col("country") == "IN")
    .group_by("user_id")
    .agg(pl.col("total").sum())
    .collect()                                 # NOW execute
)

Lazy is strictly better for production — Polars can:

Push the filter into the CSV reader (reads only matching rows).
Skip columns the query doesn’t use (projection pushdown).
Reorder operations for efficiency.

For interactive notebooks, eager is fine. For pipelines, default to lazy.

Common operations

Reading

pl.scan_parquet("s3://bucket/year=*/month=*/data.parquet")     # streams
pl.scan_csv("data.csv")
pl.scan_ndjson("data.jsonl")
pl.read_database("SELECT * FROM users", connection_uri)
pl.from_arrow(arrow_table)                                      # zero-copy from Arrow

scan_* for lazy; read_* for eager.

Selecting and filtering

df.select(["user_id", "total"])
df.filter(pl.col("status") == "paid")
df.filter((pl.col("total") > 100) & (pl.col("country").is_in(["IN", "US"])))

pl.col(...) is the column reference. Compose freely.

Group by

df.group_by("country").agg(
    pl.col("total").sum().alias("revenue"),
    pl.col("user_id").n_unique().alias("users"),
    pl.col("total").mean().alias("avg_order"),
)

Aggregation is via agg(...) with named expressions. Cleaner than Pandas’s .agg({...}) dict syntax.

Joins

df.join(other_df, on="user_id", how="left")
df.join(other_df, left_on="user_id", right_on="id", how="inner")
df.join_asof(other_df, on="timestamp", by="user_id")          # time-based fuzzy join

join_asof for time-series — match each row in left to the nearest preceding row in right. Killer feature for event analytics.

Window functions

df.with_columns([
    pl.col("total").sum().over("user_id").alias("user_total"),
    pl.col("total").rank().over("user_id").alias("user_rank"),
])

.over(group_cols) is window-function semantics. Pandas’s equivalent is .transform(), much less ergonomic.

Time series

df.with_columns(pl.col("ts").dt.truncate("1h"))
df.group_by_dynamic("ts", every="1h", by="user_id").agg(pl.col("amount").sum())

group_by_dynamic with time intervals is genuinely better than Pandas’s resample. Bucket events into hourly windows per user with one call.

Expressions — the killer concept

Polars expressions are reusable, composable, lazy:

discount_amount = pl.col("price") * pl.col("discount_pct") / 100
final_price = pl.col("price") - discount_amount

df.with_columns([
    discount_amount.alias("discount"),
    final_price.alias("final"),
])

Build expressions once; use everywhere. Polars optimizes them as a unit.

Polars + DuckDB

The 2026 production pattern: Polars for DataFrame transforms; DuckDB for SQL queries. Share Arrow:

import polars as pl
import duckdb

# Polars to DuckDB (zero-copy via Arrow)
df = pl.scan_parquet("data.parquet").filter(...).collect()
duckdb.sql("SELECT user_id, COUNT(*) FROM df GROUP BY user_id").df()

# DuckDB to Polars
arrow_table = duckdb.sql("SELECT * FROM 'data.parquet'").arrow()
df = pl.from_arrow(arrow_table)

Both libraries are Arrow-native. Moving data between them is a pointer copy, not a serialization. See DuckDB in Production .

A real ETL pipeline

import polars as pl
from datetime import date

def daily_revenue_report(d: date) -> pl.DataFrame:
    return (
        pl.scan_parquet(f"s3://bucket/orders/year={d.year}/month={d.month:02d}/day={d.day:02d}/*.parquet")
        .filter(pl.col("status") == "paid")
        .group_by(["country", "category"])
        .agg(
            pl.col("amount_cents").sum().alias("revenue_cents"),
            pl.col("order_id").count().alias("orders"),
            pl.col("user_id").n_unique().alias("buyers"),
        )
        .with_columns([
            (pl.col("revenue_cents") / 100).alias("revenue_dollars"),
        ])
        .sort("revenue_cents", descending=True)
        .collect()
    )

report = daily_revenue_report(date.today())
report.write_parquet("reports/daily.parquet")

This pipeline:

Streams Parquet from S3 (no full materialization).
Pushes the filter into the Parquet reader.
Computes aggregates in parallel.
Writes the result.

End-to-end on a few-million-row dataset: seconds.

Migration from Pandas

Polars’s API is 80% similar to Pandas; 20% intentionally different. The Pandas idioms that don’t carry over:

Index. Polars doesn’t have one. Use a column. (Refreshing.)
Mutating operations. Polars is immutable by default. df["col"] = ... doesn’t exist; use df.with_columns(...).
apply with arbitrary Python. Slow in Polars (escapes the Rust path). Almost always replaceable with native expressions.
iloc / loc. Polars uses straightforward indexing or filtering.

Migration recipe:

Find the slow Pandas operations. They’re the candidates.
Rewrite in Polars using lazy. Often shorter than the original.
Compare results on a sample.
Replace.

For exploration in Jupyter, the polars[pandas] integration lets you go back and forth. Don’t migrate the notebook; migrate the production pipeline.

When Pandas still wins

Heavy ecosystem dependencies (statsmodels, scikit-learn examples) that take Pandas DataFrames.
Quick exploration where you’ll throw the code away.
Existing codebases that work fine. Don’t rewrite for fashion.

For everyone else writing data pipelines, ETL jobs, or production analytical code: Polars is the upgrade.

Common mistakes

1. Calling `.collect()` too early

A lazy plan turned into a DataFrame too soon loses optimization. Build the whole pipeline lazy, then collect once.

2. Using `.apply()` for what an expression can do

# ⛔ Slow — escapes to Python
df.with_columns(pl.col("name").apply(lambda s: s.upper()))

# ✅ Fast — Rust expression
df.with_columns(pl.col("name").str.to_uppercase())

Always check the expression API before falling back to Python.

3. Treating the index as a thing

It isn’t. The first column has no special meaning in Polars. If you want a primary key behavior, use a pl.col("id") filter.

4. Not parallelizing IO

Polars parallelizes computation by default. For multi-file IO use scan_* with a glob — Polars distributes the work.

5. Mixing eager and lazy

df = pl.read_csv(...).filter(...)                # eager
df = df.lazy().group_by(...).agg(...).collect()  # back to lazy then eager

If you find yourself flipping repeatedly, just stay lazy throughout and collect once at the end.

Why Polars wins#

Numbers (cherry-picked, but honest)#

Eager and lazy#

Common operations#

Reading#

Selecting and filtering#

Group by#

Joins#

Window functions#

Time series#

Expressions — the killer concept#

Polars + DuckDB#

A real ETL pipeline#

Migration from Pandas#

When Pandas still wins#

Common mistakes#

1. Calling .collect() too early#

2. Using .apply() for what an expression can do#

3. Treating the index as a thing#

4. Not parallelizing IO#

5. Mixing eager and lazy#

Read this next#