You can’t optimize what you don’t measure. Python’s profiling story improved dramatically in the last few years: production-safe sampling with py-spy, GPU+CPU+memory in one tool with scalene, leak hunting with memray. This post is the working playbook.

py-spy: production sampler

pip install py-spy

# Live top
py-spy top --pid 12345

# Flamegraph
py-spy record -o profile.svg --pid 12345 --duration 30

# Dump (current stack of all threads)
py-spy dump --pid 12345

Zero code changes. Works on running production processes. Sampling: ~5% overhead.

For a deployed service:

docker exec -it api-pod py-spy record -o /tmp/profile.svg --pid 1 --duration 30
docker cp api-pod:/tmp/profile.svg ./

Open the SVG in a browser. Hot paths jump out.

scalene: line-level

pip install scalene

scalene myscript.py
# or
python -m scalene myscript.py

Per-line CPU, memory, GPU usage. Distinguishes Python time from native (C extension) time. Output: HTML report.

For an API:

scalene --html --outfile profile.html myapp.py
# load test for a few minutes; ctrl-c
# open profile.html

Beats cProfile for finding which lines actually matter.

memray: memory profiling

pip install memray

memray run -o output.bin myscript.py
memray flamegraph output.bin

Records every allocation. Find:

  • Peak memory users.
  • Memory leaks (allocations that never freed).
  • Allocation hot paths.

For long-running processes:

memray run --live --pid 12345

Live tracking on a running process.

tracemalloc (built-in)

import tracemalloc

tracemalloc.start()
# ... run ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics("lineno")[:10]:
    print(stat)

Built-in; less powerful than memray but no install. Good for “where did all my memory go?” smoke tests.

cProfile (built-in, offline)

python -m cProfile -o out.prof myscript.py
python -m pstats out.prof
> sort cumulative
> stats 30

For deterministic benchmarks where you can re-run. Higher overhead than sampling.

SnakeViz for cProfile

pip install snakeviz
python -m cProfile -o out.prof myscript.py
snakeviz out.prof

Interactive flamegraph in a browser. Friendlier than pstats.

Async profiling

py-spy, scalene work with asyncio. For task-level visibility:

import asyncio

asyncio.get_event_loop().set_debug(True)

Logs slow callbacks (>100ms). Cheap signal for “this coroutine blocked the loop.”

Workflow

1. Measure (don't guess).
2. Identify the top-3 hot spots.
3. Pick one; optimize.
4. Re-measure. Did it help?
5. Repeat until cost / benefit ratio favors stopping.

Most “optimizations” without measurement are wasted work.

Common bottleneck patterns

1. Sync code in async paths

time.sleep, requests.get, sync DB drivers. py-spy shows the event loop blocked. Use asyncio.to_thread or async libraries.

2. N+1 queries

Profile shows DB query in a loop. Refactor to one query.

3. JSON serialization in hot paths

Stdlib json in 100k req/sec hot path. Switch to orjson for 5–10x speedup.

4. Pickle / deepcopy

Often unnecessary. Profile reveals; replace with explicit copy.

5. Logging in tight loops

log.info with string formatting per iteration. Sample or aggregate.

Continuous profiling in production

For long-term visibility (not one-off):

  • Pyroscope — continuous profiling, Grafana integration.
  • Datadog Continuous Profiler — SaaS.
  • Granulate (free tier) — open-source agent.
  • Parca — OSS continuous profiling.

Hot paths over time; correlate with deploys; spot regressions before users do.

Microbenchmarks

import timeit

setup = "data = list(range(1000))"
t1 = timeit.timeit("[x*2 for x in data]", setup=setup, number=10000)
t2 = timeit.timeit("list(map(lambda x: x*2, data))", setup=setup, number=10000)
print(t1, t2)

For small “is X faster than Y” questions. Avoid microbenchmark obsession — macro perf usually dominates.

pyperf for stable benchmarks

pip install pyperf
pyperf timeit -s "data = list(range(1000))" "[x*2 for x in data]"

Statistical rigor: warmup, multiple runs, variance reporting. Better than timeit for serious benchmarking.

Common mistakes

1. Optimizing without measuring

You “knew” the SQL was slow. It wasn’t; the JSON serialization was. Wasted day.

2. cProfile in production

10-30% overhead. Use py-spy.

3. Micro before macro

Cython-rewriting a function that runs once a minute. Find the hot path first.

4. Ignoring memory

CPU-only profiling misses leaks; pod restarts mysteriously. Profile both.

5. One-off profiling

Deployed; never profiled again. Continuous profiling catches drift.

What I’d ship today

For Python services:

  • py-spy in production debugging tooling.
  • scalene for one-off deep dives.
  • memray when memory misbehaves.
  • Pyroscope or similar for continuous profiling.
  • orjson for fast JSON in hot paths.
  • Async-aware libraries everywhere; never sync IO in async.
  • Observability + profiling correlated by trace_id.

Read this next

If you want my profiling cheat sheet + py-spy production setup, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .