You can’t optimize what you don’t measure. Python’s profiling story improved dramatically in the last few years: production-safe sampling with py-spy, GPU+CPU+memory in one tool with scalene, leak hunting with memray. This post is the working playbook.
py-spy: production sampler
pip install py-spy
# Live top
py-spy top --pid 12345
# Flamegraph
py-spy record -o profile.svg --pid 12345 --duration 30
# Dump (current stack of all threads)
py-spy dump --pid 12345
Zero code changes. Works on running production processes. Sampling: ~5% overhead.
For a deployed service:
docker exec -it api-pod py-spy record -o /tmp/profile.svg --pid 1 --duration 30
docker cp api-pod:/tmp/profile.svg ./
Open the SVG in a browser. Hot paths jump out.
scalene: line-level
pip install scalene
scalene myscript.py
# or
python -m scalene myscript.py
Per-line CPU, memory, GPU usage. Distinguishes Python time from native (C extension) time. Output: HTML report.
For an API:
scalene --html --outfile profile.html myapp.py
# load test for a few minutes; ctrl-c
# open profile.html
Beats cProfile for finding which lines actually matter.
memray: memory profiling
pip install memray
memray run -o output.bin myscript.py
memray flamegraph output.bin
Records every allocation. Find:
- Peak memory users.
- Memory leaks (allocations that never freed).
- Allocation hot paths.
For long-running processes:
memray run --live --pid 12345
Live tracking on a running process.
tracemalloc (built-in)
import tracemalloc
tracemalloc.start()
# ... run ...
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics("lineno")[:10]:
print(stat)
Built-in; less powerful than memray but no install. Good for “where did all my memory go?” smoke tests.
cProfile (built-in, offline)
python -m cProfile -o out.prof myscript.py
python -m pstats out.prof
> sort cumulative
> stats 30
For deterministic benchmarks where you can re-run. Higher overhead than sampling.
SnakeViz for cProfile
pip install snakeviz
python -m cProfile -o out.prof myscript.py
snakeviz out.prof
Interactive flamegraph in a browser. Friendlier than pstats.
Async profiling
py-spy, scalene work with asyncio. For task-level visibility:
import asyncio
asyncio.get_event_loop().set_debug(True)
Logs slow callbacks (>100ms). Cheap signal for “this coroutine blocked the loop.”
Workflow
1. Measure (don't guess).
2. Identify the top-3 hot spots.
3. Pick one; optimize.
4. Re-measure. Did it help?
5. Repeat until cost / benefit ratio favors stopping.
Most “optimizations” without measurement are wasted work.
Common bottleneck patterns
1. Sync code in async paths
time.sleep, requests.get, sync DB drivers. py-spy shows the event loop blocked. Use asyncio.to_thread or async libraries.
2. N+1 queries
Profile shows DB query in a loop. Refactor to one query.
3. JSON serialization in hot paths
Stdlib json in 100k req/sec hot path. Switch to orjson for 5–10x speedup.
4. Pickle / deepcopy
Often unnecessary. Profile reveals; replace with explicit copy.
5. Logging in tight loops
log.info with string formatting per iteration. Sample or aggregate.
Continuous profiling in production
For long-term visibility (not one-off):
- Pyroscope — continuous profiling, Grafana integration.
- Datadog Continuous Profiler — SaaS.
- Granulate (free tier) — open-source agent.
- Parca — OSS continuous profiling.
Hot paths over time; correlate with deploys; spot regressions before users do.
Microbenchmarks
import timeit
setup = "data = list(range(1000))"
t1 = timeit.timeit("[x*2 for x in data]", setup=setup, number=10000)
t2 = timeit.timeit("list(map(lambda x: x*2, data))", setup=setup, number=10000)
print(t1, t2)
For small “is X faster than Y” questions. Avoid microbenchmark obsession — macro perf usually dominates.
pyperf for stable benchmarks
pip install pyperf
pyperf timeit -s "data = list(range(1000))" "[x*2 for x in data]"
Statistical rigor: warmup, multiple runs, variance reporting. Better than timeit for serious benchmarking.
Common mistakes
1. Optimizing without measuring
You “knew” the SQL was slow. It wasn’t; the JSON serialization was. Wasted day.
2. cProfile in production
10-30% overhead. Use py-spy.
3. Micro before macro
Cython-rewriting a function that runs once a minute. Find the hot path first.
4. Ignoring memory
CPU-only profiling misses leaks; pod restarts mysteriously. Profile both.
5. One-off profiling
Deployed; never profiled again. Continuous profiling catches drift.
What I’d ship today
For Python services:
- py-spy in production debugging tooling.
- scalene for one-off deep dives.
- memray when memory misbehaves.
- Pyroscope or similar for continuous profiling.
- orjson for fast JSON in hot paths.
- Async-aware libraries everywhere; never sync IO in async.
- Observability + profiling correlated by trace_id.
Read this next
If you want my profiling cheat sheet + py-spy production setup, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .