Rust gives you fast as the default, not fast guaranteed. Naive Rust beats naive Python easily; tuned Rust beats naive Rust by another 5-10×. This post is the working playbook for finding that gap.
Always release builds for perf
cargo run # debug; ~10× slower
cargo run --release # use this for any perf measurement
Debug-build benchmarks are meaningless.
Criterion benchmarks
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
[[bench]]
name = "my_bench"
harness = false
// benches/my_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_parse(c: &mut Criterion) {
let input = "hello world";
c.bench_function("parse", |b| {
b.iter(|| parse(black_box(input)))
});
}
criterion_group!(benches, bench_parse);
criterion_main!(benches);
cargo bench
Criterion runs warmup, multiple iterations, reports mean / median / variance. Statistical rigor.
black_box prevents the compiler from optimizing the call away.
Flamegraphs
cargo install flamegraph
cargo flamegraph --bin myapp -- args
Or with perf directly:
perf record -g target/release/myapp
perf script | inferno-flamegraph > flame.svg
Open the SVG; hot functions are wide. Focus there.
Allocation profiling
[dev-dependencies]
dhat = "0.3"
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;
fn main() {
let _profiler = dhat::Profiler::new_heap();
// ... run workload ...
}
Run; produces dhat-heap.json; view in dhat viewer. Shows every allocation with stack trace.
Allocations are often the hidden cost in Rust. dhat finds them.
Common Rust slowness
1. Cloning everywhere
fn process(s: String) { // takes ownership; caller must clone
// ...
}
process(my_string.clone()); // BAD
vs
fn process(s: &str) { // borrowed; no clone
// ...
}
process(&my_string); // free
Default to borrowing; clone only when necessary.
2. Allocations in hot paths
for x in items {
let s = format!("item: {}", x); // allocates per iteration
// ...
}
Pre-allocate or reuse:
let mut s = String::with_capacity(64);
for x in items {
s.clear();
write!(&mut s, "item: {}", x).unwrap();
// use s
}
3. Vec growth
let mut v = Vec::new();
for x in 0..10_000 {
v.push(x); // reallocates as it grows
}
vs
let mut v = Vec::with_capacity(10_000);
When you know the size: preallocate.
4. Box in hot paths
Dynamic dispatch costs. Static dispatch via generics is faster:
// Slower
fn process(items: &[Box<dyn Process>]) { ... }
// Faster (when type is known)
fn process<T: Process>(items: &[T]) { ... }
Trade off: code size vs perf.
5. Sync vs async overhead
Async is great for IO; for CPU work it adds overhead. For pure CPU: use threads (rayon).
Rayon for parallel CPU work
use rayon::prelude::*;
let total: i64 = items.par_iter().map(|x| heavy_compute(x)).sum();
Drop-in parallel iteration. Linear speedup on N cores for embarrassingly parallel work.
Tokio for async IO
let results = futures::future::join_all(urls.iter().map(|u| client.get(u))).await;
For IO-bound: scales to thousands of concurrent ops.
Don’t mix: don’t run rayon inside tokio (blocks event loop). Don’t run tokio inside rayon worker. Pick by workload.
SIMD
For numeric crunching, portable SIMD via std::simd (nightly) or wide / simba:
use wide::f32x8;
let a = f32x8::from([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
let b = f32x8::from([1.0; 8]);
let c = a + b; // 8 floats, one instruction
8-wide ops on AVX. 4-8× speedup for numerical loops. See also ndarray, nalgebra.
Compiler hints
#[inline]
fn hot_function() { ... }
#[inline(always)]
fn really_hot() { ... }
Helps small functions get inlined across crate boundaries (#[inline] is required for cross-crate inlining). Don’t slap on every function — bloats binary.
For likely / unlikely:
if std::intrinsics::likely(common) {
fast_path()
} else {
slow_path()
}
Only nightly stable; use sparingly.
Profile-guided optimization
RUSTFLAGS="-Cprofile-generate=/tmp/pgo" cargo build --release
# Run workload to gather profile
target/release/myapp ...
RUSTFLAGS="-Cprofile-use=/tmp/pgo" cargo build --release
5-15% speedup on top of regular release builds. Worth it for production binaries.
LTO
[profile.release]
lto = "fat" # or "thin"
codegen-units = 1 # better optimization, slower compile
5-10% perf, longer build times. Standard for shipping binaries.
Realistic benchmarking
- Realistic workload, not synthetic.
- Production-shaped data.
- Warm caches before measuring.
- Multiple runs; report variance.
- Compare against the previous version, not against ideal.
Common mistakes
1. Optimizing without profiling
You “know” what’s slow. You’re wrong. Profile.
2. Debug-build benchmarks
Useless; report meaningless numbers. Always --release.
3. Microbenchmarks dominating
Optimizing a 1ms function called once per request. The 200ms DB query is the bottleneck.
4. Cloning to “make it work”
my_data.clone() to bypass borrow checker. Sometimes ok; in hot path: bad.
5. Async for everything
Async overhead matters for sync-fast functions. CPU loops in async tasks: heavy.
What I’d ship today
For perf-sensitive Rust:
- Criterion for benchmarks; CI runs them.
- Flamegraph for hot path analysis.
- dhat for allocation analysis.
- Rayon for parallel CPU.
- Tokio for IO concurrency.
- LTO + PGO for production builds.
- Profile-then-optimize discipline.
Read this next
If you want my Rust perf playbook (criterion + flamegraph + PGO), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .