Rust gives you fast as the default, not fast guaranteed. Naive Rust beats naive Python easily; tuned Rust beats naive Rust by another 5-10×. This post is the working playbook for finding that gap.

Always release builds for perf

cargo run            # debug; ~10× slower
cargo run --release  # use this for any perf measurement

Debug-build benchmarks are meaningless.

Criterion benchmarks

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[[bench]]
name = "my_bench"
harness = false
// benches/my_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_parse(c: &mut Criterion) {
    let input = "hello world";
    c.bench_function("parse", |b| {
        b.iter(|| parse(black_box(input)))
    });
}

criterion_group!(benches, bench_parse);
criterion_main!(benches);
cargo bench

Criterion runs warmup, multiple iterations, reports mean / median / variance. Statistical rigor.

black_box prevents the compiler from optimizing the call away.

Flamegraphs

cargo install flamegraph
cargo flamegraph --bin myapp -- args

Or with perf directly:

perf record -g target/release/myapp
perf script | inferno-flamegraph > flame.svg

Open the SVG; hot functions are wide. Focus there.

Allocation profiling

[dev-dependencies]
dhat = "0.3"
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    let _profiler = dhat::Profiler::new_heap();
    // ... run workload ...
}

Run; produces dhat-heap.json; view in dhat viewer. Shows every allocation with stack trace.

Allocations are often the hidden cost in Rust. dhat finds them.

Common Rust slowness

1. Cloning everywhere

fn process(s: String) {  // takes ownership; caller must clone
    // ...
}
process(my_string.clone());  // BAD

vs

fn process(s: &str) {  // borrowed; no clone
    // ...
}
process(&my_string);  // free

Default to borrowing; clone only when necessary.

2. Allocations in hot paths

for x in items {
    let s = format!("item: {}", x);  // allocates per iteration
    // ...
}

Pre-allocate or reuse:

let mut s = String::with_capacity(64);
for x in items {
    s.clear();
    write!(&mut s, "item: {}", x).unwrap();
    // use s
}

3. Vec growth

let mut v = Vec::new();
for x in 0..10_000 {
    v.push(x);  // reallocates as it grows
}

vs

let mut v = Vec::with_capacity(10_000);

When you know the size: preallocate.

4. Box in hot paths

Dynamic dispatch costs. Static dispatch via generics is faster:

// Slower
fn process(items: &[Box<dyn Process>]) { ... }

// Faster (when type is known)
fn process<T: Process>(items: &[T]) { ... }

Trade off: code size vs perf.

5. Sync vs async overhead

Async is great for IO; for CPU work it adds overhead. For pure CPU: use threads (rayon).

Rayon for parallel CPU work

use rayon::prelude::*;

let total: i64 = items.par_iter().map(|x| heavy_compute(x)).sum();

Drop-in parallel iteration. Linear speedup on N cores for embarrassingly parallel work.

Tokio for async IO

let results = futures::future::join_all(urls.iter().map(|u| client.get(u))).await;

For IO-bound: scales to thousands of concurrent ops.

Don’t mix: don’t run rayon inside tokio (blocks event loop). Don’t run tokio inside rayon worker. Pick by workload.

SIMD

For numeric crunching, portable SIMD via std::simd (nightly) or wide / simba:

use wide::f32x8;

let a = f32x8::from([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]);
let b = f32x8::from([1.0; 8]);
let c = a + b;  // 8 floats, one instruction

8-wide ops on AVX. 4-8× speedup for numerical loops. See also ndarray, nalgebra.

Compiler hints

#[inline]
fn hot_function() { ... }

#[inline(always)]
fn really_hot() { ... }

Helps small functions get inlined across crate boundaries (#[inline] is required for cross-crate inlining). Don’t slap on every function — bloats binary.

For likely / unlikely:

if std::intrinsics::likely(common) {
    fast_path()
} else {
    slow_path()
}

Only nightly stable; use sparingly.

Profile-guided optimization

RUSTFLAGS="-Cprofile-generate=/tmp/pgo" cargo build --release
# Run workload to gather profile
target/release/myapp ...
RUSTFLAGS="-Cprofile-use=/tmp/pgo" cargo build --release

5-15% speedup on top of regular release builds. Worth it for production binaries.

LTO

[profile.release]
lto = "fat"            # or "thin"
codegen-units = 1      # better optimization, slower compile

5-10% perf, longer build times. Standard for shipping binaries.

Realistic benchmarking

  • Realistic workload, not synthetic.
  • Production-shaped data.
  • Warm caches before measuring.
  • Multiple runs; report variance.
  • Compare against the previous version, not against ideal.

Common mistakes

1. Optimizing without profiling

You “know” what’s slow. You’re wrong. Profile.

2. Debug-build benchmarks

Useless; report meaningless numbers. Always --release.

3. Microbenchmarks dominating

Optimizing a 1ms function called once per request. The 200ms DB query is the bottleneck.

4. Cloning to “make it work”

my_data.clone() to bypass borrow checker. Sometimes ok; in hot path: bad.

5. Async for everything

Async overhead matters for sync-fast functions. CPU loops in async tasks: heavy.

What I’d ship today

For perf-sensitive Rust:

  • Criterion for benchmarks; CI runs them.
  • Flamegraph for hot path analysis.
  • dhat for allocation analysis.
  • Rayon for parallel CPU.
  • Tokio for IO concurrency.
  • LTO + PGO for production builds.
  • Profile-then-optimize discipline.

Read this next

If you want my Rust perf playbook (criterion + flamegraph + PGO), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .