Rust Performance Profiling: Why Your Fast Code Is Lying to You
Rust gives you control over memory, zero-cost abstractions, and a compiler that feels like its on your side. So why does your service still have p99 spikes at 400ms? The honest answer is that writing safe, correct Rust and writing fast Rust are two different skills — and most people never cross that gap because theyre optimizing what feels slow instead of what actually is.
The Benchmark Is Not the Code You Think It Is
Before touching a profiler, most developers write a quick benchmark, see a good number, and move on. The problem: LLVM is smarter than both of us. If your benchmark doesnt produce a side effect the compiler can observe, it will happily delete the computation entirely, hand you a 0-nanosecond result, and youll ship optimized code with zero actual changes. This is not a gotcha — its the default behavior, and it bites everyone eventually.
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_parse(c: &mut Criterion) {
// Initialize data outside the loop
let raw_bytes = vec![0u8; 1024];
c.bench_function("parse_packet", |b| {
b.iter(|| {
// black_box on both input and output to prevent LLVM from deleting the call
black_box(parse_packet(black_box(&raw_bytes)))
})
});
}
criterion_group!(benches, bench_parse);
criterion_main!(benches);
What black_box Actually Does — and Why Its Not Magic
black_box is a compiler fence, nothing more. Its a sign for LLVM that says, Assume you know nothing about this value. It stops the optimizer from deleting your code entirely (dead-code elimination) or pre-calculating the result (constant folding). But dont get comfortable: black_box doesnt turn your function into a sacred, untouchable block. If your code is trivial, LLVM will still inline it and gut the internals to fit the surrounding context. Youre buying honesty at the input boundary, but everything past that is still fair game for the compilers scalpel. A benchmark can be technically correct and still be total garbage if it doesnt mirror how the function actually behaves in the wild.
Flamegraphs Tell You Where Time Goes — Not Why
A flamegraph is the first thing you reach for when something feels slow. Its a sampling profiler output: every horizontal slice is a stack frame, every width is proportional to how often it showed up in samples. The insight is visual and immediate — you see the plateau of a wide frame and know thats where CPU cycles are bleeding. What flamegraphs dont tell you is why that frame is wide: is it doing real work, waiting on memory, or spinning on a lock? That distinction matters enormously for what you do next.
# Record with perf, generate flamegraph
cargo install flamegraph
sudo cargo flamegraph --bin my_service -- --workers 4
# Or with perf directly for more control
perf record -F 999 -g ./target/release/my_service
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg
Sampling Frequency and the Heisenberg Problem
The -F 999 flag sets sampling to ~1000 Hz — one snapshot per millisecond. Too low and you miss short hot paths. Too high and the profiler itself starts affecting the results: youre measuring the observer as much as the code. For most services, 997–1000 Hz is a practical sweet spot. More importantly: run your profiler under realistic load, not idle. A flamegraph taken while your service handles 10 req/s looks nothing like one at 10,000 req/s — cache behavior, branch prediction, and lock contention all shift. Cold-start profiling is one of the most common ways engineers waste an afternoon chasing the wrong function.
`html
Memory Profiling: The Allocator Is Not Your Friend
Heres a thing nobody tells you early enough: in Rust, you can write perfectly safe, zero-unsafe code that allocates like crazy and youll never see it in a flamegraph. Sampling profilers catch CPU time. Heap fragmentation and allocation pressure show up as vague slowness — slightly worse throughput, tail latency that creeps up under load, p99 thats fine at 100 RPS and ugly at 1000. You look at the flamegraph, shrug, and blame the network.
# Cargo.toml
[dependencies]
dhat = "0.3"
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;
fn main() {
let _profiler = dhat::Profiler::new_heap();
run_server();
}
Why dhat Shows You What perf Cant
dhat instruments every allocation and deallocation, tracking where memory comes from and where it dies. Unlike perf, it doesnt sample — it counts. You get exact numbers: how many bytes were allocated in a given call path, peak heap usage, how many short-lived allocations were created and immediately dropped. That last one is the killer. A tight loop that allocates a Vec on every iteration and drops it a microsecond later is invisible to a flamegraph but absolutely murders your allocator throughput. dhat makes it obvious.
Monomorphization: When Generics Become a Performance Tax
Rust generics are compiled via monomorphization — for every concrete type you instantiate a generic function with, the compiler generates a separate copy of that function. This is how Rust achieves zero-cost abstractions: no virtual dispatch, no runtime type erasure. But zero-cost doesnt mean free. Each monomorphized copy is real machine code. If you have a generic serializer called with twelve different types across your codebase, you have twelve copies of that function sitting in your binary — and probably none of them are in L1 cache at the same time.
// This generates N copies — one per concrete type T
fn serialize(val: &T) -> Vec { ... }
// This generates one copy — dynamic dispatch, one vtable hop
fn serialize_dyn(val: &dyn Serialize) -> Vec { ... }
// Hybrid: thin monomorphized wrapper over a dyn core
fn serialize(val: &T) -> Vec {
serialize_inner(val as &dyn Serialize)
}
The Cache Miss You Didnt See Coming
The real cost of excessive monomorphization isnt compile time — its instruction cache pressure at runtime. When your hot path jumps between twelve slightly different versions of the same function, the CPUs i-cache fills up fast and branch prediction goes sideways. You wont see this as a single obvious hotspot in a flamegraph. Youll see it as everything being slightly slower than it should be, across the board. The hybrid pattern — a thin monomorphized wrapper delegating to a dyn core — is one of those genuinely useful tricks that feels wrong until you measure it and realize its faster for the workloads that matter.
Async Rust and the Profiler Blind Spot
Async Rust is wonderful until you try to profile it. Traditional sampling profilers see threads. Tokio runs thousands of tasks on a handful of threads — so what you get in perf is a wall of executor internals with your actual application logic buried somewhere inside. You can spend an hour staring at tokio::runtime::task::harness frames and learn absolutely nothing useful about why your latency spiked.
// Cargo.toml
[dependencies]
console-subscriber = "0.2"
// main.rs
fn main() {
console_subscriber::init();
tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
.block_on(run());
}
tokio-console Thinks in Tasks, Not Threads
tokio-console is a different kind of tool — its not a sampling profiler, its a runtime inspector. It shows you individual tasks: how long theyve been alive, how often theyre polled, whether theyre stuck waiting on something. The thing that makes it genuinely useful is seeing tasks that are alive but never making progress — tasks that get polled, do nothing, and yield repeatedly. Thats usually a sign of a poorly implemented future, a channel thats always empty, or a timeout that fires too aggressively. perf would show you busy threads. tokio-console shows you idle tasks pretending to be busy.
`html
PGO: The Optimization Nobody Uses But Everyone Should
Profile-Guided Optimization has been in LLVM for years. The idea is straightforward: compile your binary, run it under real load to collect execution data, then recompile using that data to guide optimization decisions. The compiler learns which branches are hot, which functions get called constantly, where inlining actually pays off. Sounds obvious. And yet almost nobody does it — partly because the workflow is annoying, partly because people assume the compiler already does its best. It doesnt. Not without data.
# Step 1: instrument build
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" \
cargo build --release
# Step 2: run under real load
./target/release/my_service --bench-mode
# Step 3: merge profiles
llvm-profdata merge -o /tmp/pgo-data/merged.profdata \
/tmp/pgo-data/*.profraw
# Step 4: optimized build
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" \
cargo build --release
What PGO Actually Changes Under the Hood
Without profile data, LLVM makes conservative guesses: it inlines based on function size heuristics, lays out code assuming branches are roughly equally likely, and makes call-site decisions without knowing which paths are hot. With real execution data, it gets specific. Cold code gets pushed out of the hot path. Frequently-called small functions get inlined aggressively even if they look too big by heuristic. Branch layout changes so the CPUs branch predictor is right more often. On real-world workloads — parsers, servers, anything with complex branching — PGO typically gets you 10–20% throughput improvement essentially for free. The catch: your profiling load needs to resemble production. PGO optimized for the wrong workload can make things marginally worse.
LTO and the Final Frontier of Inlining
Most Rust projects compile with LTO disabled. That means each crate is optimized in isolation — LLVM sees one translation unit at a time and has no idea whats happening across crate boundaries. So a hot function in your utility crate that gets called constantly from your main crate? Never inlined. The compiler literally cannot see both sides at once. Thin LTO fixes this with a reasonable compile-time cost. Full LTO fixes it more aggressively with a significant compile-time cost. For release builds that go to production, the tradeoff is almost always worth it.
# Cargo.toml — start with thin, measure, then decide
[profile.release]
lto = "thin" # cross-crate inlining, reasonable build time
codegen-units = 1 # single LLVM module, better optimization
opt-level = 3 # default for release, explicit is cleaner
# Full LTO if you have the CI budget
# lto = true
codegen-units = 1 and Why It Matters More Than You Think
By default, Rust splits compilation into multiple codegen units to parallelize LLVM work — faster builds, worse optimization. With codegen-units = 1, the entire crate goes through LLVM as a single module. That means more inlining opportunities, better dead-code elimination, and smarter register allocation across function boundaries. Combined with thin LTO, this is often the highest-leverage single change you can make to a release profile. It wont save a bad algorithm, but if your code is already reasonable and youre chasing that last 15% — this is where it lives.
The Uncomfortable Truth About Performance Work
After all of this — the flamegraphs, the dhat runs, the PGO cycles — the most important thing is still the one everyone skips: measure first, change second. Not because it sounds wise, because performance intuition in Rust is genuinely unreliable. The compiler does things you dont expect. The hardware does things the compiler doesnt expect. A change that looks like an obvious win sometimes isnt, and a change that feels like a minor cleanup sometimes drops your p99 by 40%. The only way to know is to have numbers before and numbers after. Everything else is just a theory.
Written by: