Rust Performance Profiling: Why Your Fast Code Is Lying to You

Rust gives you control over memory, zero-cost abstractions, and a compiler that feels like it’s on your side. So why does your service still have p99 spikes at 400ms? The honest answer is that writing safe, correct Rust and writing fast Rust are two different skills — and most people never cross that gap because they’re optimizing what feels slow instead of what actually is.

The Benchmark Is Not the Code You Think It Is

Before touching a profiler, most developers write a quick benchmark, see a good number, and move on. The problem: LLVM is smarter than both of us. If your benchmark doesn’t produce a side effect the compiler can observe, it will happily delete the computation entirely, hand you a 0-nanosecond result, and you’ll ship “optimized” code with zero actual changes. This is not a gotcha — it’s the default behavior, and it bites everyone eventually.

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_parse(c: &mut Criterion) {
// Initialize data outside the loop
let raw_bytes = vec![0u8; 1024];

c.bench_function("parse_packet", |b| {
    b.iter(|| {
        // black_box on both input and output to prevent LLVM from deleting the call
        black_box(parse_packet(black_box(&raw_bytes)))
    })
});
}

criterion_group!(benches, bench_parse);
criterion_main!(benches);

What black_box Actually Does — and Why It’s Not Magic

black_box is a compiler fence, nothing more. It’s a sign for LLVM that says, “Assume you know nothing about this value.” It stops the optimizer from deleting your code entirely (dead-code elimination) or pre-calculating the result (constant folding). But don’t get comfortable: black_box doesn’t turn your function into a sacred, untouchable block. If your code is trivial, LLVM will still inline it and gut the internals to fit the surrounding context. You’re buying honesty at the input boundary, but everything past that is still fair game for the compiler’s scalpel. A benchmark can be technically “correct” and still be total garbage if it doesn’t mirror how the function actually behaves in the wild.

Flamegraphs Tell You Where Time Goes — Not Why

A flamegraph is the first thing you reach for when something feels slow. It’s a sampling profiler output: every horizontal slice is a stack frame, every width is proportional to how often it showed up in samples. The insight is visual and immediate — you see the plateau of a wide frame and know that’s where CPU cycles are bleeding. What flamegraphs don’t tell you is why that frame is wide: is it doing real work, waiting on memory, or spinning on a lock? That distinction matters enormously for what you do next.

# Record with perf, generate flamegraph
cargo install flamegraph
sudo cargo flamegraph --bin my_service -- --workers 4

# Or with perf directly for more control
perf record -F 999 -g ./target/release/my_service
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg

Sampling Frequency and the Heisenberg Problem

The -F 999 flag sets sampling to ~1000 Hz — one snapshot per millisecond. Too low and you miss short hot paths. Too high and the profiler itself starts affecting the results: you’re measuring the observer as much as the code. For most services, 997–1000 Hz is a practical sweet spot. More importantly: run your profiler under realistic load, not idle. A flamegraph taken while your service handles 10 req/s looks nothing like one at 10,000 req/s — cache behavior, branch prediction, and lock contention all shift. Cold-start profiling is one of the most common ways engineers waste an afternoon chasing the wrong function.

Deep Dive
Rust Memory Safety Myths

Beyond the Compiler: 3 Dangerous Rust Memory Safety Myths Despite the widespread adoption of the language, several Rust memory safety myths persist among developers, giving a false sense of invincibility in production systems. Engineers often...

“`html

Memory Profiling: The Allocator Is Not Your Friend

Here’s a thing nobody tells you early enough: in Rust, you can write perfectly safe, zero-unsafe code that allocates like crazy and you’ll never see it in a flamegraph. Sampling profilers catch CPU time. Heap fragmentation and allocation pressure show up as vague slowness — slightly worse throughput, tail latency that creeps up under load, p99 that’s fine at 100 RPS and ugly at 1000. You look at the flamegraph, shrug, and blame the network.

# Cargo.toml
[dependencies]
dhat = "0.3"

#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    let _profiler = dhat::Profiler::new_heap();
    run_server();
}

Why dhat Shows You What perf Can’t

dhat instruments every allocation and deallocation, tracking where memory comes from and where it dies. Unlike perf, it doesn’t sample — it counts. You get exact numbers: how many bytes were allocated in a given call path, peak heap usage, how many short-lived allocations were created and immediately dropped. That last one is the killer. A tight loop that allocates a Vec on every iteration and drops it a microsecond later is invisible to a flamegraph but absolutely murders your allocator throughput. dhat makes it obvious.

Monomorphization: When Generics Become a Performance Tax

Rust generics are compiled via monomorphization — for every concrete type you instantiate a generic function with, the compiler generates a separate copy of that function. This is how Rust achieves zero-cost abstractions: no virtual dispatch, no runtime type erasure. But “zero-cost” doesn’t mean “free.” Each monomorphized copy is real machine code. If you have a generic serializer called with twelve different types across your codebase, you have twelve copies of that function sitting in your binary — and probably none of them are in L1 cache at the same time.

// This generates N copies — one per concrete type T
fn serialize(val: &T) -> Vec { ... }

// This generates one copy — dynamic dispatch, one vtable hop
fn serialize_dyn(val: &dyn Serialize) -> Vec { ... }

// Hybrid: thin monomorphized wrapper over a dyn core
fn serialize(val: &T) -> Vec {
    serialize_inner(val as &dyn Serialize)
}

The Cache Miss You Didn’t See Coming

The real cost of excessive monomorphization isn’t compile time — it’s instruction cache pressure at runtime. When your hot path jumps between twelve slightly different versions of the same function, the CPU’s i-cache fills up fast and branch prediction goes sideways. You won’t see this as a single obvious hotspot in a flamegraph. You’ll see it as everything being slightly slower than it should be, across the board. The hybrid pattern — a thin monomorphized wrapper delegating to a dyn core — is one of those genuinely useful tricks that feels wrong until you measure it and realize it’s faster for the workloads that matter.

Technical Reference
Rust Concurrency Made Simple

Rust Concurrency Made Simple Concurrency in Rust isn’t just a buzzword you drop at meetups—it’s the language’s way of making your multi-threaded code less of a headache. For beginners and mid-level devs, understanding why Rust...

Async Rust and the Profiler Blind Spot

Async Rust is wonderful until you try to profile it. Traditional sampling profilers see threads. Tokio runs thousands of tasks on a handful of threads — so what you get in perf is a wall of executor internals with your actual application logic buried somewhere inside. You can spend an hour staring at tokio::runtime::task::harness frames and learn absolutely nothing useful about why your latency spiked.

// Cargo.toml
[dependencies]
console-subscriber = "0.2"

// main.rs
fn main() {
    console_subscriber::init();
    tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(run());
}

tokio-console Thinks in Tasks, Not Threads

tokio-console is a different kind of tool — it’s not a sampling profiler, it’s a runtime inspector. It shows you individual tasks: how long they’ve been alive, how often they’re polled, whether they’re stuck waiting on something. The thing that makes it genuinely useful is seeing tasks that are alive but never making progress — tasks that get polled, do nothing, and yield repeatedly. That’s usually a sign of a poorly implemented future, a channel that’s always empty, or a timeout that fires too aggressively. perf would show you busy threads. tokio-console shows you idle tasks pretending to be busy.

“`html

PGO: The Optimization Nobody Uses But Everyone Should

Profile-Guided Optimization has been in LLVM for years. The idea is straightforward: compile your binary, run it under real load to collect execution data, then recompile using that data to guide optimization decisions. The compiler learns which branches are hot, which functions get called constantly, where inlining actually pays off. Sounds obvious. And yet almost nobody does it — partly because the workflow is annoying, partly because people assume the compiler already does its best. It doesn’t. Not without data.

# Step 1: instrument build
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" \
    cargo build --release

# Step 2: run under real load
./target/release/my_service --bench-mode

# Step 3: merge profiles
llvm-profdata merge -o /tmp/pgo-data/merged.profdata \
    /tmp/pgo-data/*.profraw

# Step 4: optimized build
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" \
    cargo build --release

What PGO Actually Changes Under the Hood

Without profile data, LLVM makes conservative guesses: it inlines based on function size heuristics, lays out code assuming branches are roughly equally likely, and makes call-site decisions without knowing which paths are hot. With real execution data, it gets specific. Cold code gets pushed out of the hot path. Frequently-called small functions get inlined aggressively even if they look “too big” by heuristic. Branch layout changes so the CPU’s branch predictor is right more often. On real-world workloads — parsers, servers, anything with complex branching — PGO typically gets you 10–20% throughput improvement essentially for free. The catch: your profiling load needs to resemble production. PGO optimized for the wrong workload can make things marginally worse.

Worth Reading
Rust Garbage Collectio

Garbage Collection in Rust Without a Single unsafe Block Most garbage collectors written in Rust have a dirty secret buried in their source tree: a unsafe block that throws your borrow checker faith right out...

LTO and the Final Frontier of Inlining

Most Rust projects compile with LTO disabled. That means each crate is optimized in isolation — LLVM sees one translation unit at a time and has no idea what’s happening across crate boundaries. So a hot function in your utility crate that gets called constantly from your main crate? Never inlined. The compiler literally cannot see both sides at once. Thin LTO fixes this with a reasonable compile-time cost. Full LTO fixes it more aggressively with a significant compile-time cost. For release builds that go to production, the tradeoff is almost always worth it.

# Cargo.toml — start with thin, measure, then decide
[profile.release]
lto = "thin"        # cross-crate inlining, reasonable build time
codegen-units = 1   # single LLVM module, better optimization
opt-level = 3       # default for release, explicit is cleaner

# Full LTO if you have the CI budget
# lto = true

codegen-units = 1 and Why It Matters More Than You Think

By default, Rust splits compilation into multiple codegen units to parallelize LLVM work — faster builds, worse optimization. With codegen-units = 1, the entire crate goes through LLVM as a single module. That means more inlining opportunities, better dead-code elimination, and smarter register allocation across function boundaries. Combined with thin LTO, this is often the highest-leverage single change you can make to a release profile. It won’t save a bad algorithm, but if your code is already reasonable and you’re chasing that last 15% — this is where it lives.

The Uncomfortable Truth About Performance Work

After all of this — the flamegraphs, the dhat runs, the PGO cycles — the most important thing is still the one everyone skips: measure first, change second. Not because it sounds wise, but because performance intuition in Rust is genuinely unreliable. The compiler does things you don’t expect. The hardware does things the compiler doesn’t expect. A change that looks like an obvious win sometimes isn’t, and a change that feels like a minor cleanup sometimes drops your p99 by 40%. The only way to know is to have numbers before and numbers after. Everything else is just a theory.

Written by:

Source Category: Rust Engineering