Rust Coroutines and the Abstraction Tax Your Profiler Wont Show You

The async/await syntax landed Rust in 2019 and immediately became the default answer to concurrent I/O. It was the right call for ecosystem growth. It was a deliberate compromise for systems-level engineers who needed something the standard model quietly refuses to give: full, auditable control over where execution pauses, what state gets saved, and who pays for the context transition. That cost exists. Its just hidden behind generated code you didnt write and cant easily inspect.

This isnt a tutorial. Its a dissection.

Waker overhead analysis

Every Future in Rusts async model gets polled. When its not ready, it registers a Waker — a handle the executor uses to reschedule the task. Sounds clean. The problem is how that waker is constructed and passed. Under the hood, RawWaker is a fat pointer: data pointer plus a vtable. That vtable carries four function pointers — clone, wake, wake_by_ref, drop. Four pointer dereferences minimum per wakeup cycle. For a futures chain three levels deep, youre chasing pointers through memory that may not be in cache.

This is vtable bloat in practice, not in theory.

The deeper issue is implicit state propagation. When an async function awaits, the compiler snapshots the local variables into an anonymous struct — a generated machine you never see. That struct gets heap-allocated (in most executors) and accessed through a Box<dyn Future>. Opaque type erasure is the mechanism; indirection is the cost. You lose static dispatch, you lose inlining opportunities, and the optimizer has less to work with. The zero-cost abstraction guarantee only holds if the compiler can see through the abstraction — and here it often cant.

// What the runtime actually juggles per wakeup
struct RawWaker {
    data: *const (),
    vtable: &'static RawWakerVTable,
}

struct RawWakerVTable {
    clone:        unsafe fn(*const ()) -> RawWaker,
    wake:         unsafe fn(*const ()),
    wake_by_ref:  unsafe fn(*const ()),
    drop:         unsafe fn(*const ()),
}
// Four fn pointers. Four potential cache misses.
// Per. Task. Per. Poll.

Static dispatch optimization

The fix isnt to avoid wakers — its to know when the dynamic dispatch is actually necessary. In a single-threaded embedded executor with a fixed task count, every future type is known at compile time. You can build a waker thats a no-op or a direct index into a task array. No vtable. No heap. No indirection. The standard machinery assumes a general-purpose runtime; when your runtime isnt general-purpose, youre carrying weight that buys you nothing.

Manual Future implementation Rust

The async keyword is a code generator. It takes what looks like linear and emits a state machine enum — one variant per suspension point. Thats all it does. The compiler isnt magic; its a pattern you can replicate by hand, with full visibility into every byte of generated state. Moving away from async means you write that enum yourself. You decide what gets saved. You decide the memory layout.

Related materials

Rust Tooling Overview

Rust Tooling: How Cargo, Clippy, and the Ecosystem Actually Shape Your Code Most developers picking up Rust focus on the borrow checker — understandably so. But the tooling ecosystem quietly does something just as important:...

[read more →]

Heres where the standard abstraction breaks down: the generated state machine includes everything in scope at each await point — whether you need it after resumption or not. The compiler is conservative. It saves state you wont touch again because proving liveness across suspension points is hard. You write it manually, you save exactly what the next state needs. Thats the difference between a 48-byte state struct and a 200-byte one on a microcontroller with 256KB of RAM.

use core::task::{Context, Poll};
use core::pin::Pin;
use core::future::Future;

enum ReadFuture {
    Init,
    Waiting { buf: [u8; 64], pos: usize },
    Done,
}

impl Future for ReadFuture {
    type Output = usize;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll {
        match *self {
            ReadFuture::Init => {
                *self = ReadFuture::Waiting { buf: [0u8; 64], pos: 0 };
                cx.waker().wake_by_ref();
                Poll::Pending
            }
            ReadFuture::Waiting { ref buf, pos } => {
                // check hardware register, not a runtime queue
                Poll::Ready(pos)
            }
            ReadFuture::Done => panic!("polled after completion"),
        }
    }
}

Zero-allocation concurrency

This code doesnt allocate. The state lives where you put it — stack, static, memory-mapped region. No executor owns it through a Box. The poll contract is the same as the standard library expects, so it slots into any executor that speaks Future. What changed is that you control the state transitions explicitly, and the compiler has no room to insert phantom saves. Thats zero-allocation concurrency without a runtime policy enforcing it — just architecture.

Stackless state machine logic

A stackful coroutine saves the entire call stack on suspension — registers, return addresses, local frames. Thats 4KB to 8KB minimum per coroutine on most platforms. Stackless means you save only what the state machine enum carries. The instruction pointer equivalent is the enum variant itself: Init means start here, Waiting means resume from this branch. Theres no hidden stack frame. Theres no dedicated stack segment. The memory layout is exactly the size of the largest variant, aligned to its strictest field.

On bare metal, this distinction is the difference between supporting 500 concurrent tasks and supporting 5.

// Memory layout: compiler picks largest variant + alignment padding
enum SensorPoll {
    Idle,                          // 0 bytes of payload
    Reading { register: u32,       // 4 bytes
              retries: u8 },       // 1 byte + 3 padding
    Faulted { code: u32 },         // 4 bytes
}
// sizeof(SensorPoll) == 8 bytes total, not 4KB
// No stack allocation. No guard pages. No context switch overhead.

Memory layout alignment

Saving register states without a stack isnt a limitation — its a constraint that forces honest design. If your coroutine needs 12 local variables across a suspension point, your state enum tells you that explicitly. Theres no hiding it behind an opaque generated struct. The enum is the contract: every field in every variant is something the system agreed to keep alive across a yield. When that enum gets large, its a design signal, not a compiler artifact. You fix the architecture, not the annotation.

No-std async executor

Tokio is a production-grade runtime built for servers. It has a thread pool, a work-stealing scheduler, a timer wheel, and a reactor that wraps epoll or kqueue. Thats roughly 50,000 lines of infrastructure you drag into your binary the moment you add it as a dependency. On a bare-metal STM32 with no OS, no allocator, and 128KB of flash, thats not a tradeoff — its a non-starter.

The minimal executor isnt a stripped-down Tokio. Its a different animal entirely.

Related materials

Rust Concurrency Made Simple

Rust Concurrency Made Simple Concurrency in Rust isn’t just a buzzword you drop at meetups—it’s the language’s way of making your multi-threaded code less of a headache. For beginners and mid-level devs, understanding why Rust...

[read more →]

What you actually need is a poll loop, a fixed task list, and a waker implementation that doesnt require heap allocation. The entire executor can fit in under 60 lines of Rust. No trait objects for the task list — you know your task types at compile time, so you use an array of concrete state machines. No dynamic waker registration — when a task signals readiness, it writes to a bitmask. The poll loop reads the bitmask, iterates over ready tasks, calls poll directly. Thats it. Thats the runtime.

#![no_std]
#![no_main]

use core::task::{RawWaker, RawWakerVTable, Waker, Context, Poll};
use core::future::Future;
use core::pin::Pin;

static READY_MASK: core::sync::atomic::AtomicU32 =
    core::sync::atomic::AtomicU32::new(0);

unsafe fn noop_clone(p: *const ()) -> RawWaker {
    RawWaker::new(p, &NOOP_VTABLE)
}
unsafe fn noop_wake(p: *const ()) {
    let idx = p as usize;
    READY_MASK.fetch_or(1 << idx,
        core::sync::atomic::Ordering::Release);
}
unsafe fn noop_drop(_: *const ()) {}

static NOOP_VTABLE: RawWakerVTable =
    RawWakerVTable::new(noop_clone, noop_wake, noop_wake, noop_drop);

Instruction pointer

The waker here carries the task index as a raw pointer — an abuse of the API that the contract technically permits. On wake, it sets a bit in a static atomic bitmask. The poll loop checks that mask each iteration. No heap. No Arc. No cross-thread synchronization beyond a single atomic store. The instruction pointer for each task is the enum variant it left in — thats all the resume address you need when theres no real stack to restore.

This works because the contract between executor and future is narrow: call poll, get Ready or Pending, reschedule if Pending. Everything else — priority queues, fairness, timer integration — is policy layered on top of that contract. Strip the policy and the mechanism is trivially small. Most embedded executors that claim to be lightweight are still carrying that policy weight. A real no-std executor discards it entirely and rebuilds only what the hardware demands.

The cost you pay is expressiveness. You cant dynamically spawn tasks. Your task count is fixed at compile time. Priorities are whatever order you poll the bitmask bits. For a sensor fusion loop running on Cortex-M4, those arent bugs — theyre features. Determinism and auditability matter more than flexibility when the system has no recovery path.

Related materials

When Rust Makes Sense

Engineering Perspective: When Rust Makes Sense Rust is not a novelty; it’s a tool for precise control over memory, concurrency, and latency in real systems. When to use Rust is determined by measurable constraints: high-load...

[read more →]

Custom suspension points

The await keyword hardcodes where a function yields. The compiler inserts the suspension point; you get no say in the conditions under which it fires. That sounds like a minor complaint until youre writing a DMA transfer routine where you need to yield only after confirming the descriptor ring advanced — not after the I/O request was submitted, not after an interrupt fired, but after a specific hardware register bit transitions from 0 to 1 under a memory barrier.

Standard async gives you no hook for that. You wrap it in a custom Future anyway, which means youre already writing manual poll logic — but now youre also carrying the async machinery around it for no reason.

enum DmaTransfer {
    Armed { descriptor: u32 },
    Polling { descriptor: u32, deadline: u32 },
    Complete { bytes: usize },
    Faulted,
}

impl Future for DmaTransfer {
    type Output = Result<usize, ()>;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll {
        match *self {
            DmaTransfer::Armed { descriptor } => {
                start_dma(descriptor);
                *self = DmaTransfer::Polling {
                    descriptor,
                    deadline: current_tick() + 1000,
                };
                cx.waker().wake_by_ref();
                Poll::Pending
            }
            DmaTransfer::Polling { descriptor, deadline } => {
                if dma_complete(descriptor) {
                    let n = dma_byte_count(descriptor);
                    *self = DmaTransfer::Complete { bytes: n };
                    Poll::Ready(Ok(n))
                } else if current_tick() > deadline {
                    *self = DmaTransfer::Faulted;
                    Poll::Ready(Err(()))
                } else {
                    cx.waker().wake_by_ref();
                    Poll::Pending
                }
            }
            DmaTransfer::Complete { bytes } => Poll::Ready(Ok(bytes)),
            DmaTransfer::Faulted => Poll::Ready(Err(())),
        }
    }
}

Explicit control over suspension

This is explicit control over suspension. The state machine encodes the hardware protocol directly: arm the DMA, poll the completion bit, enforce a deadline, fault cleanly on timeout. Each variant is a documented checkpoint in the transfer lifecycle. Theres no implicit state propagation — no hidden variable saved across an await point that you forgot was there. When the system auditor asks what is this task doing at cycle 47,000, you point to the enum variant. Thats the answer.

By moving away from generated code, we gain state determinism. In safety-critical systems, being able to map every possible byte of memory to a known hardware state isnt just a nice to have — its a certification requirement. Manual state machines transform magic async logic into a traceable execution graph.

Custom suspension points

Custom suspension points change how you think about driver architecture. Instead of writing a function that blocks until an operation completes, you write a state machine that advances when the hardware is ready. The driver becomes a description of valid hardware states and the transitions between them — which is what it always should have been. The OS kernel scheduler, the interrupt controller, and the task executor all speak the same language: poll, yield, resume. No magic. No hidden runtime. Just state and transitions.

This approach effectively eliminates the async-sync impedance mismatch. When the driver is a state machine, it doesnt care if its being polled by a high-level executor or a simple interrupt handler. Youve decoupled the logic from the execution strategy, achieving true runtime-agnostic code.

The honesty of manual implementation

The compiler lies to you about how cheap async is at the systems boundary. The vtable is there. The implicit saves are there. The opaque type erasure is there. None of it disappears because the syntax looks clean. Going manual doesnt mean going primitive — it means going honest. You trade ergonomics for auditability, and on systems where auditability is the only acceptable currency, that trade is obvious.

The engineers who will push Rust into real-time kernels and safety-critical firmware arent waiting for a better async runtime. Theyre already writing enums. They are building the Effector-level control that the standard library is still trying to figure out how to stabilize.

Written by:

Krun Dev