Go Stack Management: The 2KB Lie and What Happens After

Everyone loves the goroutines are cheap, start a million of them pitch. And its not wrong — a 2KB initial stack versus 2MB for an OS thread is a real win. But that number is a starting point, not a ceiling, and the machinery Go uses to grow that stack is where the real cost hides. Understanding go memory management internals means understanding what happens between goroutine launched and goroutine done when the stack has to move.


TL;DR: Quick Takeaways

  • Go dropped segmented stacks in 1.3 — contiguous stacks mean the entire stack gets copied to a new memory block when it needs to grow.
  • Every non-leaf function call pays a stack guard check in the prologue — thats CPU cycles on every call frame, not just when growth happens.
  • When runtime.copystack fires, Go rewrites every pointer on the stack — pointer adjustment is not free, especially on goroutines with deep frames.
  • Stack thrashing — repeated grow/shrink cycles — can spike P99 latency in highload systems with no obvious cause in profiler output.

The Morestack Ceremony: runtime.morestack Performance Overhead

Before your function body executes a single line of business logic, the Go runtime has already run a check. This is the function prologue — a few instructions injected by the compiler into every non-leaf function to verify that theres enough stack space left. The check compares the current stack pointer against a stack guard page value stored in the goroutines g struct. If the pointer is within bounds, execution falls through. If not, the runtime calls runtime.newstack, which allocates a bigger stack and migrates everything. Sounds fast. Across ten million goroutines doing deep call chains, those extra instructions accumulate into something you can actually measure.

What the Prologue Looks Like in Practice

The disassembly doesnt lie. Heres what a simple Go function compiles to on amd64 — before it does anything you actually wrote:

// func processRequest(r *Request) string
// go tool objdump output (simplified):

TEXT main.processRequest(SB)
  MOVQ (TLS), CX               // load goroutine struct (g) from TLS
  CMPQ SP, 16(CX)              // compare stack pointer vs stackguard0
  JBE  runtime.morestack_noctxt // jump if stack is too small
  // ... your actual function body starts here
  SUBQ $64, SP                 // allocate local frame
  MOVQ BP, 56(SP)
  LEAQ 56(SP), BP
  // ...
  CALL runtime.morestack_noctxt(SB)  // this is the slow path
  JMP  main.processRequest(SB)       // retry after stack grew

The CMPQ and JBE pair is the prologue. Its in every function that allocates a local frame. On the fast path (no growth needed), its a compare and a branch not taken — cheap but not zero. In tight loops that call many small functions, youre paying this tax on every frame. The epilogue mirrors it on return. When growth does happen, the JBE fires and you drop into runtime.newstack — thats where the real work begins. Benchmarks on functions called in hot paths show 3–8ns overhead per call from prologue alone, which compounds fast at high goroutine counts.

Why Leaf Functions Get a Pass

The compiler is smart enough to skip the guard check on leaf functions — functions that dont call anything else and whose stack usage is statically provable to be small. Thats your optimization target if youre writing hot-path code: push the prologue cost down by flattening call chains and inlining aggressively. The compilers inliner does this automatically up to a cost budget; you can inspect it with -gcflags="-m=2" to see what made the cut.

The Great Migration: How Go Copies Stack

When the stack guard trips and runtime.newstack decides growth is needed, Go doesnt ask the OS for another memory segment and chain it on. It allocates a new, larger contiguous block — typically double the current size — copies the entire old stack into it, and then throws the old one away. This is the contiguous stack golang model, adopted in Go 1.3 to kill the hot split problem that plagued segmented stacks. The trade-off is that you never pay the hot split penalty, but you pay a full copy every time the stack outgrows its current allocation.

Related materials
Go Allocation Rate

How Go Allocation Rate Drives GC Pressure and Latency at Scale Stop guessing. Run go tool pprof -alloc_objects to find where your app actually bleeds memory before touching any knobs. Kill heap-escaping pointers. If escape...

[read more →]

Pointer Adjustment: The Part Nobody Talks About

Copying raw bytes is easy. The hard part is pointer adjustment. When the stack moves to a new address, every pointer on the stack that points back into the stack is now wrong. Gos runtime has to scan the entire stack, identify all such pointers, and rewrite them to reflect the new base address. This is what runtime.copystack actually does — its not a memcpy, its a memcpy followed by a pointer-rewriting pass. For a goroutine with 50 frames and a few hundred stack-resident pointers, this is a non-trivial amount of work done while that goroutine is paused.

// Triggering copystack deliberately — a recursive function with no base optimization
func recurse(depth int, buf [512]byte) {
    if depth == 0 {
        return
    }
    // [512]byte on stack forces early stack growth
    // At depth ~20, initial 2KB stack is exhausted
    // runtime.copystack fires, doubles the stack
    recurse(depth-1, buf)
}

// To observe: GODEBUG=gcstacklimit=1 go run main.go
// Or: runtime/trace + go tool trace to see stack growth events

The [512]byte local is what forces the issue — it sits on the stack and eats through the initial allocation fast. Once runtime.copystack fires, execution resumes on the new stack, but the latency spike is already in the trace. In production systems, this shows up as sudden P99 jitter on endpoints that hit deep call paths — and its invisible to most profilers because the time is spent in runtime internals, not in your code.

Stack Shrinking and the GC Connection

Go also shrinks stacks — it doesnt hold onto bloated allocations forever. Stack shrinking happens during garbage collection cycles: if a goroutines stack is less than 25% utilized, the GC will halve it. This is the mechanism behind memory fragmentation recovery — long-lived goroutines that had a spike of deep recursion and then went back to shallow work will eventually get their memory back. But the shrink is itself a copy operation, same mechanics as growth. Goroutines that oscillate between deep and shallow call stacks pay this cost repeatedly.

Stack vs Heap: Golang Escape Analysis Performance

The compilers escape analysis decides whether a variable lives on the stack or gets promoted to the heap. Stack allocation is normally faster — no GC involvement, better cache locality, deterministic lifetime. But normally is doing a lot of work in that sentence. When a goroutines stack is already large and a new allocation would trigger another copy, putting that variable on the heap can actually be the cheaper option. Escape to heap trades a one-time GC overhead for avoiding a full stack migration — and in some cases thats the right trade.

When the GC is Cheaper Than copystack

Stack allocation costs roughly 1–3ns per allocation in the fast path. Heap allocation via mallocgc costs 20–100ns depending on size class and GC pressure. That sounds like an obvious win for the stack. But runtime.copystack on a 64KB goroutine stack runs in the 500ns–2µs range, depending on pointer density. If your goroutine is already at 60KB and the next function call pushes it over, youre paying that copy cost. A heap-escaped object that sits in the same GC cycle adds maybe 10–20ns of GC overhead amortized. The math isnt always in the stacks favor.

Allocation Path Typical Cost GC Involvement Pointer Adjustment
Stack (no growth) 1–3 ns None None
Stack (triggers copystack) 500 ns – 2 µs Indirect (shrink at GC) Full stack rescan
Heap (mallocgc) 20–100 ns Yes (mark + sweep) None
Heap (sync.Pool reuse) 5–15 ns Minimal None

Track stack allocation rate with runtime.MemStats — specifically StackInuse and StackSys. A system where StackInuse is climbing steadily under constant goroutine count is telling you stacks are growing, not that youre creating more goroutines. Thats your signal to look at call depth. For deeper analysis of how objects escape and affect GC pressure, see our breakdown of golang escape analysis internals.

Related materials
GO escape analysis

The Hidden Cost of Go Allocations: What Escape Analysis Actually Does to Your Code Go looks clean — but under the surface, the compiler is making memory decisions you never asked for, and those decisions...

[read more →]

Reading Escape Analysis Output

The garbage collector overhead from escaped variables is visible if you know where to look. Run your code with -gcflags="-m" and read the output for escapes to heap annotations. Every escape is a potential GC object — not a disaster, but worth auditing in hot paths. The less-obvious cost is that heap-allocated objects increase pointer density in GC scan, which extends mark phase duration. On a high-throughput service, the difference between 10K stack-allocated request objects and 10K heap-escaped ones can show up as 10–15% longer GC pauses.

Highload Pitfalls: Hidden Cost of Goroutine Stack Growth

In production, stack management issues dont announce themselves cleanly. They show up as unexplained P99 latency spikes, as GC pause anomalies, or as the service gets slow under load and then recovers symptoms that make on-call engineers question their life choices. The culprit is often not the goroutine count — its what those goroutines are doing to their stacks while they run.

Stack Thrashing: The Loop Nobody Warns You About

Stack thrashing happens when a goroutine repeatedly crosses the growth boundary in a loop. Grow, do work, shrink (at next GC), grow again, repeat. Each cycle is a full copy. A common pattern: a worker goroutine that handles requests with variable depth — shallow requests do nothing interesting, but deep requests trigger deep recursion that pushes the stack past its current size. After the deep request finishes, the stack shrinks at GC. Next deep request: another copy. If deep requests arrive faster than GC cycles complete, you accumulate copy latency linearly.

// Stack thrashing scenario — worker processes mixed-depth tasks
func worker(tasks <-chan Task) {
    for task := range tasks {
        // shallow tasks: stack stays at 2KB, no growth
        // deep tasks: stack grows to 8KB+ on processDeep
        // After GC between deep tasks: stack shrinks back to 2KB
        // Next deep task: copystack fires again
        if task.IsDeep {
            processDeep(task, 0) // recursive, 20+ frames
        } else {
            processShallow(task) // 3 frames max
        }
    }
}

// Fix: pre-warm the stack or restructure to avoid recursion
// GODEBUG=morestack=1 will log every stack growth event

The fix is either pre-allocating stack space by doing a dummy deep call at goroutine startup (forcing the stack to grow once and stay there), or restructuring the recursive logic into an explicit stack-based iteration. The former is a hack that works; the latter is the correct engineering answer. Theres also runtime/debug.SetMaxStack — though thats a ceiling, not a floor, and doesnt prevent the thrashing, it just turns a silent performance issue into a panic at the stack limit exceed boundary, which is at least debuggable.

Monitoring Without Going Insane

For a live production system, the runtime.MemStats fields StackInuse and StackSys give you aggregate stack memory. More useful is exporting these to your metrics system and watching the ratio — StackInuse/StackSys under load tells you utilization efficiency. A ratio that drops sharply under load means stacks grew and then GC shrank them — classic thrashing signature. Pair that with runtime/trace goroutine events to identify which specific goroutines are growing. For one-off diagnosis, debug.FreeOSMemory() followed by a MemStats snapshot gives a clean baseline. For deeper profiling patterns on memory-heavy services, the approach in go profiling in production covers continuous MemStats integration.

FAQ

What is the maximum size of a goroutine stack in Go?

On a 64-bit architecture, the default golang stack limit is 1GB, controlled by runtime/debug.SetMaxStack. On 32-bit systems its 250MB. The runtime will throw a stack overflow panic if a goroutine hits this ceiling — typically from runaway recursion with no base case or from a genuinely pathological call depth. In practice, most production goroutines never exceed a few hundred KB. The 1GB limit is a safety net, not a design target — if youre approaching it, the problem is in the code, not the limit.

Related materials
GOMAXPROCS Trap

GOMAXPROCS Trap: Why 1,000 Goroutines Sleep on a 16-Core Machine Goroutines feel like magic. Stack starts at 2 KB, you can spin up a hundred thousand of them on a laptop, and Go's runtime just...

[read more →]

Does Go ever shrink the stack automatically?

Yes. Stack shrinking is built into the garbage collection cycle. During GC, the runtime inspects each goroutines stack utilization. If a goroutine is using less than 25% of its current stack allocation, the runtime halves the stack via the same copy mechanism used for growth. This keeps long-lived goroutines from holding onto stack memory they no longer need after a spike. The cost is the same as a growth copy — its a full migration. Goroutines with consistently variable call depth pay this cost both ways, which is exactly the thrashing scenario described above.

How can I monitor stack growth in a running Go service?

runtime.MemStats exposes StackInuse (bytes currently in use by goroutine stacks) and StackSys (total stack memory obtained from the OS). Export these to Prometheus or your metrics backend with a 5–10 second scrape interval. Spikes in StackInuse that correlate with latency increases are your first signal. For granular diagnosis, runtime/trace records goroutine stack growth events with timestamps — use go tool trace to visualize them. debug.FreeOSMemory() before a StackInuse snapshot forces GC and gives you a clean post-shrink baseline to compare against peak.

Why did Go abandon segmented stacks?

The hot split problem made segmented stacks impractical at scale. In the segmented model, a function call near a segment boundary would allocate a new segment, execute the function, and then free it — only to repeat the same allocation on the next call iteration. A tight loop calling a function that happened to land on a segment boundary would hammer the allocator continuously. Gos switch to contiguous stacks in 1.3 eliminated this by design: you copy once, then run on a larger contiguous block. The hot split is gone. The cost moved from unpredictable per-call segment churn to predictable but heavier periodic copy — a better trade for most workloads.

When does escape analysis actually hurt performance?

Golang escape analysis is conservative by design — when the compiler cant prove a variables lifetime is bounded to the current stack frame, it escapes to heap. This is correct behavior, but it means interface conversions, closures capturing variables, and pointers returned from functions will often escape even when youd prefer them not to. The hurt comes from GC pressure accumulation: in a hot path that processes 100K requests/sec, even a single extra heap escape per request adds 100K objects/sec to the GCs mark queue. Use -gcflags="-m" to audit escapes in critical paths, and benchmark before and after with go test -bench -benchmem to see if the allocation rate change is actually affecting throughput.

Can I control how much stack a goroutine starts with?

Not directly through a public API. The initial goroutine stack size is set internally by the runtime — 2KB as of recent Go versions — and theres no go func() { ... }(stackSize: 64*1024) syntax. The workaround for pre-allocating stack space is to call a dummy recursive function at goroutine startup that forces the stack to grow to your expected working size, then return and let the real work begin. Its inelegant, but it eliminates first-call growth latency for goroutines you know will hit deep call paths immediately. A cleaner architectural answer is to bound recursion depth explicitly and restructure algorithms to use explicit stack data structures on the heap instead of relying on the call stack. See also our analysis of go concurrency patterns and performance for goroutine lifecycle design.

Written by: