Go Garbage Collector Internals: Mastering Performance Beyond GOGC=off

Go’s garbage collector is the engine under the hood. Most engineers ignore it until P99 spikes start killing production SLAs — and by then, the heap is already a mess. The GC doesn’t announce itself; it just quietly inflates latency until someone opens a trace and starts asking uncomfortable questions.

What separates teams that tune Go GC well from those that just throw GOGC=off at the problem is a solid mental model of what the collector is actually doing — phase by phase, byte by byte.

TL;DR: Quick Takeaways

Go’s GC uses a concurrent tri-color mark-and-sweep algorithm — most collection work runs alongside your code, not instead of it.
True Stop-the-World pauses happen only at Mark Termination and Sweep Termination, and typically stay under 500 microseconds on modern hardware.
GOGC controls heap growth ratio; GOMEMLIMIT caps absolute memory — they solve different problems and need to be tuned together.
Allocation rate, not live heap size, is the primary driver of GC pressure in high-throughput services.

How Go Garbage Collector Works: The Core Architecture

Go’s garbage collector is a concurrent, tri-color mark-and-sweep collector with a hybrid write barrier for concurrent correctness. The design goal is explicit and documented in the runtime source: minimize latency, not maximize throughput. This shapes every architectural decision — from how the heap is structured in 8KB spans to how background goroutines are scheduled against your mutator threads.

The collector works across two main phases: marking (finding live objects) and sweeping (reclaiming dead ones). Most of that work happens concurrently with your code running.

The Tri-Color Mark and Sweep Algorithm

The tri-color abstraction gives the GC a way to reason about object reachability during a concurrent scan — where the mutator (your application code) is actively modifying the heap at the same time. Every object in the heap is assigned to one of three sets at any given moment:

White: Candidates for collection — they haven’t been reached yet.
Grey: Objects discovered by the collector, but their outgoing pointers haven’t been fully scanned.
Black: Confirmed live objects, with all their references fully traced.

The invariant the GC maintains is simple: a black object must never point directly to a white object. If this happened, and the pointer was the only reference, the white object would be incorrectly collected. The algorithm starts by marking all root objects (globals, goroutine stacks, runtime pointers) as grey. Background mark workers then pull grey objects off a work queue, scan their fields, shade any white objects they point to grey, and finally mark the original object black.

Hybrid Write Barriers and Concurrent Scanning

The biggest challenge in a concurrent GC is the “Mutator Problem”: your code might move a pointer while the collector is busy scanning. To prevent the GC from losing track of objects, Go uses a Hybrid Write Barrier (introduced in Go 1.14).

Unlike generational collectors in Java or .NET that track pointers between “young” and “old” generations, Go’s hybrid barrier focuses strictly on maintaining the tri-color invariant during the concurrent mark phase. It combines two concepts:

Dijkstra-style: Shading the new object grey when you create a new pointer.
Yuasa-style: Shading the old object grey when its pointer is overwritten.

This hybrid approach allows Go to skip the expensive “Stack Rescan” phase that plagued older versions, keeping STW pauses consistently low. The barrier intercepts every pointer store while GC is marking:

// Logical representation of the Hybrid Write Barrier
// This code is inserted by the compiler and active ONLY during _GCmark phase.
func gcWriteBarrier(ptr *unsafe.Pointer, newVal unsafe.Pointer) {
    oldVal := *ptr

    // 1. Yuasa: shade old pointer to ensure it's not lost
    shade(oldVal)   

    // 2. Dijkstra: shade new pointer to ensure it's captured
    shade(newVal)   

    // 3. Finalize the store
    *ptr = newVal   
}

The cost of the write barrier is a slight increase in CPU overhead during the mark phase, but it is the primary reason Go can handle massive heaps without the multi-millisecond pauses seen in less optimized runtimes.

Understanding GC Phases: From Mark Termination to Sweep

A single GC cycle moves through a defined sequence of phases. Understanding which phases are concurrent and which require stopping the world is the foundation for interpreting GODEBUG=gctrace=1 output and understanding where your latency actually comes from. The runtime exposes phase transitions through runtime.ReadMemStats and through the gctrace output, which logs two STW durations per cycle: one for Sweep Termination and one for Mark Termination.

Stop-the-World Pauses in Go

Go has two STW pauses per GC cycle — not one. They are short by design, but they exist, and on large heaps with deep goroutine stacks, they can surprise you.

Sweep Termination happens first. The runtime stops all goroutines, finishes any remaining sweep work from the previous cycle, and enables the write barrier. This is typically the shorter of the two pauses — sub-100 microseconds on most services — because there’s usually little residual sweep work left by the time the next cycle starts.

Deep Dive

Go Memory Model Happens...

Go Memory Model Happens-Before: Visibility Bugs Race Detector Misses The go memory model happens before relationship is the only mechanism that guarantees a write in one goroutine becomes visible in another — and most Go...

Mark Termination is the second STW. After the concurrent mark phase finishes (grey set drained), the runtime stops all goroutines again, performs a final scan of goroutine stacks for any pointers the write barrier might have missed, and flips the world from black-to-grey back to white-is-dead. On a service with thousands of goroutines each holding deep stack frames, this stack rescan is where P99 outliers come from — not from marking itself. A goroutine with a large stack frame forces a deeper scan during termination.

// Read STW pause durations from the runtime
var stats runtime.MemStats
runtime.ReadMemStats(&stats)

// PauseNs is a circular buffer of the last 256 GC pause durations in ns.
// PauseEnd holds the timestamps of when each pause ended.
// NumGC is the number of completed GC cycles.
for i := 0; i < int(stats.NumGC) && i < 256; i++ {
    idx := (int(stats.NumGC) - 1 - i + 256) % 256
    fmt.Printf("GC %d pause: %v\n", stats.NumGC-uint32(i),
        time.Duration(stats.PauseNs[idx]))
}

The concurrent mark phase between these two STW events is where most GC CPU is spent. Background mark workers run on goroutines dedicated by the scheduler, consuming up to 25% of available CPU by default. If your allocation rate is high, the GC also recruits mutator goroutines to assist — this is the assist ratio kicking in, which we’ll cover shortly. From a user perspective: the concurrent phase doesn’t stop your code, but it contends for CPU. The STW pauses do stop your code, and they’re the ones that show up as P99 spikes in APM traces.

Go GC Performance: Latency vs. Throughput

This is the design trade-off that defines the Go GC and that trips up engineers coming from JVM backgrounds. Go’s collector is built for web services — specifically for services where tail latency at the 99th and 99.9th percentile matters more than maximizing total work done per second. A JVM G1GC or ZGC can collect garbage more efficiently in terms of raw throughput — more objects reclaimed per CPU cycle — but achieves that at the cost of longer or less predictable pause times. Go makes the opposite bet.

The concrete consequence: Go GC favors frequent, short collection cycles over infrequent, thorough ones. A service processing 50,000 requests/second might trigger a GC cycle every 200ms, spending ~25% of its CPU budget on the collector, but keeping every STW pause under 1ms. A JVM service doing equivalent work might collect half as often but pause for 5–10ms when it does. For web services with SLOs at P99 ≤ 5ms, Go’s approach wins. For batch jobs maximizing total throughput over hours, it doesn’t.

The assist ratio is where this trade-off becomes visible under load. When the allocation rate outpaces the background mark worker’s ability to drain the grey set, the runtime forces mutator goroutines to do mark work proportional to how fast they’re allocating — one assist per N bytes allocated, where N is computed from the current heap growth target. Assists show up as increased goroutine CPU time during GC cycles. In gctrace output, the “assists” field tells you exactly how much CPU was diverted from your code to keep the mark phase from falling behind.

# Enable GC tracing to see real cycle data
GODEBUG=gctrace=1 ./your-service

# Output format:
# gc N @Xs: M% cpu; A->B->C MB; D MB goal; E goroutines;
#           wall STW1+concurrent+STW2 ms
#
# Example:
# gc 14 @4.231s: 3%% cpu; 42->51->28 MB; 56 MB goal; 312 goroutines;
#               0.037+2.1+0.18 ms clock, 0.29+0.45/1.8/0.12+1.4 ms cpu
#
# Read it as:
#   42 MB live at start, 51 MB peak during GC, 28 MB live after sweep
#   0.037ms sweep termination STW + 2.1ms concurrent mark + 0.18ms mark termination STW
#   3% total CPU spent on this collection cycle

The three heap sizes — before, peak, after — tell you your actual live set. If “after” keeps growing cycle over cycle, you have a memory leak (objects genuinely staying alive). If “before” grows but “after” stays flat, you have high allocation rate: lots of short-lived objects being created and collected, which is GC pressure without a leak. The distinction matters for choosing the right fix.

Advanced GC Tuning: GOGC and GOMEMLIMIT Guide

Go exposes two knobs for tuning GC behavior. They were designed for different problems, and using only one while ignoring the other is the most common tuning mistake in production Go services.

GOGC has been the primary control since Go 1.0. It sets the percentage by which the heap is allowed to grow before a new GC cycle is triggered. The default is 100, meaning the runtime triggers a collection when the live heap doubles from its size after the previous cycle. Set GOGC=50 and the GC triggers earlier, at 1.5× live heap — more frequent collections, lower peak memory, higher CPU overhead. Set GOGC=200 and collections happen less frequently, at 3× live heap — less CPU on GC, higher peak memory. The trade-off is linear and predictable: halving GOGC roughly doubles collection frequency and CPU cost while halving peak heap growth.

Technical Reference

Go Stack Management

Go Stack Management: The 2KB Lie and What Happens After Everyone loves the "goroutines are cheap, start a million of them" pitch. And it's not wrong — a 2KB initial stack versus 2MB for an...

GOMEMLIMIT, introduced in Go 1.19, is a soft absolute memory cap. It tells the runtime: “don’t let the heap exceed N bytes.” When the live heap approaches the limit, the runtime increases collection frequency regardless of GOGC, even running synchronous GC if necessary to stay under the cap. This is the right knob for containerized workloads where your pod has a hard memory limit — set GOMEMLIMIT to about 90% of your container’s memory limit and let the runtime manage within that budget.

import "runtime/debug"

func init() {
    // Set GOGC to 50: collect when heap reaches 1.5x live set.
    // Useful for latency-sensitive services with tight memory budgets.
    debug.SetGCPercent(50)

    // Set a soft 512MB memory limit.
    // Prevents OOM kills in containers without hard-stopping allocations.
    // Set to ~90% of your container's memory limit.
    debug.SetMemoryLimit(512 * 1024 * 1024)
}

The GC death spiral is what happens when you set GOGC very low (or to -1 and rely entirely on GOMEMLIMIT) on a service that’s already under memory pressure. The runtime is near the memory limit, so it triggers a collection. The collection consumes CPU. With less CPU available, request processing slows, allocations pile up. The heap grows faster than the collector can drain it. The runtime triggers another collection immediately. Now you’re spending 80–90% of CPU on GC and getting almost no useful work done — the service is alive but effectively frozen. The fix: never set GOMEMLIMIT below your service’s steady-state live heap. Leave at least 25–30% headroom above the live set for the allocator to work in without immediately triggering collection.

How to Reduce GC Pressure in Production Services

GC pressure is a function of allocation rate, not heap size. A service with a 2GB live heap but stable allocation pattern puts far less pressure on the collector than a service with a 200MB heap that’s generating 500MB of short-lived garbage per second. The strategies that matter are the ones that reduce how much you allocate — not how much you keep alive.

Object Pooling with sync.Pool

sync.Pool is the most direct tool for reducing allocation pressure for objects that are frequently created and discarded. A pool maintains a per-P (logical processor) free list of objects. When you call pool.Get(), you get an existing object if one’s available, avoiding an allocation entirely. When you call pool.Put(), the object goes back to the pool rather than becoming garbage. The GC clears pool contents between cycles, so pools don’t prevent collection — they reduce the number of allocations that need to happen in the first place.

var bufPool = sync.Pool{
    New: func() any {
        // Allocate a buffer large enough for most requests.
        // This runs only when the pool is empty.
        b := make([]byte, 0, 64*1024)
        return &b
    },
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    // Get a buffer from the pool — zero allocation if pool is warm.
    bp := bufPool.Get().(*[]byte)
    buf := (*bp)[:0] // reset length, keep capacity

    defer func() {
        *bp = buf
        bufPool.Put(bp) // return to pool, not to GC
    }()

    // Use buf for request processing...
    buf = append(buf, processRequest(r)...)
    w.Write(buf)
}

Benchmark your pool before and after with go test -bench=. -benchmem. The allocs/op metric tells you directly whether pooling is reducing heap allocations. A well-warmed pool on a high-throughput HTTP handler can cut allocs/op by 60–80% for the buffering path alone, which translates directly to fewer GC cycles per second. One important constraint: the pool’s New function must produce a value that’s safe to reuse — always reset object state before returning it to the pool.

Avoiding Pointer-Heavy Structures

The GC must scan every pointer in every live object during the mark phase. A struct with 10 pointer fields costs 10 pointer scans per mark cycle. A struct with 10 integer fields costs zero scans — the GC skips it entirely during the pointer scan phase. This is not a micro-optimization. In services with millions of live objects, the composition of those objects directly affects how long the concurrent mark phase runs and how much work Mark Termination needs to do during its STW.

The practical rule: prefer value types over pointer types in struct fields wherever semantics allow. Use string instead of *string. Use []byte (a slice header is three words, not heap-rooted per element) instead of *[]byte. Prefer flat arrays over linked lists for data that’s traversed sequentially. If a struct field is a map or interface, it contains pointers internally — keep those to the outer layer of your data model, not scattered through every inner struct. Using escape analysis tooling (go build -gcflags='-m') can help identify which allocations move to the heap unnecessarily, reducing the pointer density the GC has to chase.

Pre-allocating Slices and Reducing Small Allocations

Every append that exceeds a slice’s current capacity triggers a new allocation and a copy. In hot paths — request serialization, log formatting, response building — these incremental allocations add up fast. Pre-allocating with a reasonable capacity estimate eliminates the growth allocations entirely and gives the GC less to chase.

// Expensive: multiple allocations as the slice grows
func buildResponse(items []Item) []byte {
    var buf []byte
    for _, item := range items {
        buf = append(buf, item.Encode()...)
    }
    return buf
}

// Better: single allocation, zero reallocs for expected sizes
func buildResponseFast(items []Item) []byte {
    // Estimate: each encoded item averages ~128 bytes
    buf := make([]byte, 0, len(items)*128)
    for _, item := range items {
        buf = append(buf, item.Encode()...)
    }
    return buf
}

// For maps: pre-size to avoid rehashing and internal allocation churn
cache := make(map[string]Entry, expectedSize)

Small allocations — anything under 32KB — are served from per-P mcache structures backed by mspan free lists. They’re fast, but they still create GC work. A service making 2 million small allocations per second will trigger GC cycles at a rate determined by GOGC and the average object lifetime, regardless of how individually cheap each allocation is. The goal isn’t to eliminate allocations — it’s to keep them in the critical path only where they’re genuinely necessary. Benchmark suspicious paths with -benchmem before and after every change, and treat a reduction in allocs/op as a first-class optimization metric alongside CPU and wall time.

Worth Reading

Golang error wrapping internals

Why Your Goland Error Wrapping Is Quietly Lying to You Most Goland developers think they handle errors correctly — until errors.Is returns false in production and nobody knows why. Golang error handling looks simple on...

FAQ

When exactly does Go’s garbage collector trigger a collection cycle?

The Go GC triggers a new cycle when the heap size reaches a target computed from the live heap at the end of the previous cycle, scaled by GOGC. With the default GOGC=100, if the live heap after the last collection was 100MB, the next cycle starts when total heap allocation reaches 200MB. The runtime also triggers a forced GC cycle if more than 2 minutes elapse without one, regardless of allocation. GOMEMLIMIT can cause cycles to trigger earlier if the heap approaches the configured limit. You can inspect the current heap target via runtime.ReadMemStats — it’s the NextGC field.

What is the assist ratio and why does it spike my goroutine CPU?

When the background mark workers can’t drain the grey set fast enough to keep up with the allocation rate, the Go runtime forces allocating goroutines to perform mark work proportional to how much they’re allocating. This is the assist mechanism — each goroutine “pays” for its allocations in mark work before the allocation completes. Assist ratio spikes appear as increased CPU time on your request-handling goroutines during GC cycles. You’ll see this in pprof CPU profiles as time spent inside runtime.gcAssistAlloc. The fix is reducing allocation rate: pooling, pre-allocation, or raising GOGC to give the heap more room before collection kicks in.

How do STW pauses interact with goroutine count?

Both STW phases — Sweep Termination and Mark Termination — require the scheduler to preempt every running goroutine and wait for them to reach a safe point. Go’s runtime uses asynchronous preemption (since Go 1.14) to signal goroutines via signals rather than waiting for cooperative yield points, so most preemptions happen within microseconds. However, Mark Termination also rescans goroutine stacks for pointer changes. A service running 10,000 goroutines with large stack frames will have a measurably longer Mark Termination STW than one running 500 goroutines, because the stack rescan work is proportional to the total volume of goroutine stack memory. Reducing goroutine count or stack depth directly reduces Mark Termination duration.

What is the GC death spiral and how do I recover from it?

The GC death spiral occurs when the runtime is under severe memory pressure — typically when GOMEMLIMIT is set close to the live heap size — and begins spending so much CPU on collection that it can’t process requests fast enough to prevent new allocations. The heap never shrinks because the service is still receiving load; the GC runs continuously; CPU utilization approaches 100% with negligible useful throughput. Recovery requires one of: shedding load immediately (circuit breaking), restarting the service, or increasing available memory. Prevention requires always maintaining at least 25–30% headroom between your live heap size and your GOMEMLIMIT value.

Is it safe to call runtime.GC() manually in production?

runtime.GC() runs a synchronous, stop-the-world garbage collection and blocks the calling goroutine until it completes. It’s appropriate in very specific cases: immediately before a memory-intensive batch operation to start with a clean heap, or in test setup to get a deterministic baseline. In production request-handling code, calling it manually is almost always wrong — you’re imposing a synchronous STW pause on a random request, and you’re fighting the runtime’s own scheduling, which already knows when the heap needs collection. Use GOGC and GOMEMLIMIT to guide the runtime’s decisions instead of overriding them.

How do I tell the difference between GC pressure and an actual memory leak in Go?

The clearest signal is the “after” heap size in gctrace output across multiple GC cycles. If the post-collection heap size is stable or grows slowly in proportion to active goroutines and caches, you have GC pressure — high allocation rate of short-lived objects, but no accumulation. If the post-collection heap grows steadily across many cycles while request load stays constant, you have a genuine memory leak: objects are being retained longer than they should be. Confirm it with go tool pprof heap profiles taken 5–10 minutes apart and diffed — growing allocations concentrated in specific call paths are your leak. GC pressure shows up as high allocation rate with flat retained heap; leaks show up as growing retained heap with potentially normal allocation rate.

Written by:

Krun Dev

Related Articles