Goroutine Leak Patterns That Kill Your Service Without Warning
A goroutine leak is a goroutine that was spawned and never terminated — it holds stack memory, blocks on a channel or syscall, and the Go runtime scheduler has no mechanism to reclaim it automatically.
- Audit every
go func()call — confirm it has an explicit exit condition via context cancellation or a done channel. - Run
pprofgoroutine endpoint in staging before every release; a rising goroutine count under stable load is a leak, not a feature. - Pass
context.Contextas the first argument to any function that spawns goroutines — and propagate it, not copy it. - Never write to an unbuffered channel without a
selectdefault or a cancellation guard.
// Leaky: goroutine blocks forever if nobody reads from result
func fetchData(url string) chan string {
result := make(chan string) // unbuffered
go func() {
resp, _ := http.Get(url)
body, _ := io.ReadAll(resp.Body)
result <- string(body) // blocks if caller already moved on
}()
return result
}
The caller times out, drops the channel reference, and moves on. The goroutine inside doesnt. Its still waiting to send. That stack — 2 KB minimum, growing on demand — is yours to keep until the process dies.
The Anatomy of a Leaky Bucket: Why the Scheduler Wont Save You
Theres a persistent myth in the Go community: goroutines are cheap, so leaking a few is harmless. The scheduler is smart. Itll clean things up. It wont. The Go runtime scheduler is a cooperative M:N scheduler — it multiplexes goroutines across OS threads, parks blocked ones, and resumes them when theyre unblocked. What it does not do is terminate goroutines that are stuck. It has no concept of this goroutine has been waiting too long. Orphan goroutines are invisible to the schedulers garbage logic because theyre not garbage — theyre parked, which looks identical to a goroutine legitimately waiting for I/O.
The memory profile tells the real story. Each goroutine starts with a 2–8 KB stack that grows dynamically. A service under moderate load that leaks 10 goroutines per request, processing 500 requests per minute, accumulates 300,000 goroutines per hour. At 4 KB average stack, thats 1.2 GB of stack memory — before heap allocations for any closures captured inside those goroutines. The heap profiler wont show this cleanly because stacks live outside the main heap. Youre chasing a ghost the default tooling isnt built to catch.
The leaky bucket in golang concurrency isnt the channel or the goroutine — its the architectural assumption that something else will handle cleanup. Nothing will. Ownership of termination belongs to the creator, full stop.
// Leaky HTTP handler: goroutine outlives the request
func handler(w http.ResponseWriter, r *http.Request) {
go func() {
// r.Context() is cancelled when handler returns,
// but this goroutine ignores it entirely
time.Sleep(30 * time.Second)
log.Println("background work done")
}()
w.WriteHeader(http.StatusAccepted)
}
The handler returns in microseconds. The goroutine lives for 30 seconds. Multiply by concurrent users and you have a slow-motion OOM that only manifests at 2 AM on the highest-traffic day of the year.
Treat goroutine lifetime as a resource contract: if you open it, you close it — with the same discipline youd apply to a file descriptor or a database connection.
The Silent Killer: Blocking on Unbuffered Channels
Blocking on channel send leak is one of the most common goroutine leak patterns in production Go codebases, and its subtle precisely because it looks correct at first glance. An unbuffered channel send blocks until a receiver is ready. If that receiver exits early — due to a timeout, a context cancellation it didnt propagate, or an error it decided to handle by returning — the sender goroutine parks indefinitely. The nil channel variant is even worse: a send or receive on a nil channel blocks forever, no panic, no error, just silence.
// Fixed: select with context prevents indefinite block
func fetchData(ctx context.Context, url string) chan string {
result := make(chan string, 1) // buffered: sender won't block
go func() {
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
resp, err := http.DefaultClient.Do(req)
if err != nil {
result <- "" // non-blocking due to buffer
return
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
select {
case result <- string(body):
case <-ctx.Done(): // exit if caller is gone
}
}()
return result
}
Two changes do the work: a buffered channel of size 1 means the goroutine can always send without waiting for a receiver, and the select on ctx.Done() gives the goroutine an exit when the callers context is cancelled. Neither change is complex. Both require explicit intent — they dont happen by accident.
An unbuffered channel is a synchronization point, not a communication medium — use it only when you explicitly want the sender to wait for the receiver, and always pair it with a cancellation path.
Diagnostics: Hunting Ghosts in the Runtime
You suspect a leak. The service memory is trending upward, latency percentiles are drifting, and the on-call alert fired at 3 AM. The heap profile looks normal. The CPU profile looks normal. The leak is in goroutines, and the first tool you reach for should be pprof. The goroutine profile in pprof gives you a stack trace for every live goroutine, grouped by stack signature. A goroutine leak pprof workflow starts with a single terminal command against your running service — assuming youve imported net/http/pprof.
# Capture goroutine profile and open in browser
go tool pprof http://localhost:6060/debug/pprof/goroutine
# Or dump a raw text snapshot sorted by count
curl http://localhost:6060/debug/pprof/goroutine?debug=1 | head -100
# Compare two snapshots 30 seconds apart to find growing stacks
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > snap1.txt
sleep 30
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > snap2.txt
diff snap1.txt snap2.txt
The diff approach is underused and brutally effective. If the same stack trace grows between snapshots under stable load, youve found your leak. The debug=2 flag gives you full goroutine state — running, syscall, chan receive, chan send — which tells you exactly where the goroutine is parked. chan receive on a channel that should have been closed by now is the signature of an orphan goroutine waiting for a signal that will never arrive.
Analyze stack trace golang output methodically: look for goroutines in chan receive or semacquire state with stack frames pointing to your application code, not the standard library. Standard library goroutines in those states are expected. Yours are not — each one is a resource youre paying for without getting anything in return.
Enable the pprof endpoint in staging unconditionally; the performance overhead is negligible and the diagnostic value during an incident is irreplaceable.
Visualizing Chaos: Beyond the Heap
The pprof web UI renders goroutine stacks as a flame graph, but for hunting memory leaks in go at scale, the goroutine count over time is the metric that matters first. Wire up a Prometheus gauge to runtime.NumGoroutine() and alert when it grows monotonically over a 5-minute window under constant load. This is not a sophisticated observability setup — its table stakes. Visualising goroutines graph in Grafana takes twenty minutes to configure and will catch every class of leak that pprof identifies post-mortem, but weeks earlier.
The heap profile misleads you here because leaked goroutines primarily consume stack memory, not heap. The runtime reports stack memory separately in runtime.MemStats.StackInuse — expose that metric alongside goroutine count and you get a correlated view: goroutine count rises, stack inuse rises in lockstep, heap is stable. That pattern eliminates half the false hypotheses before you even open pprof.
Dont wait for heap exhaustion to investigate goroutine growth — by the time the heap pressure shows, youve already been leaking for hours.
The Ownership Pattern: Preventive Architecture
Diagnostics find leaks that already exist. Architecture prevents them. The ownership pattern is a single rule applied consistently: the goroutine that creates a child goroutine is responsible for ensuring it terminates. Not the runtime. Not the garbage collector. Not the framework. The creator. This maps directly to golang context propagation best practices — a context carries the cancellation signal, and it must flow from the creator to the child, not be synthesized fresh inside the child.
Ive seen codebases where every goroutine creates its own context.Background() because the developer didnt want to pollute the function signature with a context parameter. The result is a forest of goroutines with no common cancellation root. When the HTTP request that triggered the whole tree times out, the tree keeps running — because none of its nodes know the root is gone. Context propagation isnt a convention, its the mechanism by which cancellation signals travel. Skip it and youve opted out of Gos entire lifecycle management model.
// Fixed: context flows from caller to child goroutine
func processItems(ctx context.Context, items []string) error {
ctx, cancel := context.WithCancel(ctx)
defer cancel() // guarantee cleanup when processItems returns
errCh := make(chan error, 1)
go func() {
for _, item := range items {
select {
case <-ctx.Done():
errCh <- ctx.Err()
return
default:
if err := process(item); err != nil {
errCh <- err
return
}
}
}
errCh <- nil
}()
select {
case err := <-errCh:
return err
case <-ctx.Done():
return ctx.Err()
}
}
The defer cancel() is non-negotiable. It guarantees that when processItems returns — for any reason, including panic recovery — the child goroutine receives its termination signal. The child checks ctx.Done() in the hot loop, which means it responds to cancellation without waiting for the next I/O round-trip.
Pass context as the first parameter of any function that spawns goroutines — this isnt style guidance, its the load-bearing wall of your services lifecycle architecture.
Cleanup Duty: WaitGroups vs. Signal Channels
Two tools dominate goroutine lifecycle management in Go, and the choice between them is architectural, not stylistic. sync.WaitGroup answers the question has this set of goroutines finished? — its a counter that blocks until it reaches zero. A signal channel (or a contexts Done channel) answers a different question: should this goroutine stop? These are complementary, not competing, and conflating them produces code that either leaks or deadlocks.
Use WaitGroup when you spawn a bounded set of worker goroutines and need to wait for all of them before proceeding. Use a signal channel or context when you need goroutines to stop on demand, asynchronously. Closing channels safely in the WaitGroup pattern requires exactly one rule: only the sender closes the channel, never the receiver, and only after all senders have finished — which is precisely what WaitGroup tracks.
// Defensive pattern: WaitGroup + context for bounded worker pool
func runWorkers(ctx context.Context, jobs <-chan Job) error {
var wg sync.WaitGroup
errs := make(chan error, workerCount)
for i := 0; i < workerCount; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for {
select {
case job, ok := <-jobs:
if !ok {
return // channel closed, clean exit
}
if err := job.Execute(ctx); err != nil {
errs <- err
}
case <-ctx.Done():
return // context cancelled, clean exit
}
}
}()
}
// Wait for all workers, then close error channel
go func() {
wg.Wait()
close(errs)
}()
// Collect first error
for err := range errs {
if err != nil {
return err
}
}
return nil
}
The pattern composes both tools: WaitGroup tracks completion, context handles cancellation, and the error channel is closed only after all workers exit — preventing a send-on-closed-channel panic. The goroutine that calls wg.Wait() and then close(errs) is the designated closer, so theres no ambiguity about ownership.
The WaitGroup counter must be incremented before the goroutine starts, not inside it — a race between Add and Wait is one of the few ways to make this pattern silently incorrect.
FAQ
How do I handle a blocking on channel send leak in an existing codebase?
Start by converting unbuffered channels to buffered channels of size 1 anywhere the sender doesnt need to synchronize with the receiver. Then wrap every channel send in a select with a ctx.Done() case. The default case in a select is useful for non-blocking sends, but it silently drops messages — use it only when message loss is acceptable and document that decision explicitly in the code.
Why do golang context propagation best practices matter if my service has low traffic?
Traffic is irrelevant to leak mechanics. A goroutine that leaks at 10 req/s leaks at 0.1 req/s too — it just takes longer to manifest. More importantly, the Done() signal is the only mechanism by which a parent process communicates its termination to sub-goroutines. Without it, sub-goroutines become orphaned when the parent times out or is cancelled. Low traffic means you get more time before the incident, not immunity from one.
Can I use uber-go/goleak in production monitoring?
No. goleak is a testing-phase tool — it integrates with Gos testing framework via TestMain and checks for unexpected goroutines at the end of a test run. Its designed to catch leaks before they reach the build, not to monitor a running service. For production monitoring, use runtime.NumGoroutine() exposed as a Prometheus gauge, and pair it with runtime.MemStats.StackInuse. That combination gives you continuous visibility without the overhead and false-positive risk of running goleaks detection logic against live traffic.
Whats the difference between a goroutine leak and a goroutine surge?
A surge is temporary — goroutine count spikes under load and returns to baseline when load drops. A leak is monotonic — goroutine count grows regardless of load and never decreases without a restart. The diagnostic is a goroutine count graph over time. If it trends upward at constant load, its a leak. If it tracks load and recovers, its a surge that may indicate a pool sizing problem but not necessarily a lifecycle bug.
Is it safe to close a channel inside a goroutine?
Only if that goroutine is the designated, sole sender for that channel. Closing a channel from a receiver panics immediately. Closing a channel from a second sender while the first is still writing causes a panic on the redundant close. The pattern that eliminates both hazards is: one owner goroutine sends and closes; all others only receive. When multiple goroutines need to send, use a separate done channel and close the data channel only after a WaitGroup confirms all senders have exited.
How do I detect goroutine leaks before production without goleak?
Check runtime.NumGoroutine() at the start and end of each integration test. If the count is higher at the end, something your test exercised didnt clean up. This is manual but requires zero dependencies and catches the same class of bugs as goleak. Combine it with race detector enabled tests (go test -race) — race conditions and goroutine leaks frequently co-occur because both stem from missing synchronization on goroutine lifecycle.
Written by: