How to Debug Concurrency Issues: Race Conditions, Deadlocks & Thread Starvation
Failure states in concurrent and asynchronous code dont look the same across ecosystems. A Go runtime panic, a C++ undefined behavior spiral, a Kotlin coroutine deadlock, and a Rust compile-time rejection are four completely different experiences — but they all originate from the same root: uncoordinated access to shared state. This guide skips theory and cuts straight to production-breaking code in all four languages, with actual fixes you can commit today.
TL;DR: Quick Takeaways
- Gos runtime will fatally crash on concurrent map access — not serve garbage, not log a warning. Crash. Use
sync.RWMutexorsync.Mapand run-racein CI. - A data race in C++ is not just a bug — its Undefined Behavior that gives the compiler legal permission to make your program do literally anything.
- Deadlocks in C++ are a lock-ordering problem.
std::scoped_lock(C++17) solves it atomically. In Kotlin coroutines, a non-reentrant Mutex locked twice on the same path is a silent, permanent freeze. - Thread starvation in Kotlin and Tokio is self-inflicted: you blocked a cooperative scheduler by running synchronous work without telling it to yield. The fix is always a context switch —
Dispatchers.IOorspawn_blocking.
1. Data Races and Mutated State (Go & C++)
Shared mutable state is fine until two goroutines, threads, or async tasks decide to touch it at the same time without asking each other first. The way each language handles that collision ranges from a clean compile error to a full application grenade. Both Go and C++ let you write the broken code without complaint — they just punish you at different moments and with different levels of brutality. Understanding shared state mutation in async code is the baseline before touching any of the language-specific tooling.
The Go Trap: goroutine concurrent map read and map write
Picture a net/http telemetry ingestion server. Someone wires up a background goroutine to expire stale entries from a global map[string]Session while live HTTP handlers simultaneously read and write that same map. This is the most common source of goroutine concurrent map read and map write panics in Go services, and unlike most runtime errors, its fatal and unrecoverable — the process is dead, no defer, no recover saves it.
// BAD — concurrent map access without a lock
var sessions = make(map[string]Session)
// Handler: reads + writes on every request
func handleRequest(w http.ResponseWriter, r *http.Request) {
id := r.Header.Get("X-Session-ID")
sessions[id] = Session{LastSeen: time.Now()} // WRITE
s := sessions[id] // READ
_ = s
}
// Background goroutine: deletes expired sessions concurrently
func expireSessions() {
for {
time.Sleep(30 * time.Second)
for k, v := range sessions { // CONCURRENT READ + DELETE
if time.Since(v.LastSeen) > 5*time.Minute {
delete(sessions, k)
}
}
}
}
Gos map implementation is deliberately not thread-safe. The runtime has a concurrency detector baked in, and the moment two goroutines touch the same map — one writing, one ranging — it throws fatal error: concurrent map read and map write and terminates. The fix is gating all access behind a sync.RWMutex (concurrent reads allowed, exclusive write lock on mutations) or replacing the plain map with sync.Map for write-once-read-many patterns.
// GOOD — protected with sync.RWMutex
var (
mu sync.RWMutex
sessions = make(map[string]Session)
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
id := r.Header.Get("X-Session-ID")
mu.Lock()
sessions[id] = Session{LastSeen: time.Now()}
mu.Unlock()
mu.RLock()
s := sessions[id]
mu.RUnlock()
_ = s
}
func expireSessions() {
for {
time.Sleep(30 * time.Second)
mu.Lock()
for k, v := range sessions {
if time.Since(v.LastSeen) > 5*time.Minute {
delete(sessions, k)
}
}
mu.Unlock()
}
}
On top of the fix, wire how to use Go race detector into your CI: go test -race ./.... The -race flag instruments every memory access at runtime and will catch this entire class of bugs even on code paths that arent hit by the happy-path tests. It costs roughly 5–10× CPU overhead — fine for staging, not for prod binaries. Run it in CI, not in your production build.
The C++ Nightmare: data race undefined behavior C++
A high-frequency counter in a game engine tick loop or a financial ledgers position tracker — somewhere, a dev wrote counter++ and thought its one operation. In C++, that assumption is wrong and the consequences are catastrophic. Data race undefined behavior in C++ doesnt mean you get the wrong number — it means the C++ standard has formally declared your programs behavior unpredictable, giving the compiler legal permission to optimize it into anything, including an infinite loop or silent memory corruption.
// BAD — non-atomic increment across threads
int counter = 0; // shared global, no protection
void increment() {
for (int i = 0; i < 100'000; ++i)
counter++; // load + add + store: 3 instructions, NOT atomic
}
int main() {
std::thread t1(increment);
std::thread t2(increment);
t1.join();
t2.join();
std::cout << counter; // could print 100012, 199999, 200000, or loop forever
}
counter++ compiles to a load, an add, and a store — three instructions. Two threads can each load the same stale value, both add 1, and both store the same result, losing increments. But thats the optimistic scenario. The compiler, seeing that counter is accessed across threads without synchronization, can assume no race exists and legally hoist the variable into a register indefinitely or reorder instructions past the race point. CPU cache coherency adds another layer: even a well-intentioned compiler doesnt control instruction reordering at the hardware level. The fix is replacing the raw int with std::atomic<int>, which guarantees each operation is a single indivisible read-modify-write with appropriate memory barriers.
// GOOD — std::atomic eliminates the race entirely
#include <atomic>
std::atomic<int> counter{0};
void increment() {
for (int i = 0; i < 100'000; ++i)
counter.fetch_add(1, std::memory_order_relaxed);
// relaxed: no ordering guarantees needed for a plain counter
}
int main() {
std::thread t1(increment);
std::thread t2(increment);
t1.join();
t2.join();
std::cout << counter.load(); // always 200000
}
Atomic operation overhead is real but small — a locked instruction on x86, a load-linked/store-conditional pair on ARM. For a counter, its negligible. The tradeoff only gets interesting when youre doing complex multi-field updates spanning multiple variables, at which point youre back to a mutex for the transaction boundary anyway.
Why Rust Rejects Code That Seems Correct Rust borrow checker errors occur when the compiler detects ownership or reference conflicts that would cause memory unsafety — and it refuses to compile rather than let that...
[read more →]2. The Deadlock Graveyard (C++ & Kotlin)
Deadlocks are tedious to diagnose and brutal in production. The pattern is always the same: a circular dependency threads deadlock where two or more execution units each hold a resource the other needs, so everyone waits forever. The OS has no way to resolve it — it just parks the threads until someone kills the process. Both C++ and Kotlin have distinct flavors of this failure, and both have clean solutions that eliminate the condition entirely.
C++ Circular Locks: C++ nested mutex deadlock
The bank transfer is the canonical example of a C++ nested mutex deadlock, and it keeps appearing in production codebases. Thread 1 transfers from Account A to Account B: locks A, then tries to lock B. Simultaneously, Thread 2 transfers from B to A: locks B, then tries to lock A. Each thread holds exactly what the other one needs. Neither can proceed. This is a textbook circular dependency and the OS will park both threads indefinitely.
// BAD — reversed lock order creates deadlock
std::mutex mtx_a, mtx_b;
void transfer_a_to_b(double amount) {
std::lock_guard<std::mutex> lk_a(mtx_a); // locks A
std::this_thread::sleep_for(1ms); // realistic scheduling gap
std::lock_guard<std::mutex> lk_b(mtx_b); // BLOCKS — B held by t2
// never reached
}
void transfer_b_to_a(double amount) {
std::lock_guard<std::mutex> lk_b(mtx_b); // locks B
std::this_thread::sleep_for(1ms);
std::lock_guard<std::mutex> lk_a(mtx_a); // BLOCKS — A held by t1
// never reached
}
The root cause is lock-ordering: each function acquires the mutexes in a different order, which is the exact condition required for a deadlock. The C++17 solution is preventing deadlocks with std::scoped_lock. Unlike lock_guard, scoped_lock is variadic — it takes multiple mutexes and acquires all of them atomically using a deadlock-avoidance algorithm. The semantic is give me all of these or none, which makes reversed acquisition order impossible.
// GOOD — std::scoped_lock acquires both atomically
std::mutex mtx_a, mtx_b;
void transfer(std::mutex& from, std::mutex& to, double amount) {
// acquires both atomically — argument order is irrelevant
std::scoped_lock lk(from, to);
account_from -= amount;
account_to += amount;
} // both locks released on scope exit
scoped_lock with two mutexes uses std::lock() internally, which implements a non-livelock algorithm via try-lock with backoff. The context switching cost is identical to two sequential lock_guards on the non-contended path; the overhead only appears under actual contention — exactly when youd be deadlocked without it.
Kotlin Coroutines: Kotlin coroutine mutex deadlock
Kotlins kotlinx.coroutines.sync.Mutex is not reentrant, was never designed to be, and the omission is intentional. In a Ktor backend or an Android ViewModel managing complex shared state, a Kotlin coroutine mutex deadlock typically appears when a coroutine locks the mutex, calls a suspending function, and that suspending function — somewhere down the call chain — tries to acquire the same mutex again on the same coroutine execution path.
// BAD — nested mutex acquisition on the same coroutine path
val mutex = Mutex()
suspend fun processItem(item: Item) {
mutex.lock() // coroutine holds the lock
try {
updateState(item) // calls another suspend fun...
} finally {
mutex.unlock()
}
}
suspend fun updateState(item: Item) {
mutex.lock() // tries to lock again — DEADLOCK
try { /* ... */ }
finally { mutex.unlock() }
}
In a regular Java thread, ReentrantLock allows the same thread to acquire the lock again. Coroutines dont map cleanly to threads — a coroutine can suspend and resume on different threads, making same thread meaningless for ownership tracking. Mutex tracks the coroutine holding the lock, not the thread. Trying to acquire from the same coroutine that already holds it suspends immediately with nobody left to unlock. The fix is refactoring so nested acquisition is structurally impossible: extract the critical section into an internal function that doesnt re-acquire, and use coroutine context propagation to enforce the boundary.
// GOOD — single acquisition point, internal function does not re-lock
val mutex = Mutex()
suspend fun processItem(item: Item) {
mutex.withLock {
updateStateInternal(item) // called under existing lock — no re-acquire
}
}
// private, non-suspend: caller is responsible for holding the lock
private fun updateStateInternal(item: Item) {
/* mutate state here */
}
The discipline is simple: a function either acquires the lock or assumes its already held — never both. Mixing both responsibilities in the same function is how you end up here on a Friday at 2am.
3. Thread Starvation and Runtime Blocking (Kotlin & Rust)
Cooperative multitasking schedulers are efficient precisely because tasks are expected to yield. When a task refuses — because its doing heavy CPU work or synchronous blocking I/O — it hijacks the thread its on and starves everything else queued behind it. This is cooperative multitasking starvation, and both Kotlins coroutine runtime and Rusts Tokio executor have specific, well-documented failure modes for it. The symptoms look identical: application throughput collapses, latency spikes, and nothing in your logs explains why.
Solve TypeError: 'NoneType' object is not subscriptable in Python TypeError: 'NoneType' object is not subscriptable means you're trying to use [] on a variable that is None. Check if the variable is None before indexing...
[read more →]Kotlin: blocking Dispatchers.Default in Kotlin
A Ktor service ingests large JSON payloads or decodes images in bulk. A developer spins up 1,000 coroutines and processes everything inline on Dispatchers.Default. The problem: blocking Dispatchers.Default in Kotlin exhausts the thread pool. Dispatchers.Default has exactly max(2, CPU cores) threads — on an 8-core machine thats 8 threads. Block them all with CPU-bound or I/O work and every other coroutine in the process, including your health check endpoint and your UI layer on Android, freezes indefinitely.
// BAD — blocking work on the wrong dispatcher exhausts the pool
fun processAll(items: List<ByteArray>) = runBlocking {
items.map { payload ->
launch(Dispatchers.Default) { // wrong dispatcher
val result = heavyJsonParse(payload) // blocks the thread
saveToDatabase(result) // also blocking I/O
}
}.joinAll()
// after ~8 launches on 8-core machine: Default pool exhausted
// remaining coroutines queue up indefinitely
}
The fix is shifting blocking work to Dispatchers.IO, which maintains a larger elastic pool (up to 64 threads by default) designed to absorb blocking calls without starving the scheduler. For pure CPU-bound computation with no I/O, Dispatchers.Default is actually correct — but then your task must never block on any I/O. Use withContext(Dispatchers.IO) to hop dispatchers for the blocking portion. This is coroutine context propagation working as designed: structured concurrency lets you switch context without losing the parent jobs cancellation scope.
// GOOD — blocking work dispatched to IO pool via withContext
fun processAll(items: List<ByteArray>) = runBlocking {
items.map { payload ->
launch(Dispatchers.Default) {
val result = withContext(Dispatchers.IO) {
heavyJsonParse(payload) // blocking parse on IO pool
}
withContext(Dispatchers.IO) {
saveToDatabase(result) // blocking I/O on IO pool
}
// back on Default for any CPU-bound post-processing
}
}.joinAll()
}
If youre writing library code and cant predict the callers dispatcher, default to Dispatchers.IO for anything touching the file system, network, or a database driver. Safer than assuming the caller set up the right context.
Rust Tokio: tokio block_in_place thread starvation
A high-performance async proxy written in Rust handles thousands of concurrent connections on Tokio — runs beautifully until it needs to verify a TLS certificate against a local CRL file or run a CPU-intensive key derivation. Someone drops a synchronous std::fs::read into an async task without a second thought. Now you have Tokio block_in_place thread starvation: Tokios work-stealing scheduler has one of its OS threads frozen on a blocking syscall, and all async tasks queued on that thread are now stalled. For cases where you cannot restructure the code into spawn_blocking, Tokio provides block_in_place to notify the scheduler to move other tasks to a different thread.
// BAD — blocking I/O inside an async task hijacks the executor thread
#[tokio::main]
async fn main() {
let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();
loop {
let (socket, _) = listener.accept().await.unwrap();
tokio::spawn(async move {
// std::fs::read blocks the OS thread — no .await, no yield
let cert = std::fs::read("/etc/certs/ca.crt").unwrap();
handle_connection(socket, cert).await;
});
}
}
Tokios runtime maps async tasks to a small pool of OS threads (typically one per CPU core). The scheduler is cooperative: tasks yield at every .await. std::fs::read has no .await — its a synchronous syscall that parks the OS thread until the kernel responds. The task has hijacked that thread. Every async task queued on it stalls. Under load this cascades across the pool. The fix is tokio::task::spawn_blocking, which offloads the synchronous work to a dedicated blocking thread pool that is completely separate from the async executor. For cases where you need to call blocking code deep inside an async context that you cant restructure, tokio::task::block_in_place signals Tokio to evacuate other tasks off the current thread before blocking.
// GOOD — blocking work offloaded via spawn_blocking
#[tokio::main]
async fn main() {
let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();
loop {
let (socket, _) = listener.accept().await.unwrap();
tokio::spawn(async move {
// offload to dedicated blocking pool — executor threads never blocked
let cert = tokio::task::spawn_blocking(|| {
std::fs::read("/etc/certs/ca.crt")
})
.await
.unwrap()
.unwrap();
handle_connection(socket, cert).await;
});
}
}
spawn_blocking draws from a separate pool (512 threads by default). For a one-time file read per connection, thats fine. For sustained CPU-heavy crypto work on every request, consider a dedicated rayon thread pool instead — spinning up hundreds of OS threads under sustained load has its own overhead.
4. Advanced Diagnostics: detecting deadlocks in async runtimes
Getting a concurrency bug past CI means your static analysis and race detector didnt catch it. When youre staring at a hung process or a CPU pegged at 100% with zero throughput, you need tooling. Detecting deadlocks in async runtimes is a different discipline from synchronous debugging — stack traces lie, traditional debuggers have limited visibility into suspended coroutines, and most profilers dont distinguish blocked on a lock from blocked on I/O.
Go: pprof goroutine dumps. Hit /debug/pprof/goroutine?debug=2 on a live server (requires importing net/http/pprof). Youll get a full dump of every goroutine and exactly what its blocked on. Goroutines stuck on sync.Mutex.Lock or a channel receive for longer than your expected latency are your suspects. For CPU profiling that reveals context switching cost from mutex overuse, use pprof.StartCPUProfile and look for time spent in runtime.lock and runtime.schedule. Excessive time there means your critical sections are too coarse.
Rust: the tracing crate and tokio-console. Instrument async tasks with #[tracing::instrument] on every async fn that matters. Run tokio-console — a runtime subscriber that exposes a live TUI showing every running task, its poll count, total lifetime, and whether its been stuck on the same poll for an abnormal duration. A task polling for 30 seconds without yielding is your blocking call. The overhead of the tracing crate is low enough to leave on in production if you use sampling spans rather than per-request traces.
Solving JavaScript Promise Errors: Why Your Data is Undefined and Your App Is Silently Burning Uncaught (in promise) TypeError occurs when an async operation — a fetch, a database query, a timer — resolves to...
[read more →]Kotlin: coroutine debugger and thread dumps. IntelliJs coroutine debugger gives a coroutine stack view showing which coroutines are suspended and what theyre waiting on. In production, a JVM thread dump (kill -3 <pid> or jstack) shows threads parked on kotlinx.coroutines.sync.Mutex. Match thread names against your dispatcher config — a DefaultDispatcher-worker-1 thread in WAITING state for 60 seconds is a deadlock, not a slow query.
One thing to watch regardless of language: atomic operation overhead from over-synchronizing hot paths. A mutex acquisition under contention involves a syscall and a context switch — roughly 1–10 μs depending on the OS. If your profiler shows 40% of CPU time in lock contention, youve over-synchronized. Fix it by reducing critical section length, sharding the lock across multiple instances, or replacing mutexes with lock-free thread-safe data structures backed by compare-and-swap — sync/atomic in Go, std::atomic in C++, crossbeams lock-free queues in Rust.
Final Verdict: Choosing the Right Tool for the Job
No hedging. Each language has a specific concurrency niche where it genuinely outperforms the others. Picking based on team familiarity alone is a valid engineering decision — but you should at least make it consciously.
- Rust — if memory safety and rust compile-time data race prevention are non-negotiable. Financial systems, safety-critical infrastructure, anything where a race condition has real-world consequences. The Rust Arc Mutex pattern and the borrow checker enforce correctness before the binary exists. Pay the learning curve; it pays back in production stability.
- Go — if youre building networked services that need to scale horizontally fast, with built-in telemetry and low operational complexity. The goroutine model is opinionated, how to debug race conditions is a solved problem with
-race, and the runtime is fast enough for most backend workloads without touching a single lock. - C++ — if you need raw hardware control, zero-overhead abstractions, or youre maintaining a large legacy codebase.
std::atomicandstd::scoped_lockare genuinely good tools. But data race undefined behavior C++ is a live grenade that requires sanitizers, ThreadSanitizer in CI, and experienced reviewers on every concurrent code path. - Kotlin — if youre already on the JVM or building for Android, and developer ergonomics matter more than raw throughput. Coroutines are clean and readable once you understand the dispatcher model and the non-reentrant mutex behavior. If your concurrency problems are I/O-bound, dont switch runtimes. Fix your dispatchers.
FAQ
What is the fastest way to start debugging concurrency issues in a live Go service?
Enable net/http/pprof and pull a goroutine dump from /debug/pprof/goroutine?debug=2 immediately. Look for goroutines blocked on sync.Mutex or channel operations for longer than your expected latency. If you can reproduce in staging, recompile with the Go race detector flag -race — it catches goroutine concurrent map read and map write and most other shared-state races before they escalate to production.
Why cant Kotlins coroutine Mutex be reentrant like Javas ReentrantLock?
A Kotlin coroutine mutex deadlock from attempted reentrancy is actually the safe failure mode. The alternative — allowing reentry — would let a coroutine silently hold a lock across suspension points, allowing other coroutines to observe partially mutated state. Because coroutines suspend and resume across threads, same thread is meaningless for ownership tracking. The non-reentrant design is intentional; refactor to avoid nested acquisition instead.
Is std::atomic always the right fix for data races on primitive types in C++?
Data race undefined behavior in C++ on a single primitive is always fixed with std::atomic. But if your atomic operation is logically a multi-step transaction across multiple fields — debit one account, credit another — atomics dont compose. You still need a mutex for the transaction boundary. Use atomics for truly independent counters and flags; reach for a mutex the moment an invariant spans more than one variable.
How do I distinguish Tokio thread starvation from genuine overload?
Overloaded systems degrade gradually under increasing load. Tokio block_in_place thread starvation causes sudden throughput collapse at a specific concurrency threshold — often the first time a particular code path runs under load. Use tokio-console to watch task poll times. A task that has been polling for seconds without a yield is a blocking call. A task waiting to be polled is just queued. That distinction tells you whether you have a scheduler problem or a capacity problem.
Does Rust actually prevent all concurrency bugs at compile time?
Rust eliminates the data race class of bugs entirely — two threads cannot simultaneously hold a mutable reference to the same data; the ownership model makes it a compile error. What it doesnt prevent are logical bugs: deadlocks, priority inversion, incorrect lock ordering, and async task blocking executor thread scenarios all compile cleanly. Rust compile-time data race prevention is a narrowly defined and valuable guarantee, not a blanket concurrency correctness certificate.
When should I use unbuffered channels instead of mutexes for shared state in Go?
If ownership of the state is already flowing through the application — one goroutine produces, one consumes — model it with channels and eliminate the shared state entirely. Unbuffered channels blocking on send until a receiver is ready acts as a natural synchronization primitive. Use mutexes when you have genuinely shared state with multiple concurrent readers and writers where the single-ownership model breaks down, such as a session cache or a connection registry.
Written by: