How Kotlin Observability Works in Production Systems

Most Kotlin services ship with fake observability: default Logback setup, no proper MDC propagation, and trace context that disappears when coroutines suspend. Kotlin observability in production is about keeping logs, metrics, and traces connected when the system goes async.

The real problem isnt missing data — its losing context. If a 500 spike in Prometheus cant be quickly tied to a trace in Jaeger and matching logs in Loki, you dont have observability, just scattered signals. In Kotlin, ThreadLocal breaks under coroutines, and thats usually where the tracking chain falls apart

TL;DR: Quick Takeaways

ThreadLocal-based MDC breaks silently in coroutines — MDCContext() from kotlinx-coroutines-slf4j is the fix, not a workaround.
JSON structured logging isn’t optional if you run ELK or Loki at any scale — plain text is unindexable garbage at volume.
High-cardinality tags in Prometheus will OOM your metrics backend — UserID as a label is a career-limiting move.
OpenTelemetry auto-instrumentation covers 80% of the work, but coroutine context propagation across Kafka requires manual spans.

The Three Pillars Are One Pipeline — Not Three Separate Tools

Engineers treat logs, metrics, and traces as independent concerns. They’re not. They’re three projections of the same telemetry pipeline, and the only way observability works in a distributed system is if all three share a correlation ID. A metric spike without a trace to drill into is just noise. A log line without a trace ID is archaeology. The mental model shift is this: you’re not adding tools, you’re building an instrumentation layer that answers questions you didn’t know you’d need to ask at 3 AM.

Pillar	What It Answers	Breaks When	Production Signal
Logs	What happened and why	No structure, no correlation ID	Error messages, stack traces
Metrics	How often and how bad	High-cardinality tags, wrong aggregation	Rate, error %, latency p99
Traces	Where the time went	Context lost across async boundaries	Span duration, service graph

High-Performance Logging: Stop Writing Plain Text Trash

Out-of-the-box Logback produces human-readable plaintext. That format was designed for a world where one engineer tailed one log file. In a distributed system with five microservices, a log aggregator like Loki or Elasticsearch, and 10K req/s, that format is actively harmful. Machines parse logs. Humans read dashboards. The moment you commit to that distinction, your entire logging strategy changes.

Structured Logging and JSON Patterns

The default Logback encoder writes unindexable strings that Loki has to brute-force parse with regex. Switch to logstash-logback-encoder and every log line becomes a JSON object with consistent field names. Your log aggregator can then filter by severity, service name, trace ID, or any custom field — without regex. The performance overhead is negligible; the operational gain at 1M+ logs/day is not.

// logback.xml — drop the default encoder, use JSON
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
 <encoder class="net.logstash.logback.encoder.LogstashEncoder">
 <includeMdcKeyName>traceId</includeMdcKeyName>
 <includeMdcKeyName>spanId</includeMdcKeyName>
 <includeMdcKeyName>userId</includeMdcKeyName>
 </encoder>
</appender>

This config pulls MDC keys directly into the JSON payload. Every log line now carries traceId and spanId as first-class fields — no parsing, no regex, direct filter in Kibana or Grafana. The critical part is deciding which MDC keys matter before you hit production, not after.

The Coroutine Context Nightmare

Here’s where most Kotlin services fall apart. MDC (Mapped Diagnostic Context) is ThreadLocal under the hood. When a coroutine suspends and resumes on a different thread — which happens constantly in IO dispatchers — the ThreadLocal is gone. Your traceId vanishes mid-request with zero warning. No exception, no log entry about it, just missing context in downstream log lines.

// BROKEN — MDC lost after suspension
suspend fun handleRequest(requestId: String) {
 MDC.put("traceId", requestId)
 delay(100) // coroutine may resume on different thread
 log.info("Processing") // traceId is gone here
}

// FIXED — MDCContext propagates the MDC map across suspensions
suspend fun handleRequest(requestId: String) {
 MDC.put("traceId", requestId)
 withContext(MDCContext()) {
 delay(100)
 log.info("Processing") // traceId survives
 }
}

MDCContext() from kotlinx-coroutines-slf4j snapshots the current MDC map and restores it every time the coroutine resumes — regardless of which thread picks it up. The fix is one line of import and one wrapper. The production impact of not doing this is that coroutine logging context becomes unreliable the moment you use any suspend function with IO dispatcher, which is essentially every real service.

Deep Dive

Kotlin Bytecode Bloat

Kotlin Bytecode Bloat: What Aggressive Inlining Does to JVM Performance There's a particular kind of performance problem that doesn't show up in unit tests, doesn't trigger alerts, and looks perfectly reasonable in code review. You're...

Correlation IDs Across Suspend Functions

A traceId that survives a single service is half the solution. In a microservices architecture, the traceId needs to propagate through HTTP headers, Kafka message headers, and gRPC metadata. The pattern that holds up is generating the ID at the entry point (HTTP filter or Kafka consumer), placing it in both MDC and CoroutineContext, and extracting it in every outbound interceptor. Implementing a custom CoroutineContext.Element for the trace ID gives you type-safe propagation that doesn’t depend on MDC at all — useful when switching between Kotlin and Java services in the same call chain.

Metrics Strategy: What Actually Matters at 3 AM

Metrics are not about coverage. Every team that tries to instrument everything ends up with 50K time series, a Prometheus instance on the edge of OOM, and zero useful alerts. The engineering discipline in a metrics strategy is knowing what to not measure. The RED model — Rate, Errors, Duration — applied per service endpoint gives you operational visibility without the cardinality explosion.

Micrometer and Prometheus Integration

Micrometer is the right abstraction layer for Kotlin services. It decouples your instrumentation code from the metrics backend — you write against the MeterRegistry API, and whether the backend is Prometheus, Datadog, or InfluxDB is a configuration concern, not a code concern. In Spring Boot, auto-configuration wires up the registry; in a pure Kotlin service you instantiate PrometheusMeterRegistry directly and expose /metrics via an HTTP endpoint.

// Registering a service-level timer with Micrometer
val registry = PrometheusMeterRegistry(PrometheusConfig.DEFAULT)

val timer = Timer.builder("http.request.duration")
 .tag("endpoint", "/api/orders")
 .tag("method", "POST")
 .register(registry)

timer.record {
 // your business logic here
}

Two tags here: endpoint and method. Both low-cardinality — finite, bounded sets of values. This is intentional. The timer produces a histogram that Prometheus can aggregate into p50, p95, p99 latency — the latency tracking that actually drives alerting decisions in production.

RED vs USE and JVM-Specific Signals

RED covers your service layer: request Rate, Error rate, Duration. USE covers your infrastructure: Utilization, Saturation, Errors on CPU/memory/IO. For JVM services running on Kotlin, you need both. GC pause duration is the metric that kills p99 latency without showing up in your application code — a full GC that pauses for 800ms appears as a latency spike on every concurrent request. Micrometer’s JvmGcMetrics binder gives you this out of the box. Heap utilization crossing 80% before GC is your early warning for memory leak scenarios.

The High-Cardinality Trap

This is the most common way teams destroy their own observability stack. Prometheus stores one time series per unique combination of label values. If you add userId as a label on any metric, you now have one time series per user — potentially millions. At 10K active users, you’re looking at 10K+ time series for a single metric. Prometheus will OOM, scraping will time out, and your SRE will start asking uncomfortable questions in Slack. The rule is binary: if a tag value comes from a bounded, known set (HTTP method, status code, service name, region) — it’s safe. If it comes from user data, database IDs, or any unbounded domain — it never goes in a tag.

Technical Reference

Kotlin Testing

Contract Testing in Kotlin: Why Your APIs Break in Production (and How to Fix It) Frontend deploys. Backend deploys. Someone's Swagger was three sprints out of date. Now there's a 500 in prod, a hotfix...

Distributed Tracing: Connecting the Dots with OpenTelemetry

Distributed tracing is what turns a wall of logs from five services into a single timeline of a failed request. OpenTelemetry is the standard — not Jaeger, not Zipkin, not vendor-specific SDKs. OTel is the instrumentation layer; Jaeger and Tempo are storage and UI backends. Getting this distinction right means your instrumentation code survives backend migrations.

Auto-Instrumentation vs Manual Spans in Kotlin

The OTel Java agent covers the obvious cases automatically: HTTP client calls via OkHttp or Ktor client, JDBC queries, Spring MVC request handling. You get spans for these with zero code changes — attach the agent via JVM argument and you’re done for the 80% case. The 20% that requires manual instrumentation is where Kotlin-specific patterns appear: coroutine boundaries, custom async processing, Kafka consumers where you need to extract the parent span from message headers and re-attach it to the current coroutine context.

// Manual span around a coroutine block
val tracer = GlobalOpenTelemetry.getTracer("order-service")

suspend fun processOrder(orderId: String) {
 val span = tracer.spanBuilder("processOrder")
 .setAttribute("order.id", orderId)
 .startSpan()

 withContext(span.asContextElement()) {
 try {
 // actual processing
 } finally {
 span.end()
 }
 }
}

span.asContextElement() is the OTel equivalent of MDCContext() — it propagates the active span across coroutine suspension points. Without this, every suspend call inside the span creates an orphaned trace that doesn’t connect to the parent. The finally block guarantees span.end() fires even on exception — missing this produces unclosed spans that corrupt the trace timeline.

Trace Context Propagation Across Service Boundaries

HTTP propagation is handled by OTel auto-instrumentation — W3C TraceContext headers (traceparent, tracestate) get injected and extracted automatically on WebClient and OkHttp calls. Kafka is the tricky part. Kafka message headers carry the trace context, but the consumer runs in a different thread, often a different JVM restart later. The OTel Kafka instrumentation handles header extraction, but you must explicitly create a child span in your consumer and link it to the extracted parent context — otherwise every Kafka-triggered operation appears as a new root trace with no connection to the producer side.

Common Pitfalls: Where Observability Actually Breaks

These aren’t theoretical edge cases. These are the failure modes that appear in real production incidents across Kotlin microservice deployments.

Context loss in async blocks. Coroutines, CompletableFuture chains, and RxJava streams all break ThreadLocal-based context. MDCContext() and OTel’s context propagation API fix the coroutine case. Everything else requires explicit context passing.
Excessive logging volume. DEBUG logs left on in production at 5K req/s generate gigabytes per hour. Log aggregation pipelines back up, disk I/O spikes, and the useful signals drown in noise. Log at INFO in production, DEBUG only behind a runtime flag.
Unstructured log data. Free-text log messages with string interpolation are unqueryable at scale. Every field that you’ll ever filter on — request ID, user tier, error code — must be a structured MDC key, not part of the message string.
High-cardinality tags in metrics. Covered above. The pattern is always the same: someone adds a “helpful” tag, the metrics backend degrades over a week, and the root cause takes two days to diagnose.
Missing span context on Kafka consumers. Produces disconnected traces. Every async processing pipeline needs explicit parent span linking on the consumer side.

FAQ

Why does coroutine logging context break in Kotlin when MDC works fine in Java?

In Java, you’re predominantly on thread-per-request models where ThreadLocal MDC is reliable because the request lifecycle stays on one thread. Kotlin coroutines are designed to suspend and resume on different threads — that’s the point of structured concurrency. When resumption happens on a different thread, ThreadLocal state is simply not there. Java services running Spring MVC with blocking IO don’t hit this because the thread doesn’t change. Kotlin services using suspend functions and IO dispatcher hit it constantly. MDCContext() from kotlinx-coroutines-slf4j solves this by capturing the MDC snapshot at suspension and restoring it on resumption.

Worth Reading

Mastering Kotlin Coroutines for...

Kotlin Coroutines in Production I still remember the first time I pushed a coroutine-heavy service to production. On my local machine, it was a masterpiece—fast and non-blocking. But under real high load, it turned into...

What’s the actual performance cost of structured JSON logging in Kotlin microservices?

Benchmarks on logstash-logback-encoder show roughly 10–15% higher serialization overhead versus plain text Logback at equivalent log volume — typically under 1ms per log event at normal throughput. In practice, at 1K–5K req/s this overhead is invisible against network and DB latency. The operational gain — indexed, queryable, field-filterable logs in Loki or Elasticsearch — is orders of magnitude more valuable than the CPU cost. The only scenario where it matters is extreme log volume (100K+ events/sec on constrained hardware), at which point your logging design has other problems.

How does OpenTelemetry distributed tracing handle context propagation across Kafka in Kotlin?

OTel injects trace context into Kafka message headers on the producer side using the W3C TraceContext format. On the consumer side, the OTel Kafka instrumentation automatically extracts these headers and creates a new span linked to the remote parent. However, if you’re processing Kafka messages inside Kotlin coroutines, you need to explicitly call span.asContextElement() to propagate the span into the coroutine context — the automatic instrumentation only covers the thread-level OTel context. Without this step, your Kafka consumer spans appear as disconnected root traces in Jaeger or Tempo.

What Micrometer metrics should every Kotlin microservice expose by default?

The baseline set covers HTTP server request duration (histogram, tagged by endpoint and status code), JVM heap usage and GC pause duration (via JvmGcMetrics and JvmMemoryMetrics binders), thread pool saturation for coroutine dispatchers, and external dependency call duration for any DB or HTTP client. This covers the RED model at the service level and the USE model at the JVM level. From this baseline you’ll catch 90% of production incidents without building a custom metrics empire. Add domain-specific metrics only when RED signals are insufficient to diagnose an actual recurring incident.

How do you prevent high-cardinality label explosion in Prometheus when using Micrometer?

The engineering control is a label whitelist enforced at the registry level. Micrometer’s MeterFilter API lets you define which tag keys are allowed per meter name — any tag key not in the whitelist gets dropped before the metric is registered. This prevents an engineer from accidentally shipping a userId tag in a PR that sneaks past code review. In teams with strict SLA on metrics backend stability, this filter runs in CI as a validation step against the MeterRegistry configuration. The alternative — trusting developers to always remember cardinality rules — is a recipe for a 3 AM Prometheus OOM alert.

Is OpenTelemetry auto-instrumentation sufficient for Kotlin coroutine-based services?

Auto-instrumentation handles the entry points reliably: incoming HTTP requests, outbound HTTP calls, JDBC, messaging. Where it falls short is inside coroutine chains that span multiple suspend functions — the OTel context propagates on the thread, but coroutine hops create gaps in the trace. For services where business logic lives in deep coroutine call graphs, you need manual span creation at meaningful boundaries using span.asContextElement(). A practical approach is auto-instrumentation for infrastructure-level spans and manual instrumentation for business-level operations (order processing, payment handling) where trace granularity drives actual debugging decisions.

Written by:

Ines.M

Related Articles