Your Unix Socket Stack Is Misconfigured. Heres What to Fix and Why.
You already switched from TCP to UDS and saw the first win — fair. But if you havent touched keepalive, backlog, somaxconn, or ulimit, youre running a half-tuned system. The default kernel and Nginx settings were not designed for 5k–10k RPS over a local socket. They were designed to not obviously break. This guide is about the gap between not broken and actually fast — with exact numbers, annotated configs, and the three failure modes nobody puts in their README.
⚡ TL;DR: Quick Takeaways
- UDS cuts p99 latency by ~35–40% vs TCP localhost — but only if Nginx keepalive and
proxy_http_versionare set correctly - Default
ulimit -n 1024silently kills Node.js at ~900 concurrent connections — before your app logic even runs - Cluster mode + shared UDS path is broken at the OS level; the fix is one line of IPC — most tutorials skip this
net.core.somaxconndefault of 128 caps your accept queue well below what any modern load pattern requires
Unix socket vs TCP latency: what the numbers actually say
The unix socket vs TCP latency debate usually ends at UDS is faster because no network stack. Thats true but lazy. Heres what the delta actually looks like under controlled load — and why the numbers behave the way they do at p999.
| Metric | Unix Domain Socket | TCP localhost (127.0.0.1) | Delta |
|---|---|---|---|
| p50 latency | 0.31 ms | 0.48 ms | −35% |
| p99 latency | 1.12 ms | 1.74 ms | −36% |
| p999 latency | 3.8 ms | 9.2 ms | −59% |
| Throughput (req/s) | 48,400 | 38,900 | +24% |
| Syscalls per request | 4 | 8–10 | −50–60% |
| CPU % at 5k RPS | 18% | 27% | −33% |
Test conditions: Linux 6.6.30 (Ubuntu 22.04), Node.js 20.12, 4-core / 8GB VM (Hetzner CX31 equivalent), autocannon at 100 connections / 10 pipelines, 30s warmup + 60s measurement. Both stacks: Nginx 1.24 → Node.js HTTP server, identical app logic, keep-alive enabled on both.
The p999 gap is the interesting one. TCP localhost still runs through the full IP stack — packet framing, checksum, loopback interface, and the Nagle algorithm delay unless you explicitly set TCP_NODELAY. UDS skips all of it: sendmsg/recvmsg go straight to the kernel socket buffer, the epoll event loop picks them up, and theres no fragmentation surface. At high percentiles, that TCP overhead compounds — hence the 59% p999 delta. The syscall count drop from ~9 to ~4 per request is what drives CPU reduction: fewer context switches, better cache locality, the kernel event loop spending less time on bookkeeping.
The unix socket vs tcp overhead benchmark numbers above are conservative — on bare metal with NVMe-backed tmpfs for the socket file, p50 drops below 0.2ms. The VM network virtualization layer adds variance that masks some of the UDS advantage at p50 but not at p999.
Nginx upstream socket configuration for high concurrency
This is where most setups are broken. The nginx upstream socket configuration defaults are sized for moderate load, not for a Node.js service handling thousands of concurrent requests. Three directives control almost everything: keepalive, worker_connections, and accept_mutex — and they interact in ways that arent obvious from the docs alone.
Nginx unix socket keepalive tuning
The nginx unix socket keepalive directive controls how many idle upstream connections Nginx caches per worker. The formula: set it to 2× your Node.js worker count. If youre running 4 Node workers (cluster mode), keepalive 8. Heres why: each Nginx worker needs at least one keepalive slot per Node worker to avoid connection teardown under burst. The ×2 factor covers the overlap window where a new request arrives while the previous connection is still in TIME_WAIT on the Node side.
upstream node_app {
server unix:/var/run/node-app/server.sock;
# 2× Node worker count (here: 4 workers → 8)
# Too low: connection churn, extra syscalls per request
# Too high: idle FDs held open, memory overhead per worker
keepalive 8;
# Required for keepalive to actually work (see proxy section)
keepalive_requests 1000;
# Close idle keepalive connections after 65s
# Matches Node.js default server.keepAliveTimeout (5s) with margin
keepalive_timeout 65s;
}
If keepalive is too low (say, 2 with 8 Node workers), youll see connection churn in nginx -T stats and p99 spikes under burst — every new request that cant reuse a cached socket pays the full connection setup cost. Too high (e.g., 64 with 4 Node workers) and youre holding idle file descriptors that never get used, burning FD budget from your ulimit -n. The sweet spot is 2× — not a coincidence, its the minimum to absorb burst without waste.
Nginx worker connections unix socket
The ceiling formula for simultaneous connections in Nginx is straightforward: worker_processes × worker_connections = max simultaneous connections. The dangerous part is that this ceiling is hit silently — Nginx starts returning 502s without logging a clear connection limit reached message. Youll see it in nginx error.log as worker_connections are not enough but only if you know to look.
# Formula: worker_processes × worker_connections = max connections
# Example: 4 workers × 1024 = 4096 simultaneous connections
# For 10k target: 4 workers × 2560 = 10240
worker_processes auto; # matches CPU core count
worker_connections 2048; # per-worker ceiling
# Detect ceiling hit:
# watch -n1 "grep 'worker_connections' /var/log/nginx/error.log | tail -5"
nginx worker_processes auto — when it hurts
worker_processes auto sets workers equal to CPU core count, which is correct for CPU-bound workloads. For UDS proxy to a Node.js backend, it can hurt: if you have 16 cores and a 4-worker Node cluster, youre spawning 16 Nginx workers each maintaining their own keepalive pool — 16 × 8 = 128 idle FDs minimum, plus connection overhead. On a 16-core machine proxying to a lightly loaded Node app, worker_processes 4 is often the better call. Benchmark both. The auto setting optimizes for CPU parallelism, not for IPC socket throughput where the bottleneck is usually the Node-side accept queue.
Node.js Event Loop Lag in Production Systems Your Node.js server is alive. CPU at 12%, memory stable, no errors. But API response times quietly climb from 40ms to 400ms over a busy afternoon. No crash,...
[read more →]One more misconfiguration that silently kills nginx performance unix socket reuse: using HTTP/1.0 on proxy_pass. HTTP/1.0 does not support persistent connections. Every proxied request opens a new UDS connection, burns a syscall pair, and throws away your keepalive pool entirely. The fix is one line — covered in the dedicated config section below.
On accept_mutex: turn it off (accept_mutex off) when running on Linux 3.9+ with reuseport. With accept_mutex on (the old default), only one Nginx worker wakes to accept a new connection — this serializes accept under high concurrency. With reuseport, the kernel distributes connections across workers directly, and accept_mutex becomes overhead with no benefit.
Node.js IPC socket throughput under real load
The node.js ipc socket throughput story is mostly a libuv story. The event loop processes I/O in the poll phase — incoming socket data arrives, epoll fires, libuv queues the callbacks. Under high throughput this works great right up until it doesnt: the accept queue saturates before your application code even runs. This is the failure mode people misdiagnose as Node.js is slow when its actually the kernel dropping connections at the OS boundary.
Node.js socket high concurrency
At 10k+ concurrent connections, the first thing that breaks is the accept() backlog queue — not the event loop, not your async logic. The kernel holds pending connections in a queue bounded by the backlog parameter passed to server.listen(). Default in Node.js is 511. When the queue fills, new connection attempts get ECONNREFUSED — immediately, with no retry, no logging, no stack trace in your app. Youll see it as dropped requests in your load balancer, not as errors in Node.
Node.js cluster IPC socket
The broken pattern is obvious in retrospect but every tutorial shows it wrong: multiple workers calling server.listen('/var/run/app.sock') directly. At the OS level, the second worker to call bind() on an already-bound path gets EADDRINUSE and either crashes or silently fails, leaving you with one live worker and no indication anything is wrong. The socket file exists, Nginx connects, requests flow — to one worker. Congratulations, your cluster is a single-threaded server with extra memory usage.
const cluster = require('cluster');
const net = require('net');
const SOCK = '/var/run/node-app/server.sock';
if (cluster.isPrimary) {
// Master owns the socket — one bind(), one accept queue
const server = net.createServer();
server.listen(SOCK, () => {
for (let i = 0; i < 4; i++) { const worker = cluster.fork(); // Send the server handle to each worker via IPC worker.send('server', server); } }); } else { process.on('message', (msg, handle) => {
if (msg === 'server') {
require('./app').listen(handle);
}
});
}
This pattern gives you one accept queue, one bound socket path, and true load distribution via the masters handle passing. The OS round-robins accepted connections across workers. Benchmark difference vs the broken pattern: at 5k RPS with 4 workers, the correct pattern shows ~4× throughput and flat p99; the broken pattern shows 1× throughput with erratic p99 spikes from the single overloaded worker.
ulimit open files — the silent bottleneck
Default ulimit -n on most Linux distros is 1024. A Node.js process at 1k concurrent connections needs roughly: 1000 sockets + ~20 internal FDs + file handles from your app logic. You hit the wall around 980 connections and the process starts getting EMFILE — too many open files. Node doesnt crash. It just silently rejects new connections. Your monitoring shows nothing, your logs show nothing, your users see timeouts.
# Check current limit
ulimit -n
# /etc/security/limits.conf — for shell-started processes
node soft nofile 65536
node hard nofile 65536
# systemd service override (takes precedence over limits.conf)
# /etc/systemd/system/node-app.service.d/limits.conf
[Service]
LimitNOFILE=65536
# Verify after restart
cat /proc/$(pgrep -f node-app)/limits | grep "open files"
65536 is the standard production floor. For high-load services handling 10k+ concurrent, bump to 131072. The unix socket file descriptor limit applies per-process — if youre running 4 cluster workers, each needs the limit set, which is why the systemd unit is the right place to configure it (it propagates to all forked children automatically).
Nginx proxy Unix socket: tuning keepalive and worker connections
Heres the production-ready nginx proxy unix socket config for a Node.js upstream — annotated. Every line has a reason. If you cant explain why a directive is there, it doesnt belong in a production config.
upstream node_backend {
server unix:/var/run/node-app/server.sock;
keepalive 8; # 2× worker count
keepalive_requests 1000; # requests per keepalive connection before recycle
keepalive_timeout 65s;
}
server {
listen 80;
location / {
proxy_pass http://node_backend;
# NON-NEGOTIABLE: HTTP/1.0 disables keepalive entirely
# Without this line, every request opens a new UDS connection
proxy_http_version 1.1;
# Required companion to proxy_http_version 1.1
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# proxy_buffering off: correct for streaming / SSE / long-poll
# proxy_buffering on (default): correct for standard request-response
# "off" adds latency for small payloads — Nginx can't coalesce writes
# "on" buffers full response before forwarding — adds memory, cuts syscalls
proxy_buffering on;
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
}
The proxy_http_version 1.1 + proxy_set_header Connection "" pair is non-negotiable for keepalive. HTTP/1.1 defaults to persistent connections; the explicit Connection: "" clears the header that would otherwise carry a close value forwarded from some clients. Miss either line and youre back to connection-per-request regardless of your keepalive setting upstream.
JS Memory Leaks: Deep Dive into Node.js and Browser Pitfalls Memory leaks aren’t just small annoyances—they’re production killers waiting to explode. A Node.js service can seem stable for hours, then silently balloon memory until latency...
[read more →]On proxy_buffering: default is on, which is correct for typical JSON APIs. Nginx buffers the full response from Node before forwarding to the client — this reduces syscall count and lets Nginx free the upstream connection back to the keepalive pool faster. Turn it off only for streaming responses (SSE, chunked transfers, WebSocket upgrades) where you cant buffer by definition. Using proxy_buffering off on a standard REST API adds measurable latency at high throughput — every write from Node triggers a forwarding syscall instead of being batched.
Node.js net module socket: backlog, buffers, and limits
The node.js net module socket exposes exactly two parameters that matter at the kernel level: backlog and the socket buffer sizes. Everything else is application-layer. Get these wrong and no amount of async optimization saves you.
Node.js socket backlog configuration
The backlog parameter in server.listen(path, backlog) sets the depth of the accept() queue in the kernel — how many fully-established connections can sit waiting before the process calls accept() on them. It does not set a connection limit. Its a queue depth. When the queue fills, the kernel starts sending TCP RST (or UNIX socket ECONNREFUSED) to new connection attempts. Nodes default is 511. At 10k RPS bursts, 511 is too shallow.
const server = require('net').createServer(handleConnection);
// backlog = 2048: enough headroom for burst without wasting kernel memory
// Actual effective value = min(backlog, net.core.somaxconn)
server.listen('/var/run/node-app/server.sock', 2048, () => {
console.log('Listening on UDS');
});
The critical detail: the kernel silently clamps your backlog to net.core.somaxconn. Pass 2048 to listen() but forget to raise somaxconn and youre actually running with 128 (the default). Check and set:
# Check current value
sysctl net.core.somaxconn
# Raise to 2048 (persist across reboots in /etc/sysctl.conf)
sysctl -w net.core.somaxconn=2048
echo "net.core.somaxconn=2048" >> /etc/sysctl.conf
Unix socket file descriptor limit
At the process level, every open UDS connection is a file descriptor. The kernel enforces two limits: ulimit -n (per-process soft limit, configurable) and fs.file-max (system-wide hard ceiling). The default fs.file-max on most Linux systems is several hundred thousand — you wont hit it. The per-process limit of 1024 is what kills you first, as covered above. The second thing that kills you silently: the socket file itself. If the path lives on a filesystem with noexec or inode limits, bind() fails. Put your UDS files in /var/run/ (tmpfs, no inode issues) not in your app directory.
Unix socket buffer size kernel
The unix socket buffer size kernel parameters control how much data can be in-flight in the socket buffer before sendmsg blocks. For local IPC this rarely matters at normal payload sizes — but at sustained high throughput with large payloads (API responses >64KB), default buffers become a ceiling.
# Check current buffer limits
sysctl net.core.wmem_max # max send buffer
sysctl net.core.rmem_max # max receive buffer
# Defaults: 212992 bytes (~208KB) — fine for small payloads
# For high-throughput local IPC with large responses, raise to 4MB:
sysctl -w net.core.wmem_max=4194304
sysctl -w net.core.rmem_max=4194304
# Verify actual buffer utilization under load:
# ss -tm | grep server.sock
# Look at skmem: r rb t tb
The SO_SNDBUF and SO_RCVBUF socket options let you set per-socket buffer sizes programmatically, but for UDS theyre bounded by wmem_max/rmem_max. You cant set a socket buffer larger than the kernel maximum. The ss -tm output during load testing tells you if buffers are actually saturating — if rb consistently shows <10% free, raise the max. If buffers are half-empty, the bottleneck is elsewhere.
When Unix Domain Sockets stop being faster — and how to find out
UDS wins on transport overhead. Thats the only thing it wins on. Once your bottleneck shifts from kernel I/O to something else — CPU-bound request handling, GC pressure, a slow database query, a blocking fs call landing in the libuv thread pool — the UDS advantage disappears completely. The p50 gap between UDS and TCP localhost is ~0.17ms. If your average request handler takes 2ms, you just optimized 8% of the problem. This is the most common reason a UDS migration shows underwhelming results in production: the transport was never the bottleneck to begin with.
Three specific scenarios where UDS stops being faster: (1) payload size crosses ~512KB — kernel buffer copy cost starts dominating, the zero-handshake advantage shrinks to noise; (2) Node.js GC pauses exceed 5ms — the event loop stalls regardless of socket type, p99 blows up on both transports equally; (3) accept queue depth is misconfigured — if somaxconn is still at 128 and your backlog is effectively capped, UDS and TCP perform identically badly. Before blaming the transport, measure it.
V8 Serialization: When JSON.stringify Finally Lets You Down V8 serialization isn't something most Node.js developers reach for on day one. You've got JSON.stringify, it works, life goes on. Then one day you're passing a Map...
[read more →]Profiling UDS bottleneck with strace and perf
The fastest way to confirm UDS is actually your bottleneck — not GC, not app logic, not a misconfigured buffer — is to count syscalls under load with strace and cross-reference with perf. If sendmsg/recvmsg dominate your syscall profile, youre transport-bound. If they dont, stop tuning the socket and go fix whatever does dominate.
# Attach strace to running Node process, count syscalls for 10 seconds
# -p: attach to PID, -c: summary count, -e: filter to socket I/O only
strace -p $(pgrep -f node-app) -c -e trace=sendmsg,recvmsg,epoll_wait,accept4 -T sleep 10
# Expected output when transport IS the bottleneck:
# % time seconds usecs/call calls syscall
# ------- --------- ---------- -------- --------
# 61.32 0.84123 4 187,442 epoll_wait
# 22.18 0.30441 3 101,231 sendmsg
# 14.47 0.19832 3 98,774 recvmsg
# 2.03 0.02781 11 2,431 accept4
# If epoll_wait dominates at >60% — you're I/O bound, UDS tuning helps
# If accept4 count is low relative to sendmsg — accept queue is fine
# If sendmsg usecs/call climbs above 10 — buffer saturation, raise wmem_max
# perf: CPU-level view of where Node spends time under load
# Run autocannon in background, then sample Node for 15 seconds
autocannon -c 500 -d 15 http://localhost &
perf record -p $(pgrep -f node-app) -g -F 99 -- sleep 15
perf report --stdio --no-children | head -40
# What you're looking for in perf report:
# If top entries are libuv__io_poll, uv__stream_io — transport bound, tune socket
# If top entries are v8::internal::GarbageCollector — GC pressure, fix memory first
# If top entries are your app functions (e.g. JSON.parse, crypto) — app bound
# UDS tuning does exactly zero for the last two cases
Concrete numbers from a misconfigured production system: Node.js app, 3k RPS, UDS already in place, p99 at 8ms — worse than the TCP baseline in the benchmark table above. strace -c showed epoll_wait at 71% of syscall time with usecs/call of 38 — 9× higher than the expected ~4. Root cause: somaxconn was 128, accept queue was saturating at burst, kernel was spending cycles on connection drops and retries rather than data transfer. Raising somaxconn to 2048 dropped p99 to 1.4ms. The socket type was irrelevant — the queue was the problem the entire time. strace found it in under two minutes.
`
FAQ: Unix socket vs TCP in production
What is the performance difference between unix socket vs TCP on localhost?
Under controlled load (autocannon, Linux 6.6, Node.js 20, 4-core VM), UDS shows p50 latency of 0.31ms vs 0.48ms for TCP localhost — a 35% reduction. At p999 the gap widens to 59% (3.8ms vs 9.2ms). The delta comes from syscall reduction: UDS averages 4 syscalls per request vs 8–10 for TCP, because sendmsg/recvmsg bypass the IP stack, checksum computation, and the Nagle algorithm delay. CPU usage at 5k RPS drops from 27% to 18%. The numbers hold consistently on bare metal; VM overhead compresses the p50 gap but not p999.
How to tune nginx unix socket keepalive for Node.js?
Set keepalive in the upstream block to 2× your Node.js worker count. Four Node workers: keepalive 8. Without this, Nginx cant reuse connections to Node and opens a new UDS socket per request — killing the entire latency advantage. The companion directives are non-negotiable: proxy_http_version 1.1 and proxy_set_header Connection "" in the location block. Missing either one means HTTP/1.0 semantics force connection close after every response, making your keepalive pool useless regardless of the value set.
Why does Node.js socket performance degrade at high concurrency?
Three causes, ranked by how often theyre the actual problem: (1) ulimit exhaustion — default 1024 FDs, process hits EMFILE silently around 980 connections; (2) accept queue saturation — backlog too shallow (Node default: 511) combined with low net.core.somaxconn (default: 128), kernel drops connections before Node sees them; (3) libuv poll phase back-pressure — at 10k+ connections the event loop poll phase spends more time processing epoll events than running application callbacks. Fix in order: raise ulimit -n to 65536, raise somaxconn to 2048, then profile the event loop.
What is net.core.somaxconn and when does it matter?
net.core.somaxconn is the kernel-enforced ceiling on the accept() backlog queue depth for any socket on the system. Default value on most Linux distros is 128. It matters the moment your Node.js process or Nginx upstream starts receiving bursts — if your app calls server.listen(path, 2048) but somaxconn is 128, the kernel silently uses 128. Connections beyond queue depth get ECONNREFUSED. Check with sysctl net.core.somaxconn, raise with sysctl -w net.core.somaxconn=2048, persist in /etc/sysctl.conf.
How does Node.js cluster mode affect unix domain socket performance?
The naive shared-path pattern — all workers calling server.listen(sockPath) — means only one worker successfully binds; the rest fail silently with EADDRINUSE, leaving you with a fake cluster. The correct pattern: master process binds the socket, passes the handle to workers via IPC (worker.send('server', serverHandle)). This gives you one accept queue, true OS-level load distribution, and eliminates the bind race entirely. Performance impact: at 5k RPS with 4 workers, correct IPC pattern shows ~4× throughput vs the broken single-worker scenario, with p99 remaining flat instead of spiking under burst.
When does unix socket buffer size become a bottleneck?
Buffer size matters when individual response payloads are large (consistently >64KB) or when sustained throughput saturates the in-flight data capacity. Symptom: sendmsg calls start blocking, event loop poll phase stalls, p99 climbs without obvious CPU or accept-queue pressure. Diagnose with ss -tm | grep server.sock under load — if receive buffer allocation (r) is consistently near the buffer maximum (rb), youre saturated. Fix: sysctl -w net.core.wmem_max=4194304 and net.core.rmem_max=4194304. For typical JSON APIs with <8KB responses, default 208KB buffers are never the bottleneck.
Written by: