TIL: When Autoscaling Node.js by CPU Is Not Enough
CPU looks like the obvious metric for autoscaling, until a Node.js service starts timing out while the CPU graph still looks calm. For Node.js workloads, the event loop can tell a more honest story.
A strange thing can happen when you run Node.js inside Kubernetes.
The dashboard looks fine. CPU is not high. Memory is stable. The pods are alive. Health checks are green.
And still, users complain that the API feels slow.
At first, this sounds like one of those vague production issues where everyone looks at the same Grafana board and waits for the graph to confess. The backend team checks the database. The DevOps side checks the cluster. Someone asks if the frontend is calling the endpoint too many times.
Then you notice the p95 or p99 latency.
That is where the story usually changes. The average request still looks acceptable, but the slow requests are getting slower. Some requests wait longer before the application even starts doing useful work. Not because the pod is dead. Not because Kubernetes forgot to scale.
Because the Node.js process is busy in a way CPU does not always explain well.
CPU is a good signal, but not the whole story
Most teams start with CPU-based autoscaling because it is simple and it works well enough for a lot of services.
A typical Horizontal Pod Autoscaler setup looks something like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: node-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
There is nothing wrong with this as a starting point. If the service is doing CPU-heavy work, like image processing, encryption, parsing large files, or heavy data transforms, CPU can be a useful pressure signal.
But many Node.js APIs do not fail only because they run out of CPU.
They fail because the event loop gets blocked, delayed, or saturated. That can happen due to synchronous code, large JSON operations, slow middleware, heavy logging, bad metrics collection, expensive validation, too many callbacks waiting to run, or a few innocent-looking lines that sit inside the hot path.
The pod may still report moderate CPU usage. Kubernetes sees that and thinks, "Nothing dramatic here."
The user sees a spinner.
The event loop is where latency starts to hurt
Node.js is good at handling many concurrent operations because it does not create a thread per request in the usual way. It depends heavily on the event loop.
That means one overloaded event loop can hurt many requests at the same time.
A request arrives. It waits for its turn. Then another one arrives. Then the app does some synchronous work. Maybe it parses a big payload. Maybe it runs a blocking transform. Maybe it builds a large response object. The event loop keeps getting less room to breathe.
The CPU graph may not scream. The event loop does.
That is why event loop lag and event loop utilization are useful metrics for Node.js services. They give you a better view of how much room the process has left to accept and process work.
A very small custom metric can already make the service easier to understand:
import express from 'express';
import client from 'prom-client';
import { performance } from 'node:perf_hooks';
const app = express();
const register = new client.Registry();
client.collectDefaultMetrics({ register });
let previous = performance.eventLoopUtilization();
const eventLoopUtilization = new client.Gauge({
name: 'nodejs_event_loop_utilization_ratio',
help: 'Ratio of time the event loop spent active since the last sample',
});
register.registerMetric(eventLoopUtilization);
setInterval(() => {
const current = performance.eventLoopUtilization();
const diff = performance.eventLoopUtilization(current, previous);
eventLoopUtilization.set(diff.utilization);
previous = current;
}, 1000).unref();
app.get('/metrics', async (_req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(3000);
This does not magically fix scaling. It only gives you a better signal.
And that is the whole point.
Before scaling based on a metric, you need to know which metric actually describes pain for your application.
The autoscaler reacts after the problem starts
There is another small trap here.
Even when you add a better metric, autoscaling is not instant. Kubernetes does not sit there watching every request in real time. The HPA checks metrics on a loop, calculates what it thinks the replica count should be, then asks the workload to scale.
After that, new pods still need to be created. The image may need to be pulled. The app needs to boot. Readiness checks need to pass. The load balancer needs to send traffic to the new pods.
So, by the time extra capacity is ready, the spike may already have done the damage.
This is why simply lowering the CPU threshold often feels like a cheap fix but creates a different problem. You get more pods earlier, yes. You also pay for more idle capacity when traffic is normal. The system becomes safer, but not smarter.
The same can happen with event loop metrics if we treat them as a magic threshold.
Scale when ELU is above 0.75.
That sounds nice. But if the traffic spike is sharp, you are still reacting to a value that already crossed the line.
What I would measure before changing the HPA
I would not start by deleting CPU-based autoscaling.
I would start by adding visibility.
For a Node.js API, I would want to see CPU, memory, request rate, response time, error rate, event loop lag, event loop utilization, and active handles. If the service uses a queue, I would also track queue depth and queue age. If it calls a database, I would check connection pool usage and slow queries.
Only then would I decide what scaling should care about.
For example, a public API that spends most of its time waiting on I/O may not need CPU as the main scaling signal. Request rate, in-flight requests, or event loop utilization might tell a better story.
A worker that consumes jobs from a queue may care more about queue age than CPU. If the oldest job keeps getting older, the system is falling behind, even if CPU still looks calm.
A backend-for-frontend service may need a mix. CPU can catch heavy render or transform work. Event loop metrics can catch blocking code. Latency can catch the pain that users actually feel.
There is no single perfect metric.
There is only a metric that matches the way your service breaks.
A better HPA shape
After exposing the right metric, the autoscaler can use custom or external metrics. The exact setup depends on the metrics adapter, Prometheus stack, and cluster setup, but the shape usually moves in this direction:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: node-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: node-api
minReplicas: 2
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: nodejs_event_loop_utilization_ratio
target:
type: AverageValue
averageValue: "0.7"
This is not a copy-paste production config. The values need load testing.
A threshold of 0.7 may be fine for one app and too high for another. Some services can run hot without visible user impact. Others start showing tail latency much earlier.
The important part is not the number. It is the shift in thinking.
You stop asking only, "How much CPU are we using?"
You also ask, "How busy is the runtime that actually accepts and processes the work?"
The real fix may not be scaling
This is the part that is easy to skip.
Autoscaling can hide a performance issue, but it cannot always fix it.
If one route blocks the event loop for 300ms, adding more pods will reduce the blast radius. It will not make that route good. If a JSON payload is too large, more replicas may help for a while. If logging is synchronous or metrics are too heavy, scaling can make the bill bigger while the bug stays in place.
A better debugging path is usually:
First, measure the event loop. Then compare it with latency. After that, find which route or job causes the spike. Only then decide whether to scale, optimize, move work to a queue, use worker threads, or split the service.
Sometimes the right fix is boring. Cache a response. Add pagination. Stop doing sync work inside a request. Move PDF parsing away from the API. Reduce a payload. Change a bad loop.
Scaling should be the safety net, not the first excuse.
What I learned
CPU-based autoscaling is not wrong. It is just incomplete for many Node.js services.
Node.js can be under pressure while CPU still looks acceptable. The event loop can be busy, delayed, or blocked, and that pressure often shows up first in tail latency. If the autoscaler only watches CPU, it may scale too late or not at all.
The better approach is to treat autoscaling as part of observability, not just infrastructure.
Measure the thing that hurts. For Node.js, that often means event loop lag, event loop utilization, request concurrency, and queue age. CPU still matters, but it should not be the only voice in the room.
Today I learned that "add more pods" is not always the wrong answer.
Sometimes it is just answering the wrong question.
I am not a DevOps expert, and this post is not written from that point of view.
It is mostly a note from a frontend/backend developer who started paying more attention to what the DevOps team discusses during real production issues. Things like CPU usage, event loop pressure, metrics, autoscaling, and Kubernetes behavior are easy to ignore when they are “not your part” of the stack.
But the more I listen, the more I realize that application code and infrastructure are not separate worlds. A small blocking function in Node.js can become a scaling problem. A bad metric can hide a real user-facing issue. And sometimes, understanding the question the DevOps team is asking is enough to make you a better developer.