Docker Compose monitoring: how to track all your services
Practical guide to monitoring a Docker Compose stack. Healthchecks, monitoring all services from one place, and getting alerted when one fails.
Most Docker setups start with a docker-compose.yml. Web server, database, Redis, maybe a background worker — a handful of services running together on a single host. Docker Compose monitoring is the practice of keeping visibility over all of those services from one place and getting notified when any of them fails. This is a practical guide to healthchecks, what they test vs what they miss, and how to monitor a Compose stack properly.
How Docker Compose handles health
Docker Compose has a built-in healthcheck configuration that runs a command inside your container on a schedule. If the command succeeds (exit code 0), the container is "healthy." If it fails repeatedly, it becomes "unhealthy."
services:
api:
image: myapp:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
The fields:
test: Command to run inside the container to check healthinterval: How often to run the check (30s is a reasonable default)timeout: How long the healthcheck has to complete before it's considered timed outretries: How many consecutive failures before marking unhealthystart_period: Grace period during startup before failures count against the retries
# See healthcheck status
docker ps
# CONTAINER ID ... STATUS
# a1b2c3d4e5f6 ... Up 2 minutes (healthy)
# f6e5d4c3b2a1 ... Up 2 minutes (unhealthy)
Healthcheck patterns for common services
Web/API services
api:
image: myapp:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
Your /health endpoint should:
- Return HTTP 200 on success
- Check that critical dependencies (DB connection, cache) are reachable
- Respond in under 1s (not a full integration test)
- Return non-200 if the service can't serve requests
PostgreSQL
postgres:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
interval: 10s
timeout: 5s
retries: 5
start_period: 10s
pg_isready checks that PostgreSQL is accepting connections. It's included in the Postgres image — no additional tooling needed.
Redis
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 5
redis-cli ping returns PONG if Redis is healthy. Quick and reliable.
Nginx / reverse proxy
nginx:
image: nginx:alpine
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 30s
timeout: 5s
retries: 3
You'll need to add a /health location to your nginx config:
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
Background workers (queue consumers)
Workers are harder to healthcheck because they don't listen on HTTP. Options:
worker:
image: myapp:latest
command: python worker.py
healthcheck:
# Check a PID file that the worker writes on startup
test: ["CMD", "test", "-f", "/tmp/worker.pid"]
interval: 30s
timeout: 5s
retries: 3
Or use a custom health script that checks if the worker is making progress (reading a counter from Redis, for example):
#!/bin/bash
# /usr/local/bin/worker-health.sh
LAST_PROCESSED=$(redis-cli get worker:last_processed_at)
NOW=$(date +%s)
# Fail if no message processed in last 5 minutes
if [ $((NOW - LAST_PROCESSED)) -gt 300 ]; then
exit 1
fi
exit 0
What healthchecks test vs what they miss
Docker healthchecks are useful but limited. Understanding the gap helps you decide where additional monitoring is needed.
What healthchecks tell you
- The container process is running and responding
- The service can accept basic requests
- Critical dependencies (DB, cache) are reachable at the moment of the check
What healthchecks miss
Degraded performance — a healthcheck that passes in 0.1 seconds doesn't tell you when p99 latency is 8 seconds. The service is "healthy" by the check but users are experiencing timeouts.
Memory leaks — a container leaking 50MB/hour passes every healthcheck until it finally OOM kills. By then it's too late.
Queue backlogs — a worker container is "running" but has 10,000 unprocessed jobs backed up. Healthcheck passes, users are waiting.
Partial failures — one replica in a multi-replica setup is unhealthy, but others are handling traffic. Your single healthcheck passing on one container doesn't tell you the others are down.
Network isolation — the container thinks it's healthy, but outbound network access broke and it can't reach external APIs. A simple localhost healthcheck passes.
This is why healthchecks are necessary but not sufficient for production monitoring. You need:
- Healthchecks: immediate detection of complete service failure
- Resource monitoring: detecting degradation before it becomes failure
- Uptime tracking: historical view of when services were healthy vs not
- Alerts: proactive notification when health status changes
Monitoring all services in a Compose stack
Adding the Kernus agent to your Compose stack
You can run the Kernus agent as a service in your docker-compose.yml alongside your application services:
# docker-compose.yml
version: '3.8'
services:
api:
image: myapp:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
postgres:
image: postgres:16
environment:
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: myapp
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d myapp"]
interval: 10s
retries: 5
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
worker:
image: myapp:latest
command: python worker.py
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
# Add Kernus agent alongside your services
kernus-agent:
image: kernus/agent:latest
environment:
- KERNUS_AGENT_TOKEN=${KERNUS_AGENT_TOKEN}
- KERNUS_HOST_NAME=${HOSTNAME:-my-server}
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
restart: unless-stopped
network_mode: host # Needed to resolve host metrics
The agent reads Docker's API via the mounted socket. It automatically discovers every container in your stack — you don't configure which services to monitor, it finds them all.
Alternatively, install the agent directly on the host instead of as a container, which is the approach we recommend for production:
curl -fsSL https://kernus.app/install | sh
kernus token YOUR_TOKEN --host company-backend
kernus agent start
Windows: Use WSL or Git Bash for this one-liner, or use PowerShell (see Windows install).
A typical compose monitoring setup
For a production api + postgres + redis + worker stack:
Alert rules to create:
Rule 1: Any service down
Condition: Container status != running
Duration: 1 minute
Filter: (all containers)
Channel: #production-alerts
Rule 2: High API memory
Condition: Memory > 80% of limit
Duration: 10 minutes
Filter: api (container name)
Channel: #infrastructure-warn
Rule 3: Worker restart storm
Condition: Restart count increases by 5
Duration: 30 minutes
Filter: worker
Channel: #production-alerts
Rule 4: Any OOM kill
Condition: Container OOM killed
Duration: Immediate
Filter: (all containers)
Channel: SMS + Slack
The depends_on healthcheck pattern (critical for reliability)
One of the most common causes of container restart loops in Compose stacks is service startup order. Your API starts before the database is ready, tries to connect, fails, restarts. The solution is depends_on with condition: service_healthy:
services:
api:
depends_on:
postgres:
condition: service_healthy # API won't start until postgres is HEALTHY
redis:
condition: service_healthy
worker:
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
api:
condition: service_healthy # Worker waits for API too (optional)
This requires that each dependency has a healthcheck configured. Without a healthcheck, condition: service_healthy won't work (it'll fail to start or treat the service as always healthy).
The full flow on docker-compose up:
- Postgres starts, healthcheck begins polling
- Redis starts, healthcheck begins polling
- Once both are healthy, API and worker start
- If API healthcheck fails after startup, Docker reports it as unhealthy (dependent services won't automatically restart, but your monitoring will alert)
Viewing your Compose stack health
# See all services with health status
docker compose ps
# Output:
# NAME IMAGE SERVICE STATUS PORTS
# api myapp api running 0.0.0.0:8080->8080/tcp
# postgres postgres postgres running 5432/tcp
# redis redis redis running 6379/tcp
# worker myapp worker running (unhealthy)
# Watch in real time
watch docker compose ps
The (unhealthy) suffix means the container is running but failing its healthcheck. It's still running — it hasn't crashed — but something is wrong.
For setting up alerts when a service fails: How to set up Docker container alerts for Slack, Discord, Telegram. For the complete Docker monitoring approach: Docker container monitoring — complete guide 2025.
Try Kernus free
Set up Docker monitoring in 2 minutes. Free for 1 host — no credit card required.
Start monitoring