---
title: Global Health Check
url: https://blog.guigpap.com/en/workflows/health-check/
url_md: https://blog.guigpap.com/en/workflows/health-check.md
category: automation
date: '2026-01-31'
maturite: production
techno:
  - docker
  - n8n
  - telegram
  - prometheus
application:
  - monitoring
  - operations
---

# Global Health Check

> Proactive Docker infrastructure monitoring with detection of failed containers

## 1. What? — Definition and context

The **Global Health Check** workflow monitors the health of Docker containers every 5 minutes. It detects unhealthy or unexpectedly stopped containers and notifies admins via Telegram.

> **Note - Health check**
>
> A **health check** is a periodic verification of a service's state. Docker can mark a container as "unhealthy" if its own checks fail (e.g., a web server that no longer responds).

### Detection method

| Method | Tool | Status |
|--------|------|--------|
| **Filtered docker ps** | SSH + CLI | Current (production) |
| Prometheus cAdvisor | `container_health_status` | Not working (Docker 29+) |
| HTTP endpoints | Curl to /healthz | Specified, not implemented |

### Monitored services

| Service | Stack | Critical |
|---------|-------|----------|
| Caddy | security | Yes |
| CrowdSec | security | Yes |
| N8N | n8n | Yes |
| N8N-Postgres | n8n | Yes |
| Redis | n8n | Yes |
| N8N Workers | n8n | No |
| Odoo | odoo | Yes |
| Odoo-Postgres | odoo | Yes |
| Qdrant | ai | No |
| Claude-Ollama | ai | No |
| Prometheus | monitoring | No |
| Grafana | monitoring | No |

---

## 2. Why? — Stakes and motivations

### Problems solved

| Problem | Without health check | With health check |
|---------|---------------------|-------------------|
| **Container crash** | Discovered by a user | Alert within 5 minutes |
| **Unhealthy service** | No visibility | Automatic detection |
| **Extended downtime** | No notification | Quick intervention |

### Why SSH instead of Prometheus?

> **Caution - cAdvisor bug**
>
> The `container_health_status` cAdvisor metric does not work on Docker 29+ with overlayfs. Metrics always return 0.

| Method | Advantage | Drawback |
|--------|-----------|----------|
| **SSH + docker ps** | Always works | No Prometheus history |
| cAdvisor metrics | History, graphs | Docker 29+ bug |

---

## 3. How? — Technical implementation

### Current architecture

```mermaid
flowchart TD
  Sched["Schedule · 5 min"]
  SSH["SSH Docker Health · docker ps --filter"]
  Parse["Parse Results"]
  HasIssues{"count > 0 ?"}
  Skip["Skip"]
  Prep["Prepare Notification"]
  Hub["Notification Hub"]
  TG["Telegram"]

  Sched --> SSH --> Parse --> HasIssues
  HasIssues -->|No| Skip
  HasIssues -->|Yes| Prep --> Hub --> TG
```

A `Docker Health Check with Retry` sub-workflow (`jDN2QV3nEMGacrCvgEBBV`) wraps this check with a retry policy (2 attempts spaced 30s apart) before notifying, which filters out false positives on containers in the middle of a restart.

### Detection command

```bash
docker ps -a \
  --filter "health=unhealthy" \
  --filter "status=exited" \
  --format json
```

This command returns:
- Containers with a failed healthcheck (`health=unhealthy`)
- Containers stopped unexpectedly (`status=exited`)

### N8N configuration

**SSH Docker Health Node:**

```yaml
Type: SSH
Host: localhost
Command: docker ps -a --filter "health=unhealthy" --filter "status=exited" --format json
```

**Parse Results (Code):**

```javascript
const output = $json.stdout;
if (!output || output.trim() === '') {
  return [{ json: { count: 0, containers: [] } }];
}

const containers = output.trim().split('\n')
  .filter(line => line)
  .map(line => JSON.parse(line))
  .map(c => ({
    name: c.Names,
    status: c.State,
    health: c.Status
  }));

return [{
  json: {
    count: containers.length,
    containers: containers
  }
}];
```

### Notification format

```json
{
  "source": "health_check",
  "type": "health_issue",
  "severity": "critical",
  "title": "2 container(s) in trouble",
  "message": "Detected containers:\n- n8n-worker-1 (unhealthy)\n- redis (exited)",
  "container": "n8n-worker-1",
  "containers": ["n8n-worker-1", "redis"],
  "timestamp": "2026-01-20T10:00:00.000Z"
}
```

### Sample Telegram notification

```
🚨 HEALTH CHECK ALERT

2 container(s) in trouble

Detected containers:
❌ n8n-worker-1 (unhealthy)
❌ redis (exited)

Affected stack: n8n-stack

[🔄 Restart] [📋 Logs] [🔇 Mute 1h]
```

### Useful commands

```bash
# See unhealthy containers
docker ps -a --filter "health=unhealthy"

# See stopped containers
docker ps -a --filter "status=exited"

# Health of a specific container
docker inspect --format='{{.State.Health.Status}}' n8n

# Docker health-check logs
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' n8n
```

---

## 4. What if? — Outlook and limits

### Current limits

| Limit | Impact | Mitigation |
|-------|--------|------------|
| **No HTTP checks** | Detects Docker state, not application state | Planned evolution |
| **No daily report** | No historical view | Grafana dashboard |
| **5 min interval** | Max detection latency 5 min | Acceptable for personal usage |

### Evolution scenarios

**If HTTP checks are needed**:
- Add curl checks to each service's /healthz
- Differentiate Docker healthy vs HTTP responding
- More precise alerts

**If a daily report is needed**:
- Aggregate incidents over 24h
- Compute uptime per service
- Send digest at 8 AM

**If cAdvisor becomes functional again**:
- Migrate to Prometheus metrics
- Drop the SSH check
- Native history and graphs

### Troubleshooting

| Problem | Check |
|---------|-------|
| False positives | Containers without healthcheck return "none" |
| SSH timeout | Valid SSH credential? Port 22 reachable? |
| No notification | Workflow active? Notification Hub OK? |
| Too many alerts | Adjust filters in docker ps |

---

## Related pages

### Workflows
- [Notification Hub](/en/workflows/notification-hub/) — Notification routing
- [Docker Auto-Updates](/en/workflows/docker-updates/) — Image updates

### Infrastructure
- [Monitoring Stack](/en/infrastructure/monitoring-stack/) — Prometheus & Grafana
- [Security Stack](/en/infrastructure/security-stack/) — Caddy & CrowdSec

## Metadonnees agent

- Cet article est issu du blog GuiGPaP Lab.
- Contexte global du blog: https://blog.guigpap.com/llms.txt
- Contact auteur: https://odoo.guigpap.com/mon-cv
- Licence: CC-BY-SA 4.0