Health Check Global

1. Quoi ? — Définition et contexte

Le workflow Health Check Global surveille la santé des conteneurs Docker toutes les 5 minutes. Il détecte les conteneurs en état unhealthy ou arrêtés inopinément et notifie les admins via Telegram.

Méthode de détection

Méthode	Outil	Status
Docker ps filtré	SSH + CLI	Actuel (production)
Prometheus cAdvisor	`container_health_status`	Non fonctionnel (Docker 29+)
HTTP endpoints	Curl vers /healthz	Spécifié, non implémenté

Services surveillés

Service	Stack	Critical
Caddy	security	Oui
CrowdSec	security	Oui
N8N	n8n	Oui
N8N-Postgres	n8n	Oui
Redis	n8n	Oui
N8N Workers	n8n	Non
Odoo	odoo	Oui
Odoo-Postgres	odoo	Oui
Qdrant	ai	Non
Claude-Ollama	ai	Non
Prometheus	monitoring	Non
Grafana	monitoring	Non

2. Pourquoi ? — Enjeux et motivations

Problèmes résolus

Problème	Sans health check	Avec health check
Container crash	Découvert par un utilisateur	Alerte en 5 minutes
Service unhealthy	Pas de visibilité	Détection automatique
Downtime prolongé	Pas de notification	Intervention rapide

Pourquoi SSH au lieu de Prometheus ?

Méthode	Avantage	Inconvénient
SSH + docker ps	Fonctionne toujours	Pas d’historique Prometheus
cAdvisor metrics	Historique, graphes	Bug Docker 29+

3. Comment ? — Mise en œuvre technique

Architecture actuelle

Schedule (5min)
       │
       ▼
  SSH Docker Health
  (docker ps --filter)
       │
       ▼
  Parse Results
       │
       ▼
  IF: count > 0
       │
  ┌────┴────┐
  │ yes     │ no
  ▼         ▼
Prepare    Skip
Notification
       │
       ▼
  Notification Hub → Telegram

Commande de détection

docker ps -a \
  --filter "health=unhealthy" \
  --filter "status=exited" \
  --format json

Cette commande retourne :

Les conteneurs avec healthcheck échoué (health=unhealthy)
Les conteneurs arrêtés inopinément (status=exited)

Configuration N8N

Node SSH Docker Health :

Type: SSH
Host: localhost
Command: docker ps -a --filter "health=unhealthy" --filter "status=exited" --format json

Parse Results (Code) :

const output = $json.stdout;
if (!output || output.trim() === '') {
  return [{ json: { count: 0, containers: [] } }];
}

const containers = output.trim().split('\n')
  .filter(line => line)
  .map(line => JSON.parse(line))
  .map(c => ({
    name: c.Names,
    status: c.State,
    health: c.Status
  }));

return [{
  json: {
    count: containers.length,
    containers: containers
  }
}];

Format de notification

{
  "source": "health_check",
  "type": "health_issue",
  "severity": "critical",
  "title": "2 container(s) en problème",
  "message": "Containers détectés:\n- n8n-worker-1 (unhealthy)\n- redis (exited)",
  "container": "n8n-worker-1",
  "containers": ["n8n-worker-1", "redis"],
  "timestamp": "2026-01-20T10:00:00.000Z"
}

Exemple de notification Telegram

🚨 HEALTH CHECK ALERT

2 container(s) en problème

Containers détectés:
❌ n8n-worker-1 (unhealthy)
❌ redis (exited)

Stack affecté: n8n-stack

[🔄 Restart] [📋 Logs] [🔇 Mute 1h]

Commandes utiles

# Voir les conteneurs unhealthy
docker ps -a --filter "health=unhealthy"

# Voir les conteneurs arrêtés
docker ps -a --filter "status=exited"

# Health d'un conteneur spécifique
docker inspect --format='{{.State.Health.Status}}' n8n

# Logs des health checks Docker
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' n8n

4. Et si ? — Perspectives et limites

Spécification étendue (non implémentée)

HTTP Checks :

{
  "checks": [
    {
      "name": "Caddy",
      "type": "http",
      "url": "https://guigpap.com",
      "expected_status": 200,
      "timeout_ms": 10000
    },
    {
      "name": "N8N",
      "type": "http",
      "url": "https://n8n.guigpap.com/healthz"
    }
  ]
}

Rapport quotidien :

📊 HEALTH REPORT - 20 Janvier 2026

════════════════════════
UPTIME 24H
════════════════════════

🟢 Caddy          100.0%
🟢 N8N            99.9% (1 incident)
🟢 Odoo           100.0%
🟡 CrowdSec       98.5% (2 incidents)

════════════════════════
HEALTH SCORE: 97/100
════════════════════════

Escalade :

Durée down	Action
> 5 min	Première alerte
> 15 min	Rappel
> 30 min	Escalade critique
Multiple services	Alerte immédiate

Limites actuelles

Limite	Impact	Mitigation
Pas d’HTTP checks	Détecte état Docker, pas applicatif	Prévu en évolution
Pas de rapport quotidien	Pas de vue historique	Dashboard Grafana
5min intervalle	Latence détection max 5min	Acceptable pour usage perso

Scénarios d’évolution

Si besoin de HTTP checks :

Ajouter des checks curl vers /healthz de chaque service
Différencier Docker healthy vs HTTP répondant
Alertes plus précises

Si besoin de rapport quotidien :

Agréger les incidents sur 24h
Calculer uptime par service
Envoyer digest à 8h

Si cAdvisor redevient fonctionnel :

Migrer vers métriques Prometheus
Supprimer le check SSH
Historique et graphes natifs

Troubleshooting

Problème	Vérification
Faux positifs	Conteneurs sans healthcheck retournent “none”
SSH timeout	Credential SSH valide ? Port 22 accessible ?
Pas de notification	Workflow actif ? Notification Hub ok ?
Trop d’alertes	Ajuster les filtres dans docker ps