--- title: Global Error Handler url: https://blog.guigpap.com/en/workflows/error-handler/ url_md: https://blog.guigpap.com/en/workflows/error-handler.md category: automation date: '2026-03-28' maturite: production techno: - n8n - telegram - claude application: - automation - operations --- # Global Error Handler > Centralised N8N error handling with smart classification, Dead Letter Queue and automatic retry ## 1. What? — Definition and context Imagine about forty N8N workflows running continuously — GitHub synchronisation, Docker updates, Telegram notifications, content pipeline. When one of them crashes at 3 AM, how do you know what happened, whether it is serious, and what to do? The **Global Error Handler** (GEH) is the answer: a centralised system that intercepts every error, classifies it automatically, stores it in a dedicated queue, and sends a Telegram notification with action buttons. A single entry point for every error in the N8N infrastructure. > **Note - Dead Letter Queue** > > A **Dead Letter Queue** (DLQ) is a concept borrowed from messaging systems: when a message cannot be processed, instead of being lost, it is stored in a dedicated queue for later investigation. Here, every N8N error becomes a DLQ entry with its full context. ### The 4 workflows of the system | Workflow | Nodes | Role | |----------|-------|------| | **Global Error Handler** | 19 | Capture, classification, notification | | **GEH Callback Actions** | 30 | Retry, AI analysis, ignore, fix | | **GEH Fix Applier** | 31 | Application and rollback of AI fixes | | **DLQ Weekly Digest** | 8 | Weekly error summary | ### Architecture ```mermaid flowchart TD WFs["~42 N8N workflows · Error Trigger"] subgraph GEH["Global Error Handler · 19 nodes"] direction TB Extract["Extract Error · redact secrets"] Config["Check error_handling_config"] DLQ["DLQ Insert · err_"] Classify["Classify · keyword rules"] end Hub["Notification Hub · Telegram with buttons"] subgraph CB["GEH Callback Actions · 30 nodes"] direction TB Retry["Retry · 6n"] Details["AI Details · 9n"] Ignore["Ignore · 1n"] Fix["AI Fix · 12n"] end FixApplier["GEH Fix Applier · 31 nodes"] Digest["DLQ Weekly Digest · 8 nodes"] WFs --> Extract --> Config --> DLQ --> Classify --> Hub Hub --> CB Fix --> FixApplier DLQ --> Digest --> Hub ``` --- ## 2. Why? — Stakes and motivations Before the GEH, N8N errors were silent. A workflow failed, N8N noted it in its internal logs, and nobody knew before noticing a malfunction. No classification, no notification, no visibility. ### Problems solved | Problem | Without GEH | With GEH | |---------|-------------|----------| | **Silent errors** | Discovered by chance in logs | Immediate Telegram notification | | **No context** | "Workflow failed" without details | Classification, failing node, stack trace | | **Manual retry** | Open N8N, find the execution, restart | [Retry] button in Telegram | | **No history** | Errors lost after N8N purge | Persistent Dead Letter Queue | | **Repetitive errors** | Same alert in a loop | Deduplication + per-workflow config | ### Two-level retry strategy The GEH does not handle retries blindly. It relies on the native N8N retry at the node level: | Level | Mechanism | When | |-------|-----------|------| | **Node** | Native N8N Retry on Fail | Transient errors (timeout, 503, rate limit) | | **Workflow** | Retry button via GEH | When all node retries are exhausted | > **Tip - Native Retry on Fail** > > Every HTTP, Telegram or AI node is configured with 2-3 automatic retries and a 5-10 second delay. If it succeeds, the workflow continues normally. The GEH only intervenes when those retries are exhausted — the error is therefore truly persistent. --- ## 3. How? — Technical implementation ### An error's journey When a workflow fails, here is what happens, step by step: **1. Capture** — The Error Trigger intercepts the failure (including activation errors). **2. Extraction** — A Code node normalises the context: source workflow, failing node, error message, stack trace. Sensitive data (tokens, passwords, connection strings) are automatically masked by a `redact()` function. **3. Configuration** — The GEH consults the `error_handling_config` table to determine whether this workflow has specific rules (notifications disabled, temporary suppression, custom max retries). **4. DLQ insert** — The error is stored in the `error_dead_letter_queue` table with a unique identifier (`err_<16hex>`). **5. Classification** — Keyword rules analyse the error message to determine type and severity: | Detected type | Keywords | Severity | |---------------|----------|----------| | `timeout` | timeout, ETIMEDOUT, deadline | warning | | `network` | ECONNREFUSED, ENOTFOUND, socket | warning | | `authentication` | 401, 403, unauthorized | critical | | `rate_limit` | 429, rate limit, quota | warning | | `data_validation` | invalid, schema, parse error | info | | `resource` | out of memory, disk full | critical | | `configuration` | missing credential, not found | critical | **6. Notification** — If notifications are enabled for that workflow, the GEH calls the [Notification Hub](/en/workflows/notification-hub/) with a formatted message and action buttons. > **Caution - Loop prevention** > > The GEH itself has no Error Workflow. If the GEH crashes, it does not trigger another GEH. The Notification Hub, on its side, has a minimal fallback (direct Telegram) to avoid error cascades. ### The 4 Telegram actions When the notification arrives on Telegram, it offers four buttons: **[Retry]** — Restarts the failed execution via the N8N API. Before restarting, the system checks that the error is eligible for retry (`can_auto_retry`). The retry counter is incremented in the DLQ. **[Details]** — Requests an AI analysis from Claude. The first request generates the analysis (error type, probable cause, fix suggestion); subsequent ones use the cache stored in the DLQ. Useful to understand a complex error without opening N8N. **[Ignore]** — Marks the error as resolved in the DLQ. Useful for false positives or ephemeral errors already fixed. **[Fix]** — Advanced feature: Claude analyses the failing workflow and proposes an automatic fix. The fix is stored as a proposal, then a second workflow (GEH Fix Applier) handles application with backup, confirmation and rollback. ### Per-workflow configuration Each workflow can have its own rules in the `error_handling_config` table: | Parameter | Default | Usage | |-----------|---------|-------| | `error_handling_enabled` | true | Disable for a workflow under maintenance | | `max_retries` | 3 | Override the retry count | | `notify_on_error` | true | Mute notifications without disabling the DLQ | | `auto_retry_enabled` | false | Automatic retry without intervention | | `suppress_until` | null | Temporary suppression (ISO timestamp) | > **Danger - Maintenance mode** > > When N8N self-updates (Docker self-update), a maintenance flag is inserted into `error_handling_config` to suppress notifications during the restart. Without this, every worker that stops would generate a false alarm. ### DLQ Weekly Digest Every Sunday at 9 AM, the DLQ Weekly Digest workflow generates a summary of the week's errors and sends it through the Notification Hub. This digest helps spot patterns: a workflow failing regularly, a recurring error type, retries that never solve the problem. The digest includes: - Total error count by severity - Top failing workflows - Unresolved errors (retry_status = pending/exhausted) - Trends compared to the previous week --- ## 4. What if? — Outlook and limits ### Current limits | Limit | Impact | Mitigation | |-------|--------|------------| | **Rule-based classification** | Unknown types classified as "unknown" | On-demand AI analysis via [Details] | | **No auto-retry** | Each retry requires a click | `auto_retry_enabled` flag prepared but not deployed | | **Experimental AI Fix** | Proposed fixes are not always applicable | Double confirmation before application + automatic rollback | | **No correlation** | Related errors not grouped | Identifiable via the weekly digest | ### Evolution scenarios **If error volume grows**: - Enable auto-retry for transient errors (timeout, rate limit) - Group similar errors in the digest - Add an alert threshold: "5 errors from the same workflow in 1h = escalate" **If cross-workflow correlation is needed**: - Trace execution chains (workflow A calls B which calls C) - If C fails, show the full chain context - Allow retry from the parent workflow **If the team grows**: - Assign errors by domain (Docker → ops, Odoo → business) - Escalate if not handled after a configurable delay - Grafana dashboard with DLQ metrics --- ## Related pages ### Infrastructure - [N8N in queue mode](/en/infrastructure/n8n-queue-mode/) — Backend running the workflows - [Monitoring Stack](/en/infrastructure/monitoring-stack/) — Prometheus and Grafana ### Workflows - [Notification Hub](/en/workflows/notification-hub/) — Error notification routing - [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) — Receiver for Retry/Details/Ignore callbacks - [Docker Auto-Updates](/en/workflows/docker-updates/) — Maintenance mode during updates ### Reference - [Glossary](/en/reference/glossary/) — Dead Letter Queue, Error Trigger ## Metadonnees agent - Cet article est issu du blog GuiGPaP Lab. - Contexte global du blog: https://blog.guigpap.com/llms.txt - Contact auteur: https://odoo.guigpap.com/mon-cv - Licence: CC-BY-SA 4.0