Global Error Handler
1. What? — Definition and context
Section titled “1. What? — Definition and context”Imagine about forty N8N workflows running continuously — GitHub synchronisation, Docker updates, Telegram notifications, content pipeline. When one of them crashes at 3 AM, how do you know what happened, whether it is serious, and what to do?
The Global Error Handler (GEH) is the answer: a centralised system that intercepts every error, classifies it automatically, stores it in a dedicated queue, and sends a Telegram notification with action buttons. A single entry point for every error in the N8N infrastructure.
The 4 workflows of the system
Section titled “The 4 workflows of the system”| Workflow | Nodes | Role |
|---|---|---|
| Global Error Handler | 19 | Capture, classification, notification |
| GEH Callback Actions | 30 | Retry, AI analysis, ignore, fix |
| GEH Fix Applier | 31 | Application and rollback of AI fixes |
| DLQ Weekly Digest | 8 | Weekly error summary |
Architecture
Section titled “Architecture”2. Why? — Stakes and motivations
Section titled “2. Why? — Stakes and motivations”Before the GEH, N8N errors were silent. A workflow failed, N8N noted it in its internal logs, and nobody knew before noticing a malfunction. No classification, no notification, no visibility.
Problems solved
Section titled “Problems solved”| Problem | Without GEH | With GEH |
|---|---|---|
| Silent errors | Discovered by chance in logs | Immediate Telegram notification |
| No context | ”Workflow failed” without details | Classification, failing node, stack trace |
| Manual retry | Open N8N, find the execution, restart | [Retry] button in Telegram |
| No history | Errors lost after N8N purge | Persistent Dead Letter Queue |
| Repetitive errors | Same alert in a loop | Deduplication + per-workflow config |
Two-level retry strategy
Section titled “Two-level retry strategy”The GEH does not handle retries blindly. It relies on the native N8N retry at the node level:
| Level | Mechanism | When |
|---|---|---|
| Node | Native N8N Retry on Fail | Transient errors (timeout, 503, rate limit) |
| Workflow | Retry button via GEH | When all node retries are exhausted |
3. How? — Technical implementation
Section titled “3. How? — Technical implementation”An error’s journey
Section titled “An error’s journey”When a workflow fails, here is what happens, step by step:
1. Capture — The Error Trigger intercepts the failure (including activation errors).
2. Extraction — A Code node normalises the context: source workflow, failing node, error message, stack trace. Sensitive data (tokens, passwords, connection strings) are automatically masked by a redact() function.
3. Configuration — The GEH consults the error_handling_config table to determine whether this workflow has specific rules (notifications disabled, temporary suppression, custom max retries).
4. DLQ insert — The error is stored in the error_dead_letter_queue table with a unique identifier (err_<16hex>).
5. Classification — Keyword rules analyse the error message to determine type and severity:
| Detected type | Keywords | Severity |
|---|---|---|
timeout | timeout, ETIMEDOUT, deadline | warning |
network | ECONNREFUSED, ENOTFOUND, socket | warning |
authentication | 401, 403, unauthorized | critical |
rate_limit | 429, rate limit, quota | warning |
data_validation | invalid, schema, parse error | info |
resource | out of memory, disk full | critical |
configuration | missing credential, not found | critical |
6. Notification — If notifications are enabled for that workflow, the GEH calls the Notification Hub with a formatted message and action buttons.
The 4 Telegram actions
Section titled “The 4 Telegram actions”When the notification arrives on Telegram, it offers four buttons:
[Retry] — Restarts the failed execution via the N8N API. Before restarting, the system checks that the error is eligible for retry (can_auto_retry). The retry counter is incremented in the DLQ.
[Details] — Requests an AI analysis from Claude. The first request generates the analysis (error type, probable cause, fix suggestion); subsequent ones use the cache stored in the DLQ. Useful to understand a complex error without opening N8N.
[Ignore] — Marks the error as resolved in the DLQ. Useful for false positives or ephemeral errors already fixed.
[Fix] — Advanced feature: Claude analyses the failing workflow and proposes an automatic fix. The fix is stored as a proposal, then a second workflow (GEH Fix Applier) handles application with backup, confirmation and rollback.
Per-workflow configuration
Section titled “Per-workflow configuration”Each workflow can have its own rules in the error_handling_config table:
| Parameter | Default | Usage |
|---|---|---|
error_handling_enabled | true | Disable for a workflow under maintenance |
max_retries | 3 | Override the retry count |
notify_on_error | true | Mute notifications without disabling the DLQ |
auto_retry_enabled | false | Automatic retry without intervention |
suppress_until | null | Temporary suppression (ISO timestamp) |
DLQ Weekly Digest
Section titled “DLQ Weekly Digest”Every Sunday at 9 AM, the DLQ Weekly Digest workflow generates a summary of the week’s errors and sends it through the Notification Hub. This digest helps spot patterns: a workflow failing regularly, a recurring error type, retries that never solve the problem.
The digest includes:
- Total error count by severity
- Top failing workflows
- Unresolved errors (retry_status = pending/exhausted)
- Trends compared to the previous week
4. What if? — Outlook and limits
Section titled “4. What if? — Outlook and limits”Current limits
Section titled “Current limits”| Limit | Impact | Mitigation |
|---|---|---|
| Rule-based classification | Unknown types classified as “unknown” | On-demand AI analysis via [Details] |
| No auto-retry | Each retry requires a click | auto_retry_enabled flag prepared but not deployed |
| Experimental AI Fix | Proposed fixes are not always applicable | Double confirmation before application + automatic rollback |
| No correlation | Related errors not grouped | Identifiable via the weekly digest |
Evolution scenarios
Section titled “Evolution scenarios”If error volume grows:
- Enable auto-retry for transient errors (timeout, rate limit)
- Group similar errors in the digest
- Add an alert threshold: “5 errors from the same workflow in 1h = escalate”
If cross-workflow correlation is needed:
- Trace execution chains (workflow A calls B which calls C)
- If C fails, show the full chain context
- Allow retry from the parent workflow
If the team grows:
- Assign errors by domain (Docker → ops, Odoo → business)
- Escalate if not handled after a configurable delay
- Grafana dashboard with DLQ metrics
Related pages
Section titled “Related pages”Infrastructure
Section titled “Infrastructure”- N8N in queue mode — Backend running the workflows
- Monitoring Stack — Prometheus and Grafana
Workflows
Section titled “Workflows”- Notification Hub — Error notification routing
- Telegram Orchestrator — Receiver for Retry/Details/Ignore callbacks
- Docker Auto-Updates — Maintenance mode during updates
Reference
Section titled “Reference”- Glossary — Dead Letter Queue, Error Trigger