Global Error Handler

1. What? — Definition and context

Imagine about forty N8N workflows running continuously — GitHub synchronisation, Docker updates, Telegram notifications, content pipeline. When one of them crashes at 3 AM, how do you know what happened, whether it is serious, and what to do?

The Global Error Handler (GEH) is the answer: a centralised system that intercepts every error, classifies it automatically, stores it in a dedicated queue, and sends a Telegram notification with action buttons. A single entry point for every error in the N8N infrastructure.

The 4 workflows of the system

Workflow	Nodes	Role
Global Error Handler	19	Capture, classification, notification
GEH Callback Actions	30	Retry, AI analysis, ignore, fix
GEH Fix Applier	31	Application and rollback of AI fixes
DLQ Weekly Digest	8	Weekly error summary

Architecture

2. Why? — Stakes and motivations

Before the GEH, N8N errors were silent. A workflow failed, N8N noted it in its internal logs, and nobody knew before noticing a malfunction. No classification, no notification, no visibility.

Problems solved

Problem	Without GEH	With GEH
Silent errors	Discovered by chance in logs	Immediate Telegram notification
No context	”Workflow failed” without details	Classification, failing node, stack trace
Manual retry	Open N8N, find the execution, restart	[Retry] button in Telegram
No history	Errors lost after N8N purge	Persistent Dead Letter Queue
Repetitive errors	Same alert in a loop	Deduplication + per-workflow config

Two-level retry strategy

The GEH does not handle retries blindly. It relies on the native N8N retry at the node level:

Level	Mechanism	When
Node	Native N8N Retry on Fail	Transient errors (timeout, 503, rate limit)
Workflow	Retry button via GEH	When all node retries are exhausted

3. How? — Technical implementation

An error’s journey

When a workflow fails, here is what happens, step by step:

1. Capture — The Error Trigger intercepts the failure (including activation errors).

2. Extraction — A Code node normalises the context: source workflow, failing node, error message, stack trace. Sensitive data (tokens, passwords, connection strings) are automatically masked by a redact() function.

3. Configuration — The GEH consults the error_handling_config table to determine whether this workflow has specific rules (notifications disabled, temporary suppression, custom max retries).

4. DLQ insert — The error is stored in the error_dead_letter_queue table with a unique identifier (err_<16hex>).

5. Classification — Keyword rules analyse the error message to determine type and severity:

Detected type	Keywords	Severity
`timeout`	timeout, ETIMEDOUT, deadline	warning
`network`	ECONNREFUSED, ENOTFOUND, socket	warning
`authentication`	401, 403, unauthorized	critical
`rate_limit`	429, rate limit, quota	warning
`data_validation`	invalid, schema, parse error	info
`resource`	out of memory, disk full	critical
`configuration`	missing credential, not found	critical

6. Notification — If notifications are enabled for that workflow, the GEH calls the Notification Hub with a formatted message and action buttons.

The 4 Telegram actions

When the notification arrives on Telegram, it offers four buttons:

[Retry] — Restarts the failed execution via the N8N API. Before restarting, the system checks that the error is eligible for retry (can_auto_retry). The retry counter is incremented in the DLQ.

[Details] — Requests an AI analysis from Claude. The first request generates the analysis (error type, probable cause, fix suggestion); subsequent ones use the cache stored in the DLQ. Useful to understand a complex error without opening N8N.

[Ignore] — Marks the error as resolved in the DLQ. Useful for false positives or ephemeral errors already fixed.

[Fix] — Advanced feature: Claude analyses the failing workflow and proposes an automatic fix. The fix is stored as a proposal, then a second workflow (GEH Fix Applier) handles application with backup, confirmation and rollback.

Per-workflow configuration

Each workflow can have its own rules in the error_handling_config table:

Parameter	Default	Usage
`error_handling_enabled`	true	Disable for a workflow under maintenance
`max_retries`	3	Override the retry count
`notify_on_error`	true	Mute notifications without disabling the DLQ
`auto_retry_enabled`	false	Automatic retry without intervention
`suppress_until`	null	Temporary suppression (ISO timestamp)

DLQ Weekly Digest

Every Sunday at 9 AM, the DLQ Weekly Digest workflow generates a summary of the week’s errors and sends it through the Notification Hub. This digest helps spot patterns: a workflow failing regularly, a recurring error type, retries that never solve the problem.

The digest includes:

Total error count by severity
Top failing workflows
Unresolved errors (retry_status = pending/exhausted)
Trends compared to the previous week

4. What if? — Outlook and limits

Current limits

Limit	Impact	Mitigation
Rule-based classification	Unknown types classified as “unknown”	On-demand AI analysis via [Details]
No auto-retry	Each retry requires a click	`auto_retry_enabled` flag prepared but not deployed
Experimental AI Fix	Proposed fixes are not always applicable	Double confirmation before application + automatic rollback
No correlation	Related errors not grouped	Identifiable via the weekly digest

Evolution scenarios

If error volume grows:

Enable auto-retry for transient errors (timeout, rate limit)
Group similar errors in the digest
Add an alert threshold: “5 errors from the same workflow in 1h = escalate”

If cross-workflow correlation is needed:

Trace execution chains (workflow A calls B which calls C)
If C fails, show the full chain context
Allow retry from the parent workflow

If the team grows:

Assign errors by domain (Docker → ops, Odoo → business)
Escalate if not handled after a configurable delay
Grafana dashboard with DLQ metrics

Infrastructure

N8N in queue mode — Backend running the workflows
Monitoring Stack — Prometheus and Grafana

Workflows

Notification Hub — Error notification routing
Telegram Orchestrator — Receiver for Retry/Details/Ignore callbacks
Docker Auto-Updates — Maintenance mode during updates

Reference

Glossary — Dead Letter Queue, Error Trigger