---
title: 'Monitoring Stack: Prometheus + Grafana'
url: https://blog.guigpap.com/en/infrastructure/monitoring-stack/
url_md: https://blog.guigpap.com/en/infrastructure/monitoring-stack.md
category: infrastructure
date: '2026-01-20'
maturite: production
techno:
  - prometheus
  - grafana
  - docker
application:
  - monitoring
  - infrastructure
---

# Monitoring Stack: Prometheus + Grafana

> Full observability with metric collection, dashboards and automated alerts

## 1. What? — Definition and context

The **Monitoring Stack** delivers observability for the whole infrastructure. It collects system and application metrics, visualises them through dashboards, and triggers alerts when anomalies appear.

### Components

| Service | Port | Memory limit | Role |
|---------|------|--------------|------|
| **Prometheus** | 9090 | 2 GB | Metric collection and storage (pull mode) |
| **Grafana** | 3000 | 1 GB | Visualisation and dashboards |
| **Alertmanager** | 9093 | 512 MB | Alert routing and grouping |
| **Node Exporter** | 9100 | 256 MB | System metrics (CPU, RAM, disk) |
| **Docker Exporter** | 9487 | 256 MB | Per-container metrics (CPU, RAM, state) |
| **OTEL Collector** | 4317/4318 (in), 8889 (scrape) | 512 MB | Claude Code telemetry ingestion |

> **Note - Pull mode**
>
> **Prometheus** runs in "pull" mode: it regularly polls (scrapes) services to fetch their metrics. Prometheus reaches out for the data — services do not push it.

### Architecture diagram

```mermaid
flowchart TD
  subgraph Scrape["Scrape targets"]
    direction LR
    NE["Node Exporter · host:9100"]
    DE["Docker Exporter · host:9487"]
    OT["OTEL Collector · host:8889"]
    DEN["Docker Engine · host:9323"]
  end

  subgraph Prom["Prometheus · :9090"]
    direction TB
    P1["Scrape every 15s"]
    P2["Evaluate alerting rules"]
    P3["Store 15 days / 5 GB max"]
  end

  subgraph AM["Alertmanager · :9093"]
    direction TB
    A1["Group alerts"]
    A2["Route → N8N"]
  end

  subgraph Graf["Grafana · :3000"]
    direction TB
    G1["Linux System"]
    G2["Docker Containers"]
    G3["Claude Code"]
  end

  Hook["N8N webhook · /webhook/prometheus/alert"]
  Tg["Notification Hub → Telegram"]

  Scrape --> Prom
  Prom --> AM
  Prom --> Graf
  AM --> Hook
  Hook --> Tg
```

---

## 2. Why? — Stakes and motivations

### Problems monitoring solves

| Problem | Without monitoring | With monitoring |
|---------|--------------------|-----------------|
| **Container crash** | Reported by a user | Immediate alert |
| **Disk full** | Service unreachable | Caught before saturation |
| **Memory leak** | Random OOM kills | Trend visible, preventive action |
| **Claude costs** | End-of-month surprise | Real-time tracking |

### Most useful alerts in practice

| Alert | Triggered by | Observed value |
|-------|--------------|----------------|
| **ContainerDown** | Service crash | Quick detection, manual or auto restart |
| **Claude Code telemetry** | Claude sessions | Track time spent and tokens used |
| **DiskSpaceLow** | Disk space < 15% | Prevention before incidents |
| **HighMemoryUsage** | RAM > 85% | Not yet triggered (sufficient headroom) |

> **Tip - Claude Code telemetry**
>
> The OTEL integration tracks Claude Code sessions: tokens consumed, estimated cost, active time. These metrics feed Grafana dashboards for productivity tracking.

---

## 3. How? — Technical implementation

### Prometheus configuration

```yaml
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'docker-exporter'
    static_configs:
      - targets: ['docker-exporter:9487']

  - job_name: 'docker-engine'
    static_configs:
      - targets: ['host.docker.internal:9323']

  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
```

> **Note - cAdvisor removed (security audit 2026-02-20)**
>
> cAdvisor was historically used for per-container metrics, but it was removed: (1) broken on Docker 29+ with overlayfs, (2) security risk (privileged container with `rootfs` + Docker socket). Per-container metrics now come from **docker-exporter**, and global daemon counters from **Docker Engine metrics** (`host.docker.internal:9323`).

### Data retention

```yaml
# In docker-compose.yaml, Prometheus command
command:
  - '--storage.tsdb.retention.time=15d'
  - '--storage.tsdb.retention.size=5GB'
```

> **Note - Sizing**
>
> 15 days and 5 GB are enough for a personal infrastructure. For longer history, look at Thanos or Victoria Metrics.

### Alerting rules

```yaml
# prometheus/alerts.yml
groups:
  - name: infrastructure
    rules:
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory on {{ $labels.instance }}"
          description: "Memory usage: {{ $value | printf \"%.1f\" }}%"

      - alert: HighCPUUsage
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning

      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: critical

      - alert: ContainerDown
        expr: absent(container_last_seen{name!=""})
        for: 1m
        labels:
          severity: critical

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
```

### Alertmanager → N8N

```yaml
# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: 'n8n'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'n8n'
    webhook_configs:
      - url: 'http://n8n:5678/webhook/prometheus/alert'
        send_resolved: true
```

The [Notification Hub](/en/workflows/notification-hub/) inspects severity to route alerts: criticals → instant Telegram, warnings → grouped.

### Claude Code metrics (OTEL)

Claude Code configuration to export telemetry:

```json
// ~/.claude/settings.json
{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",
    "OTEL_LOGS_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4318",
    "OTEL_SERVICE_NAME": "claude-code"
  }
}
```

| Metric | Description |
|--------|-------------|
| `claude_code_token_usage_tokens_total` | Tokens per model and type |
| `claude_code_cost_usage_USD_total` | Cumulative cost in USD |
| `claude_code_active_time_seconds_total` | Active time |
| `claude_code_lines_of_code_count_total` | Lines changed |

### PromQL examples

```promql
# CPU usage percentage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage
(1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100

# Container memory usage (top 5)
topk(5, container_memory_usage_bytes{name!=""})

# Claude Code tokens total by model
sum(claude_code_token_usage_tokens_total) by (model)

# Claude Code cost USD
sum(claude_code_cost_usage_USD_total)
```

### Grafana dashboards

| Dashboard | Metrics |
|-----------|---------|
| **Linux System** | CPU, RAM, disk, network, load average |
| **Docker Containers** | CPU/RAM per container, I/O, restarts |
| **Claude Code** | Tokens, costs, active time, lines of code |

---

## 4. What if? — Perspectives and limits

### Claude Code telemetry → Odoo integration

The full pipeline goes beyond raw observability: a `SessionEnd` hook on the dev machine sends session metadata to N8N, which queries Prometheus for the metrics (tokens, cost, active time) and updates the matching Odoo task via XML-RPC.

```
~/.claude (SessionEnd hook)
   │
   ▼  POST /webhook/telemetry/session-end
N8N Telemetry workflow
   │  Query Prometheus for the session
   ▼  XML-RPC to Odoo
project.task (x_claude_time_total, x_claude_cost_total, …)
```

See [Claude Code Telemetry](/en/workflows/claude-code-telemetry/) for the workflow side.

### Current limits

| Limit | Impact | Mitigation |
|-------|--------|------------|
| **15-day retention** | No long-term history | Export to S3/Thanos if needed |
| **No tracing** | Limited workflow debugging | Consider Jaeger if needed |
| **Single OTEL Collector** | SPOF for telemetry | Acceptable for personal use |

### Evolution scenarios

**If history > 15 days is required**:
- Deploy Thanos for long-term storage
- Or export snapshots to S3

**If N8N workflow tracing is needed**:
- Add Jaeger or Tempo
- Instrument N8N with OTEL traces

**If metric volume explodes**:
- Increase Prometheus retention
- Consider Victoria Metrics (more efficient)

### Troubleshooting commands

```bash
# Inspect Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Test exporter connectivity
docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head

# Check active alerts
curl http://localhost:9093/api/v1/alerts

# Test the N8N webhook
curl -X POST http://n8n:5678/webhook/prometheus/alert \
  -H "Content-Type: application/json" \
  -d '{"alerts":[{"labels":{"alertname":"test"}}]}'
```

---

## Related pages

### Infrastructure
- [VPS Architecture](/en/infrastructure/architecture-vps/) — Overview
- [Security Stack](/en/infrastructure/security-stack/) — Caddy exposes Grafana

### Workflows
- [Notification Hub](/en/workflows/notification-hub/) — Alert routing

### Reference
- [Glossary](/en/reference/glossary/) — Prometheus, PromQL, OTEL, scrape

## Metadonnees agent

- Cet article est issu du blog GuiGPaP Lab.
- Contexte global du blog: https://blog.guigpap.com/llms.txt
- Contact auteur: https://odoo.guigpap.com/mon-cv
- Licence: CC-BY-SA 4.0