--- title: 'Monitoring Stack: Prometheus + Grafana' url: https://blog.guigpap.com/en/infrastructure/monitoring-stack/ url_md: https://blog.guigpap.com/en/infrastructure/monitoring-stack.md category: infrastructure date: '2026-01-20' maturite: production techno: - prometheus - grafana - docker application: - monitoring - infrastructure --- # Monitoring Stack: Prometheus + Grafana > Full observability with metric collection, dashboards and automated alerts ## 1. What? — Definition and context The **Monitoring Stack** delivers observability for the whole infrastructure. It collects system and application metrics, visualises them through dashboards, and triggers alerts when anomalies appear. ### Components | Service | Port | Memory limit | Role | |---------|------|--------------|------| | **Prometheus** | 9090 | 2 GB | Metric collection and storage (pull mode) | | **Grafana** | 3000 | 1 GB | Visualisation and dashboards | | **Alertmanager** | 9093 | 512 MB | Alert routing and grouping | | **Node Exporter** | 9100 | 256 MB | System metrics (CPU, RAM, disk) | | **Docker Exporter** | 9487 | 256 MB | Per-container metrics (CPU, RAM, state) | | **OTEL Collector** | 4317/4318 (in), 8889 (scrape) | 512 MB | Claude Code telemetry ingestion | > **Note - Pull mode** > > **Prometheus** runs in "pull" mode: it regularly polls (scrapes) services to fetch their metrics. Prometheus reaches out for the data — services do not push it. ### Architecture diagram ```mermaid flowchart TD subgraph Scrape["Scrape targets"] direction LR NE["Node Exporter · host:9100"] DE["Docker Exporter · host:9487"] OT["OTEL Collector · host:8889"] DEN["Docker Engine · host:9323"] end subgraph Prom["Prometheus · :9090"] direction TB P1["Scrape every 15s"] P2["Evaluate alerting rules"] P3["Store 15 days / 5 GB max"] end subgraph AM["Alertmanager · :9093"] direction TB A1["Group alerts"] A2["Route → N8N"] end subgraph Graf["Grafana · :3000"] direction TB G1["Linux System"] G2["Docker Containers"] G3["Claude Code"] end Hook["N8N webhook · /webhook/prometheus/alert"] Tg["Notification Hub → Telegram"] Scrape --> Prom Prom --> AM Prom --> Graf AM --> Hook Hook --> Tg ``` --- ## 2. Why? — Stakes and motivations ### Problems monitoring solves | Problem | Without monitoring | With monitoring | |---------|--------------------|-----------------| | **Container crash** | Reported by a user | Immediate alert | | **Disk full** | Service unreachable | Caught before saturation | | **Memory leak** | Random OOM kills | Trend visible, preventive action | | **Claude costs** | End-of-month surprise | Real-time tracking | ### Most useful alerts in practice | Alert | Triggered by | Observed value | |-------|--------------|----------------| | **ContainerDown** | Service crash | Quick detection, manual or auto restart | | **Claude Code telemetry** | Claude sessions | Track time spent and tokens used | | **DiskSpaceLow** | Disk space < 15% | Prevention before incidents | | **HighMemoryUsage** | RAM > 85% | Not yet triggered (sufficient headroom) | > **Tip - Claude Code telemetry** > > The OTEL integration tracks Claude Code sessions: tokens consumed, estimated cost, active time. These metrics feed Grafana dashboards for productivity tracking. --- ## 3. How? — Technical implementation ### Prometheus configuration ```yaml # prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] - job_name: 'docker-exporter' static_configs: - targets: ['docker-exporter:9487'] - job_name: 'docker-engine' static_configs: - targets: ['host.docker.internal:9323'] - job_name: 'otel-collector' static_configs: - targets: ['otel-collector:8889'] ``` > **Note - cAdvisor removed (security audit 2026-02-20)** > > cAdvisor was historically used for per-container metrics, but it was removed: (1) broken on Docker 29+ with overlayfs, (2) security risk (privileged container with `rootfs` + Docker socket). Per-container metrics now come from **docker-exporter**, and global daemon counters from **Docker Engine metrics** (`host.docker.internal:9323`). ### Data retention ```yaml # In docker-compose.yaml, Prometheus command command: - '--storage.tsdb.retention.time=15d' - '--storage.tsdb.retention.size=5GB' ``` > **Note - Sizing** > > 15 days and 5 GB are enough for a personal infrastructure. For longer history, look at Thanos or Victoria Metrics. ### Alerting rules ```yaml # prometheus/alerts.yml groups: - name: infrastructure rules: - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory on {{ $labels.instance }}" description: "Memory usage: {{ $value | printf \"%.1f\" }}%" - alert: HighCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning - alert: DiskSpaceLow expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100 > 85 for: 5m labels: severity: critical - alert: ContainerDown expr: absent(container_last_seen{name!=""}) for: 1m labels: severity: critical - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical ``` ### Alertmanager → N8N ```yaml # alertmanager/alertmanager.yml global: resolve_timeout: 5m route: receiver: 'n8n' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receivers: - name: 'n8n' webhook_configs: - url: 'http://n8n:5678/webhook/prometheus/alert' send_resolved: true ``` The [Notification Hub](/en/workflows/notification-hub/) inspects severity to route alerts: criticals → instant Telegram, warnings → grouped. ### Claude Code metrics (OTEL) Claude Code configuration to export telemetry: ```json // ~/.claude/settings.json { "env": { "CLAUDE_CODE_ENABLE_TELEMETRY": "1", "OTEL_METRICS_EXPORTER": "otlp", "OTEL_LOGS_EXPORTER": "otlp", "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf", "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4318", "OTEL_SERVICE_NAME": "claude-code" } } ``` | Metric | Description | |--------|-------------| | `claude_code_token_usage_tokens_total` | Tokens per model and type | | `claude_code_cost_usage_USD_total` | Cumulative cost in USD | | `claude_code_active_time_seconds_total` | Active time | | `claude_code_lines_of_code_count_total` | Lines changed | ### PromQL examples ```promql # CPU usage percentage 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory usage percentage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Disk usage percentage (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100 # Container memory usage (top 5) topk(5, container_memory_usage_bytes{name!=""}) # Claude Code tokens total by model sum(claude_code_token_usage_tokens_total) by (model) # Claude Code cost USD sum(claude_code_cost_usage_USD_total) ``` ### Grafana dashboards | Dashboard | Metrics | |-----------|---------| | **Linux System** | CPU, RAM, disk, network, load average | | **Docker Containers** | CPU/RAM per container, I/O, restarts | | **Claude Code** | Tokens, costs, active time, lines of code | --- ## 4. What if? — Perspectives and limits ### Claude Code telemetry → Odoo integration The full pipeline goes beyond raw observability: a `SessionEnd` hook on the dev machine sends session metadata to N8N, which queries Prometheus for the metrics (tokens, cost, active time) and updates the matching Odoo task via XML-RPC. ``` ~/.claude (SessionEnd hook) │ ▼ POST /webhook/telemetry/session-end N8N Telemetry workflow │ Query Prometheus for the session ▼ XML-RPC to Odoo project.task (x_claude_time_total, x_claude_cost_total, …) ``` See [Claude Code Telemetry](/en/workflows/claude-code-telemetry/) for the workflow side. ### Current limits | Limit | Impact | Mitigation | |-------|--------|------------| | **15-day retention** | No long-term history | Export to S3/Thanos if needed | | **No tracing** | Limited workflow debugging | Consider Jaeger if needed | | **Single OTEL Collector** | SPOF for telemetry | Acceptable for personal use | ### Evolution scenarios **If history > 15 days is required**: - Deploy Thanos for long-term storage - Or export snapshots to S3 **If N8N workflow tracing is needed**: - Add Jaeger or Tempo - Instrument N8N with OTEL traces **If metric volume explodes**: - Increase Prometheus retention - Consider Victoria Metrics (more efficient) ### Troubleshooting commands ```bash # Inspect Prometheus targets curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}' # Test exporter connectivity docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head # Check active alerts curl http://localhost:9093/api/v1/alerts # Test the N8N webhook curl -X POST http://n8n:5678/webhook/prometheus/alert \ -H "Content-Type: application/json" \ -d '{"alerts":[{"labels":{"alertname":"test"}}]}' ``` --- ## Related pages ### Infrastructure - [VPS Architecture](/en/infrastructure/architecture-vps/) — Overview - [Security Stack](/en/infrastructure/security-stack/) — Caddy exposes Grafana ### Workflows - [Notification Hub](/en/workflows/notification-hub/) — Alert routing ### Reference - [Glossary](/en/reference/glossary/) — Prometheus, PromQL, OTEL, scrape ## Metadonnees agent - Cet article est issu du blog GuiGPaP Lab. - Contexte global du blog: https://blog.guigpap.com/llms.txt - Contact auteur: https://odoo.guigpap.com/mon-cv - Licence: CC-BY-SA 4.0