Monitoring Stack: Prometheus + Grafana
1. What? — Definition and context
Section titled “1. What? — Definition and context”The Monitoring Stack delivers observability for the whole infrastructure. It collects system and application metrics, visualises them through dashboards, and triggers alerts when anomalies appear.
Components
Section titled “Components”| Service | Port | Memory limit | Role |
|---|---|---|---|
| Prometheus | 9090 | 2 GB | Metric collection and storage (pull mode) |
| Grafana | 3000 | 1 GB | Visualisation and dashboards |
| Alertmanager | 9093 | 512 MB | Alert routing and grouping |
| Node Exporter | 9100 | 256 MB | System metrics (CPU, RAM, disk) |
| Docker Exporter | 9487 | 256 MB | Per-container metrics (CPU, RAM, state) |
| OTEL Collector | 4317/4318 (in), 8889 (scrape) | 512 MB | Claude Code telemetry ingestion |
Architecture diagram
Section titled “Architecture diagram”2. Why? — Stakes and motivations
Section titled “2. Why? — Stakes and motivations”Problems monitoring solves
Section titled “Problems monitoring solves”| Problem | Without monitoring | With monitoring |
|---|---|---|
| Container crash | Reported by a user | Immediate alert |
| Disk full | Service unreachable | Caught before saturation |
| Memory leak | Random OOM kills | Trend visible, preventive action |
| Claude costs | End-of-month surprise | Real-time tracking |
Most useful alerts in practice
Section titled “Most useful alerts in practice”| Alert | Triggered by | Observed value |
|---|---|---|
| ContainerDown | Service crash | Quick detection, manual or auto restart |
| Claude Code telemetry | Claude sessions | Track time spent and tokens used |
| DiskSpaceLow | Disk space < 15% | Prevention before incidents |
| HighMemoryUsage | RAM > 85% | Not yet triggered (sufficient headroom) |
3. How? — Technical implementation
Section titled “3. How? — Technical implementation”Prometheus configuration
Section titled “Prometheus configuration”global: scrape_interval: 15s evaluation_interval: 15s
scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
- job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100']
- job_name: 'docker-exporter' static_configs: - targets: ['docker-exporter:9487']
- job_name: 'docker-engine' static_configs: - targets: ['host.docker.internal:9323']
- job_name: 'otel-collector' static_configs: - targets: ['otel-collector:8889']Data retention
Section titled “Data retention”# In docker-compose.yaml, Prometheus commandcommand: - '--storage.tsdb.retention.time=15d' - '--storage.tsdb.retention.size=5GB'Alerting rules
Section titled “Alerting rules”groups: - name: infrastructure rules: - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory on {{ $labels.instance }}" description: "Memory usage: {{ $value | printf \"%.1f\" }}%"
- alert: HighCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning
- alert: DiskSpaceLow expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100 > 85 for: 5m labels: severity: critical
- alert: ContainerDown expr: absent(container_last_seen{name!=""}) for: 1m labels: severity: critical
- alert: ServiceDown expr: up == 0 for: 1m labels: severity: criticalAlertmanager → N8N
Section titled “Alertmanager → N8N”global: resolve_timeout: 5m
route: receiver: 'n8n' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h
receivers: - name: 'n8n' webhook_configs: - url: 'http://n8n:5678/webhook/prometheus/alert' send_resolved: trueThe Notification Hub inspects severity to route alerts: criticals → instant Telegram, warnings → grouped.
Claude Code metrics (OTEL)
Section titled “Claude Code metrics (OTEL)”Claude Code configuration to export telemetry:
{ "env": { "CLAUDE_CODE_ENABLE_TELEMETRY": "1", "OTEL_METRICS_EXPORTER": "otlp", "OTEL_LOGS_EXPORTER": "otlp", "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf", "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4318", "OTEL_SERVICE_NAME": "claude-code" }}| Metric | Description |
|---|---|
claude_code_token_usage_tokens_total | Tokens per model and type |
claude_code_cost_usage_USD_total | Cumulative cost in USD |
claude_code_active_time_seconds_total | Active time |
claude_code_lines_of_code_count_total | Lines changed |
PromQL examples
Section titled “PromQL examples”# CPU usage percentage100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage(1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100
# Container memory usage (top 5)topk(5, container_memory_usage_bytes{name!=""})
# Claude Code tokens total by modelsum(claude_code_token_usage_tokens_total) by (model)
# Claude Code cost USDsum(claude_code_cost_usage_USD_total)Grafana dashboards
Section titled “Grafana dashboards”| Dashboard | Metrics |
|---|---|
| Linux System | CPU, RAM, disk, network, load average |
| Docker Containers | CPU/RAM per container, I/O, restarts |
| Claude Code | Tokens, costs, active time, lines of code |
4. What if? — Perspectives and limits
Section titled “4. What if? — Perspectives and limits”Claude Code telemetry → Odoo integration
Section titled “Claude Code telemetry → Odoo integration”The full pipeline goes beyond raw observability: a SessionEnd hook on the dev machine sends session metadata to N8N, which queries Prometheus for the metrics (tokens, cost, active time) and updates the matching Odoo task via XML-RPC.
~/.claude (SessionEnd hook) │ ▼ POST /webhook/telemetry/session-endN8N Telemetry workflow │ Query Prometheus for the session ▼ XML-RPC to Odooproject.task (x_claude_time_total, x_claude_cost_total, …)See Claude Code Telemetry for the workflow side.
Current limits
Section titled “Current limits”| Limit | Impact | Mitigation |
|---|---|---|
| 15-day retention | No long-term history | Export to S3/Thanos if needed |
| No tracing | Limited workflow debugging | Consider Jaeger if needed |
| Single OTEL Collector | SPOF for telemetry | Acceptable for personal use |
Evolution scenarios
Section titled “Evolution scenarios”If history > 15 days is required:
- Deploy Thanos for long-term storage
- Or export snapshots to S3
If N8N workflow tracing is needed:
- Add Jaeger or Tempo
- Instrument N8N with OTEL traces
If metric volume explodes:
- Increase Prometheus retention
- Consider Victoria Metrics (more efficient)
Troubleshooting commands
Section titled “Troubleshooting commands”# Inspect Prometheus targetscurl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Test exporter connectivitydocker exec prometheus wget -qO- http://node-exporter:9100/metrics | head
# Check active alertscurl http://localhost:9093/api/v1/alerts
# Test the N8N webhookcurl -X POST http://n8n:5678/webhook/prometheus/alert \ -H "Content-Type: application/json" \ -d '{"alerts":[{"labels":{"alertname":"test"}}]}'Related pages
Section titled “Related pages”Infrastructure
Section titled “Infrastructure”- VPS Architecture — Overview
- Security Stack — Caddy exposes Grafana
Workflows
Section titled “Workflows”- Notification Hub — Alert routing
Reference
Section titled “Reference”- Glossary — Prometheus, PromQL, OTEL, scrape