Monitoring Stack: Prometheus + Grafana

1. What? — Definition and context

The Monitoring Stack delivers observability for the whole infrastructure. It collects system and application metrics, visualises them through dashboards, and triggers alerts when anomalies appear.

Components

Service	Port	Memory limit	Role
Prometheus	9090	2 GB	Metric collection and storage (pull mode)
Grafana	3000	1 GB	Visualisation and dashboards
Alertmanager	9093	512 MB	Alert routing and grouping
Node Exporter	9100	256 MB	System metrics (CPU, RAM, disk)
Docker Exporter	9487	256 MB	Per-container metrics (CPU, RAM, state)
OTEL Collector	4317/4318 (in), 8889 (scrape)	512 MB	Claude Code telemetry ingestion

Architecture diagram

2. Why? — Stakes and motivations

Problems monitoring solves

Problem	Without monitoring	With monitoring
Container crash	Reported by a user	Immediate alert
Disk full	Service unreachable	Caught before saturation
Memory leak	Random OOM kills	Trend visible, preventive action
Claude costs	End-of-month surprise	Real-time tracking

Most useful alerts in practice

Alert	Triggered by	Observed value
ContainerDown	Service crash	Quick detection, manual or auto restart
Claude Code telemetry	Claude sessions	Track time spent and tokens used
DiskSpaceLow	Disk space < 15%	Prevention before incidents
HighMemoryUsage	RAM > 85%	Not yet triggered (sufficient headroom)

3. How? — Technical implementation

Prometheus configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'docker-exporter'
    static_configs:
      - targets: ['docker-exporter:9487']

  - job_name: 'docker-engine'
    static_configs:
      - targets: ['host.docker.internal:9323']

  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

Data retention

# In docker-compose.yaml, Prometheus command
command:
  - '--storage.tsdb.retention.time=15d'
  - '--storage.tsdb.retention.size=5GB'

Alerting rules

groups:
  - name: infrastructure
    rules:
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory on {{ $labels.instance }}"
          description: "Memory usage: {{ $value | printf \"%.1f\" }}%"

      - alert: HighCPUUsage
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning

      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: critical

      - alert: ContainerDown
        expr: absent(container_last_seen{name!=""})
        for: 1m
        labels:
          severity: critical

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical

Alertmanager → N8N

global:
  resolve_timeout: 5m

route:
  receiver: 'n8n'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'n8n'
    webhook_configs:
      - url: 'http://n8n:5678/webhook/prometheus/alert'
        send_resolved: true

The Notification Hub inspects severity to route alerts: criticals → instant Telegram, warnings → grouped.

Claude Code metrics (OTEL)

Claude Code configuration to export telemetry:

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",
    "OTEL_LOGS_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4318",
    "OTEL_SERVICE_NAME": "claude-code"
  }
}

Metric	Description
`claude_code_token_usage_tokens_total`	Tokens per model and type
`claude_code_cost_usage_USD_total`	Cumulative cost in USD
`claude_code_active_time_seconds_total`	Active time
`claude_code_lines_of_code_count_total`	Lines changed

PromQL examples

# CPU usage percentage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage
(1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes)) * 100

# Container memory usage (top 5)
topk(5, container_memory_usage_bytes{name!=""})

# Claude Code tokens total by model
sum(claude_code_token_usage_tokens_total) by (model)

# Claude Code cost USD
sum(claude_code_cost_usage_USD_total)

Grafana dashboards

Dashboard	Metrics
Linux System	CPU, RAM, disk, network, load average
Docker Containers	CPU/RAM per container, I/O, restarts
Claude Code	Tokens, costs, active time, lines of code

4. What if? — Perspectives and limits

Claude Code telemetry → Odoo integration

The full pipeline goes beyond raw observability: a SessionEnd hook on the dev machine sends session metadata to N8N, which queries Prometheus for the metrics (tokens, cost, active time) and updates the matching Odoo task via XML-RPC.

~/.claude (SessionEnd hook)
   │
   ▼  POST /webhook/telemetry/session-end
N8N Telemetry workflow
   │  Query Prometheus for the session
   ▼  XML-RPC to Odoo
project.task (x_claude_time_total, x_claude_cost_total, …)

See Claude Code Telemetry for the workflow side.

Current limits

Limit	Impact	Mitigation
15-day retention	No long-term history	Export to S3/Thanos if needed
No tracing	Limited workflow debugging	Consider Jaeger if needed
Single OTEL Collector	SPOF for telemetry	Acceptable for personal use

Evolution scenarios

If history > 15 days is required:

Deploy Thanos for long-term storage
Or export snapshots to S3

If N8N workflow tracing is needed:

Add Jaeger or Tempo
Instrument N8N with OTEL traces

If metric volume explodes:

Increase Prometheus retention
Consider Victoria Metrics (more efficient)

Troubleshooting commands

# Inspect Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Test exporter connectivity
docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head

# Check active alerts
curl http://localhost:9093/api/v1/alerts

# Test the N8N webhook
curl -X POST http://n8n:5678/webhook/prometheus/alert \
  -H "Content-Type: application/json" \
  -d '{"alerts":[{"labels":{"alertname":"test"}}]}'

Infrastructure

VPS Architecture — Overview
Security Stack — Caddy exposes Grafana

Workflows

Notification Hub — Alert routing

Reference

Glossary — Prometheus, PromQL, OTEL, scrape