---
title: 'AI Stack: Qdrant + CLI Ollama'
url: https://blog.guigpap.com/en/infrastructure/ai-stack/
url_md: https://blog.guigpap.com/en/infrastructure/ai-stack.md
category: infrastructure
date: '2026-01-20'
maturite: production
techno:
  - qdrant
  - claude
  - redis
application:
  - ai
  - infrastructure
---

# AI Stack: Qdrant + CLI Ollama

> Local AI infrastructure with a vector database, multi-model gateway, scoped-privilege profiles, and a whitelisted MCP gateway

## 1. What? — Definition and context

The **AI Stack** brings together five services that work in concert to serve every AI need on the infrastructure: a vector database for RAG, a multi-model Ollama-compatible gateway, session memory, and two MCP (Model Context Protocol) building blocks that expose N8N tools to models with strict access control.

### Components

| Service | Port | Role |
|---------|------|------|
| **Qdrant** | 6333 | Vector database (embeddings, semantic search) |
| **CLI Ollama** | 11434 | Ollama-compatible gateway towards Codex and Gemini |
| **Claude Redis** | 6379 | Conversation session memory (LRU 256 MB) |
| **MCP Gateway** | 3001 | MCP reverse proxy: tool whitelist + SSE/JSON negotiation |
| **N8N MCP** | 3000 | MCP bridge exposing the N8N API to models |

> **Note - Ollama-compatible**
>
> **CLI Ollama** re-implements the Ollama API (`/api/generate`, `/api/chat`, `/api/tags`) but does not run any model locally: every request is translated into a CLI call to upstream providers (OpenAI's Codex CLI, Google's Gemini CLI). It is a wrapper, not a runtime. It exposes an API that N8N and other consumers treat like a regular Ollama.

### Architecture diagram

```mermaid
flowchart TD
  subgraph N8N["N8N Workflows"]
    direction TB
    I1["AI Router · intent detection"]
    I2["Service Handler General · free chat"]
    I3["Conversation Agent · multi-turn sessions"]
    I4["RAG · future"]
  end

  subgraph AI["ai-internal"]
    direction TB
    Qdrant["Qdrant · :6333"]
    Ollama["CLI Ollama · :11434"]
    Redis["Claude Redis · sessions"]
    Sub["CLI subprocess · codex / gemini"]
    Ollama --> Redis
    Ollama --> Sub
  end

  subgraph MCPNet["mcp-backend (isolated)"]
    direction TB
    Gateway["MCP Gateway · :3001 · whitelist + Bearer"]
    NMCP["N8N MCP bridge · :3000"]
  end

  N8N --> AI
  Ollama -->|MCP JSON-RPC| Gateway
  Gateway -->|"20 whitelisted tools"| NMCP
  NMCP -->|"REST API key"| N8N
```

---

## 2. Why? — Stakes and motivations

### Goals of the AI Stack

| Goal | Solution |
|------|----------|
| **No vendor lock-in** | Ollama-compatible API, swappable providers (Codex / Gemini) |
| **Cross-provider sessions** | `SessionRecord` shares history across models, fallback through injection |
| **MCP tool control** | Gateway whitelist + Telegram confirmation on destructive actions |
| **Scoped profiles** | The same service exposes different permissions depending on the profile (`error-analyst` vs `n8n-admin`) |
| **Local RAG** | Self-hosted Qdrant, no leak of private documents to the cloud |

### Models available through CLI Ollama

| Virtual name | Provider | Real ID | Notes |
|--------------|----------|---------|-------|
| `codex-max` | OpenAI Codex | `gpt-5.1-codex-max` | Frontier, agentic, complex tasks |
| `codex` | OpenAI Codex | `gpt-5.1-codex` | Standard agentic |
| `codex-mini` | OpenAI Codex | `gpt-5.1-codex-mini` | Smaller, more economical |
| `gemini-flash` | Google Gemini | `gemini-2.5-flash` | Fast, low-cost |
| `gemini-pro` | Google Gemini | `gemini-2.5-pro` | Most capable on the Gemini side |
| `<model>-yolo` | All | — | Skip approval/confirmation flow |

> **Caution - YOLO mode**
>
> The `-yolo` suffix bypasses human approval. N8N calls coming from automated workflows **must** use a `-yolo` model (e.g. `codex-yolo`) — otherwise the CLI starts in interactive plan mode and the webhook times out at 120 s.

### Why Codex as the default provider?

| Criterion | Codex (`gpt-5.1`) | Claude | Gemini |
|-----------|-------------------|--------|--------|
| **Native agentic mode** | Yes | Yes (via tool use) | Limited |
| **Persistent sessions** | Native `thread_id` | Limited | Native `session_id` |
| **Cost for workflows** | Medium | High | Low |
| **`-yolo` (full-auto) models** | Yes (`--full-auto`) | No equivalent | Yes (`--yolo`) |

Codex covers the agentic needs (analysis, refactoring, code generation). Gemini serves as a fast/economical fallback. Claude is no longer served through CLI Ollama: it remains used directly through Claude Code on the developer workstation.

---

## 3. How? — Technical implementation

### Qdrant: collection management

```bash
# List collections
curl http://localhost:6333/collections | jq '.result.collections'

# Create a collection
curl -X PUT http://localhost:6333/collections/documents \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": { "size": 1536, "distance": "Cosine" }
  }'

# Vector search
curl http://localhost:6333/collections/documents/points/search \
  -H "Content-Type: application/json" \
  -d '{ "vector": [0.1, 0.2, ...], "limit": 5 }'
```

> **Tip - Embeddings**
>
> Qdrant does not generate embeddings, it stores and searches them. Use the OpenAI API (`text-embedding-3-small`) or a local model to produce the vectors before insertion.

### Multi-provider sessions (#14)

CLI Ollama maintains a `SessionRecord` per `session_id` that follows the user across providers. If a conversation starts on `codex-max` and then switches to `gemini-pro`, the previous history is re-injected as a prompt prefix.

```mermaid
flowchart TD
  Req["Request · session_id, model"]
  Get["get_session_record() · loads SessionRecord"]
  Match{"Same model?"}
  Native["Native resume · Codex thread_id or Gemini session_id"]
  Inject["History injection · _inject_history()"]
  Exec["CLI execution"]
  Save["record_turn() · updates SessionRecord + current_model"]

  Req --> Get --> Match
  Match -->|yes| Native --> Exec
  Match -->|no, or no ID| Inject --> Exec
  Exec --> Save
```

`SessionRecord` fields: `session_id`, `current_provider`, `current_model`, `codex_thread_id`, `gemini_session_id`, `turn_count`, `total_tokens`. Stored in memory (no Redis persistence for the SessionRecords themselves — the message history lives in Claude Redis).

> **Note - MVP limits**
>
> Multi-provider handling runs only in **non-streaming** mode. Streaming paths do not record `turn` events (to be fixed). No TTL on SessionRecords and no cap on the size of the injected history.

### Profile system

CLI Ollama exposes several **YAML profiles** that scope the usable MCP tools and the allowed models. The same service therefore offers different permissions depending on the profile sent in the request.

| Profile | Allowed tools | Tools requiring approval | Knowledge base |
|---------|---------------|--------------------------|----------------|
| `error-analyst` | 5 read-only (`n8n_get_workflow`, `n8n_executions`, …) | none | DLQ format, workflow architecture |
| `n8n-admin` | 5 read + 2 write | `n8n_update_partial_workflow`, `n8n_update_full_workflow` | Admin guide |

> **Tip - Ceiling semantics**
>
> A profile defines the **ceiling** (`allowed_tools`). The request can **narrow** further through `mcp_config`, but never broaden beyond the ceiling. That is what allows the same endpoint to be used for workflows of very different risk levels.

The `.md` files listed in `knowledge_base` are injected into the system prompt on every request, giving the model a stable context without resending it every turn.

### MCP Gateway (whitelist + transport negotiation)

`mcp-gateway` is an MCP reverse proxy interposed between `cli-ollama` and `n8n-mcp`. It plays three roles:

1. **Server-side tool whitelist** — blocks any tool not on the list before the request reaches `n8n-mcp`.
2. **Transport negotiation** — always asks for SSE upstream, then re-formats based on the client's `Accept` header (Claude CLI prefers JSON, Codex CLI prefers SSE/`rmcp`).
3. **Bearer auth + network isolation** — `mcp-backend` is not reachable from `ai-internal` other than through the gateway.

**Whitelisted tools (20)**: every required tool (`n8n_list_workflows`, `n8n_get_workflow`, `n8n_validate_workflow`, `tools_documentation`, `search_nodes`, `get_node`, `validate_node`, `search_templates`, `get_template`, `validate_workflow`, `n8n_executions`, `n8n_health_check`, `n8n_create_workflow`, `n8n_update_full_workflow`, `n8n_update_partial_workflow`, `n8n_delete_workflow`, `n8n_deploy_template`, `n8n_autofix_workflow`, `n8n_test_workflow`, `n8n_workflow_versions`).

**Critical tools (7)** — pass the whitelist but require Telegram confirmation before execution:
`n8n_create_workflow`, `n8n_update_full_workflow`, `n8n_update_partial_workflow`, `n8n_delete_workflow`, `n8n_deploy_template`, `n8n_autofix_workflow`, `n8n_test_workflow`. The confirmation is handled by `mcp_confirmation.py` on the `cli-ollama` side which posts a webhook to the `MCP Confirmation Handler` workflow.

### N8N MCP bridge

`n8n-mcp` is the MCP server that actually exposes the N8N tools. It authenticates against the N8N REST API through an API key (`N8N_MCP_API_KEY`), and requires a Bearer token (`N8N_MCP_AUTH_TOKEN`) on the MCP client side.

```bash
# Test the full chain from the host (via the gateway)
curl -X POST http://127.0.0.1:3001/mcp \
  -H "Authorization: Bearer ${N8N_MCP_AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'
```

> **Caution - Claude CLI --print mode**
>
> Claude CLI's `--print` (headless) mode **does not load** the HTTP servers listed in `.mcp.json`. The MCP configuration is set at runtime through `claude mcp add --scope user` in the container's `entrypoint.sh`, which writes to `.claude.json`. The server is registered as `n8n-local` (tool prefix: `mcp__n8n-local__`).

### Approval workflow (non-yolo mode)

```mermaid
flowchart TD
  Req(["Request · non-YOLO"])

  subgraph Ollama["CLI Ollama"]
    direction TB
    L1["Detects tool_use or question"]
    L2["N8N webhook + Pending state"]
    L3["Wait for approval · 5 min timeout"]
  end

  subgraph Tg["N8N Telegram Orchestrator"]
    direction TB
    T1["Render plan or question"]
    T2["Inline keyboard buttons"]
  end

  Decision{Decision}
  Exec["Actual execution"]
  Reject["'Rejected' response"]
  Result["Return result"]

  Req --> Ollama --> Tg --> Decision
  Decision -->|Approved| Exec
  Decision -->|Rejected/Expired| Reject
  Exec --> Result
```

Pluggable hooks (`pre_tool_use`, `post_tool_use`, `on_response`, `on_error`) allow cross-cutting rules to be added without touching the providers.

### Calling from N8N

```javascript
// Intent detection (non-streaming, no approval)
{
  "url": "http://cli-ollama:11434/api/generate",
  "method": "POST",
  "body": {
    "model": "codex-yolo",
    "prompt": "Analyse this Telegram message and return a JSON …",
    "stream": false
  }
}

// With profile and restricted MCP tools
{
  "model": "codex-yolo",
  "prompt": "Inspect workflow X and propose a fix",
  "profile": "error-analyst",
  "mcp_config": "{\"allowed_tools\": [\"n8n_get_workflow\", \"n8n_executions\"]}"
}
```

### System resources

| Service | Memory limit | CPU |
|---------|--------------|-----|
| Qdrant | 4 GB | 2 vCPU |
| CLI Ollama | 4 GB | 2 vCPU |
| Claude Redis | 512 MB | — |
| MCP Gateway | 128 MB | — |
| N8N MCP | 256 MB | — |

> **Note - No GPU**
>
> This stack does not run any model locally. ML workloads run in the cloud (Codex / Gemini), Qdrant performs CPU-only vector computation. The VPS does not need a GPU.

---

## 4. What if? — Outlook and limits

### Current limits

| Limit | Impact | Mitigation |
|-------|--------|------------|
| **Sparsely populated Qdrant** | No production RAG today | Progressive import from the Obsidian vault |
| **Embeddings via external API** | OpenAI dependency for vectorisation | Local embedding model is conceivable |
| **No streaming SessionRecord** | Streaming conversations do not capitalise on history | Roadmap: record streaming chunks in memory |
| **In-memory SessionRecord** | Lost on container restart | Redis persistence planned post-MVP |
| **Codex / Gemini cloud** | Cost + external dependency | Redis caching for intent detection, `-yolo` models for hot paths |

### Evolution scenarios

**If RAG goes to production**:
- N8N worker that ingests the Obsidian vault into a dedicated Qdrant collection.
- `obsidian-rag` profile on the CLI Ollama side with a knowledge base + search tools.
- Pattern: embeddings via OpenAI, Qdrant search, prompt enriched on the CLI Ollama side.

**If I want to add another provider**:
- Implement `ProviderProtocol` in `services/providers/<name>.py`.
- Add the virtual-name → real-model mapping in `config.py`.
- The SessionRecord and hooks work automatically (unified `ExecutionMessage` interface).

**If API costs grow**:
- Extend Claude Redis caching to frequent answers (notably intent detection).
- Route simple requests to `gemini-flash` (the cheapest) through the AI Router.
- Enable a per-session_id rate limit on the `cli-ollama` side.

**If MCP security must be tightened**:
- Reduce the gateway whitelist to the 5 read-only tools only.
- Add use-case-specific profiles (instead of a permissive `n8n-admin`).
- Extend Telegram confirmation to more tools (currently 7 critical out of 20).

### Troubleshooting commands

```bash
# Health checks
curl http://localhost:6333/healthz                  # Qdrant
curl http://localhost:11434/api/tags                # CLI Ollama
docker logs cli-ollama --tail 100

# Pending approvals
curl http://cli-ollama:11434/api/approvals
curl http://cli-ollama:11434/api/questions

# Test the MCP gateway → n8n-mcp chain
curl -X POST http://127.0.0.1:3001/mcp \
  -H "Authorization: Bearer $N8N_MCP_AUTH_TOKEN" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | jq .result.tools[].name

# Verify network isolation
docker network inspect mcp-backend | jq '.[0].Containers'
```

---

## Related pages

### Infrastructure
- [VPS Architecture](/en/infrastructure/architecture-vps/) — Overview
- [N8N in queue mode](/en/infrastructure/n8n-queue-mode/) — Consumer workers

### Workflows
- [Conversational system](/en/workflows/systeme-conversationnel/) — Multi-turn sessions through CLI Ollama
- [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) — AI Router and MCP confirmation
- [Notification Hub](/en/workflows/notification-hub/) — AI alert routing

### Reference
- [Glossary](/en/reference/glossary/) — RAG, embeddings, vector database, MCP, LLM

## Metadonnees agent

- Cet article est issu du blog GuiGPaP Lab.
- Contexte global du blog: https://blog.guigpap.com/llms.txt
- Contact auteur: https://odoo.guigpap.com/mon-cv
- Licence: CC-BY-SA 4.0