--- title: 'AI Stack: Qdrant + CLI Ollama' url: https://blog.guigpap.com/en/infrastructure/ai-stack/ url_md: https://blog.guigpap.com/en/infrastructure/ai-stack.md category: infrastructure date: '2026-01-20' maturite: production techno: - qdrant - claude - redis application: - ai - infrastructure --- # AI Stack: Qdrant + CLI Ollama > Local AI infrastructure with a vector database, multi-model gateway, scoped-privilege profiles, and a whitelisted MCP gateway ## 1. What? — Definition and context The **AI Stack** brings together five services that work in concert to serve every AI need on the infrastructure: a vector database for RAG, a multi-model Ollama-compatible gateway, session memory, and two MCP (Model Context Protocol) building blocks that expose N8N tools to models with strict access control. ### Components | Service | Port | Role | |---------|------|------| | **Qdrant** | 6333 | Vector database (embeddings, semantic search) | | **CLI Ollama** | 11434 | Ollama-compatible gateway towards Codex and Gemini | | **Claude Redis** | 6379 | Conversation session memory (LRU 256 MB) | | **MCP Gateway** | 3001 | MCP reverse proxy: tool whitelist + SSE/JSON negotiation | | **N8N MCP** | 3000 | MCP bridge exposing the N8N API to models | > **Note - Ollama-compatible** > > **CLI Ollama** re-implements the Ollama API (`/api/generate`, `/api/chat`, `/api/tags`) but does not run any model locally: every request is translated into a CLI call to upstream providers (OpenAI's Codex CLI, Google's Gemini CLI). It is a wrapper, not a runtime. It exposes an API that N8N and other consumers treat like a regular Ollama. ### Architecture diagram ```mermaid flowchart TD subgraph N8N["N8N Workflows"] direction TB I1["AI Router · intent detection"] I2["Service Handler General · free chat"] I3["Conversation Agent · multi-turn sessions"] I4["RAG · future"] end subgraph AI["ai-internal"] direction TB Qdrant["Qdrant · :6333"] Ollama["CLI Ollama · :11434"] Redis["Claude Redis · sessions"] Sub["CLI subprocess · codex / gemini"] Ollama --> Redis Ollama --> Sub end subgraph MCPNet["mcp-backend (isolated)"] direction TB Gateway["MCP Gateway · :3001 · whitelist + Bearer"] NMCP["N8N MCP bridge · :3000"] end N8N --> AI Ollama -->|MCP JSON-RPC| Gateway Gateway -->|"20 whitelisted tools"| NMCP NMCP -->|"REST API key"| N8N ``` --- ## 2. Why? — Stakes and motivations ### Goals of the AI Stack | Goal | Solution | |------|----------| | **No vendor lock-in** | Ollama-compatible API, swappable providers (Codex / Gemini) | | **Cross-provider sessions** | `SessionRecord` shares history across models, fallback through injection | | **MCP tool control** | Gateway whitelist + Telegram confirmation on destructive actions | | **Scoped profiles** | The same service exposes different permissions depending on the profile (`error-analyst` vs `n8n-admin`) | | **Local RAG** | Self-hosted Qdrant, no leak of private documents to the cloud | ### Models available through CLI Ollama | Virtual name | Provider | Real ID | Notes | |--------------|----------|---------|-------| | `codex-max` | OpenAI Codex | `gpt-5.1-codex-max` | Frontier, agentic, complex tasks | | `codex` | OpenAI Codex | `gpt-5.1-codex` | Standard agentic | | `codex-mini` | OpenAI Codex | `gpt-5.1-codex-mini` | Smaller, more economical | | `gemini-flash` | Google Gemini | `gemini-2.5-flash` | Fast, low-cost | | `gemini-pro` | Google Gemini | `gemini-2.5-pro` | Most capable on the Gemini side | | `-yolo` | All | — | Skip approval/confirmation flow | > **Caution - YOLO mode** > > The `-yolo` suffix bypasses human approval. N8N calls coming from automated workflows **must** use a `-yolo` model (e.g. `codex-yolo`) — otherwise the CLI starts in interactive plan mode and the webhook times out at 120 s. ### Why Codex as the default provider? | Criterion | Codex (`gpt-5.1`) | Claude | Gemini | |-----------|-------------------|--------|--------| | **Native agentic mode** | Yes | Yes (via tool use) | Limited | | **Persistent sessions** | Native `thread_id` | Limited | Native `session_id` | | **Cost for workflows** | Medium | High | Low | | **`-yolo` (full-auto) models** | Yes (`--full-auto`) | No equivalent | Yes (`--yolo`) | Codex covers the agentic needs (analysis, refactoring, code generation). Gemini serves as a fast/economical fallback. Claude is no longer served through CLI Ollama: it remains used directly through Claude Code on the developer workstation. --- ## 3. How? — Technical implementation ### Qdrant: collection management ```bash # List collections curl http://localhost:6333/collections | jq '.result.collections' # Create a collection curl -X PUT http://localhost:6333/collections/documents \ -H "Content-Type: application/json" \ -d '{ "vectors": { "size": 1536, "distance": "Cosine" } }' # Vector search curl http://localhost:6333/collections/documents/points/search \ -H "Content-Type: application/json" \ -d '{ "vector": [0.1, 0.2, ...], "limit": 5 }' ``` > **Tip - Embeddings** > > Qdrant does not generate embeddings, it stores and searches them. Use the OpenAI API (`text-embedding-3-small`) or a local model to produce the vectors before insertion. ### Multi-provider sessions (#14) CLI Ollama maintains a `SessionRecord` per `session_id` that follows the user across providers. If a conversation starts on `codex-max` and then switches to `gemini-pro`, the previous history is re-injected as a prompt prefix. ```mermaid flowchart TD Req["Request · session_id, model"] Get["get_session_record() · loads SessionRecord"] Match{"Same model?"} Native["Native resume · Codex thread_id or Gemini session_id"] Inject["History injection · _inject_history()"] Exec["CLI execution"] Save["record_turn() · updates SessionRecord + current_model"] Req --> Get --> Match Match -->|yes| Native --> Exec Match -->|no, or no ID| Inject --> Exec Exec --> Save ``` `SessionRecord` fields: `session_id`, `current_provider`, `current_model`, `codex_thread_id`, `gemini_session_id`, `turn_count`, `total_tokens`. Stored in memory (no Redis persistence for the SessionRecords themselves — the message history lives in Claude Redis). > **Note - MVP limits** > > Multi-provider handling runs only in **non-streaming** mode. Streaming paths do not record `turn` events (to be fixed). No TTL on SessionRecords and no cap on the size of the injected history. ### Profile system CLI Ollama exposes several **YAML profiles** that scope the usable MCP tools and the allowed models. The same service therefore offers different permissions depending on the profile sent in the request. | Profile | Allowed tools | Tools requiring approval | Knowledge base | |---------|---------------|--------------------------|----------------| | `error-analyst` | 5 read-only (`n8n_get_workflow`, `n8n_executions`, …) | none | DLQ format, workflow architecture | | `n8n-admin` | 5 read + 2 write | `n8n_update_partial_workflow`, `n8n_update_full_workflow` | Admin guide | > **Tip - Ceiling semantics** > > A profile defines the **ceiling** (`allowed_tools`). The request can **narrow** further through `mcp_config`, but never broaden beyond the ceiling. That is what allows the same endpoint to be used for workflows of very different risk levels. The `.md` files listed in `knowledge_base` are injected into the system prompt on every request, giving the model a stable context without resending it every turn. ### MCP Gateway (whitelist + transport negotiation) `mcp-gateway` is an MCP reverse proxy interposed between `cli-ollama` and `n8n-mcp`. It plays three roles: 1. **Server-side tool whitelist** — blocks any tool not on the list before the request reaches `n8n-mcp`. 2. **Transport negotiation** — always asks for SSE upstream, then re-formats based on the client's `Accept` header (Claude CLI prefers JSON, Codex CLI prefers SSE/`rmcp`). 3. **Bearer auth + network isolation** — `mcp-backend` is not reachable from `ai-internal` other than through the gateway. **Whitelisted tools (20)**: every required tool (`n8n_list_workflows`, `n8n_get_workflow`, `n8n_validate_workflow`, `tools_documentation`, `search_nodes`, `get_node`, `validate_node`, `search_templates`, `get_template`, `validate_workflow`, `n8n_executions`, `n8n_health_check`, `n8n_create_workflow`, `n8n_update_full_workflow`, `n8n_update_partial_workflow`, `n8n_delete_workflow`, `n8n_deploy_template`, `n8n_autofix_workflow`, `n8n_test_workflow`, `n8n_workflow_versions`). **Critical tools (7)** — pass the whitelist but require Telegram confirmation before execution: `n8n_create_workflow`, `n8n_update_full_workflow`, `n8n_update_partial_workflow`, `n8n_delete_workflow`, `n8n_deploy_template`, `n8n_autofix_workflow`, `n8n_test_workflow`. The confirmation is handled by `mcp_confirmation.py` on the `cli-ollama` side which posts a webhook to the `MCP Confirmation Handler` workflow. ### N8N MCP bridge `n8n-mcp` is the MCP server that actually exposes the N8N tools. It authenticates against the N8N REST API through an API key (`N8N_MCP_API_KEY`), and requires a Bearer token (`N8N_MCP_AUTH_TOKEN`) on the MCP client side. ```bash # Test the full chain from the host (via the gateway) curl -X POST http://127.0.0.1:3001/mcp \ -H "Authorization: Bearer ${N8N_MCP_AUTH_TOKEN}" \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' ``` > **Caution - Claude CLI --print mode** > > Claude CLI's `--print` (headless) mode **does not load** the HTTP servers listed in `.mcp.json`. The MCP configuration is set at runtime through `claude mcp add --scope user` in the container's `entrypoint.sh`, which writes to `.claude.json`. The server is registered as `n8n-local` (tool prefix: `mcp__n8n-local__`). ### Approval workflow (non-yolo mode) ```mermaid flowchart TD Req(["Request · non-YOLO"]) subgraph Ollama["CLI Ollama"] direction TB L1["Detects tool_use or question"] L2["N8N webhook + Pending state"] L3["Wait for approval · 5 min timeout"] end subgraph Tg["N8N Telegram Orchestrator"] direction TB T1["Render plan or question"] T2["Inline keyboard buttons"] end Decision{Decision} Exec["Actual execution"] Reject["'Rejected' response"] Result["Return result"] Req --> Ollama --> Tg --> Decision Decision -->|Approved| Exec Decision -->|Rejected/Expired| Reject Exec --> Result ``` Pluggable hooks (`pre_tool_use`, `post_tool_use`, `on_response`, `on_error`) allow cross-cutting rules to be added without touching the providers. ### Calling from N8N ```javascript // Intent detection (non-streaming, no approval) { "url": "http://cli-ollama:11434/api/generate", "method": "POST", "body": { "model": "codex-yolo", "prompt": "Analyse this Telegram message and return a JSON …", "stream": false } } // With profile and restricted MCP tools { "model": "codex-yolo", "prompt": "Inspect workflow X and propose a fix", "profile": "error-analyst", "mcp_config": "{\"allowed_tools\": [\"n8n_get_workflow\", \"n8n_executions\"]}" } ``` ### System resources | Service | Memory limit | CPU | |---------|--------------|-----| | Qdrant | 4 GB | 2 vCPU | | CLI Ollama | 4 GB | 2 vCPU | | Claude Redis | 512 MB | — | | MCP Gateway | 128 MB | — | | N8N MCP | 256 MB | — | > **Note - No GPU** > > This stack does not run any model locally. ML workloads run in the cloud (Codex / Gemini), Qdrant performs CPU-only vector computation. The VPS does not need a GPU. --- ## 4. What if? — Outlook and limits ### Current limits | Limit | Impact | Mitigation | |-------|--------|------------| | **Sparsely populated Qdrant** | No production RAG today | Progressive import from the Obsidian vault | | **Embeddings via external API** | OpenAI dependency for vectorisation | Local embedding model is conceivable | | **No streaming SessionRecord** | Streaming conversations do not capitalise on history | Roadmap: record streaming chunks in memory | | **In-memory SessionRecord** | Lost on container restart | Redis persistence planned post-MVP | | **Codex / Gemini cloud** | Cost + external dependency | Redis caching for intent detection, `-yolo` models for hot paths | ### Evolution scenarios **If RAG goes to production**: - N8N worker that ingests the Obsidian vault into a dedicated Qdrant collection. - `obsidian-rag` profile on the CLI Ollama side with a knowledge base + search tools. - Pattern: embeddings via OpenAI, Qdrant search, prompt enriched on the CLI Ollama side. **If I want to add another provider**: - Implement `ProviderProtocol` in `services/providers/.py`. - Add the virtual-name → real-model mapping in `config.py`. - The SessionRecord and hooks work automatically (unified `ExecutionMessage` interface). **If API costs grow**: - Extend Claude Redis caching to frequent answers (notably intent detection). - Route simple requests to `gemini-flash` (the cheapest) through the AI Router. - Enable a per-session_id rate limit on the `cli-ollama` side. **If MCP security must be tightened**: - Reduce the gateway whitelist to the 5 read-only tools only. - Add use-case-specific profiles (instead of a permissive `n8n-admin`). - Extend Telegram confirmation to more tools (currently 7 critical out of 20). ### Troubleshooting commands ```bash # Health checks curl http://localhost:6333/healthz # Qdrant curl http://localhost:11434/api/tags # CLI Ollama docker logs cli-ollama --tail 100 # Pending approvals curl http://cli-ollama:11434/api/approvals curl http://cli-ollama:11434/api/questions # Test the MCP gateway → n8n-mcp chain curl -X POST http://127.0.0.1:3001/mcp \ -H "Authorization: Bearer $N8N_MCP_AUTH_TOKEN" \ -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | jq .result.tools[].name # Verify network isolation docker network inspect mcp-backend | jq '.[0].Containers' ``` --- ## Related pages ### Infrastructure - [VPS Architecture](/en/infrastructure/architecture-vps/) — Overview - [N8N in queue mode](/en/infrastructure/n8n-queue-mode/) — Consumer workers ### Workflows - [Conversational system](/en/workflows/systeme-conversationnel/) — Multi-turn sessions through CLI Ollama - [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) — AI Router and MCP confirmation - [Notification Hub](/en/workflows/notification-hub/) — AI alert routing ### Reference - [Glossary](/en/reference/glossary/) — RAG, embeddings, vector database, MCP, LLM ## Metadonnees agent - Cet article est issu du blog GuiGPaP Lab. - Contexte global du blog: https://blog.guigpap.com/llms.txt - Contact auteur: https://odoo.guigpap.com/mon-cv - Licence: CC-BY-SA 4.0