AI Stack: Qdrant + CLI Ollama
1. What? — Definition and context
Section titled “1. What? — Definition and context”The AI Stack brings together five services that work in concert to serve every AI need on the infrastructure: a vector database for RAG, a multi-model Ollama-compatible gateway, session memory, and two MCP (Model Context Protocol) building blocks that expose N8N tools to models with strict access control.
Components
Section titled “Components”| Service | Port | Role |
|---|---|---|
| Qdrant | 6333 | Vector database (embeddings, semantic search) |
| CLI Ollama | 11434 | Ollama-compatible gateway towards Codex and Gemini |
| Claude Redis | 6379 | Conversation session memory (LRU 256 MB) |
| MCP Gateway | 3001 | MCP reverse proxy: tool whitelist + SSE/JSON negotiation |
| N8N MCP | 3000 | MCP bridge exposing the N8N API to models |
Architecture diagram
Section titled “Architecture diagram”2. Why? — Stakes and motivations
Section titled “2. Why? — Stakes and motivations”Goals of the AI Stack
Section titled “Goals of the AI Stack”| Goal | Solution |
|---|---|
| No vendor lock-in | Ollama-compatible API, swappable providers (Codex / Gemini) |
| Cross-provider sessions | SessionRecord shares history across models, fallback through injection |
| MCP tool control | Gateway whitelist + Telegram confirmation on destructive actions |
| Scoped profiles | The same service exposes different permissions depending on the profile (error-analyst vs n8n-admin) |
| Local RAG | Self-hosted Qdrant, no leak of private documents to the cloud |
Models available through CLI Ollama
Section titled “Models available through CLI Ollama”| Virtual name | Provider | Real ID | Notes |
|---|---|---|---|
codex-max | OpenAI Codex | gpt-5.1-codex-max | Frontier, agentic, complex tasks |
codex | OpenAI Codex | gpt-5.1-codex | Standard agentic |
codex-mini | OpenAI Codex | gpt-5.1-codex-mini | Smaller, more economical |
gemini-flash | Google Gemini | gemini-2.5-flash | Fast, low-cost |
gemini-pro | Google Gemini | gemini-2.5-pro | Most capable on the Gemini side |
<model>-yolo | All | — | Skip approval/confirmation flow |
Why Codex as the default provider?
Section titled “Why Codex as the default provider?”| Criterion | Codex (gpt-5.1) | Claude | Gemini |
|---|---|---|---|
| Native agentic mode | Yes | Yes (via tool use) | Limited |
| Persistent sessions | Native thread_id | Limited | Native session_id |
| Cost for workflows | Medium | High | Low |
-yolo (full-auto) models | Yes (--full-auto) | No equivalent | Yes (--yolo) |
Codex covers the agentic needs (analysis, refactoring, code generation). Gemini serves as a fast/economical fallback. Claude is no longer served through CLI Ollama: it remains used directly through Claude Code on the developer workstation.
3. How? — Technical implementation
Section titled “3. How? — Technical implementation”Qdrant: collection management
Section titled “Qdrant: collection management”# List collectionscurl http://localhost:6333/collections | jq '.result.collections'
# Create a collectioncurl -X PUT http://localhost:6333/collections/documents \ -H "Content-Type: application/json" \ -d '{ "vectors": { "size": 1536, "distance": "Cosine" } }'
# Vector searchcurl http://localhost:6333/collections/documents/points/search \ -H "Content-Type: application/json" \ -d '{ "vector": [0.1, 0.2, ...], "limit": 5 }'Multi-provider sessions (#14)
Section titled “Multi-provider sessions (#14)”CLI Ollama maintains a SessionRecord per session_id that follows the user across providers. If a conversation starts on codex-max and then switches to gemini-pro, the previous history is re-injected as a prompt prefix.
SessionRecord fields: session_id, current_provider, current_model, codex_thread_id, gemini_session_id, turn_count, total_tokens. Stored in memory (no Redis persistence for the SessionRecords themselves — the message history lives in Claude Redis).
Profile system
Section titled “Profile system”CLI Ollama exposes several YAML profiles that scope the usable MCP tools and the allowed models. The same service therefore offers different permissions depending on the profile sent in the request.
| Profile | Allowed tools | Tools requiring approval | Knowledge base |
|---|---|---|---|
error-analyst | 5 read-only (n8n_get_workflow, n8n_executions, …) | none | DLQ format, workflow architecture |
n8n-admin | 5 read + 2 write | n8n_update_partial_workflow, n8n_update_full_workflow | Admin guide |
The .md files listed in knowledge_base are injected into the system prompt on every request, giving the model a stable context without resending it every turn.
MCP Gateway (whitelist + transport negotiation)
Section titled “MCP Gateway (whitelist + transport negotiation)”mcp-gateway is an MCP reverse proxy interposed between cli-ollama and n8n-mcp. It plays three roles:
- Server-side tool whitelist — blocks any tool not on the list before the request reaches
n8n-mcp. - Transport negotiation — always asks for SSE upstream, then re-formats based on the client’s
Acceptheader (Claude CLI prefers JSON, Codex CLI prefers SSE/rmcp). - Bearer auth + network isolation —
mcp-backendis not reachable fromai-internalother than through the gateway.
Whitelisted tools (20): every required tool (n8n_list_workflows, n8n_get_workflow, n8n_validate_workflow, tools_documentation, search_nodes, get_node, validate_node, search_templates, get_template, validate_workflow, n8n_executions, n8n_health_check, n8n_create_workflow, n8n_update_full_workflow, n8n_update_partial_workflow, n8n_delete_workflow, n8n_deploy_template, n8n_autofix_workflow, n8n_test_workflow, n8n_workflow_versions).
Critical tools (7) — pass the whitelist but require Telegram confirmation before execution:
n8n_create_workflow, n8n_update_full_workflow, n8n_update_partial_workflow, n8n_delete_workflow, n8n_deploy_template, n8n_autofix_workflow, n8n_test_workflow. The confirmation is handled by mcp_confirmation.py on the cli-ollama side which posts a webhook to the MCP Confirmation Handler workflow.
N8N MCP bridge
Section titled “N8N MCP bridge”n8n-mcp is the MCP server that actually exposes the N8N tools. It authenticates against the N8N REST API through an API key (N8N_MCP_API_KEY), and requires a Bearer token (N8N_MCP_AUTH_TOKEN) on the MCP client side.
# Test the full chain from the host (via the gateway)curl -X POST http://127.0.0.1:3001/mcp \ -H "Authorization: Bearer ${N8N_MCP_AUTH_TOKEN}" \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'Approval workflow (non-yolo mode)
Section titled “Approval workflow (non-yolo mode)”Pluggable hooks (pre_tool_use, post_tool_use, on_response, on_error) allow cross-cutting rules to be added without touching the providers.
Calling from N8N
Section titled “Calling from N8N”// Intent detection (non-streaming, no approval){ "url": "http://cli-ollama:11434/api/generate", "method": "POST", "body": { "model": "codex-yolo", "prompt": "Analyse this Telegram message and return a JSON …", "stream": false }}
// With profile and restricted MCP tools{ "model": "codex-yolo", "prompt": "Inspect workflow X and propose a fix", "profile": "error-analyst", "mcp_config": "{\"allowed_tools\": [\"n8n_get_workflow\", \"n8n_executions\"]}"}System resources
Section titled “System resources”| Service | Memory limit | CPU |
|---|---|---|
| Qdrant | 4 GB | 2 vCPU |
| CLI Ollama | 4 GB | 2 vCPU |
| Claude Redis | 512 MB | — |
| MCP Gateway | 128 MB | — |
| N8N MCP | 256 MB | — |
4. What if? — Outlook and limits
Section titled “4. What if? — Outlook and limits”Current limits
Section titled “Current limits”| Limit | Impact | Mitigation |
|---|---|---|
| Sparsely populated Qdrant | No production RAG today | Progressive import from the Obsidian vault |
| Embeddings via external API | OpenAI dependency for vectorisation | Local embedding model is conceivable |
| No streaming SessionRecord | Streaming conversations do not capitalise on history | Roadmap: record streaming chunks in memory |
| In-memory SessionRecord | Lost on container restart | Redis persistence planned post-MVP |
| Codex / Gemini cloud | Cost + external dependency | Redis caching for intent detection, -yolo models for hot paths |
Evolution scenarios
Section titled “Evolution scenarios”If RAG goes to production:
- N8N worker that ingests the Obsidian vault into a dedicated Qdrant collection.
obsidian-ragprofile on the CLI Ollama side with a knowledge base + search tools.- Pattern: embeddings via OpenAI, Qdrant search, prompt enriched on the CLI Ollama side.
If I want to add another provider:
- Implement
ProviderProtocolinservices/providers/<name>.py. - Add the virtual-name → real-model mapping in
config.py. - The SessionRecord and hooks work automatically (unified
ExecutionMessageinterface).
If API costs grow:
- Extend Claude Redis caching to frequent answers (notably intent detection).
- Route simple requests to
gemini-flash(the cheapest) through the AI Router. - Enable a per-session_id rate limit on the
cli-ollamaside.
If MCP security must be tightened:
- Reduce the gateway whitelist to the 5 read-only tools only.
- Add use-case-specific profiles (instead of a permissive
n8n-admin). - Extend Telegram confirmation to more tools (currently 7 critical out of 20).
Troubleshooting commands
Section titled “Troubleshooting commands”# Health checkscurl http://localhost:6333/healthz # Qdrantcurl http://localhost:11434/api/tags # CLI Ollamadocker logs cli-ollama --tail 100
# Pending approvalscurl http://cli-ollama:11434/api/approvalscurl http://cli-ollama:11434/api/questions
# Test the MCP gateway → n8n-mcp chaincurl -X POST http://127.0.0.1:3001/mcp \ -H "Authorization: Bearer $N8N_MCP_AUTH_TOKEN" \ -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | jq .result.tools[].name
# Verify network isolationdocker network inspect mcp-backend | jq '.[0].Containers'Related pages
Section titled “Related pages”Infrastructure
Section titled “Infrastructure”- VPS Architecture — Overview
- N8N in queue mode — Consumer workers
Workflows
Section titled “Workflows”- Conversational system — Multi-turn sessions through CLI Ollama
- Telegram Orchestrator — AI Router and MCP confirmation
- Notification Hub — AI alert routing
Reference
Section titled “Reference”- Glossary — RAG, embeddings, vector database, MCP, LLM