Skip to content

AI Stack: Qdrant + CLI Ollama

The AI Stack brings together five services that work in concert to serve every AI need on the infrastructure: a vector database for RAG, a multi-model Ollama-compatible gateway, session memory, and two MCP (Model Context Protocol) building blocks that expose N8N tools to models with strict access control.

ServicePortRole
Qdrant6333Vector database (embeddings, semantic search)
CLI Ollama11434Ollama-compatible gateway towards Codex and Gemini
Claude Redis6379Conversation session memory (LRU 256 MB)
MCP Gateway3001MCP reverse proxy: tool whitelist + SSE/JSON negotiation
N8N MCP3000MCP bridge exposing the N8N API to models

mcp-backend (isolated)

ai-internal

MCP JSON-RPC

20 whitelisted tools

REST API key

N8N Workflows

AI Router · intent detection

Service Handler General · free chat

Conversation Agent · multi-turn sessions

RAG · future

Qdrant · :6333

CLI Ollama · :11434

Claude Redis · sessions

CLI subprocess · codex / gemini

MCP Gateway · :3001 · whitelist + Bearer

N8N MCP bridge · :3000


GoalSolution
No vendor lock-inOllama-compatible API, swappable providers (Codex / Gemini)
Cross-provider sessionsSessionRecord shares history across models, fallback through injection
MCP tool controlGateway whitelist + Telegram confirmation on destructive actions
Scoped profilesThe same service exposes different permissions depending on the profile (error-analyst vs n8n-admin)
Local RAGSelf-hosted Qdrant, no leak of private documents to the cloud
Virtual nameProviderReal IDNotes
codex-maxOpenAI Codexgpt-5.1-codex-maxFrontier, agentic, complex tasks
codexOpenAI Codexgpt-5.1-codexStandard agentic
codex-miniOpenAI Codexgpt-5.1-codex-miniSmaller, more economical
gemini-flashGoogle Geminigemini-2.5-flashFast, low-cost
gemini-proGoogle Geminigemini-2.5-proMost capable on the Gemini side
<model>-yoloAllSkip approval/confirmation flow
CriterionCodex (gpt-5.1)ClaudeGemini
Native agentic modeYesYes (via tool use)Limited
Persistent sessionsNative thread_idLimitedNative session_id
Cost for workflowsMediumHighLow
-yolo (full-auto) modelsYes (--full-auto)No equivalentYes (--yolo)

Codex covers the agentic needs (analysis, refactoring, code generation). Gemini serves as a fast/economical fallback. Claude is no longer served through CLI Ollama: it remains used directly through Claude Code on the developer workstation.


Fenêtre de terminal
# List collections
curl http://localhost:6333/collections | jq '.result.collections'
# Create a collection
curl -X PUT http://localhost:6333/collections/documents \
-H "Content-Type: application/json" \
-d '{
"vectors": { "size": 1536, "distance": "Cosine" }
}'
# Vector search
curl http://localhost:6333/collections/documents/points/search \
-H "Content-Type: application/json" \
-d '{ "vector": [0.1, 0.2, ...], "limit": 5 }'

CLI Ollama maintains a SessionRecord per session_id that follows the user across providers. If a conversation starts on codex-max and then switches to gemini-pro, the previous history is re-injected as a prompt prefix.

yes

no, or no ID

Request · session_id, model

get_session_record() · loads SessionRecord

Same model?

Native resume · Codex thread_id or Gemini session_id

History injection · _inject_history()

CLI execution

record_turn() · updates SessionRecord + current_model

SessionRecord fields: session_id, current_provider, current_model, codex_thread_id, gemini_session_id, turn_count, total_tokens. Stored in memory (no Redis persistence for the SessionRecords themselves — the message history lives in Claude Redis).

CLI Ollama exposes several YAML profiles that scope the usable MCP tools and the allowed models. The same service therefore offers different permissions depending on the profile sent in the request.

ProfileAllowed toolsTools requiring approvalKnowledge base
error-analyst5 read-only (n8n_get_workflow, n8n_executions, …)noneDLQ format, workflow architecture
n8n-admin5 read + 2 writen8n_update_partial_workflow, n8n_update_full_workflowAdmin guide

The .md files listed in knowledge_base are injected into the system prompt on every request, giving the model a stable context without resending it every turn.

MCP Gateway (whitelist + transport negotiation)

Section titled “MCP Gateway (whitelist + transport negotiation)”

mcp-gateway is an MCP reverse proxy interposed between cli-ollama and n8n-mcp. It plays three roles:

  1. Server-side tool whitelist — blocks any tool not on the list before the request reaches n8n-mcp.
  2. Transport negotiation — always asks for SSE upstream, then re-formats based on the client’s Accept header (Claude CLI prefers JSON, Codex CLI prefers SSE/rmcp).
  3. Bearer auth + network isolationmcp-backend is not reachable from ai-internal other than through the gateway.

Whitelisted tools (20): every required tool (n8n_list_workflows, n8n_get_workflow, n8n_validate_workflow, tools_documentation, search_nodes, get_node, validate_node, search_templates, get_template, validate_workflow, n8n_executions, n8n_health_check, n8n_create_workflow, n8n_update_full_workflow, n8n_update_partial_workflow, n8n_delete_workflow, n8n_deploy_template, n8n_autofix_workflow, n8n_test_workflow, n8n_workflow_versions).

Critical tools (7) — pass the whitelist but require Telegram confirmation before execution: n8n_create_workflow, n8n_update_full_workflow, n8n_update_partial_workflow, n8n_delete_workflow, n8n_deploy_template, n8n_autofix_workflow, n8n_test_workflow. The confirmation is handled by mcp_confirmation.py on the cli-ollama side which posts a webhook to the MCP Confirmation Handler workflow.

n8n-mcp is the MCP server that actually exposes the N8N tools. It authenticates against the N8N REST API through an API key (N8N_MCP_API_KEY), and requires a Bearer token (N8N_MCP_AUTH_TOKEN) on the MCP client side.

Fenêtre de terminal
# Test the full chain from the host (via the gateway)
curl -X POST http://127.0.0.1:3001/mcp \
-H "Authorization: Bearer ${N8N_MCP_AUTH_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

Approved

Rejected/Expired

N8N Telegram Orchestrator

Render plan or question

Inline keyboard buttons

CLI Ollama

Detects tool_use or question

N8N webhook + Pending state

Wait for approval · 5 min timeout

Request · non-YOLO

Decision

Actual execution

'Rejected' response

Return result

Pluggable hooks (pre_tool_use, post_tool_use, on_response, on_error) allow cross-cutting rules to be added without touching the providers.

// Intent detection (non-streaming, no approval)
{
"url": "http://cli-ollama:11434/api/generate",
"method": "POST",
"body": {
"model": "codex-yolo",
"prompt": "Analyse this Telegram message and return a JSON …",
"stream": false
}
}
// With profile and restricted MCP tools
{
"model": "codex-yolo",
"prompt": "Inspect workflow X and propose a fix",
"profile": "error-analyst",
"mcp_config": "{\"allowed_tools\": [\"n8n_get_workflow\", \"n8n_executions\"]}"
}
ServiceMemory limitCPU
Qdrant4 GB2 vCPU
CLI Ollama4 GB2 vCPU
Claude Redis512 MB
MCP Gateway128 MB
N8N MCP256 MB

LimitImpactMitigation
Sparsely populated QdrantNo production RAG todayProgressive import from the Obsidian vault
Embeddings via external APIOpenAI dependency for vectorisationLocal embedding model is conceivable
No streaming SessionRecordStreaming conversations do not capitalise on historyRoadmap: record streaming chunks in memory
In-memory SessionRecordLost on container restartRedis persistence planned post-MVP
Codex / Gemini cloudCost + external dependencyRedis caching for intent detection, -yolo models for hot paths

If RAG goes to production:

  • N8N worker that ingests the Obsidian vault into a dedicated Qdrant collection.
  • obsidian-rag profile on the CLI Ollama side with a knowledge base + search tools.
  • Pattern: embeddings via OpenAI, Qdrant search, prompt enriched on the CLI Ollama side.

If I want to add another provider:

  • Implement ProviderProtocol in services/providers/<name>.py.
  • Add the virtual-name → real-model mapping in config.py.
  • The SessionRecord and hooks work automatically (unified ExecutionMessage interface).

If API costs grow:

  • Extend Claude Redis caching to frequent answers (notably intent detection).
  • Route simple requests to gemini-flash (the cheapest) through the AI Router.
  • Enable a per-session_id rate limit on the cli-ollama side.

If MCP security must be tightened:

  • Reduce the gateway whitelist to the 5 read-only tools only.
  • Add use-case-specific profiles (instead of a permissive n8n-admin).
  • Extend Telegram confirmation to more tools (currently 7 critical out of 20).
Fenêtre de terminal
# Health checks
curl http://localhost:6333/healthz # Qdrant
curl http://localhost:11434/api/tags # CLI Ollama
docker logs cli-ollama --tail 100
# Pending approvals
curl http://cli-ollama:11434/api/approvals
curl http://cli-ollama:11434/api/questions
# Test the MCP gateway → n8n-mcp chain
curl -X POST http://127.0.0.1:3001/mcp \
-H "Authorization: Bearer $N8N_MCP_AUTH_TOKEN" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | jq .result.tools[].name
# Verify network isolation
docker network inspect mcp-backend | jq '.[0].Containers'

  • Glossary — RAG, embeddings, vector database, MCP, LLM