AI Stack: Qdrant + CLI Ollama

1. What? — Definition and context

The AI Stack brings together five services that work in concert to serve every AI need on the infrastructure: a vector database for RAG, a multi-model Ollama-compatible gateway, session memory, and two MCP (Model Context Protocol) building blocks that expose N8N tools to models with strict access control.

Components

Service	Port	Role
Qdrant	6333	Vector database (embeddings, semantic search)
CLI Ollama	11434	Ollama-compatible gateway towards Codex and Gemini
Claude Redis	6379	Conversation session memory (LRU 256 MB)
MCP Gateway	3001	MCP reverse proxy: tool whitelist + SSE/JSON negotiation
N8N MCP	3000	MCP bridge exposing the N8N API to models

Architecture diagram

2. Why? — Stakes and motivations

Goals of the AI Stack

Goal	Solution
No vendor lock-in	Ollama-compatible API, swappable providers (Codex / Gemini)
Cross-provider sessions	`SessionRecord` shares history across models, fallback through injection
MCP tool control	Gateway whitelist + Telegram confirmation on destructive actions
Scoped profiles	The same service exposes different permissions depending on the profile (`error-analyst` vs `n8n-admin`)
Local RAG	Self-hosted Qdrant, no leak of private documents to the cloud

Models available through CLI Ollama

Virtual name	Provider	Real ID	Notes
`codex-max`	OpenAI Codex	`gpt-5.1-codex-max`	Frontier, agentic, complex tasks
`codex`	OpenAI Codex	`gpt-5.1-codex`	Standard agentic
`codex-mini`	OpenAI Codex	`gpt-5.1-codex-mini`	Smaller, more economical
`gemini-flash`	Google Gemini	`gemini-2.5-flash`	Fast, low-cost
`gemini-pro`	Google Gemini	`gemini-2.5-pro`	Most capable on the Gemini side
`<model>-yolo`	All	—	Skip approval/confirmation flow

Why Codex as the default provider?

Criterion	Codex (`gpt-5.1`)	Claude	Gemini
Native agentic mode	Yes	Yes (via tool use)	Limited
Persistent sessions	Native `thread_id`	Limited	Native `session_id`
Cost for workflows	Medium	High	Low
`-yolo` (full-auto) models	Yes (`--full-auto`)	No equivalent	Yes (`--yolo`)

Codex covers the agentic needs (analysis, refactoring, code generation). Gemini serves as a fast/economical fallback. Claude is no longer served through CLI Ollama: it remains used directly through Claude Code on the developer workstation.

3. How? — Technical implementation

Qdrant: collection management

# List collections
curl http://localhost:6333/collections | jq '.result.collections'

# Create a collection
curl -X PUT http://localhost:6333/collections/documents \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": { "size": 1536, "distance": "Cosine" }
  }'

# Vector search
curl http://localhost:6333/collections/documents/points/search \
  -H "Content-Type: application/json" \
  -d '{ "vector": [0.1, 0.2, ...], "limit": 5 }'

Multi-provider sessions (#14)

CLI Ollama maintains a SessionRecord per session_id that follows the user across providers. If a conversation starts on codex-max and then switches to gemini-pro, the previous history is re-injected as a prompt prefix.

SessionRecord fields: session_id, current_provider, current_model, codex_thread_id, gemini_session_id, turn_count, total_tokens. Stored in memory (no Redis persistence for the SessionRecords themselves — the message history lives in Claude Redis).

Profile system

CLI Ollama exposes several YAML profiles that scope the usable MCP tools and the allowed models. The same service therefore offers different permissions depending on the profile sent in the request.

Profile	Allowed tools	Tools requiring approval	Knowledge base
`error-analyst`	5 read-only (`n8n_get_workflow`, `n8n_executions`, …)	none	DLQ format, workflow architecture
`n8n-admin`	5 read + 2 write	`n8n_update_partial_workflow`, `n8n_update_full_workflow`	Admin guide

The .md files listed in knowledge_base are injected into the system prompt on every request, giving the model a stable context without resending it every turn.

MCP Gateway (whitelist + transport negotiation)

mcp-gateway is an MCP reverse proxy interposed between cli-ollama and n8n-mcp. It plays three roles:

Server-side tool whitelist — blocks any tool not on the list before the request reaches n8n-mcp.
Transport negotiation — always asks for SSE upstream, then re-formats based on the client’s Accept header (Claude CLI prefers JSON, Codex CLI prefers SSE/rmcp).
Bearer auth + network isolation — mcp-backend is not reachable from ai-internal other than through the gateway.

Whitelisted tools (20): every required tool (n8n_list_workflows, n8n_get_workflow, n8n_validate_workflow, tools_documentation, search_nodes, get_node, validate_node, search_templates, get_template, validate_workflow, n8n_executions, n8n_health_check, n8n_create_workflow, n8n_update_full_workflow, n8n_update_partial_workflow, n8n_delete_workflow, n8n_deploy_template, n8n_autofix_workflow, n8n_test_workflow, n8n_workflow_versions).

Critical tools (7) — pass the whitelist but require Telegram confirmation before execution: n8n_create_workflow, n8n_update_full_workflow, n8n_update_partial_workflow, n8n_delete_workflow, n8n_deploy_template, n8n_autofix_workflow, n8n_test_workflow. The confirmation is handled by mcp_confirmation.py on the cli-ollama side which posts a webhook to the MCP Confirmation Handler workflow.

N8N MCP bridge

n8n-mcp is the MCP server that actually exposes the N8N tools. It authenticates against the N8N REST API through an API key (N8N_MCP_API_KEY), and requires a Bearer token (N8N_MCP_AUTH_TOKEN) on the MCP client side.

# Test the full chain from the host (via the gateway)
curl -X POST http://127.0.0.1:3001/mcp \
  -H "Authorization: Bearer ${N8N_MCP_AUTH_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

Approval workflow (non-yolo mode)

Pluggable hooks (pre_tool_use, post_tool_use, on_response, on_error) allow cross-cutting rules to be added without touching the providers.

Calling from N8N

// Intent detection (non-streaming, no approval)
{
  "url": "http://cli-ollama:11434/api/generate",
  "method": "POST",
  "body": {
    "model": "codex-yolo",
    "prompt": "Analyse this Telegram message and return a JSON …",
    "stream": false
  }
}

// With profile and restricted MCP tools
{
  "model": "codex-yolo",
  "prompt": "Inspect workflow X and propose a fix",
  "profile": "error-analyst",
  "mcp_config": "{\"allowed_tools\": [\"n8n_get_workflow\", \"n8n_executions\"]}"
}

System resources

Service	Memory limit	CPU
Qdrant	4 GB	2 vCPU
CLI Ollama	4 GB	2 vCPU
Claude Redis	512 MB	—
MCP Gateway	128 MB	—
N8N MCP	256 MB	—

4. What if? — Outlook and limits

Current limits

Limit	Impact	Mitigation
Sparsely populated Qdrant	No production RAG today	Progressive import from the Obsidian vault
Embeddings via external API	OpenAI dependency for vectorisation	Local embedding model is conceivable
No streaming SessionRecord	Streaming conversations do not capitalise on history	Roadmap: record streaming chunks in memory
In-memory SessionRecord	Lost on container restart	Redis persistence planned post-MVP
Codex / Gemini cloud	Cost + external dependency	Redis caching for intent detection, `-yolo` models for hot paths

Evolution scenarios

If RAG goes to production:

N8N worker that ingests the Obsidian vault into a dedicated Qdrant collection.
obsidian-rag profile on the CLI Ollama side with a knowledge base + search tools.
Pattern: embeddings via OpenAI, Qdrant search, prompt enriched on the CLI Ollama side.

If I want to add another provider:

Implement ProviderProtocol in services/providers/<name>.py.
Add the virtual-name → real-model mapping in config.py.
The SessionRecord and hooks work automatically (unified ExecutionMessage interface).

If API costs grow:

Extend Claude Redis caching to frequent answers (notably intent detection).
Route simple requests to gemini-flash (the cheapest) through the AI Router.
Enable a per-session_id rate limit on the cli-ollama side.

If MCP security must be tightened:

Reduce the gateway whitelist to the 5 read-only tools only.
Add use-case-specific profiles (instead of a permissive n8n-admin).
Extend Telegram confirmation to more tools (currently 7 critical out of 20).

Troubleshooting commands

# Health checks
curl http://localhost:6333/healthz                  # Qdrant
curl http://localhost:11434/api/tags                # CLI Ollama
docker logs cli-ollama --tail 100

# Pending approvals
curl http://cli-ollama:11434/api/approvals
curl http://cli-ollama:11434/api/questions

# Test the MCP gateway → n8n-mcp chain
curl -X POST http://127.0.0.1:3001/mcp \
  -H "Authorization: Bearer $N8N_MCP_AUTH_TOKEN" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | jq .result.tools[].name

# Verify network isolation
docker network inspect mcp-backend | jq '.[0].Containers'

Infrastructure

VPS Architecture — Overview
N8N in queue mode — Consumer workers

Workflows

Conversational system — Multi-turn sessions through CLI Ollama
Telegram Orchestrator — AI Router and MCP confirmation
Notification Hub — AI alert routing

Reference

Glossary — RAG, embeddings, vector database, MCP, LLM