Skip to content

Question Hub & Vision

When Claude Code (or Codex, or Gemini) needs to ask a multiple-choice question — “Which framework should we use? React, Vue, or Svelte?” — it cannot do so ergonomically inside a terminal. The Question Hub displays that question in Telegram with interactive buttons, supports multi-select and pagination, and returns the answer to the CLI.

In addition, the Vision OCR system analyses photos sent on Telegram: business card, invoice, screenshot, handwritten note — each document type goes through a specialised extraction.

SystemNodesTriggerRole
Question Hub~35 (parent + callback)CLI webhookInteractive Telegram questions
Vision OCR14Sub-workflow (photo)Document extraction

ProblemWithout these workflowsWith these workflows
Unreadable CLI questionsNumbered list inside the terminalTelegram buttons with emojis
No multi-selectType the numbers one by one✅ toggle and confirmation
Useless photosPhoto = binary file, no infoType-specific structured extraction
Generic OCRSame processing for everythingInvoice ≠ business card ≠ screenshot

The Vision OCR classifies each photo before extracting it:

TypeExtracted fields
business_cardName, function, company, email, phone
invoiceVendor, number, lines, total, date
screenshotVisible text, identified UI
handwritten_noteTranscription, confidence, language
general_documentStructured raw text
not_document(Not a document — photo, landscape, etc.)

1. Reception — The CLI sends a webhook with the options, the question type (single/multi-select), and a timeout (300s by default).

2. Formatting — N8N builds an inline keyboard sized to the option count. Beyond 4 options, pagination kicks in automatically (4 options per page with ◀️ ▶️ arrows).

3. Interaction — The user clicks options. In multi-select, every click toggles the ✅ and updates the keyboard in real time (via editMessageReplyMarkup). Selections are persisted in a Data Table to survive page changes.

4. Confirmation — A [Confirm] button validates the choices. The answer is returned to the CLI through the callback.

5. Free text — An optional [Other] button enables Telegram’s ForceReply mode to capture a free-form answer.

Everything happens inside a single Telegram message — no spam of one message per interaction.

1. Classification — Gemini Flash analyses the base64 image and returns a document type with a confidence score.

2. Specialised extraction — Depending on the detected type, a specific prompt is sent to Gemini Vision. Each branch extracts different fields:

Photo received

Gemini Flash · Classification

business_card → name, email, phone, company

invoice → vendor, number, lines, total

screenshot → text, UI

handwritten_note → transcription, confidence

general_document → structured text

not_document → ignore

3. Normalisation — The reply is formatted as sanitised HTML for Telegram with a uniform contract: {status, docType, extracted, text}.

The profile system lets you configure different AI personas with specific knowledge and tools. Each profile is a YAML file in /workspace/profiles/ that defines:

  • A specialised system prompt
  • A knowledge base (Markdown files injected into the context)
  • A list of allowed MCP tools (ceiling semantic)
  • Tools requiring Telegram approval

Two profiles are deployed: error-analyst (DLQ analysis, 5 read tools) and n8n-admin (workflow administration, 5 read + 2 write with confirmation).


LimitImpactMitigation
5-min timeoutQuestion expires without an answerSufficient for interactive use
4 options/pageLots of pages with 20+ optionsPagination preserving selections
OCR depends on GeminiNo local fallbackFast and reliable in practice

If more accurate OCR is needed:

  • Add specialised models (Tesseract for standard fonts)
  • Post-process invoices with total validation
  • Direct integration with Odoo accounting

If multi-user use:

  • Map CLI sessions to Telegram users
  • Queue if several questions arrive at once

  • AI Stack — CLI Ollama and Gemini Vision