Vision OCR

1. What? — Definition and context

Vision OCR is an N8N sub-workflow called from the Telegram Orchestrator every time a user sends a photo. It classifies the image into one of five recognised document categories, then applies a specialised extraction schema to return structured data ready to be consumed by business workflows (Odoo, notes, invoices).

Metadata

Field	Value
Workflow ID	`2ZDgU3TWbF4OeOKY`
Type	Execute Workflow Trigger (passthrough)
Nodes	14
Called by	Binary Content Handler (`QtkJDN8XAGlpcSPV`)
Model	`gemini-flash-yolo` via cli-ollama
Issue	#176

Architecture diagram

The 6 document types

Type	Extracted fields	Typical use
`business_card`	name, function, company_name, email, phone, mobile, street, city, zip, country, website, comment	Odoo contact creation
`invoice`	vendor, invoice_number, date, items[], subtotal, tax, total, currency	Odoo accounting record
`screenshot`	visible_text, ui_elements[], error_messages[], context	Conversational chat, tickets
`handwritten_note`	transcribed_text, confidence, language	Note in Obsidian vault
`general_document`	full_text, document_title, key_sections[]	Searchable text
`not_document`	(none)	Scene photo, AI Router fallback

2. Why? — Stakes and motivations

The problem without Vision OCR

Before this sub-workflow, every photo sent to the Telegram bot was treated as an empty message or required manual data entry. Photographing a business card and then typing every field into Odoo took several minutes per contact.

Problem without extraction	Consequence
Manual entry	Several minutes per card, high error rate
No structure	Impossible to filter/search later
No classification	Every photo handled identically
No fallback	A landscape photo produced an error

Why a single prompt isn’t enough

Asking a vision model “extract the useful information” on any image yields uneven results. A business card has predictable fields, an invoice has a different layout, an error screenshot has no “fields”. The two-phase pipeline resolves that tension:

Approach	Pros	Cons
Single generic prompt	Simple, 1 LLM call	Inconsistent fields, unstable JSON format
Pipeline classify → extract	Schema tailored per type	2 LLM calls, latency x2

The latency overhead (≈ 4-6 s) is acceptable because the user is already waiting for the transcription. The extraction quality and the stable output contract justify the cost.

Why Gemini Flash over GPT-4 Vision

Criterion	Gemini Flash	GPT-4 Vision	Claude 3 Vision
Cost	Free (cli-ollama)	$0.01 / image	$0.005 / image
Latency	1-3 s	3-8 s	2-5 s
Structured JSON	Variable, defensive parsing required	Stable	Stable
Latin OCR	Excellent	Excellent	Excellent
Handwritten OCR	Good	Excellent	Very good
Hosting	Self-hosted (cli-ollama)	OpenAI API	Anthropic API

The Gemini Flash choice aligns with the cli-ollama multi-provider strategy: no per-image cost, acceptable latency, parseable JSON with a defensive regex fallback.

3. How? — Technical implementation

Output contract

Three possible statuses, all returned in the same format:

Success:

{
  "status": "success",
  "docType": "invoice",
  "extracted": {
    "vendor": "Amazon",
    "invoice_number": "INV-2026-001",
    "total": 23.98,
    "currency": "EUR"
  },
  "text": "<b>Invoice</b>\n\nVendor: <b>Amazon</b>\n..."
}

Fallback (not a document):

{
  "status": "fallback",
  "docType": "not_document"
}

Error (cli-ollama unreachable):

{
  "status": "error",
  "error": "cli-ollama request failed (HTTP 500)"
}

The text field contains an HTML-formatted representation ready to send into Telegram. The extracted field contains structured data a caller workflow can consume to create an Odoo record, for instance.

Fallback strategy

JSON parsing of Gemini output is defensive at several levels:

Case	Behaviour
Clean JSON	Direct parse
JSON wrapped in markdown backticks	Strip ``` then parse
JSON with text preamble	Regex `\{[\s\S]*\}` then parse
Malformed JSON	Fall back to `general_document` with `text: <raw response>`
Unknown type returned	Force `not_document`
HTTP error from cli-ollama	`status: error` propagated without crashing

This defensive layer is necessary because Gemini Flash sometimes returns extra text (“Here is the result:”) before the JSON, or wraps JSON in backticks. Without the fallback, the caller workflow received parsing errors instead of a graceful fallback.

Per-type prompts

The initial detector classifies the image with a short prompt:

Analyze this image and classify it as ONE of:
- business_card
- invoice
- screenshot
- handwritten_note
- general_document
- not_document
Return ONLY JSON: {"type": "...", "confidence": 0.95}

Then the Switch routing sends the image to a specialised extraction prompt. Example for a business card — the fields are aligned with Odoo’s res.partner model to allow direct creation via XML-RPC:

Extract contact information from this business card.
Return ONLY JSON (null if not found):
{
  "name": "Full name",
  "function": "Job title",
  "company_name": "Company",
  "email": "...",
  "phone": "...",
  "mobile": "...",
  "street": "...",
  "city": "...",
  "zip": "...",
  "country": "...",
  "website": "...",
  "comment": "LinkedIn or notes"
}

Telegram Orchestrator integration

The Binary Content Handler calls Vision OCR via Execute Workflow then routes the result based on status:

Status	Routing
`success` + `docType=business_card`	Offer to save as Odoo contact (Telegram buttons)
`success` + `docType=invoice`	Offer to store as accounting record
`success` + `docType=screenshot/handwritten/general`	Direct display + “Discuss” button (starts a conversation)
`fallback` (`not_document`)	Switch to AI Router for conversational handling
`error`	Error notification via Notification Hub

Performance

Step	Typical latency
base64 encoding + transfer	100-500 ms (depending on photo size)
Classification detection	1-2 s
Schema-driven extraction	2-4 s
HTML format + return	< 100 ms
End-to-end total	3-7 s

Telegram compression already brings photos down to ~1 MB max, which keeps latency in an acceptable range even for high-resolution business cards.

4. What if? — Outlook and limits

Current limits

Limit	Impact	Mitigation
Latency x2	Two successive LLM calls	Acceptable, the user is already waiting
No multi-doc classification	One photo with a card + a receipt = a single type	Ask for two separate photos
No Odoo validation	A malformed email would create an invalid partner	Validation in the caller workflow
No memory across photos	Each photo is handled in isolation	The conversational system keeps context once the photo is extracted

Evolution scenarios

If extraction quality degrades:

Move to gemini-pro-yolo for complex types (multi-line invoices)
Add a validation/correction step via a second prompt
Compare with an alternative model (Claude Vision via Anthropic API if quota allows)

If new document types emerge:

Add a new case to the Switch + a dedicated extraction prompt
Keep the {status, docType, extracted, text} contract unchanged so caller workflows don’t break
Candidate examples: id_card, passport, recipe, prescription

If volumes grow significantly:

Share the classification cache (same images within 1-2 days)
Move to a quantised local Vision model (LLaVA, MiniCPM) hosted on the VPS for zero network latency
Batch several images in a single API call

Infrastructure

AI Stack — cli-ollama and the Gemini/Claude routing
N8N Queue Mode — Backend running the sub-workflow

Workflows

Telegram Orchestrator — Calling Binary Content Handler
Conversational system — Logical follow-up after a screenshot extraction
Voice Transcription — Analogous pipeline for audio

Reference

Glossary — OCR, Vision, Sub-workflow, cli-ollama