Skip to content

Vision OCR

Vision OCR is an N8N sub-workflow called from the Telegram Orchestrator every time a user sends a photo. It classifies the image into one of five recognised document categories, then applies a specialised extraction schema to return structured data ready to be consumed by business workflows (Odoo, notes, invoices).

FieldValue
Workflow ID2ZDgU3TWbF4OeOKY
TypeExecute Workflow Trigger (passthrough)
Nodes14
Called byBinary Content Handler (QtkJDN8XAGlpcSPV)
Modelgemini-flash-yolo via cli-ollama
Issue#176

Output · uniform contract

Phase 2 · Schema-driven extraction

Phase 1 · Classification

HTTP error

not_document

recognized type

Input · passthrough Binary Content Handler

Telegram photo · base64 + mimeType

HTTP cli-ollama gemini-flash-yolo

Parse Detection · regex fallback

Switch Doc Type

Extract Business Card

Extract Invoice

Extract Screenshot

Extract Handwritten

Extract General Document

Parse and Format · escape HTML

status: success

status: fallback · not_document

status: error

TypeExtracted fieldsTypical use
business_cardname, function, company_name, email, phone, mobile, street, city, zip, country, website, commentOdoo contact creation
invoicevendor, invoice_number, date, items[], subtotal, tax, total, currencyOdoo accounting record
screenshotvisible_text, ui_elements[], error_messages[], contextConversational chat, tickets
handwritten_notetranscribed_text, confidence, languageNote in Obsidian vault
general_documentfull_text, document_title, key_sections[]Searchable text
not_document(none)Scene photo, AI Router fallback

Before this sub-workflow, every photo sent to the Telegram bot was treated as an empty message or required manual data entry. Photographing a business card and then typing every field into Odoo took several minutes per contact.

Problem without extractionConsequence
Manual entrySeveral minutes per card, high error rate
No structureImpossible to filter/search later
No classificationEvery photo handled identically
No fallbackA landscape photo produced an error

Asking a vision model “extract the useful information” on any image yields uneven results. A business card has predictable fields, an invoice has a different layout, an error screenshot has no “fields”. The two-phase pipeline resolves that tension:

ApproachProsCons
Single generic promptSimple, 1 LLM callInconsistent fields, unstable JSON format
Pipeline classify → extractSchema tailored per type2 LLM calls, latency x2

The latency overhead (≈ 4-6 s) is acceptable because the user is already waiting for the transcription. The extraction quality and the stable output contract justify the cost.

CriterionGemini FlashGPT-4 VisionClaude 3 Vision
CostFree (cli-ollama)$0.01 / image$0.005 / image
Latency1-3 s3-8 s2-5 s
Structured JSONVariable, defensive parsing requiredStableStable
Latin OCRExcellentExcellentExcellent
Handwritten OCRGoodExcellentVery good
HostingSelf-hosted (cli-ollama)OpenAI APIAnthropic API

The Gemini Flash choice aligns with the cli-ollama multi-provider strategy: no per-image cost, acceptable latency, parseable JSON with a defensive regex fallback.


Three possible statuses, all returned in the same format:

Success:

{
"status": "success",
"docType": "invoice",
"extracted": {
"vendor": "Amazon",
"invoice_number": "INV-2026-001",
"total": 23.98,
"currency": "EUR"
},
"text": "<b>Invoice</b>\n\nVendor: <b>Amazon</b>\n..."
}

Fallback (not a document):

{
"status": "fallback",
"docType": "not_document"
}

Error (cli-ollama unreachable):

{
"status": "error",
"error": "cli-ollama request failed (HTTP 500)"
}

The text field contains an HTML-formatted representation ready to send into Telegram. The extracted field contains structured data a caller workflow can consume to create an Odoo record, for instance.

JSON parsing of Gemini output is defensive at several levels:

CaseBehaviour
Clean JSONDirect parse
JSON wrapped in markdown backticksStrip ``` then parse
JSON with text preambleRegex \{[\s\S]*\} then parse
Malformed JSONFall back to general_document with text: <raw response>
Unknown type returnedForce not_document
HTTP error from cli-ollamastatus: error propagated without crashing

This defensive layer is necessary because Gemini Flash sometimes returns extra text (“Here is the result:”) before the JSON, or wraps JSON in backticks. Without the fallback, the caller workflow received parsing errors instead of a graceful fallback.

The initial detector classifies the image with a short prompt:

Analyze this image and classify it as ONE of:
- business_card
- invoice
- screenshot
- handwritten_note
- general_document
- not_document
Return ONLY JSON: {"type": "...", "confidence": 0.95}

Then the Switch routing sends the image to a specialised extraction prompt. Example for a business card — the fields are aligned with Odoo’s res.partner model to allow direct creation via XML-RPC:

Extract contact information from this business card.
Return ONLY JSON (null if not found):
{
"name": "Full name",
"function": "Job title",
"company_name": "Company",
"email": "...",
"phone": "...",
"mobile": "...",
"street": "...",
"city": "...",
"zip": "...",
"country": "...",
"website": "...",
"comment": "LinkedIn or notes"
}

The Binary Content Handler calls Vision OCR via Execute Workflow then routes the result based on status:

StatusRouting
success + docType=business_cardOffer to save as Odoo contact (Telegram buttons)
success + docType=invoiceOffer to store as accounting record
success + docType=screenshot/handwritten/generalDirect display + “Discuss” button (starts a conversation)
fallback (not_document)Switch to AI Router for conversational handling
errorError notification via Notification Hub
StepTypical latency
base64 encoding + transfer100-500 ms (depending on photo size)
Classification detection1-2 s
Schema-driven extraction2-4 s
HTML format + return< 100 ms
End-to-end total3-7 s

Telegram compression already brings photos down to ~1 MB max, which keeps latency in an acceptable range even for high-resolution business cards.


LimitImpactMitigation
Latency x2Two successive LLM callsAcceptable, the user is already waiting
No multi-doc classificationOne photo with a card + a receipt = a single typeAsk for two separate photos
No Odoo validationA malformed email would create an invalid partnerValidation in the caller workflow
No memory across photosEach photo is handled in isolationThe conversational system keeps context once the photo is extracted

If extraction quality degrades:

  • Move to gemini-pro-yolo for complex types (multi-line invoices)
  • Add a validation/correction step via a second prompt
  • Compare with an alternative model (Claude Vision via Anthropic API if quota allows)

If new document types emerge:

  • Add a new case to the Switch + a dedicated extraction prompt
  • Keep the {status, docType, extracted, text} contract unchanged so caller workflows don’t break
  • Candidate examples: id_card, passport, recipe, prescription

If volumes grow significantly:

  • Share the classification cache (same images within 1-2 days)
  • Move to a quantised local Vision model (LLaVA, MiniCPM) hosted on the VPS for zero network latency
  • Batch several images in a single API call

  • Glossary — OCR, Vision, Sub-workflow, cli-ollama