--- title: Vision OCR url: https://blog.guigpap.com/en/workflows/vision-ocr/ url_md: https://blog.guigpap.com/en/workflows/vision-ocr.md category: automation date: '2026-05-04' maturite: production techno: - n8n - claude application: - ai - content --- # Vision OCR > Image classification and structured extraction sub-workflow via Gemini Vision and cli-ollama ## 1. What? — Definition and context **Vision OCR** is an N8N sub-workflow called from the [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) every time a user sends a photo. It classifies the image into one of five recognised document categories, then applies a specialised extraction schema to return structured data ready to be consumed by business workflows (Odoo, notes, invoices). > **Note - Pure-extraction sub-workflow** > > Vision OCR does not talk to Telegram, does not write to Odoo, does not notify anyone. It receives an image as input, returns a `{status, docType, extracted, text}` object as output. It is a plumbing component — the decision about what to do with the extraction belongs to the caller workflow. ### Metadata | Field | Value | |-------|-------| | **Workflow ID** | `2ZDgU3TWbF4OeOKY` | | **Type** | Execute Workflow Trigger (passthrough) | | **Nodes** | 14 | | **Called by** | Binary Content Handler (`QtkJDN8XAGlpcSPV`) | | **Model** | `gemini-flash-yolo` via cli-ollama | | **Issue** | #176 | ### Architecture diagram ```mermaid flowchart TD subgraph Input["Input · passthrough Binary Content Handler"] direction TB Photo["Telegram photo · base64 + mimeType"] end subgraph Detect["Phase 1 · Classification"] direction TB HTTP1["HTTP cli-ollama gemini-flash-yolo"] Parse["Parse Detection · regex fallback"] SwitchType["Switch Doc Type"] end subgraph Extract["Phase 2 · Schema-driven extraction"] direction TB BC["Extract Business Card"] INV["Extract Invoice"] SC["Extract Screenshot"] HW["Extract Handwritten"] GEN["Extract General Document"] end subgraph Output["Output · uniform contract"] direction TB FormatResult["Parse and Format · escape HTML"] Success["status: success"] Fallback["status: fallback · not_document"] ErrOut["status: error"] end Input --> HTTP1 --> Parse Parse -->|HTTP error| ErrOut Parse -->|not_document| Fallback Parse -->|recognized type| SwitchType SwitchType --> BC SwitchType --> INV SwitchType --> SC SwitchType --> HW SwitchType --> GEN BC --> FormatResult INV --> FormatResult SC --> FormatResult HW --> FormatResult GEN --> FormatResult FormatResult --> Success ``` ### The 6 document types | Type | Extracted fields | Typical use | |------|------------------|-------------| | `business_card` | name, function, company_name, email, phone, mobile, street, city, zip, country, website, comment | Odoo contact creation | | `invoice` | vendor, invoice_number, date, items[], subtotal, tax, total, currency | Odoo accounting record | | `screenshot` | visible_text, ui_elements[], error_messages[], context | Conversational chat, tickets | | `handwritten_note` | transcribed_text, confidence, language | Note in Obsidian vault | | `general_document` | full_text, document_title, key_sections[] | Searchable text | | `not_document` | (none) | Scene photo, AI Router fallback | --- ## 2. Why? — Stakes and motivations ### The problem without Vision OCR Before this sub-workflow, every photo sent to the Telegram bot was treated as an empty message or required manual data entry. Photographing a business card and then typing every field into Odoo took several minutes per contact. | Problem without extraction | Consequence | |----------------------------|-------------| | **Manual entry** | Several minutes per card, high error rate | | **No structure** | Impossible to filter/search later | | **No classification** | Every photo handled identically | | **No fallback** | A landscape photo produced an error | ### Why a single prompt isn't enough Asking a vision model *"extract the useful information"* on any image yields uneven results. A business card has predictable fields, an invoice has a different layout, an error screenshot has no "fields". The two-phase pipeline resolves that tension: | Approach | Pros | Cons | |----------|------|------| | **Single generic prompt** | Simple, 1 LLM call | Inconsistent fields, unstable JSON format | | **Pipeline classify → extract** | Schema tailored per type | 2 LLM calls, latency x2 | The latency overhead (≈ 4-6 s) is acceptable because the user is already waiting for the transcription. The extraction quality and the stable output contract justify the cost. ### Why Gemini Flash over GPT-4 Vision | Criterion | Gemini Flash | GPT-4 Vision | Claude 3 Vision | |-----------|--------------|--------------|-----------------| | **Cost** | Free (cli-ollama) | $0.01 / image | $0.005 / image | | **Latency** | 1-3 s | 3-8 s | 2-5 s | | **Structured JSON** | Variable, defensive parsing required | Stable | Stable | | **Latin OCR** | Excellent | Excellent | Excellent | | **Handwritten OCR** | Good | Excellent | Very good | | **Hosting** | Self-hosted (cli-ollama) | OpenAI API | Anthropic API | The Gemini Flash choice aligns with the cli-ollama multi-provider strategy: no per-image cost, acceptable latency, parseable JSON with a defensive regex fallback. --- ## 3. How? — Technical implementation ### Output contract Three possible statuses, all returned in the same format: **Success:** ```json { "status": "success", "docType": "invoice", "extracted": { "vendor": "Amazon", "invoice_number": "INV-2026-001", "total": 23.98, "currency": "EUR" }, "text": "Invoice\n\nVendor: Amazon\n..." } ``` **Fallback (not a document):** ```json { "status": "fallback", "docType": "not_document" } ``` **Error (cli-ollama unreachable):** ```json { "status": "error", "error": "cli-ollama request failed (HTTP 500)" } ``` The `text` field contains an HTML-formatted representation ready to send into Telegram. The `extracted` field contains structured data a caller workflow can consume to create an Odoo record, for instance. ### Fallback strategy JSON parsing of Gemini output is defensive at several levels: | Case | Behaviour | |------|-----------| | **Clean JSON** | Direct parse | | **JSON wrapped in markdown backticks** | Strip ` ``` ` then parse | | **JSON with text preamble** | Regex `\{[\s\S]*\}` then parse | | **Malformed JSON** | Fall back to `general_document` with `text: ` | | **Unknown type returned** | Force `not_document` | | **HTTP error from cli-ollama** | `status: error` propagated without crashing | This defensive layer is necessary because Gemini Flash sometimes returns extra text ("Here is the result:") before the JSON, or wraps JSON in backticks. Without the fallback, the caller workflow received parsing errors instead of a graceful fallback. ### Per-type prompts The initial detector classifies the image with a short prompt: ``` Analyze this image and classify it as ONE of: - business_card - invoice - screenshot - handwritten_note - general_document - not_document Return ONLY JSON: {"type": "...", "confidence": 0.95} ``` Then the Switch routing sends the image to a specialised extraction prompt. Example for a business card — the fields are aligned with Odoo's `res.partner` model to allow direct creation via XML-RPC: ``` Extract contact information from this business card. Return ONLY JSON (null if not found): { "name": "Full name", "function": "Job title", "company_name": "Company", "email": "...", "phone": "...", "mobile": "...", "street": "...", "city": "...", "zip": "...", "country": "...", "website": "...", "comment": "LinkedIn or notes" } ``` > **Tip - cli-ollama -yolo suffix** > > The model is `gemini-flash-yolo` (not `gemini-flash`). The `-yolo` suffix means "auto-approve dangerous", which disables the CLI Ollama interactive plan mode and avoids a timeout on the N8N webhook (120 s limit). See [AI Stack](/en/infrastructure/ai-stack/) for details. ### Telegram Orchestrator integration The Binary Content Handler calls Vision OCR via Execute Workflow then routes the result based on `status`: | Status | Routing | |--------|---------| | `success` + `docType=business_card` | Offer to save as Odoo contact (Telegram buttons) | | `success` + `docType=invoice` | Offer to store as accounting record | | `success` + `docType=screenshot/handwritten/general` | Direct display + "Discuss" button (starts a conversation) | | `fallback` (`not_document`) | Switch to AI Router for conversational handling | | `error` | Error notification via [Notification Hub](/en/workflows/notification-hub/) | ### Performance | Step | Typical latency | |------|-----------------| | base64 encoding + transfer | 100-500 ms (depending on photo size) | | Classification detection | 1-2 s | | Schema-driven extraction | 2-4 s | | HTML format + return | < 100 ms | | **End-to-end total** | **3-7 s** | Telegram compression already brings photos down to ~1 MB max, which keeps latency in an acceptable range even for high-resolution business cards. --- ## 4. What if? — Outlook and limits ### Current limits | Limit | Impact | Mitigation | |-------|--------|------------| | **Latency x2** | Two successive LLM calls | Acceptable, the user is already waiting | | **No multi-doc classification** | One photo with a card + a receipt = a single type | Ask for two separate photos | | **No Odoo validation** | A malformed email would create an invalid partner | Validation in the caller workflow | | **No memory across photos** | Each photo is handled in isolation | The conversational system keeps context once the photo is extracted | ### Evolution scenarios **If extraction quality degrades**: - Move to `gemini-pro-yolo` for complex types (multi-line invoices) - Add a validation/correction step via a second prompt - Compare with an alternative model (Claude Vision via Anthropic API if quota allows) **If new document types emerge**: - Add a new case to the Switch + a dedicated extraction prompt - Keep the `{status, docType, extracted, text}` contract unchanged so caller workflows don't break - Candidate examples: `id_card`, `passport`, `recipe`, `prescription` **If volumes grow significantly**: - Share the classification cache (same images within 1-2 days) - Move to a quantised local Vision model (LLaVA, MiniCPM) hosted on the VPS for zero network latency - Batch several images in a single API call --- ## Related pages ### Infrastructure - [AI Stack](/en/infrastructure/ai-stack/) — cli-ollama and the Gemini/Claude routing - [N8N Queue Mode](/en/infrastructure/n8n-queue-mode/) — Backend running the sub-workflow ### Workflows - [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) — Calling Binary Content Handler - [Conversational system](/en/workflows/systeme-conversationnel/) — Logical follow-up after a screenshot extraction - [Voice Transcription](/en/workflows/voice-transcription/) — Analogous pipeline for audio ### Reference - [Glossary](/en/reference/glossary/) — OCR, Vision, Sub-workflow, cli-ollama ## Metadonnees agent - Cet article est issu du blog GuiGPaP Lab. - Contexte global du blog: https://blog.guigpap.com/llms.txt - Contact auteur: https://odoo.guigpap.com/mon-cv - Licence: CC-BY-SA 4.0