Vision OCR
1. What? — Definition and context
Section titled “1. What? — Definition and context”Vision OCR is an N8N sub-workflow called from the Telegram Orchestrator every time a user sends a photo. It classifies the image into one of five recognised document categories, then applies a specialised extraction schema to return structured data ready to be consumed by business workflows (Odoo, notes, invoices).
Metadata
Section titled “Metadata”| Field | Value |
|---|---|
| Workflow ID | 2ZDgU3TWbF4OeOKY |
| Type | Execute Workflow Trigger (passthrough) |
| Nodes | 14 |
| Called by | Binary Content Handler (QtkJDN8XAGlpcSPV) |
| Model | gemini-flash-yolo via cli-ollama |
| Issue | #176 |
Architecture diagram
Section titled “Architecture diagram”The 6 document types
Section titled “The 6 document types”| Type | Extracted fields | Typical use |
|---|---|---|
business_card | name, function, company_name, email, phone, mobile, street, city, zip, country, website, comment | Odoo contact creation |
invoice | vendor, invoice_number, date, items[], subtotal, tax, total, currency | Odoo accounting record |
screenshot | visible_text, ui_elements[], error_messages[], context | Conversational chat, tickets |
handwritten_note | transcribed_text, confidence, language | Note in Obsidian vault |
general_document | full_text, document_title, key_sections[] | Searchable text |
not_document | (none) | Scene photo, AI Router fallback |
2. Why? — Stakes and motivations
Section titled “2. Why? — Stakes and motivations”The problem without Vision OCR
Section titled “The problem without Vision OCR”Before this sub-workflow, every photo sent to the Telegram bot was treated as an empty message or required manual data entry. Photographing a business card and then typing every field into Odoo took several minutes per contact.
| Problem without extraction | Consequence |
|---|---|
| Manual entry | Several minutes per card, high error rate |
| No structure | Impossible to filter/search later |
| No classification | Every photo handled identically |
| No fallback | A landscape photo produced an error |
Why a single prompt isn’t enough
Section titled “Why a single prompt isn’t enough”Asking a vision model “extract the useful information” on any image yields uneven results. A business card has predictable fields, an invoice has a different layout, an error screenshot has no “fields”. The two-phase pipeline resolves that tension:
| Approach | Pros | Cons |
|---|---|---|
| Single generic prompt | Simple, 1 LLM call | Inconsistent fields, unstable JSON format |
| Pipeline classify → extract | Schema tailored per type | 2 LLM calls, latency x2 |
The latency overhead (≈ 4-6 s) is acceptable because the user is already waiting for the transcription. The extraction quality and the stable output contract justify the cost.
Why Gemini Flash over GPT-4 Vision
Section titled “Why Gemini Flash over GPT-4 Vision”| Criterion | Gemini Flash | GPT-4 Vision | Claude 3 Vision |
|---|---|---|---|
| Cost | Free (cli-ollama) | $0.01 / image | $0.005 / image |
| Latency | 1-3 s | 3-8 s | 2-5 s |
| Structured JSON | Variable, defensive parsing required | Stable | Stable |
| Latin OCR | Excellent | Excellent | Excellent |
| Handwritten OCR | Good | Excellent | Very good |
| Hosting | Self-hosted (cli-ollama) | OpenAI API | Anthropic API |
The Gemini Flash choice aligns with the cli-ollama multi-provider strategy: no per-image cost, acceptable latency, parseable JSON with a defensive regex fallback.
3. How? — Technical implementation
Section titled “3. How? — Technical implementation”Output contract
Section titled “Output contract”Three possible statuses, all returned in the same format:
Success:
{ "status": "success", "docType": "invoice", "extracted": { "vendor": "Amazon", "invoice_number": "INV-2026-001", "total": 23.98, "currency": "EUR" }, "text": "<b>Invoice</b>\n\nVendor: <b>Amazon</b>\n..."}Fallback (not a document):
{ "status": "fallback", "docType": "not_document"}Error (cli-ollama unreachable):
{ "status": "error", "error": "cli-ollama request failed (HTTP 500)"}The text field contains an HTML-formatted representation ready to send into Telegram. The extracted field contains structured data a caller workflow can consume to create an Odoo record, for instance.
Fallback strategy
Section titled “Fallback strategy”JSON parsing of Gemini output is defensive at several levels:
| Case | Behaviour |
|---|---|
| Clean JSON | Direct parse |
| JSON wrapped in markdown backticks | Strip ``` then parse |
| JSON with text preamble | Regex \{[\s\S]*\} then parse |
| Malformed JSON | Fall back to general_document with text: <raw response> |
| Unknown type returned | Force not_document |
| HTTP error from cli-ollama | status: error propagated without crashing |
This defensive layer is necessary because Gemini Flash sometimes returns extra text (“Here is the result:”) before the JSON, or wraps JSON in backticks. Without the fallback, the caller workflow received parsing errors instead of a graceful fallback.
Per-type prompts
Section titled “Per-type prompts”The initial detector classifies the image with a short prompt:
Analyze this image and classify it as ONE of:- business_card- invoice- screenshot- handwritten_note- general_document- not_documentReturn ONLY JSON: {"type": "...", "confidence": 0.95}Then the Switch routing sends the image to a specialised extraction prompt. Example for a business card — the fields are aligned with Odoo’s res.partner model to allow direct creation via XML-RPC:
Extract contact information from this business card.Return ONLY JSON (null if not found):{ "name": "Full name", "function": "Job title", "company_name": "Company", "email": "...", "phone": "...", "mobile": "...", "street": "...", "city": "...", "zip": "...", "country": "...", "website": "...", "comment": "LinkedIn or notes"}Telegram Orchestrator integration
Section titled “Telegram Orchestrator integration”The Binary Content Handler calls Vision OCR via Execute Workflow then routes the result based on status:
| Status | Routing |
|---|---|
success + docType=business_card | Offer to save as Odoo contact (Telegram buttons) |
success + docType=invoice | Offer to store as accounting record |
success + docType=screenshot/handwritten/general | Direct display + “Discuss” button (starts a conversation) |
fallback (not_document) | Switch to AI Router for conversational handling |
error | Error notification via Notification Hub |
Performance
Section titled “Performance”| Step | Typical latency |
|---|---|
| base64 encoding + transfer | 100-500 ms (depending on photo size) |
| Classification detection | 1-2 s |
| Schema-driven extraction | 2-4 s |
| HTML format + return | < 100 ms |
| End-to-end total | 3-7 s |
Telegram compression already brings photos down to ~1 MB max, which keeps latency in an acceptable range even for high-resolution business cards.
4. What if? — Outlook and limits
Section titled “4. What if? — Outlook and limits”Current limits
Section titled “Current limits”| Limit | Impact | Mitigation |
|---|---|---|
| Latency x2 | Two successive LLM calls | Acceptable, the user is already waiting |
| No multi-doc classification | One photo with a card + a receipt = a single type | Ask for two separate photos |
| No Odoo validation | A malformed email would create an invalid partner | Validation in the caller workflow |
| No memory across photos | Each photo is handled in isolation | The conversational system keeps context once the photo is extracted |
Evolution scenarios
Section titled “Evolution scenarios”If extraction quality degrades:
- Move to
gemini-pro-yolofor complex types (multi-line invoices) - Add a validation/correction step via a second prompt
- Compare with an alternative model (Claude Vision via Anthropic API if quota allows)
If new document types emerge:
- Add a new case to the Switch + a dedicated extraction prompt
- Keep the
{status, docType, extracted, text}contract unchanged so caller workflows don’t break - Candidate examples:
id_card,passport,recipe,prescription
If volumes grow significantly:
- Share the classification cache (same images within 1-2 days)
- Move to a quantised local Vision model (LLaVA, MiniCPM) hosted on the VPS for zero network latency
- Batch several images in a single API call
Related pages
Section titled “Related pages”Infrastructure
Section titled “Infrastructure”- AI Stack — cli-ollama and the Gemini/Claude routing
- N8N Queue Mode — Backend running the sub-workflow
Workflows
Section titled “Workflows”- Telegram Orchestrator — Calling Binary Content Handler
- Conversational system — Logical follow-up after a screenshot extraction
- Voice Transcription — Analogous pipeline for audio
Reference
Section titled “Reference”- Glossary — OCR, Vision, Sub-workflow, cli-ollama