---
title: Vision OCR
url: https://blog.guigpap.com/en/workflows/vision-ocr/
url_md: https://blog.guigpap.com/en/workflows/vision-ocr.md
category: automation
date: '2026-05-04'
maturite: production
techno:
- n8n
- claude
application:
- ai
- content
---
# Vision OCR
> Image classification and structured extraction sub-workflow via Gemini Vision and cli-ollama
## 1. What? — Definition and context
**Vision OCR** is an N8N sub-workflow called from the [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) every time a user sends a photo. It classifies the image into one of five recognised document categories, then applies a specialised extraction schema to return structured data ready to be consumed by business workflows (Odoo, notes, invoices).
> **Note - Pure-extraction sub-workflow**
>
> Vision OCR does not talk to Telegram, does not write to Odoo, does not notify anyone. It receives an image as input, returns a `{status, docType, extracted, text}` object as output. It is a plumbing component — the decision about what to do with the extraction belongs to the caller workflow.
### Metadata
| Field | Value |
|-------|-------|
| **Workflow ID** | `2ZDgU3TWbF4OeOKY` |
| **Type** | Execute Workflow Trigger (passthrough) |
| **Nodes** | 14 |
| **Called by** | Binary Content Handler (`QtkJDN8XAGlpcSPV`) |
| **Model** | `gemini-flash-yolo` via cli-ollama |
| **Issue** | #176 |
### Architecture diagram
```mermaid
flowchart TD
subgraph Input["Input · passthrough Binary Content Handler"]
direction TB
Photo["Telegram photo · base64 + mimeType"]
end
subgraph Detect["Phase 1 · Classification"]
direction TB
HTTP1["HTTP cli-ollama gemini-flash-yolo"]
Parse["Parse Detection · regex fallback"]
SwitchType["Switch Doc Type"]
end
subgraph Extract["Phase 2 · Schema-driven extraction"]
direction TB
BC["Extract Business Card"]
INV["Extract Invoice"]
SC["Extract Screenshot"]
HW["Extract Handwritten"]
GEN["Extract General Document"]
end
subgraph Output["Output · uniform contract"]
direction TB
FormatResult["Parse and Format · escape HTML"]
Success["status: success"]
Fallback["status: fallback · not_document"]
ErrOut["status: error"]
end
Input --> HTTP1 --> Parse
Parse -->|HTTP error| ErrOut
Parse -->|not_document| Fallback
Parse -->|recognized type| SwitchType
SwitchType --> BC
SwitchType --> INV
SwitchType --> SC
SwitchType --> HW
SwitchType --> GEN
BC --> FormatResult
INV --> FormatResult
SC --> FormatResult
HW --> FormatResult
GEN --> FormatResult
FormatResult --> Success
```
### The 6 document types
| Type | Extracted fields | Typical use |
|------|------------------|-------------|
| `business_card` | name, function, company_name, email, phone, mobile, street, city, zip, country, website, comment | Odoo contact creation |
| `invoice` | vendor, invoice_number, date, items[], subtotal, tax, total, currency | Odoo accounting record |
| `screenshot` | visible_text, ui_elements[], error_messages[], context | Conversational chat, tickets |
| `handwritten_note` | transcribed_text, confidence, language | Note in Obsidian vault |
| `general_document` | full_text, document_title, key_sections[] | Searchable text |
| `not_document` | (none) | Scene photo, AI Router fallback |
---
## 2. Why? — Stakes and motivations
### The problem without Vision OCR
Before this sub-workflow, every photo sent to the Telegram bot was treated as an empty message or required manual data entry. Photographing a business card and then typing every field into Odoo took several minutes per contact.
| Problem without extraction | Consequence |
|----------------------------|-------------|
| **Manual entry** | Several minutes per card, high error rate |
| **No structure** | Impossible to filter/search later |
| **No classification** | Every photo handled identically |
| **No fallback** | A landscape photo produced an error |
### Why a single prompt isn't enough
Asking a vision model *"extract the useful information"* on any image yields uneven results. A business card has predictable fields, an invoice has a different layout, an error screenshot has no "fields". The two-phase pipeline resolves that tension:
| Approach | Pros | Cons |
|----------|------|------|
| **Single generic prompt** | Simple, 1 LLM call | Inconsistent fields, unstable JSON format |
| **Pipeline classify → extract** | Schema tailored per type | 2 LLM calls, latency x2 |
The latency overhead (≈ 4-6 s) is acceptable because the user is already waiting for the transcription. The extraction quality and the stable output contract justify the cost.
### Why Gemini Flash over GPT-4 Vision
| Criterion | Gemini Flash | GPT-4 Vision | Claude 3 Vision |
|-----------|--------------|--------------|-----------------|
| **Cost** | Free (cli-ollama) | $0.01 / image | $0.005 / image |
| **Latency** | 1-3 s | 3-8 s | 2-5 s |
| **Structured JSON** | Variable, defensive parsing required | Stable | Stable |
| **Latin OCR** | Excellent | Excellent | Excellent |
| **Handwritten OCR** | Good | Excellent | Very good |
| **Hosting** | Self-hosted (cli-ollama) | OpenAI API | Anthropic API |
The Gemini Flash choice aligns with the cli-ollama multi-provider strategy: no per-image cost, acceptable latency, parseable JSON with a defensive regex fallback.
---
## 3. How? — Technical implementation
### Output contract
Three possible statuses, all returned in the same format:
**Success:**
```json
{
"status": "success",
"docType": "invoice",
"extracted": {
"vendor": "Amazon",
"invoice_number": "INV-2026-001",
"total": 23.98,
"currency": "EUR"
},
"text": "Invoice\n\nVendor: Amazon\n..."
}
```
**Fallback (not a document):**
```json
{
"status": "fallback",
"docType": "not_document"
}
```
**Error (cli-ollama unreachable):**
```json
{
"status": "error",
"error": "cli-ollama request failed (HTTP 500)"
}
```
The `text` field contains an HTML-formatted representation ready to send into Telegram. The `extracted` field contains structured data a caller workflow can consume to create an Odoo record, for instance.
### Fallback strategy
JSON parsing of Gemini output is defensive at several levels:
| Case | Behaviour |
|------|-----------|
| **Clean JSON** | Direct parse |
| **JSON wrapped in markdown backticks** | Strip ` ``` ` then parse |
| **JSON with text preamble** | Regex `\{[\s\S]*\}` then parse |
| **Malformed JSON** | Fall back to `general_document` with `text: ` |
| **Unknown type returned** | Force `not_document` |
| **HTTP error from cli-ollama** | `status: error` propagated without crashing |
This defensive layer is necessary because Gemini Flash sometimes returns extra text ("Here is the result:") before the JSON, or wraps JSON in backticks. Without the fallback, the caller workflow received parsing errors instead of a graceful fallback.
### Per-type prompts
The initial detector classifies the image with a short prompt:
```
Analyze this image and classify it as ONE of:
- business_card
- invoice
- screenshot
- handwritten_note
- general_document
- not_document
Return ONLY JSON: {"type": "...", "confidence": 0.95}
```
Then the Switch routing sends the image to a specialised extraction prompt. Example for a business card — the fields are aligned with Odoo's `res.partner` model to allow direct creation via XML-RPC:
```
Extract contact information from this business card.
Return ONLY JSON (null if not found):
{
"name": "Full name",
"function": "Job title",
"company_name": "Company",
"email": "...",
"phone": "...",
"mobile": "...",
"street": "...",
"city": "...",
"zip": "...",
"country": "...",
"website": "...",
"comment": "LinkedIn or notes"
}
```
> **Tip - cli-ollama -yolo suffix**
>
> The model is `gemini-flash-yolo` (not `gemini-flash`). The `-yolo` suffix means "auto-approve dangerous", which disables the CLI Ollama interactive plan mode and avoids a timeout on the N8N webhook (120 s limit). See [AI Stack](/en/infrastructure/ai-stack/) for details.
### Telegram Orchestrator integration
The Binary Content Handler calls Vision OCR via Execute Workflow then routes the result based on `status`:
| Status | Routing |
|--------|---------|
| `success` + `docType=business_card` | Offer to save as Odoo contact (Telegram buttons) |
| `success` + `docType=invoice` | Offer to store as accounting record |
| `success` + `docType=screenshot/handwritten/general` | Direct display + "Discuss" button (starts a conversation) |
| `fallback` (`not_document`) | Switch to AI Router for conversational handling |
| `error` | Error notification via [Notification Hub](/en/workflows/notification-hub/) |
### Performance
| Step | Typical latency |
|------|-----------------|
| base64 encoding + transfer | 100-500 ms (depending on photo size) |
| Classification detection | 1-2 s |
| Schema-driven extraction | 2-4 s |
| HTML format + return | < 100 ms |
| **End-to-end total** | **3-7 s** |
Telegram compression already brings photos down to ~1 MB max, which keeps latency in an acceptable range even for high-resolution business cards.
---
## 4. What if? — Outlook and limits
### Current limits
| Limit | Impact | Mitigation |
|-------|--------|------------|
| **Latency x2** | Two successive LLM calls | Acceptable, the user is already waiting |
| **No multi-doc classification** | One photo with a card + a receipt = a single type | Ask for two separate photos |
| **No Odoo validation** | A malformed email would create an invalid partner | Validation in the caller workflow |
| **No memory across photos** | Each photo is handled in isolation | The conversational system keeps context once the photo is extracted |
### Evolution scenarios
**If extraction quality degrades**:
- Move to `gemini-pro-yolo` for complex types (multi-line invoices)
- Add a validation/correction step via a second prompt
- Compare with an alternative model (Claude Vision via Anthropic API if quota allows)
**If new document types emerge**:
- Add a new case to the Switch + a dedicated extraction prompt
- Keep the `{status, docType, extracted, text}` contract unchanged so caller workflows don't break
- Candidate examples: `id_card`, `passport`, `recipe`, `prescription`
**If volumes grow significantly**:
- Share the classification cache (same images within 1-2 days)
- Move to a quantised local Vision model (LLaVA, MiniCPM) hosted on the VPS for zero network latency
- Batch several images in a single API call
---
## Related pages
### Infrastructure
- [AI Stack](/en/infrastructure/ai-stack/) — cli-ollama and the Gemini/Claude routing
- [N8N Queue Mode](/en/infrastructure/n8n-queue-mode/) — Backend running the sub-workflow
### Workflows
- [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) — Calling Binary Content Handler
- [Conversational system](/en/workflows/systeme-conversationnel/) — Logical follow-up after a screenshot extraction
- [Voice Transcription](/en/workflows/voice-transcription/) — Analogous pipeline for audio
### Reference
- [Glossary](/en/reference/glossary/) — OCR, Vision, Sub-workflow, cli-ollama
## Metadonnees agent
- Cet article est issu du blog GuiGPaP Lab.
- Contexte global du blog: https://blog.guigpap.com/llms.txt
- Contact auteur: https://odoo.guigpap.com/mon-cv
- Licence: CC-BY-SA 4.0