---
title: Vision OCR
url: https://blog.guigpap.com/en/workflows/vision-ocr/
url_md: https://blog.guigpap.com/en/workflows/vision-ocr.md
category: automation
date: '2026-05-04'
maturite: production
techno:
  - n8n
  - claude
application:
  - ai
  - content
---

# Vision OCR

> Image classification and structured extraction sub-workflow via Gemini Vision and cli-ollama

## 1. What? — Definition and context

**Vision OCR** is an N8N sub-workflow called from the [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) every time a user sends a photo. It classifies the image into one of five recognised document categories, then applies a specialised extraction schema to return structured data ready to be consumed by business workflows (Odoo, notes, invoices).

> **Note - Pure-extraction sub-workflow**
>
> Vision OCR does not talk to Telegram, does not write to Odoo, does not notify anyone. It receives an image as input, returns a `{status, docType, extracted, text}` object as output. It is a plumbing component — the decision about what to do with the extraction belongs to the caller workflow.

### Metadata

| Field | Value |
|-------|-------|
| **Workflow ID** | `2ZDgU3TWbF4OeOKY` |
| **Type** | Execute Workflow Trigger (passthrough) |
| **Nodes** | 14 |
| **Called by** | Binary Content Handler (`QtkJDN8XAGlpcSPV`) |
| **Model** | `gemini-flash-yolo` via cli-ollama |
| **Issue** | #176 |

### Architecture diagram

```mermaid
flowchart TD
  subgraph Input["Input · passthrough Binary Content Handler"]
    direction TB
    Photo["Telegram photo · base64 + mimeType"]
  end

  subgraph Detect["Phase 1 · Classification"]
    direction TB
    HTTP1["HTTP cli-ollama gemini-flash-yolo"]
    Parse["Parse Detection · regex fallback"]
    SwitchType["Switch Doc Type"]
  end

  subgraph Extract["Phase 2 · Schema-driven extraction"]
    direction TB
    BC["Extract Business Card"]
    INV["Extract Invoice"]
    SC["Extract Screenshot"]
    HW["Extract Handwritten"]
    GEN["Extract General Document"]
  end

  subgraph Output["Output · uniform contract"]
    direction TB
    FormatResult["Parse and Format · escape HTML"]
    Success["status: success"]
    Fallback["status: fallback · not_document"]
    ErrOut["status: error"]
  end

  Input --> HTTP1 --> Parse
  Parse -->|HTTP error| ErrOut
  Parse -->|not_document| Fallback
  Parse -->|recognized type| SwitchType
  SwitchType --> BC
  SwitchType --> INV
  SwitchType --> SC
  SwitchType --> HW
  SwitchType --> GEN
  BC --> FormatResult
  INV --> FormatResult
  SC --> FormatResult
  HW --> FormatResult
  GEN --> FormatResult
  FormatResult --> Success
```

### The 6 document types

| Type | Extracted fields | Typical use |
|------|------------------|-------------|
| `business_card` | name, function, company_name, email, phone, mobile, street, city, zip, country, website, comment | Odoo contact creation |
| `invoice` | vendor, invoice_number, date, items[], subtotal, tax, total, currency | Odoo accounting record |
| `screenshot` | visible_text, ui_elements[], error_messages[], context | Conversational chat, tickets |
| `handwritten_note` | transcribed_text, confidence, language | Note in Obsidian vault |
| `general_document` | full_text, document_title, key_sections[] | Searchable text |
| `not_document` | (none) | Scene photo, AI Router fallback |

---

## 2. Why? — Stakes and motivations

### The problem without Vision OCR

Before this sub-workflow, every photo sent to the Telegram bot was treated as an empty message or required manual data entry. Photographing a business card and then typing every field into Odoo took several minutes per contact.

| Problem without extraction | Consequence |
|----------------------------|-------------|
| **Manual entry** | Several minutes per card, high error rate |
| **No structure** | Impossible to filter/search later |
| **No classification** | Every photo handled identically |
| **No fallback** | A landscape photo produced an error |

### Why a single prompt isn't enough

Asking a vision model *"extract the useful information"* on any image yields uneven results. A business card has predictable fields, an invoice has a different layout, an error screenshot has no "fields". The two-phase pipeline resolves that tension:

| Approach | Pros | Cons |
|----------|------|------|
| **Single generic prompt** | Simple, 1 LLM call | Inconsistent fields, unstable JSON format |
| **Pipeline classify → extract** | Schema tailored per type | 2 LLM calls, latency x2 |

The latency overhead (≈ 4-6 s) is acceptable because the user is already waiting for the transcription. The extraction quality and the stable output contract justify the cost.

### Why Gemini Flash over GPT-4 Vision

| Criterion | Gemini Flash | GPT-4 Vision | Claude 3 Vision |
|-----------|--------------|--------------|-----------------|
| **Cost** | Free (cli-ollama) | $0.01 / image | $0.005 / image |
| **Latency** | 1-3 s | 3-8 s | 2-5 s |
| **Structured JSON** | Variable, defensive parsing required | Stable | Stable |
| **Latin OCR** | Excellent | Excellent | Excellent |
| **Handwritten OCR** | Good | Excellent | Very good |
| **Hosting** | Self-hosted (cli-ollama) | OpenAI API | Anthropic API |

The Gemini Flash choice aligns with the cli-ollama multi-provider strategy: no per-image cost, acceptable latency, parseable JSON with a defensive regex fallback.

---

## 3. How? — Technical implementation

### Output contract

Three possible statuses, all returned in the same format:

**Success:**
```json
{
  "status": "success",
  "docType": "invoice",
  "extracted": {
    "vendor": "Amazon",
    "invoice_number": "INV-2026-001",
    "total": 23.98,
    "currency": "EUR"
  },
  "text": "<b>Invoice</b>\n\nVendor: <b>Amazon</b>\n..."
}
```

**Fallback (not a document):**
```json
{
  "status": "fallback",
  "docType": "not_document"
}
```

**Error (cli-ollama unreachable):**
```json
{
  "status": "error",
  "error": "cli-ollama request failed (HTTP 500)"
}
```

The `text` field contains an HTML-formatted representation ready to send into Telegram. The `extracted` field contains structured data a caller workflow can consume to create an Odoo record, for instance.

### Fallback strategy

JSON parsing of Gemini output is defensive at several levels:

| Case | Behaviour |
|------|-----------|
| **Clean JSON** | Direct parse |
| **JSON wrapped in markdown backticks** | Strip ` ``` ` then parse |
| **JSON with text preamble** | Regex `\{[\s\S]*\}` then parse |
| **Malformed JSON** | Fall back to `general_document` with `text: <raw response>` |
| **Unknown type returned** | Force `not_document` |
| **HTTP error from cli-ollama** | `status: error` propagated without crashing |

This defensive layer is necessary because Gemini Flash sometimes returns extra text ("Here is the result:") before the JSON, or wraps JSON in backticks. Without the fallback, the caller workflow received parsing errors instead of a graceful fallback.

### Per-type prompts

The initial detector classifies the image with a short prompt:

```
Analyze this image and classify it as ONE of:
- business_card
- invoice
- screenshot
- handwritten_note
- general_document
- not_document
Return ONLY JSON: {"type": "...", "confidence": 0.95}
```

Then the Switch routing sends the image to a specialised extraction prompt. Example for a business card — the fields are aligned with Odoo's `res.partner` model to allow direct creation via XML-RPC:

```
Extract contact information from this business card.
Return ONLY JSON (null if not found):
{
  "name": "Full name",
  "function": "Job title",
  "company_name": "Company",
  "email": "...",
  "phone": "...",
  "mobile": "...",
  "street": "...",
  "city": "...",
  "zip": "...",
  "country": "...",
  "website": "...",
  "comment": "LinkedIn or notes"
}
```

> **Tip - cli-ollama -yolo suffix**
>
> The model is `gemini-flash-yolo` (not `gemini-flash`). The `-yolo` suffix means "auto-approve dangerous", which disables the CLI Ollama interactive plan mode and avoids a timeout on the N8N webhook (120 s limit). See [AI Stack](/en/infrastructure/ai-stack/) for details.

### Telegram Orchestrator integration

The Binary Content Handler calls Vision OCR via Execute Workflow then routes the result based on `status`:

| Status | Routing |
|--------|---------|
| `success` + `docType=business_card` | Offer to save as Odoo contact (Telegram buttons) |
| `success` + `docType=invoice` | Offer to store as accounting record |
| `success` + `docType=screenshot/handwritten/general` | Direct display + "Discuss" button (starts a conversation) |
| `fallback` (`not_document`) | Switch to AI Router for conversational handling |
| `error` | Error notification via [Notification Hub](/en/workflows/notification-hub/) |

### Performance

| Step | Typical latency |
|------|-----------------|
| base64 encoding + transfer | 100-500 ms (depending on photo size) |
| Classification detection | 1-2 s |
| Schema-driven extraction | 2-4 s |
| HTML format + return | < 100 ms |
| **End-to-end total** | **3-7 s** |

Telegram compression already brings photos down to ~1 MB max, which keeps latency in an acceptable range even for high-resolution business cards.

---

## 4. What if? — Outlook and limits

### Current limits

| Limit | Impact | Mitigation |
|-------|--------|------------|
| **Latency x2** | Two successive LLM calls | Acceptable, the user is already waiting |
| **No multi-doc classification** | One photo with a card + a receipt = a single type | Ask for two separate photos |
| **No Odoo validation** | A malformed email would create an invalid partner | Validation in the caller workflow |
| **No memory across photos** | Each photo is handled in isolation | The conversational system keeps context once the photo is extracted |

### Evolution scenarios

**If extraction quality degrades**:
- Move to `gemini-pro-yolo` for complex types (multi-line invoices)
- Add a validation/correction step via a second prompt
- Compare with an alternative model (Claude Vision via Anthropic API if quota allows)

**If new document types emerge**:
- Add a new case to the Switch + a dedicated extraction prompt
- Keep the `{status, docType, extracted, text}` contract unchanged so caller workflows don't break
- Candidate examples: `id_card`, `passport`, `recipe`, `prescription`

**If volumes grow significantly**:
- Share the classification cache (same images within 1-2 days)
- Move to a quantised local Vision model (LLaVA, MiniCPM) hosted on the VPS for zero network latency
- Batch several images in a single API call

---

## Related pages

### Infrastructure
- [AI Stack](/en/infrastructure/ai-stack/) — cli-ollama and the Gemini/Claude routing
- [N8N Queue Mode](/en/infrastructure/n8n-queue-mode/) — Backend running the sub-workflow

### Workflows
- [Telegram Orchestrator](/en/workflows/telegram-orchestrator/) — Calling Binary Content Handler
- [Conversational system](/en/workflows/systeme-conversationnel/) — Logical follow-up after a screenshot extraction
- [Voice Transcription](/en/workflows/voice-transcription/) — Analogous pipeline for audio

### Reference
- [Glossary](/en/reference/glossary/) — OCR, Vision, Sub-workflow, cli-ollama

## Metadonnees agent

- Cet article est issu du blog GuiGPaP Lab.
- Contexte global du blog: https://blog.guigpap.com/llms.txt
- Contact auteur: https://odoo.guigpap.com/mon-cv
- Licence: CC-BY-SA 4.0