Voice Transcription

1. What? — Definition and context

The Voice Transcription workflow automatically transcribes Telegram voice messages into text. A smart routing system picks the optimal transcription service based on the message duration.

Services used

Service	Usage	Advantage
Groq Whisper	Messages ≤ 30s	Free, fast (1-2s)
ElevenLabs Scribe	Messages > 30s	Diarisation, long files

Routing by duration

Duration	Service	Reason
≤ 30 seconds	Groq Whisper	Fast, free
> 30 seconds	ElevenLabs Scribe	Diarisation, long files

2. Why? — Stakes and motivations

Problems solved

Problem	Without transcription	With transcription
Mandatory listening	Replay to understand	Readable text instantly
No search	No ctrl+F on audio	Indexable text
Hard to share	Send the audio file	Copy-paste the text
Accessibility	Not accessible to deaf users	Universal text

Why two services?

Criterion	Groq Whisper	ElevenLabs Scribe
Cost	Free	Paid (per hour)
Speed	~1-2s	~10-30s
File limit	25 MB	3 GB
Diarisation	No	Yes
Best for	Short messages	Meetings, podcasts

3. How? — Technical implementation

Architecture

Workflow input

{
  "message": {
    "voice": {
      "file_id": "AwACAgIAAxkB...",
      "duration": 15,
      "mime_type": "audio/ogg"
    },
    "from": {
      "id": 123456789,
      "first_name": "Guillaume"
    },
    "chat": {
      "id": 123456789
    }
  }
}

Groq Whisper configuration (≤30s)

Community Node: n8n-nodes-groq

Parameter	Value
Credential	`Groq account - N8N`
Operation	Transcribe
Model	`whisper-large-v3-turbo`
Input Data Field	`data`
Language	`fr` (optional)
Response Format	`json`

ElevenLabs Scribe configuration (>30s)

HTTP Request Node

Parameter	Value
Method	POST
URL	`https://api.elevenlabs.io/v1/speech-to-text`
Authentication	Header Auth → `ElevenLabs API`
Body Content Type	Form-Data

Form Parameters:

Name	Type	Value
file	Binary	`{{ $binary.data }}`
model_id	String	`scribe_v1`
language_code	String	`fr`

Output

{
  "success": true,
  "text": "Remind me to call Jean tomorrow",
  "duration": 15,
  "service": "groq"
}

Integration with the orchestrator

The Telegram Orchestrator detects voice notes and calls this sub-workflow:

IF message.voice exists:
  Execute Workflow: Voice Transcription
  Input: $json (contains message.voice)

  IF response.success:
    IF active conversation exists (#231/#232):
      Send Transcript Preview + route text to Conversation Agent
    ELSE:
      Send message: "🎤 {response.text}"
  ELSE:
    Send message: "❌ Transcription failed"

Since Phase 5 (#231/#232), if a conversation is active when a voice note arrives, the transcribed text is injected as a message into the conversation instead of being returned as-is to the user. This enables a voice discussion with the bot.

Post-transcription callbacks

Callback	Action
`voice_retry_{msg_id}`	Retry transcription
`voice_process_{msg_id}`	Process with Claude (summary, extraction)
`voice_save_{msg_id}`	Save as a note

4. What if? — Outlook and limits

Limits and costs

Service	File limit	Cost	Speed
Groq Whisper	25 MB	Free	~1-2s
ElevenLabs Scribe	3 GB	Paid (per hour)	~10-30s

Current limits

Limit	Impact	Mitigation
Groq quota	Possible rate limiting	ElevenLabs fallback
OGG format	Telegram-only format	APIs natively supported
No diarisation < 30s	No speaker identification	Acceptable for short messages

Evolution scenarios

If Groq rate limit is hit:

Temporarily set duration ≤ 0 to force ElevenLabs
Or add OpenAI Whisper as an intermediate fallback

If systematic diarisation is needed:

Route every message to ElevenLabs
Or use a local model with speaker detection

If multi-language is needed:

Auto-detect the language
Adapt parameters according to detected language

Troubleshooting

Problem	Check
Empty transcription	Does the audio file actually contain speech?
ElevenLabs timeout	Files > 5min: increase timeout (180s)
Groq rate limit	Check quotas, fall back to ElevenLabs
Unsupported format	Telegram sends .ogg (Opus) — natively supported

Voice Transcription

1. What? — Definition and context

Services used

Routing by duration

2. Why? — Stakes and motivations

Problems solved

Why two services?

3. How? — Technical implementation

Architecture

Workflow input

Groq Whisper configuration (≤30s)

ElevenLabs Scribe configuration (>30s)

Output

Integration with the orchestrator

Post-transcription callbacks

4. What if? — Outlook and limits

Limits and costs

Current limits

Evolution scenarios

Troubleshooting

Workflows

Infrastructure

External references

Voice Transcription

1. What? — Definition and context

Services used

Routing by duration

2. Why? — Stakes and motivations

Problems solved

Why two services?

3. How? — Technical implementation

Architecture

Workflow input

Groq Whisper configuration (≤30s)

ElevenLabs Scribe configuration (>30s)

Output

Integration with the orchestrator

Post-transcription callbacks

4. What if? — Outlook and limits

Limits and costs

Current limits

Evolution scenarios

Troubleshooting

Related pages

Workflows

Infrastructure

External references