Voice Integration
DELPHOS provides voice capabilities for two primary use cases:
- Real-time clinical transcription — continuous audio capture during consultations, producing structured medical notes in SOAP format.
- Voice-based scheduling — patients book and manage appointments through phone calls and WhatsApp audio messages.
This article covers the core voice infrastructure: speech-to-text transcription, text-to-speech synthesis, audio processing requirements, and LGPD compliance. For the scheduling-specific flow, see Voice-Enabled Scheduling.
Architecture Overview
The voice pipeline is composed of three main components:
| Component | Purpose | Deployment |
|---|---|---|
| Transcription Engine | Converts speech to text with per-segment timestamps and confidence scoring | Dedicated GPU server (can be deployed remotely) |
| TTS Engine | Synthesizes natural-sounding speech from text with configurable voice personas | External cloud service called from the API layer |
| Audio Processor | Validates, converts, and sanitizes audio input in memory | Embedded in the API layer |
Speech-to-Text (Transcription)
The Transcription Engine processes audio input and returns structured text with timing information and confidence metrics.
Capabilities
- Primary language: Portuguese (pt)
- Response format: Verbose JSON with per-segment timestamps
- Quality filtering: Silence segments are automatically excluded when the no-speech probability exceeds 0.8
- Confidence scoring: Computed as an exponential average of log probabilities per segment
Transcription Response Structure
{ "text": "Paciente relata dor abdominal ha tres dias...", "segments": [ { "start": 0.0, "end": 3.2, "text": "Paciente relata dor abdominal", "confidence": 0.92 }, { "start": 3.2, "end": 6.1, "text": "ha tres dias com piora progressiva", "confidence": 0.87 } ], "confidence": 0.89, "language": "pt"}Audio Format Requirements
All voice endpoints enforce strict audio validation before processing.
Supported Formats
| Format | MIME Type | Notes |
|---|---|---|
| OGG/Opus | audio/ogg | Common for WhatsApp voice messages |
| WAV | audio/wav | Optimal format for transcription |
| MP3 | audio/mpeg | Widely supported, lossy compression |
| WebM | audio/webm | Browser recording format |
Constraints
| Parameter | Limit |
|---|---|
| Maximum file size | 10 MB |
| Maximum duration | 60 seconds |
| Optimal format | WAV, 16 kHz, mono, PCM 16-bit |
Validation Pipeline
Every uploaded audio file passes through a two-stage validation:
- Content-Type whitelist — the MIME type in the request header must match a supported format.
- Magic byte signature detection — the actual file bytes are inspected to confirm the format matches the declared content type, preventing spoofed uploads.
Consultation Transcription (Chunked Upload)
For real-time transcription during medical consultations, DELPHOS provides a chunked upload protocol. Audio is captured progressively and transcribed as it arrives, allowing physicians to see text appearing in near-real-time.
Session Lifecycle
The consultation transcription follows a five-step lifecycle:
Start Session --> Send Audio Chunks --> (repeat) --> Final Chunk --> SOAP Note Generated1. Start a Session
POST /v1/consultation/startCreates a new streaming transcription session and returns a session_id.
2. Submit Audio Chunks
POST /v1/consultation/chunkSend audio data progressively as the consultation proceeds.
{ "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "chunk_sequence": 1, "audio_base64": "UklGRi4AAABXQVZFZm10IBAAAA...", "audio_format": "wav", "is_final": false}| Field | Type | Required | Description |
|---|---|---|---|
session_id | UUID | Yes | Session identifier from the start endpoint |
chunk_sequence | integer | Yes | Auto-incremented sequence number for ordering |
audio_base64 | string | Yes | Base64-encoded audio data |
audio_format | string | No | Format of the audio chunk: wav, mp3, or webm. Defaults to wav |
is_final | boolean | No | When true, triggers SOAP note generation after transcription. Defaults to false |
{ "transcription": "Paciente relata dor abdominal", "segments": [ { "start": 0.0, "end": 3.2, "text": "Paciente relata dor abdominal" } ], "confidence": 0.92, "session_stats": { "total_chunks": 5, "total_duration_seconds": 24.3, "average_confidence": 0.88 }}3. Check Session Status
GET /v1/consultation/{session_id}/statusReturns the current session state: active, processing, completed, or error.
4. Retrieve Accumulated Transcript
GET /v1/consultation/{session_id}/transcriptReturns the full transcript assembled from all chunks received so far. Useful for UI synchronization if the client missed a chunk response.
5. End Session
POST /v1/consultation/end| Field | Type | Required | Description |
|---|---|---|---|
session_id | UUID | Yes | The session to close |
Explicitly ends the session and triggers SOAP note generation if not already triggered by a final chunk.
Text-to-Speech (TTS)
The TTS engine synthesizes natural speech from text input, with configurable voice personas, speed control, and SSML prosody support.
Endpoint
POST /v1/voice/synthesize{ "text": "Sua consulta esta confirmada para amanha as 14 horas.", "voice_persona": "receptionist", "speed": 1.0, "format": "mp3", "stream": false}| Field | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | Yes | — | Text to synthesize into speech |
voice_persona | string | No | receptionist | Voice profile to use (see Voice Personas below) |
speed | float | No | 1.0 | Playback speed multiplier, from 0.5x to 2.0x |
format | string | No | mp3 | Output audio format: mp3, wav, or opus |
stream | boolean | No | false | When true, returns audio via chunked transfer encoding |
| Header | Description |
|---|---|
Content-Type | Audio MIME type matching the requested format |
X-Voice-Persona | The voice persona used for synthesis |
Voice Personas
DELPHOS ships with three pre-configured voice personas, each tuned for a specific interaction context:
| Persona | Identifier | Characteristics | Use Case |
|---|---|---|---|
| Clinical Assistant | clinical_assistant | Male voice, deeper pitch (-15%), slightly slower rate (-5%), soft volume | Clinical interactions, reading back patient information to physicians |
| Receptionist | receptionist | Female voice, warm natural tone, standard rate | Scheduling confirmations, WhatsApp voice messages, patient-facing interactions |
| Presentation | presentation | Male voice, standard professional tone | System announcements, presentations, formal communications |
Streaming vs. Non-Streaming
| Mode | Timeout | Behavior |
|---|---|---|
Streaming (stream: true) | 5 seconds per chunk | Audio is returned via chunked transfer encoding as it is generated. Ideal for real-time playback in phone calls. |
Non-streaming (stream: false) | 30 seconds total | The complete audio file is generated and returned in a single response. Suitable for pre-generated messages. |
Voice Pipeline for Scheduling
DELPHOS also exposes a dedicated voice endpoint for appointment scheduling via phone calls and WhatsApp.
Endpoint
POST /v1/scheduling/voiceContent-Type: multipart/form-data| Parameter | Type | Required | Description |
|---|---|---|---|
audio | file | Yes | Audio file (multipart upload) |
patient_id | UUID | Yes | Patient identifier |
session_id | UUID | No | Existing session to continue (for multi-turn conversations) |
channel | string | No | Source channel: whatsapp, phone, or web |
Pipeline Steps
- Validate audio — format, size, and duration checks
- Convert to WAV — normalize to 16 kHz mono PCM via in-memory ffmpeg
- Transcribe — speech-to-text with confidence scoring
- NLP extraction — the AI Engine extracts scheduling intent, dates, times, and patient references
- Scheduling action — the extracted intent is routed to the appropriate scheduling operation
For the complete scheduling flow including intent extraction, confirmation dialogs, and multi-turn conversations, see Voice-Enabled Scheduling.
LGPD Audio Compliance
All audio processing in DELPHOS is designed with LGPD (Lei Geral de Proteção de Dados) compliance as a first-class requirement. Patient voice data is sensitive personal data under LGPD Article 5, and receives the following protections:
| Measure | Implementation |
|---|---|
| In-memory processing only | Audio is never written to disk. All conversion and transcription operates on in-memory buffers. |
| No temporary files | ffmpeg is invoked with stdin/stdout pipes, eliminating filesystem exposure. |
| Immediate deletion | Audio buffers are released immediately after transcription completes. |
| Log sanitization | Audio metadata (file names, sizes, durations) is sanitized in application logs. No audio content is ever logged. |
| Buffer cleanup | In-memory file handles are explicitly closed after the audio bytes are read, ensuring memory is reclaimed promptly. |
Error Handling
All voice endpoints return structured error responses. The following table summarizes the expected error codes:
| HTTP Status | Condition | Response Body |
|---|---|---|
| 400 | Invalid audio format | { "detail": "Unsupported audio format. Supported: ogg, wav, mp3, webm" } |
| 400 | Duration exceeds limit | { "detail": "Audio duration exceeds maximum of 60 seconds" } |
| 413 | File size exceeds limit | { "detail": "Audio file exceeds maximum size of 10 MB" } |
| 429 | Concurrency limit reached | { "detail": "Voice processing at capacity. Please retry." } with Retry-After header |
| 503 | Transcription service unavailable | { "detail": "Transcription service is currently unavailable" } |
{ "detail": "Unsupported audio format. Supported: ogg, wav, mp3, webm", "error_code": "INVALID_AUDIO_FORMAT"}{ "detail": "Audio file exceeds maximum size of 10 MB", "error_code": "FILE_TOO_LARGE", "max_size_bytes": 10485760}HTTP/1.1 429 Too Many RequestsRetry-After: 5
{ "detail": "Voice processing at capacity. Please retry.", "error_code": "CONCURRENCY_LIMIT"}{ "detail": "Transcription service is currently unavailable", "error_code": "TRANSCRIPTION_UNAVAILABLE"}Integration Checklist
Before enabling voice features in a DELPHOS deployment, verify the following:
- Transcription Engine is deployed and accessible from the API server
- GPU server has sufficient VRAM allocated for the transcription model
- ffmpeg is available in the API container image (bundled by default; required for audio conversion)
- Network connectivity between the API layer and the Transcription Engine is confirmed
- TTS engine is configured with the desired voice personas
- Concurrency limits are tuned for expected call volume (default: 5 concurrent voice requests)
- Application logs are configured to sanitize audio metadata per LGPD requirements