Voice Integration

DELPHOS provides voice capabilities for two primary use cases:

Real-time clinical transcription — continuous audio capture during consultations, producing structured medical notes in SOAP format.
Voice-based scheduling — patients book and manage appointments through phone calls and WhatsApp audio messages.

This article covers the core voice infrastructure: speech-to-text transcription, text-to-speech synthesis, audio processing requirements, and LGPD compliance. For the scheduling-specific flow, see Voice-Enabled Scheduling.

Architecture Overview

The voice pipeline is composed of three main components:

Component	Purpose	Deployment
Transcription Engine	Converts speech to text with per-segment timestamps and confidence scoring	Dedicated inference service (can be deployed remotely)
TTS Engine	Synthesizes natural-sounding speech from text with configurable voice personas	External cloud service called from the API layer
Audio Processor	Validates, converts, and sanitizes audio input in memory	Embedded in the API layer

Speech-to-Text (Transcription)

The Transcription Engine processes audio input and returns structured text with timing information and confidence metrics.

Capabilities

Primary language: Portuguese (pt)
Response format: Verbose JSON with per-segment timestamps
Quality filtering: Silence segments are automatically excluded when the no-speech probability exceeds 0.8
Confidence scoring: Computed as an exponential average of log probabilities per segment

Transcription Response Structure

{
  "text": "Paciente relata dor abdominal ha tres dias...",
  "segments": [
    {
      "start": 0.0,
      "end": 3.2,
      "text": "Paciente relata dor abdominal",
      "confidence": 0.92
    },
    {
      "start": 3.2,
      "end": 6.1,
      "text": "ha tres dias com piora progressiva",
      "confidence": 0.87
    }
  ],
  "confidence": 0.89,
  "language": "pt"
}

Audio Format Requirements

All voice endpoints enforce strict audio validation before processing.

Supported Formats

Format	MIME Type	Notes
OGG/Opus	`audio/ogg`	Common for WhatsApp voice messages
WAV	`audio/wav`	Optimal format for transcription
MP3	`audio/mpeg`	Widely supported, lossy compression
WebM	`audio/webm`	Browser recording format

Constraints

Parameter	Limit
Maximum file size	10 MB
Maximum duration	60 seconds
Optimal format	WAV, 16 kHz, mono, PCM 16-bit

Validation Pipeline

Every uploaded audio file passes through a two-stage validation:

Content-Type whitelist — the MIME type in the request header must match a supported format.
Magic byte signature detection — the actual file bytes are inspected to confirm the format matches the declared content type, preventing spoofed uploads.

Consultation Transcription (Chunked Upload)

For real-time transcription during medical consultations, DELPHOS provides a chunked upload protocol. Audio is captured progressively and transcribed as it arrives, allowing physicians to see text appearing in near-real-time.

Session Lifecycle

The consultation transcription follows a five-step lifecycle:

Start Session --> Send Audio Chunks --> (repeat) --> Final Chunk --> SOAP Note Generated

1. Start a Session

POST /v1/consultation/start

Creates a new streaming transcription session and returns a session_id.

2. Submit Audio Chunks

POST /v1/consultation/chunk

Send audio data progressively as the consultation proceeds.

{
  "session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "chunk_sequence": 1,
  "audio_base64": "UklGRi4AAABXQVZFZm10IBAAAA...",
  "audio_format": "wav",
  "is_final": false
}

Field	Type	Required	Description
`session_id`	UUID	Yes	Session identifier from the start endpoint
`chunk_sequence`	integer	Yes	Auto-incremented sequence number for ordering
`audio_base64`	string	Yes	Base64-encoded audio data
`audio_format`	string	No	Format of the audio chunk: `wav`, `mp3`, or `webm`. Defaults to `wav`
`is_final`	boolean	No	When `true`, triggers SOAP note generation after transcription. Defaults to `false`

{
  "transcription": "Paciente relata dor abdominal",
  "segments": [
    {
      "start": 0.0,
      "end": 3.2,
      "text": "Paciente relata dor abdominal"
    }
  ],
  "confidence": 0.92,
  "session_stats": {
    "total_chunks": 5,
    "total_duration_seconds": 24.3,
    "average_confidence": 0.88
  }
}

3. Check Session Status

GET /v1/consultation/{session_id}/status

Returns the current session state: active, processing, completed, or error.

4. Retrieve Accumulated Transcript

GET /v1/consultation/{session_id}/transcript

Returns the full transcript assembled from all chunks received so far. Useful for UI synchronization if the client missed a chunk response.

5. End Session

POST /v1/consultation/end

Field	Type	Required	Description
`session_id`	UUID	Yes	The session to close

Explicitly ends the session and triggers SOAP note generation if not already triggered by a final chunk.

Text-to-Speech (TTS)

The TTS engine synthesizes natural speech from text input, with configurable voice personas, speed control, and SSML prosody support.

Endpoint

POST /v1/voice/synthesize

{
  "text": "Sua consulta esta confirmada para amanha as 14 horas.",
  "voice_persona": "receptionist",
  "speed": 1.0,
  "format": "mp3",
  "stream": false
}

Field	Type	Required	Default	Description
`text`	string	Yes	—	Text to synthesize into speech
`voice_persona`	string	No	`receptionist`	Voice profile to use (see Voice Personas below)
`speed`	float	No	`1.0`	Playback speed multiplier, from `0.5x` to `2.0x`
`format`	string	No	`mp3`	Output audio format: `mp3`, `wav`, or `opus`
`stream`	boolean	No	`false`	When `true`, returns audio via chunked transfer encoding

Header	Description
`Content-Type`	Audio MIME type matching the requested format
`X-Voice-Persona`	The voice persona used for synthesis

Voice Personas

DELPHOS ships with three pre-configured voice personas, each tuned for a specific interaction context:

Persona	Identifier	Characteristics	Use Case
Clinical Assistant	`clinical_assistant`	Male voice, deeper pitch (-15%), slightly slower rate (-5%), soft volume	Clinical interactions, reading back patient information to physicians
Receptionist	`receptionist`	Female voice, warm natural tone, standard rate	Scheduling confirmations, WhatsApp voice messages, patient-facing interactions
Presentation	`presentation`	Male voice, standard professional tone	System announcements, presentations, formal communications

Streaming vs. Non-Streaming

Mode	Timeout	Behavior
Streaming (`stream: true`)	5 seconds per chunk	Audio is returned via chunked transfer encoding as it is generated. Ideal for real-time playback in phone calls.
Non-streaming (`stream: false`)	30 seconds total	The complete audio file is generated and returned in a single response. Suitable for pre-generated messages.

Voice Pipeline for Scheduling

DELPHOS also exposes a dedicated voice endpoint for appointment scheduling via phone calls and WhatsApp.

Endpoint

POST /v1/scheduling/voice
Content-Type: multipart/form-data

Parameter	Type	Required	Description
`audio`	file	Yes	Audio file (multipart upload)
`patient_id`	UUID	Yes	Patient identifier
`session_id`	UUID	No	Existing session to continue (for multi-turn conversations)
`channel`	string	No	Source channel: `whatsapp`, `phone`, or `web`

Pipeline Steps

Validate audio — format, size, and duration checks
Convert to WAV — normalize to 16 kHz mono PCM via in-memory ffmpeg
Transcribe — speech-to-text with confidence scoring
NLP extraction — the AI Engine extracts scheduling intent, dates, times, and patient references
Scheduling action — the extracted intent is routed to the appropriate scheduling operation

For the complete scheduling flow including intent extraction, confirmation dialogs, and multi-turn conversations, see Voice-Enabled Scheduling.

LGPD Audio Compliance

All audio processing in DELPHOS is designed with LGPD (Lei Geral de Proteção de Dados) compliance as a first-class requirement. Patient voice data is sensitive personal data under LGPD Article 5, and receives the following protections:

Measure	Implementation
In-memory processing only	Audio is never written to disk. All conversion and transcription operates on in-memory buffers.
No temporary files	ffmpeg is invoked with stdin/stdout pipes, eliminating filesystem exposure.
Immediate deletion	Audio buffers are released immediately after transcription completes.
Log sanitization	Audio metadata (file names, sizes, durations) is sanitized in application logs. No audio content is ever logged.
Buffer cleanup	In-memory file handles are explicitly closed after the audio bytes are read, ensuring memory is reclaimed promptly.

Error Handling

All voice endpoints return structured error responses. The following table summarizes the expected error codes:

HTTP Status	Condition	Response Body
400	Invalid audio format	`{ "detail": "Unsupported audio format. Supported: ogg, wav, mp3, webm" }`
400	Duration exceeds limit	`{ "detail": "Audio duration exceeds maximum of 60 seconds" }`
413	File size exceeds limit	`{ "detail": "Audio file exceeds maximum size of 10 MB" }`
429	Concurrency limit reached	`{ "detail": "Voice processing at capacity. Please retry." }` with `Retry-After` header
503	Transcription service unavailable	`{ "detail": "Transcription service is currently unavailable" }`

{
  "detail": "Unsupported audio format. Supported: ogg, wav, mp3, webm",
  "error_code": "INVALID_AUDIO_FORMAT"
}

{
  "detail": "Audio file exceeds maximum size of 10 MB",
  "error_code": "FILE_TOO_LARGE",
  "max_size_bytes": 10485760
}

HTTP/1.1 429 Too Many Requests
Retry-After: 5

{
  "detail": "Voice processing at capacity. Please retry.",
  "error_code": "CONCURRENCY_LIMIT"
}

{
  "detail": "Transcription service is currently unavailable",
  "error_code": "TRANSCRIPTION_UNAVAILABLE"
}

Integration Checklist

Before enabling voice features in a DELPHOS deployment, verify the following:

Transcription Engine is deployed and accessible from the API server
The Transcription Engine host has sufficient capacity allocated for its model
ffmpeg is available in the API container image (bundled by default; required for audio conversion)
Network connectivity between the API layer and the Transcription Engine is confirmed
TTS engine is configured with the desired voice personas
Concurrency limits are tuned for expected call volume (default: 5 concurrent voice requests)
Application logs are configured to sanitize audio metadata per LGPD requirements