Skip to content

Voice Integration

DELPHOS provides voice capabilities for two primary use cases:

  1. Real-time clinical transcription — continuous audio capture during consultations, producing structured medical notes in SOAP format.
  2. Voice-based scheduling — patients book and manage appointments through phone calls and WhatsApp audio messages.

This article covers the core voice infrastructure: speech-to-text transcription, text-to-speech synthesis, audio processing requirements, and LGPD compliance. For the scheduling-specific flow, see Voice-Enabled Scheduling.


Architecture Overview

The voice pipeline is composed of three main components:

ComponentPurposeDeployment
Transcription EngineConverts speech to text with per-segment timestamps and confidence scoringDedicated GPU server (can be deployed remotely)
TTS EngineSynthesizes natural-sounding speech from text with configurable voice personasExternal cloud service called from the API layer
Audio ProcessorValidates, converts, and sanitizes audio input in memoryEmbedded in the API layer

Speech-to-Text (Transcription)

The Transcription Engine processes audio input and returns structured text with timing information and confidence metrics.

Capabilities

  • Primary language: Portuguese (pt)
  • Response format: Verbose JSON with per-segment timestamps
  • Quality filtering: Silence segments are automatically excluded when the no-speech probability exceeds 0.8
  • Confidence scoring: Computed as an exponential average of log probabilities per segment

Transcription Response Structure

{
"text": "Paciente relata dor abdominal ha tres dias...",
"segments": [
{
"start": 0.0,
"end": 3.2,
"text": "Paciente relata dor abdominal",
"confidence": 0.92
},
{
"start": 3.2,
"end": 6.1,
"text": "ha tres dias com piora progressiva",
"confidence": 0.87
}
],
"confidence": 0.89,
"language": "pt"
}

Audio Format Requirements

All voice endpoints enforce strict audio validation before processing.

Supported Formats

FormatMIME TypeNotes
OGG/Opusaudio/oggCommon for WhatsApp voice messages
WAVaudio/wavOptimal format for transcription
MP3audio/mpegWidely supported, lossy compression
WebMaudio/webmBrowser recording format

Constraints

ParameterLimit
Maximum file size10 MB
Maximum duration60 seconds
Optimal formatWAV, 16 kHz, mono, PCM 16-bit

Validation Pipeline

Every uploaded audio file passes through a two-stage validation:

  1. Content-Type whitelist — the MIME type in the request header must match a supported format.
  2. Magic byte signature detection — the actual file bytes are inspected to confirm the format matches the declared content type, preventing spoofed uploads.

Consultation Transcription (Chunked Upload)

For real-time transcription during medical consultations, DELPHOS provides a chunked upload protocol. Audio is captured progressively and transcribed as it arrives, allowing physicians to see text appearing in near-real-time.

Session Lifecycle

The consultation transcription follows a five-step lifecycle:

Start Session --> Send Audio Chunks --> (repeat) --> Final Chunk --> SOAP Note Generated

1. Start a Session

POST /v1/consultation/start

Creates a new streaming transcription session and returns a session_id.

2. Submit Audio Chunks

POST /v1/consultation/chunk

Send audio data progressively as the consultation proceeds.

{
"session_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"chunk_sequence": 1,
"audio_base64": "UklGRi4AAABXQVZFZm10IBAAAA...",
"audio_format": "wav",
"is_final": false
}

3. Check Session Status

GET /v1/consultation/{session_id}/status

Returns the current session state: active, processing, completed, or error.

4. Retrieve Accumulated Transcript

GET /v1/consultation/{session_id}/transcript

Returns the full transcript assembled from all chunks received so far. Useful for UI synchronization if the client missed a chunk response.

5. End Session

POST /v1/consultation/end
FieldTypeRequiredDescription
session_idUUIDYesThe session to close

Explicitly ends the session and triggers SOAP note generation if not already triggered by a final chunk.


Text-to-Speech (TTS)

The TTS engine synthesizes natural speech from text input, with configurable voice personas, speed control, and SSML prosody support.

Endpoint

POST /v1/voice/synthesize
{
"text": "Sua consulta esta confirmada para amanha as 14 horas.",
"voice_persona": "receptionist",
"speed": 1.0,
"format": "mp3",
"stream": false
}

Voice Personas

DELPHOS ships with three pre-configured voice personas, each tuned for a specific interaction context:

PersonaIdentifierCharacteristicsUse Case
Clinical Assistantclinical_assistantMale voice, deeper pitch (-15%), slightly slower rate (-5%), soft volumeClinical interactions, reading back patient information to physicians
ReceptionistreceptionistFemale voice, warm natural tone, standard rateScheduling confirmations, WhatsApp voice messages, patient-facing interactions
PresentationpresentationMale voice, standard professional toneSystem announcements, presentations, formal communications

Streaming vs. Non-Streaming

ModeTimeoutBehavior
Streaming (stream: true)5 seconds per chunkAudio is returned via chunked transfer encoding as it is generated. Ideal for real-time playback in phone calls.
Non-streaming (stream: false)30 seconds totalThe complete audio file is generated and returned in a single response. Suitable for pre-generated messages.

Voice Pipeline for Scheduling

DELPHOS also exposes a dedicated voice endpoint for appointment scheduling via phone calls and WhatsApp.

Endpoint

POST /v1/scheduling/voice
Content-Type: multipart/form-data
ParameterTypeRequiredDescription
audiofileYesAudio file (multipart upload)
patient_idUUIDYesPatient identifier
session_idUUIDNoExisting session to continue (for multi-turn conversations)
channelstringNoSource channel: whatsapp, phone, or web

Pipeline Steps

  1. Validate audio — format, size, and duration checks
  2. Convert to WAV — normalize to 16 kHz mono PCM via in-memory ffmpeg
  3. Transcribe — speech-to-text with confidence scoring
  4. NLP extraction — the AI Engine extracts scheduling intent, dates, times, and patient references
  5. Scheduling action — the extracted intent is routed to the appropriate scheduling operation

For the complete scheduling flow including intent extraction, confirmation dialogs, and multi-turn conversations, see Voice-Enabled Scheduling.


LGPD Audio Compliance

All audio processing in DELPHOS is designed with LGPD (Lei Geral de Proteção de Dados) compliance as a first-class requirement. Patient voice data is sensitive personal data under LGPD Article 5, and receives the following protections:

MeasureImplementation
In-memory processing onlyAudio is never written to disk. All conversion and transcription operates on in-memory buffers.
No temporary filesffmpeg is invoked with stdin/stdout pipes, eliminating filesystem exposure.
Immediate deletionAudio buffers are released immediately after transcription completes.
Log sanitizationAudio metadata (file names, sizes, durations) is sanitized in application logs. No audio content is ever logged.
Buffer cleanupIn-memory file handles are explicitly closed after the audio bytes are read, ensuring memory is reclaimed promptly.

Error Handling

All voice endpoints return structured error responses. The following table summarizes the expected error codes:

HTTP StatusConditionResponse Body
400Invalid audio format{ "detail": "Unsupported audio format. Supported: ogg, wav, mp3, webm" }
400Duration exceeds limit{ "detail": "Audio duration exceeds maximum of 60 seconds" }
413File size exceeds limit{ "detail": "Audio file exceeds maximum size of 10 MB" }
429Concurrency limit reached{ "detail": "Voice processing at capacity. Please retry." } with Retry-After header
503Transcription service unavailable{ "detail": "Transcription service is currently unavailable" }
{
"detail": "Unsupported audio format. Supported: ogg, wav, mp3, webm",
"error_code": "INVALID_AUDIO_FORMAT"
}

Integration Checklist

Before enabling voice features in a DELPHOS deployment, verify the following:

  • Transcription Engine is deployed and accessible from the API server
  • GPU server has sufficient VRAM allocated for the transcription model
  • ffmpeg is available in the API container image (bundled by default; required for audio conversion)
  • Network connectivity between the API layer and the Transcription Engine is confirmed
  • TTS engine is configured with the desired voice personas
  • Concurrency limits are tuned for expected call volume (default: 5 concurrent voice requests)
  • Application logs are configured to sanitize audio metadata per LGPD requirements