Real-time Transcription
Real-time Transcription
DELPHOS provides real-time audio transcription during consultations. Audio
is sent in chunks, each transcribed by Transcriptor DELPHOS and
appended to the session transcript. Diarizador DELPHOS runs in
parallel and labels each utterance as DOCTOR, PATIENT, or UNKNOWN
— producing an enriched, speaker-attributed transcript suitable for
rendering a conversation UI.
How It Works
Client DELPHOS │ │ │── POST /chunk (audio 1) ──→│ Transcribe → Append │←── transcription text ────│ │ │ │── POST /chunk (audio 2) ──→│ Transcribe → Append │←── transcription text ────│ │ │ │── POST /chunk (final) ────→│ Transcribe → Generate SOAP │←── full result ───────────│Each chunk is processed independently. The client can display progressive transcription results to the physician as they arrive.
Sending Audio Chunks
curl -X POST "https://your-instance.delphos.app/v1/consultation/chunk" \ -H "x-api-key: $DELPHOS_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "session_id": "550e8400-e29b-41d4-a716-446655440000", "chunk_sequence": 1, "audio_base64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAIA...", "audio_format": "wav", "is_final": false }'import base64import httpx
# Read audio from microphone or filewith open("chunk_001.wav", "rb") as f: audio_bytes = f.read()
response = httpx.post( "https://your-instance.delphos.app/v1/consultation/chunk", headers={"x-api-key": DELPHOS_API_KEY}, json={ "session_id": session_id, "chunk_sequence": 1, "audio_base64": base64.b64encode(audio_bytes).decode(), "audio_format": "wav", "is_final": False, },)
result = response.json()print(f"Transcription: {result['transcription']['text']}")print(f"Duration: {result['transcription']['duration_seconds']}s")Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
session_id | uuid | Yes | Active session ID from /consultation/start |
chunk_sequence | integer | No | Sequence number (auto-incremented if omitted) |
audio_base64 | string | Yes | Base64-encoded audio data |
audio_format | string | No | Audio format (default: wav) |
is_final | boolean | No | Set true on the last chunk to trigger SOAP generation |
Supported Audio Formats
| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | Recommended — lossless, best transcription quality |
| MP3 | .mp3 | Compressed, good quality |
| WebM | .webm | Common in web browsers |
| OGG | .ogg | Open format |
| FLAC | .flac | Lossless compression |
| M4A | .m4a | Apple format |
| AAC | .aac | Advanced Audio Coding |
Response
{ "session_id": "550e8400-e29b-41d4-a716-446655440000", "chunk_id": "660e8400-e29b-41d4-a716-446655440001", "chunk_sequence": 1, "transcription": { "text": "Médico: Bom dia, como está se sentindo hoje?", "segments": [ { "start": 0.0, "end": 3.5, "text": "Bom dia, como está se sentindo hoje?", "speaker": "DOCTOR" } ], "duration_seconds": 3.5 }, "session_state": "ACTIVE", "total_chunks": 1, "total_transcript_length": 47, "processing_time_ms": 145.3, "message": null, "soap_generation": null}Speaker Diarization
When diarize: true is set at session start (the default for audio
consultations), DELPHOS runs Diarizador DELPHOS
(pyannote/speaker-diarization-3.1 on K.A.R.R.-01) in parallel with
Transcriptor DELPHOS on every audio chunk. The two outputs are
aligned server-side via temporal overlap, and raw speaker IDs are
mapped to clinical roles using the first-speaker heuristic:
| Heuristic | Mapping | Rationale |
|---|---|---|
| Speaker whose first segment starts earliest in the consultation | DOCTOR | In standard outpatient consultations the doctor initiates the conversation (“Bom dia, como está?”) before the patient responds |
| Second unique speaker | PATIENT | The patient responds after the doctor’s opening |
| Third and later speakers | UNKNOWN | Family members, nurses, or any additional speakers — safe-degrade, render as unattributed |
Speaker label values
| Value | Meaning |
|---|---|
"DOCTOR" | First speaker by audio-clock (heuristic) |
"PATIENT" | Second unique speaker by audio-clock (heuristic) |
"UNKNOWN" | Third+ speakers, or transcript segments with zero diarization overlap (rare; near-silent segments) |
null | Diarization disabled for this session (diarize: false) OR Diarizador DELPHOS was degraded for this chunk (graceful failure — transcript flows, labels skip) |
The diarization is performed per-chunk. The segments array in each
chunk response includes timestamps and speaker labels, enabling your UI
to render conversation bubbles incrementally as chunks arrive — no need
to wait for end-of-session.
Enriched Transcript
When you fetch the session status (GET /v1/consultation/{session_id}/status),
the SessionStatusResponse payload includes both the plain-text accumulated
transcript and an optional enriched transcript with full speaker
attribution.
Field shape
| Field | Type | Description |
|---|---|---|
current_transcript | string | Full plain-text transcript (always present) |
enriched_transcript | list[TranscriptionSegment] | null | Speaker-labeled segments (when diarization is on) |
enriched_transcript[].start | float | Segment start in seconds (audio-clock) |
enriched_transcript[].end | float | Segment end in seconds (audio-clock) |
enriched_transcript[].text | string | Verbatim utterance |
enriched_transcript[].speaker | string | null | "DOCTOR", "PATIENT", "UNKNOWN", or null |
Backwards-compatible rendering
The enriched_transcript field is OPTIONAL. Pre-diarization sessions
and sessions started with diarize: false always return
enriched_transcript: null. Render gracefully:
function renderTranscript(status: SessionStatusResponse) { if (status.enriched_transcript && status.enriched_transcript.length > 0) { // Render speaker bubbles return status.enriched_transcript.map(seg => ({ speaker: seg.speaker ?? 'UNKNOWN', text: seg.text, timestamp: seg.start, })); } // Fallback: plain-text transcript without speaker attribution return [{ speaker: null, text: status.current_transcript, timestamp: 0 }];}Hallucination Guards
Speech-to-text models are prone to generating fabricated content on silent or near-silent audio (the “Whisper hallucination” failure mode — repeated phrases like “Obrigado por assistir” or “Legendado por…” on segments with no real speech).
DELPHOS forwards five guard parameters to Transcriptor DELPHOS on every chunk to suppress this failure mode. In our 14-day live audit, these guards reduced the observed hallucination rate from ~60% to ~0% on silence-heavy audio.
| Parameter | Value | What it does |
|---|---|---|
vad_filter | "true" | Silero VAD trims silence regions before transcription, removing the dominant trigger for hallucinations |
no_speech_threshold | "0.6" | Rejects low-confidence silence-mislabeled-as-speech segments |
condition_on_previous_text | "false" | Breaks the auto-regressive cascade where one hallucination conditions the next |
compression_ratio_threshold | "2.4" | Rejects decodes with abnormal compression ratios (a hallucination signature) |
log_prob_threshold | "-1.0" | Rejects low-likelihood token sequences |
These are applied automatically — you don’t need to set them on the client side. They’re documented here so you understand the failure modes the platform mitigates on your behalf.
Progressive Transcript
At any time during an active session, retrieve the full accumulated transcript:
curl -X GET "https://your-instance.delphos.app/v1/consultation/{session_id}/transcript" \ -H "x-api-key: $DELPHOS_API_KEY"{ "session_id": "550e8400-e29b-41d4-a716-446655440000", "transcript": "Médico: Bom dia, como está se sentindo hoje?\nPaciente: As dores de cabeça estão menos frequentes..."}This endpoint is useful for displaying the full conversation in your UI while chunks continue to arrive.
Sending the Final Chunk
Set is_final: true on the last audio chunk to signal the end of
recording. This triggers SOAP note generation:
{ "session_id": "550e8400-e29b-41d4-a716-446655440000", "audio_base64": "UklGR...", "audio_format": "wav", "is_final": true}The response will include a soap_generation field with the status:
{ "transcription": { "..." }, "soap_generation": { "status": "processing", "message": "SOAP note generation started" }}Poll the session status endpoint to check when the SOAP note is ready.
Integration Pattern
A typical integration streams chunks from the client’s microphone:
import base64import httpximport time
CHUNK_DURATION_SECONDS = 5
def stream_consultation(session_id: str, audio_chunks: list[bytes]): """Stream audio chunks and display progressive transcription.""" client = httpx.Client( base_url="https://your-instance.delphos.app/v1", headers={"x-api-key": DELPHOS_API_KEY}, timeout=30.0, )
for i, chunk in enumerate(audio_chunks): is_last = i == len(audio_chunks) - 1
response = client.post( "/consultation/chunk", json={ "session_id": session_id, "chunk_sequence": i + 1, "audio_base64": base64.b64encode(chunk).decode(), "audio_format": "wav", "is_final": is_last, }, ) response.raise_for_status() result = response.json()
# Display progressive transcription text = result["transcription"]["text"] print(f"[Chunk {i+1}] {text}")
# Poll for SOAP completion while True: status = client.get(f"/consultation/{session_id}/status").json() if status["state"] == "COMPLETED": return status["soap_note"] if status["state"] == "ERROR": raise RuntimeError(status["error"]) time.sleep(1)Error Handling
| Status | Cause |
|---|---|
404 Not Found | Session does not exist |
409 Conflict | Session is not in ACTIVE state (already ended or errored) |
503 Service Unavailable | Transcription service temporarily unavailable (retry with backoff) |
Next Steps
- SOAP Streaming — Generate structured SOAP notes from the accumulated transcript in real time
- Streaming Prescription Extraction — Extract a safety-checked prescription from the same accumulated transcript in parallel with SOAP
- Consultation Lifecycle — Full session management
- Working with Records — Post-consultation editing
- Clinical Summaries — Patient intelligence