Skip to content

Real-time Transcription

Real-time Transcription

DELPHOS provides real-time audio transcription during consultations. Audio is sent in chunks, each transcribed by Transcriptor DELPHOS and appended to the session transcript. Diarizador DELPHOS runs in parallel and labels each utterance as DOCTOR, PATIENT, or UNKNOWN — producing an enriched, speaker-attributed transcript suitable for rendering a conversation UI.

How It Works

Client DELPHOS
│ │
│── POST /chunk (audio 1) ──→│ Transcribe → Append
│←── transcription text ────│
│ │
│── POST /chunk (audio 2) ──→│ Transcribe → Append
│←── transcription text ────│
│ │
│── POST /chunk (final) ────→│ Transcribe → Generate SOAP
│←── full result ───────────│

Each chunk is processed independently. The client can display progressive transcription results to the physician as they arrive.


Sending Audio Chunks

Terminal window
curl -X POST "https://your-instance.delphos.app/v1/consultation/chunk" \
-H "x-api-key: $DELPHOS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"chunk_sequence": 1,
"audio_base64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAIA...",
"audio_format": "wav",
"is_final": false
}'

Request Parameters

FieldTypeRequiredDescription
session_iduuidYesActive session ID from /consultation/start
chunk_sequenceintegerNoSequence number (auto-incremented if omitted)
audio_base64stringYesBase64-encoded audio data
audio_formatstringNoAudio format (default: wav)
is_finalbooleanNoSet true on the last chunk to trigger SOAP generation

Supported Audio Formats

FormatExtensionNotes
WAV.wavRecommended — lossless, best transcription quality
MP3.mp3Compressed, good quality
WebM.webmCommon in web browsers
OGG.oggOpen format
FLAC.flacLossless compression
M4A.m4aApple format
AAC.aacAdvanced Audio Coding

Response

{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"chunk_id": "660e8400-e29b-41d4-a716-446655440001",
"chunk_sequence": 1,
"transcription": {
"text": "Médico: Bom dia, como está se sentindo hoje?",
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "Bom dia, como está se sentindo hoje?",
"speaker": "DOCTOR"
}
],
"duration_seconds": 3.5
},
"session_state": "ACTIVE",
"total_chunks": 1,
"total_transcript_length": 47,
"processing_time_ms": 145.3,
"message": null,
"soap_generation": null
}

Speaker Diarization

When diarize: true is set at session start (the default for audio consultations), DELPHOS runs Diarizador DELPHOS (pyannote/speaker-diarization-3.1 on K.A.R.R.-01) in parallel with Transcriptor DELPHOS on every audio chunk. The two outputs are aligned server-side via temporal overlap, and raw speaker IDs are mapped to clinical roles using the first-speaker heuristic:

HeuristicMappingRationale
Speaker whose first segment starts earliest in the consultationDOCTORIn standard outpatient consultations the doctor initiates the conversation (“Bom dia, como está?”) before the patient responds
Second unique speakerPATIENTThe patient responds after the doctor’s opening
Third and later speakersUNKNOWNFamily members, nurses, or any additional speakers — safe-degrade, render as unattributed

Speaker label values

ValueMeaning
"DOCTOR"First speaker by audio-clock (heuristic)
"PATIENT"Second unique speaker by audio-clock (heuristic)
"UNKNOWN"Third+ speakers, or transcript segments with zero diarization overlap (rare; near-silent segments)
nullDiarization disabled for this session (diarize: false) OR Diarizador DELPHOS was degraded for this chunk (graceful failure — transcript flows, labels skip)

The diarization is performed per-chunk. The segments array in each chunk response includes timestamps and speaker labels, enabling your UI to render conversation bubbles incrementally as chunks arrive — no need to wait for end-of-session.


Enriched Transcript

When you fetch the session status (GET /v1/consultation/{session_id}/status), the SessionStatusResponse payload includes both the plain-text accumulated transcript and an optional enriched transcript with full speaker attribution.

Field shape

FieldTypeDescription
current_transcriptstringFull plain-text transcript (always present)
enriched_transcriptlist[TranscriptionSegment] | nullSpeaker-labeled segments (when diarization is on)
enriched_transcript[].startfloatSegment start in seconds (audio-clock)
enriched_transcript[].endfloatSegment end in seconds (audio-clock)
enriched_transcript[].textstringVerbatim utterance
enriched_transcript[].speakerstring | null"DOCTOR", "PATIENT", "UNKNOWN", or null

Backwards-compatible rendering

The enriched_transcript field is OPTIONAL. Pre-diarization sessions and sessions started with diarize: false always return enriched_transcript: null. Render gracefully:

function renderTranscript(status: SessionStatusResponse) {
if (status.enriched_transcript && status.enriched_transcript.length > 0) {
// Render speaker bubbles
return status.enriched_transcript.map(seg => ({
speaker: seg.speaker ?? 'UNKNOWN',
text: seg.text,
timestamp: seg.start,
}));
}
// Fallback: plain-text transcript without speaker attribution
return [{ speaker: null, text: status.current_transcript, timestamp: 0 }];
}

Hallucination Guards

Speech-to-text models are prone to generating fabricated content on silent or near-silent audio (the “Whisper hallucination” failure mode — repeated phrases like “Obrigado por assistir” or “Legendado por…” on segments with no real speech).

DELPHOS forwards five guard parameters to Transcriptor DELPHOS on every chunk to suppress this failure mode. In our 14-day live audit, these guards reduced the observed hallucination rate from ~60% to ~0% on silence-heavy audio.

ParameterValueWhat it does
vad_filter"true"Silero VAD trims silence regions before transcription, removing the dominant trigger for hallucinations
no_speech_threshold"0.6"Rejects low-confidence silence-mislabeled-as-speech segments
condition_on_previous_text"false"Breaks the auto-regressive cascade where one hallucination conditions the next
compression_ratio_threshold"2.4"Rejects decodes with abnormal compression ratios (a hallucination signature)
log_prob_threshold"-1.0"Rejects low-likelihood token sequences

These are applied automatically — you don’t need to set them on the client side. They’re documented here so you understand the failure modes the platform mitigates on your behalf.


Progressive Transcript

At any time during an active session, retrieve the full accumulated transcript:

Terminal window
curl -X GET "https://your-instance.delphos.app/v1/consultation/{session_id}/transcript" \
-H "x-api-key: $DELPHOS_API_KEY"
{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"transcript": "Médico: Bom dia, como está se sentindo hoje?\nPaciente: As dores de cabeça estão menos frequentes..."
}

This endpoint is useful for displaying the full conversation in your UI while chunks continue to arrive.


Sending the Final Chunk

Set is_final: true on the last audio chunk to signal the end of recording. This triggers SOAP note generation:

{
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio_base64": "UklGR...",
"audio_format": "wav",
"is_final": true
}

The response will include a soap_generation field with the status:

{
"transcription": { "..." },
"soap_generation": {
"status": "processing",
"message": "SOAP note generation started"
}
}

Poll the session status endpoint to check when the SOAP note is ready.


Integration Pattern

A typical integration streams chunks from the client’s microphone:

import base64
import httpx
import time
CHUNK_DURATION_SECONDS = 5
def stream_consultation(session_id: str, audio_chunks: list[bytes]):
"""Stream audio chunks and display progressive transcription."""
client = httpx.Client(
base_url="https://your-instance.delphos.app/v1",
headers={"x-api-key": DELPHOS_API_KEY},
timeout=30.0,
)
for i, chunk in enumerate(audio_chunks):
is_last = i == len(audio_chunks) - 1
response = client.post(
"/consultation/chunk",
json={
"session_id": session_id,
"chunk_sequence": i + 1,
"audio_base64": base64.b64encode(chunk).decode(),
"audio_format": "wav",
"is_final": is_last,
},
)
response.raise_for_status()
result = response.json()
# Display progressive transcription
text = result["transcription"]["text"]
print(f"[Chunk {i+1}] {text}")
# Poll for SOAP completion
while True:
status = client.get(f"/consultation/{session_id}/status").json()
if status["state"] == "COMPLETED":
return status["soap_note"]
if status["state"] == "ERROR":
raise RuntimeError(status["error"])
time.sleep(1)

Error Handling

StatusCause
404 Not FoundSession does not exist
409 ConflictSession is not in ACTIVE state (already ended or errored)
503 Service UnavailableTranscription service temporarily unavailable (retry with backoff)

Next Steps