Real-time Transcription

DELPHOS provides real-time audio transcription during consultations. Audio is sent in chunks, each transcribed by Transcriptor DELPHOS and appended to the session transcript. Diarizador DELPHOS runs in parallel and labels each utterance as DOCTOR, PATIENT, or UNKNOWN — producing an enriched, speaker-attributed transcript suitable for rendering a conversation UI.

How It Works

Client                      DELPHOS
  │                           │
  │── POST /chunk (audio 1) ──→│ Transcribe → Append
  │←── transcription text ────│
  │                           │
  │── POST /chunk (audio 2) ──→│ Transcribe → Append
  │←── transcription text ────│
  │                           │
  │── POST /chunk (final) ────→│ Transcribe → Generate SOAP
  │←── full result ───────────│

Each chunk is processed independently. The client can display progressive transcription results to the physician as they arrive.

curl -X POST "https://your-instance.delphos.app/v1/consultation/chunk" \
  -H "x-api-key: $DELPHOS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "chunk_sequence": 1,
    "audio_base64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAIA...",
    "audio_format": "wav",
    "is_final": false
  }'

import base64
import httpx

# Read audio from microphone or file
with open("chunk_001.wav", "rb") as f:
    audio_bytes = f.read()

response = httpx.post(
    "https://your-instance.delphos.app/v1/consultation/chunk",
    headers={"x-api-key": DELPHOS_API_KEY},
    json={
        "session_id": session_id,
        "chunk_sequence": 1,
        "audio_base64": base64.b64encode(audio_bytes).decode(),
        "audio_format": "wav",
        "is_final": False,
    },
)

result = response.json()
print(f"Transcription: {result['transcription']['text']}")
print(f"Duration: {result['transcription']['duration_seconds']}s")

Request Parameters

Field	Type	Required	Description
`session_id`	`uuid`	Yes	Active session ID from `/consultation/start`
`chunk_sequence`	`integer`	No	Sequence number (auto-incremented if omitted)
`audio_base64`	`string`	Yes	Base64-encoded audio data
`audio_format`	`string`	No	Audio format (default: `wav`)
`is_final`	`boolean`	No	Set `true` on the last chunk to trigger SOAP generation

Supported Audio Formats

Format	Extension	Notes
WAV	`.wav`	Recommended — lossless, best transcription quality
MP3	`.mp3`	Compressed, good quality
WebM	`.webm`	Common in web browsers
OGG	`.ogg`	Open format
FLAC	`.flac`	Lossless compression
M4A	`.m4a`	Apple format
AAC	`.aac`	Advanced Audio Coding

Response

{
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "chunk_id": "660e8400-e29b-41d4-a716-446655440001",
  "chunk_sequence": 1,
  "transcription": {
    "text": "Médico: Bom dia, como está se sentindo hoje?",
    "segments": [
      {
        "start": 0.0,
        "end": 3.5,
        "text": "Bom dia, como está se sentindo hoje?",
        "speaker": "DOCTOR"
      }
    ],
    "duration_seconds": 3.5
  },
  "session_state": "ACTIVE",
  "total_chunks": 1,
  "total_transcript_length": 47,
  "processing_time_ms": 145.3,
  "message": null,
  "soap_generation": null
}

Speaker Diarization

When diarize: true is set at session start (the default for audio consultations), DELPHOS runs Diarizador DELPHOS (pyannote/speaker-diarization-3.1 on K.A.R.R.-01) in parallel with Transcriptor DELPHOS on every audio chunk. The two outputs are aligned server-side via temporal overlap, and raw speaker IDs are mapped to clinical roles using the first-speaker heuristic:

Heuristic	Mapping	Rationale
Speaker whose first segment starts earliest in the consultation	`DOCTOR`	In standard outpatient consultations the doctor initiates the conversation (“Bom dia, como está?”) before the patient responds
Second unique speaker	`PATIENT`	The patient responds after the doctor’s opening
Third and later speakers	`UNKNOWN`	Family members, nurses, or any additional speakers — safe-degrade, render as unattributed

Speaker label values

Value	Meaning
`"DOCTOR"`	First speaker by audio-clock (heuristic)
`"PATIENT"`	Second unique speaker by audio-clock (heuristic)
`"UNKNOWN"`	Third+ speakers, or transcript segments with zero diarization overlap (rare; near-silent segments)
`null`	Diarization disabled for this session (`diarize: false`) OR Diarizador DELPHOS was degraded for this chunk (graceful failure — transcript flows, labels skip)

The diarization is performed per-chunk. The segments array in each chunk response includes timestamps and speaker labels, enabling your UI to render conversation bubbles incrementally as chunks arrive — no need to wait for end-of-session.

Enriched Transcript

When you fetch the session status (GET /v1/consultation/{session_id}/status), the SessionStatusResponse payload includes both the plain-text accumulated transcript and an optional enriched transcript with full speaker attribution.

Field shape

Field	Type	Description
`current_transcript`	`string`	Full plain-text transcript (always present)
`enriched_transcript`	`list[TranscriptionSegment] \| null`	Speaker-labeled segments (when diarization is on)
`enriched_transcript[].start`	`float`	Segment start in seconds (audio-clock)
`enriched_transcript[].end`	`float`	Segment end in seconds (audio-clock)
`enriched_transcript[].text`	`string`	Verbatim utterance
`enriched_transcript[].speaker`	`string \| null`	`"DOCTOR"`, `"PATIENT"`, `"UNKNOWN"`, or `null`

Backwards-compatible rendering

The enriched_transcript field is OPTIONAL. Pre-diarization sessions and sessions started with diarize: false always return enriched_transcript: null. Render gracefully:

function renderTranscript(status: SessionStatusResponse) {
  if (status.enriched_transcript && status.enriched_transcript.length > 0) {
    // Render speaker bubbles
    return status.enriched_transcript.map(seg => ({
      speaker: seg.speaker ?? 'UNKNOWN',
      text: seg.text,
      timestamp: seg.start,
    }));
  }
  // Fallback: plain-text transcript without speaker attribution
  return [{ speaker: null, text: status.current_transcript, timestamp: 0 }];
}

Hallucination Guards

Speech-to-text models are prone to generating fabricated content on silent or near-silent audio (the “Whisper hallucination” failure mode — repeated phrases like “Obrigado por assistir” or “Legendado por…” on segments with no real speech).

DELPHOS forwards five guard parameters to Transcriptor DELPHOS on every chunk to suppress this failure mode. In our 14-day live audit, these guards reduced the observed hallucination rate from ~60% to ~0% on silence-heavy audio.

Parameter	Value	What it does
`vad_filter`	`"true"`	Silero VAD trims silence regions before transcription, removing the dominant trigger for hallucinations
`no_speech_threshold`	`"0.6"`	Rejects low-confidence silence-mislabeled-as-speech segments
`condition_on_previous_text`	`"false"`	Breaks the auto-regressive cascade where one hallucination conditions the next
`compression_ratio_threshold`	`"2.4"`	Rejects decodes with abnormal compression ratios (a hallucination signature)
`log_prob_threshold`	`"-1.0"`	Rejects low-likelihood token sequences

These are applied automatically — you don’t need to set them on the client side. They’re documented here so you understand the failure modes the platform mitigates on your behalf.

Progressive Transcript

At any time during an active session, retrieve the full accumulated transcript:

curl -X GET "https://your-instance.delphos.app/v1/consultation/{session_id}/transcript" \
  -H "x-api-key: $DELPHOS_API_KEY"

{
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "transcript": "Médico: Bom dia, como está se sentindo hoje?\nPaciente: As dores de cabeça estão menos frequentes..."
}

This endpoint is useful for displaying the full conversation in your UI while chunks continue to arrive.

Sending the Final Chunk

Set is_final: true on the last audio chunk to signal the end of recording. This triggers SOAP note generation:

{
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "audio_base64": "UklGR...",
  "audio_format": "wav",
  "is_final": true
}

The response will include a soap_generation field with the status:

{
  "transcription": { "..." },
  "soap_generation": {
    "status": "processing",
    "message": "SOAP note generation started"
  }
}

Poll the session status endpoint to check when the SOAP note is ready.

Integration Pattern

A typical integration streams chunks from the client’s microphone:

import base64
import httpx
import time

CHUNK_DURATION_SECONDS = 5

def stream_consultation(session_id: str, audio_chunks: list[bytes]):
    """Stream audio chunks and display progressive transcription."""
    client = httpx.Client(
        base_url="https://your-instance.delphos.app/v1",
        headers={"x-api-key": DELPHOS_API_KEY},
        timeout=30.0,
    )

    for i, chunk in enumerate(audio_chunks):
        is_last = i == len(audio_chunks) - 1

        response = client.post(
            "/consultation/chunk",
            json={
                "session_id": session_id,
                "chunk_sequence": i + 1,
                "audio_base64": base64.b64encode(chunk).decode(),
                "audio_format": "wav",
                "is_final": is_last,
            },
        )
        response.raise_for_status()
        result = response.json()

        # Display progressive transcription
        text = result["transcription"]["text"]
        print(f"[Chunk {i+1}] {text}")

    # Poll for SOAP completion
    while True:
        status = client.get(f"/consultation/{session_id}/status").json()
        if status["state"] == "COMPLETED":
            return status["soap_note"]
        if status["state"] == "ERROR":
            raise RuntimeError(status["error"])
        time.sleep(1)

Error Handling

Status	Cause
`404 Not Found`	Session does not exist
`409 Conflict`	Session is not in `ACTIVE` state (already ended or errored)
`503 Service Unavailable`	Transcription service temporarily unavailable (retry with backoff)

Next Steps

SOAP Streaming — Generate structured SOAP notes from the accumulated transcript in real time
Streaming Prescription Extraction — Extract a safety-checked prescription from the same accumulated transcript in parallel with SOAP
Consultation Lifecycle — Full session management
Working with Records — Post-consultation editing
Clinical Summaries — Patient intelligence