Skip to content

Deployment Guide

DELPHOS is a containerized platform deployed via Docker Compose. This guide covers the complete infrastructure: AI engine services, application layer, data stores, networking, and observability.

Architecture Overview

The platform follows a three-tier architecture. Only the API Gateway (port 8000) is exposed publicly. All other services bind exclusively to 127.0.0.1.

Internet / LAN
|
[ :8000 ]
API Gateway (FastAPI)
/ | \
------ | --------
| | |
AI Engine Tier Application Tier Data Tier
(GPU Services) (CPU Services) (Persistence)
+--------------+ +--------------+ +--------------+
| Reasoning | | Evolution API| | PostgreSQL |
| Engine :8002 | | (WhatsApp) | | pgvector |
| [GPU 0] | | :8080 | | :5432 |
+--------------+ +--------------+ +--------------+
| Orchestrator | | Redis |
| :8001 | | :6379 |
| [GPU 1] | +--------------+
+--------------+
| Sem. Search |
| :8013 [GPU1] |
+--------------+
| Reranker |
| :8014 [GPU1] |
+--------------+
| Lex. Search |
| :8015 [GPU1] |
+--------------+

All inter-service communication happens over the internal delphos-network Docker bridge network. Remote services (Transcription Engine, Document Intelligence, Speaker Diarization) run on a secondary server and are accessed via HTTP.


Hardware Requirements

Primary Server

ComponentMinimumRecommended
GPUs2x NVIDIA (24 GB each)2x NVIDIA (32 GB each)
CPU12 cores16+ cores
RAM192 GB256 GB
Storage250 GB NVMe500 GB NVMe
OSLinux with NVIDIA Container ToolkitUbuntu 22.04+

GPU Memory Distribution

GPU 0 — Reasoning Engine (dedicated)

ResourceAllocation
Reasoning Engine~25 GB (78% utilization)
Headroom~7 GB

GPU 1 — Orchestrator + Retrieval Services

ResourceAllocation
Orchestrator~10 GB (62% utilization)
Semantic Search Engine~2 GB (6%)
Reranker~1.3 GB (4%)
Lexical Search Engine~2 GB (6%)
Headroom~6.7 GB

Optional Secondary Server

For Transcription Engine, Document Intelligence, and Speaker Diarization:

ComponentMinimum
GPU1x NVIDIA (16 GB+)
CPU8 cores
RAM32 GB
VRAM usage~10.7 GB total

Services Reference

AI Engine Tier (GPU)

Reasoning Engine

Complex medical reasoning, differential diagnosis, clinical decision support with chain-of-thought processing.

ParameterValue
Internal port8002
GPU0 (dedicated)
Context window16,384 tokens
Max concurrent requests2
QuantizationAWQ 4-bit
CUDA graphsPIECEWISE (2-3x throughput)
Prefix cachingDisabled (thinking mode)
Thinking modeEnabled by default
CPU swap32 GB

Orchestrator

Primary chat interface, medical conversation management, tool-calling orchestration, and routing to downstream services.

ParameterValue
Internal port8001
GPU1
Context window24,576 tokens
Max concurrent requests16
QuantizationAWQ 4-bit
CUDA graphsPIECEWISE
Prefix cachingEnabled
Tool callingEnabled (auto tool choice)
CPU swap64 GB

Semantic Search Engine

Powers semantic search over clinical guidelines, medical codes, and knowledge base retrieval.

ParameterValue
Internal port8013
GPU1
Context window8,192 tokens
Output dimensions1,024
API endpointPOST /v1/embeddings

Reranker

Cross-encoder reranking of search results for higher retrieval precision. Scores query-document pairs and returns relevance scores.

ParameterValue
Internal port8014
GPU1
Context window512 tokens
API endpointPOST /v1/score

Lexical Search Engine

Lexical matching for hybrid search. Combines with the Semantic Search Engine to improve retrieval precision on medical terminology and procedure codes.

ParameterValue
Internal port8015
GPU1
API endpointPOST /v1/sparse/encode
Start period120s (model loading)

Application Tier (CPU)

API Gateway

Unified FastAPI entry point for all DELPHOS services. Routes requests to the appropriate backend service.

ParameterValue
Port8000 (public)
FrameworkFastAPI
Documentation/docs (Swagger), /redoc (ReDoc)
Health checkGET /health
API prefix/v1/*

The API Gateway starts only after all upstream services pass their health checks.

Evolution API (WhatsApp)

Self-hosted WhatsApp Business API for patient communication and voice rescheduling workflows.

ParameterValue
Internal port8080
Feature flagWHATSAPP_ENABLED
ConditionOptional — enable via environment variable
DatabaseShares PostgreSQL instance

Data Tier

PostgreSQL (pgvector)

ParameterValue
Imagepgvector/pgvector:pg16
Internal port5432
Extensionspgvector (vector similarity search)
Data volume/data/delphos/databases/postgres
SecretsPassword via Docker secret file

Redis

ParameterValue
Imageredis:7-alpine
Internal port6379
Max memory512 MB
Eviction policyallkeys-lru
PersistenceAOF (append-only file)
Data volume/data/delphos/databases/redis

Redis serves as session cache, response cache, rate limiter, and GPU mutex coordinator.


Environment Variables

Database

VariableDescriptionExample
DB_HOSTPostgreSQL hostnamepostgres
DB_PORTPostgreSQL port5432
DB_NAMEDatabase namedelphos
DB_USERApplication userdelphos_app
DB_PASSWORDApplication password(use Docker secret)
REDIS_URLRedis connection stringredis://redis:6379

Service Endpoints

VariableDescriptionDefault
ORCHESTRATOR_ENDPOINTOrchestrator base URLhttp://orchestrator:8001
REASONING_ENGINE_ENDPOINTReasoning Engine base URLhttp://reasoning-engine:8002
EMBEDDINGS_ENDPOINTSemantic Search base URLhttp://embeddings:8013
RERANKER_ENDPOINTReranker base URLhttp://reranker:8014
SPARSE_EMBEDDINGS_URLSparse encode endpointhttp://sparse-embeddings:8015/v1/sparse/encode

Remote Services (Secondary Server)

VariableDescriptionExample
TRANSCRIPTION_ENDPOINTTranscription Engine URLhttp://<secondary-ip>:8001
OCR_ENDPOINTDocument Intelligence URLhttp://<secondary-ip>:8002
DIARIZATION_ENDPOINTSpeaker Diarization URLhttp://<secondary-ip>:8005

Feature Flags

VariableDescriptionDefault
USE_KITT_DISPATCHEREnable intelligent request dispatchertrue
USE_AGENTIC_ROUTEREnable agentic tool-calling routertrue
QUERY_EXPANSION_ENABLEDEnable query expansion for retrievaltrue
WHATSAPP_ENABLEDEnable WhatsApp integrationfalse
KITT_PLANNING_TIMEOUTDispatcher planning timeout (seconds)15

CORS

VariableDescription
CORS_ALLOWED_ORIGINSComma-separated list of allowed origins

Logging

VariableDescriptionDefault
LOG_FORMATLog output formatjson
LOG_LEVELMinimum log levelINFO
LOG_SERVICE_NAMEService identifier in logsapi-gateway

WhatsApp (Evolution API)

VariableDescription
EVOLUTION_API_KEYAuthentication key for Evolution API
EVOLUTION_WEBHOOK_URLWebhook URL for incoming WhatsApp events
EVOLUTION_DATABASE_URLPostgreSQL connection URI for Evolution

Startup Procedure

Services must start in dependency order. GPU model loading takes significant time; the health checks gate downstream services automatically.

Step 1 — Data services

Terminal window
docker compose -f docker-compose-v2.yml up -d postgres redis

Wait for both to report healthy:

Terminal window
docker compose -f docker-compose-v2.yml ps postgres redis

Step 2 — GPU 1 services (Orchestrator + Retrieval)

Terminal window
docker compose -f docker-compose-v2.yml up -d \
orchestrator embeddings reranker sparse-embeddings

The Orchestrator loads in approximately 90 seconds. The Semantic Search Engine and Reranker load in approximately 60 seconds. The Lexical Search Engine requires up to 120 seconds for model initialization.

Step 3 — GPU 0 service (Reasoning Engine)

Terminal window
docker compose -f docker-compose-v2.yml up -d reasoning-engine

The Reasoning Engine loads a large model and may take up to 120 seconds to become healthy.

Step 4 — API Gateway

Terminal window
docker compose -f docker-compose-v2.yml up -d api-gateway

The gateway has depends_on conditions for all upstream services. It will not start until every dependency passes its health check.

Step 5 — WhatsApp (optional)

Terminal window
docker compose -f docker-compose-v2.yml up -d evolution-api

Verify Deployment

Terminal window
# Check all services are healthy
docker compose -f docker-compose-v2.yml ps
# Test API Gateway
curl http://localhost:8000/health
# Monitor GPU memory
watch -n 1 nvidia-smi

Expected idle usage: GPU 0 approximately 25 GB, GPU 1 approximately 16 GB. Significantly higher readings may indicate a resource leak.


Health Checks

Every service includes a Docker health check. The API Gateway uses these to gate its own startup.

ServiceEndpointIntervalStart PeriodRetries
Reasoning EngineGET /health30s120s3
OrchestratorGET /health30s90s3
Semantic SearchGET /health30s60s3
RerankerGET /health30s60s3
Lexical SearchGET /health30s120s3
API GatewayGET /health30s15s3
Evolution APIGET /api/health30s30s3
PostgreSQLpg_isready10s30s5
Redisredis-cli ping10s10s5

GPU Mutex

GPU 0 is dedicated to the Reasoning Engine with a hard limit of 2 concurrent sequences. To prevent request queuing and timeouts, the API Gateway implements a Redis-based distributed lock.

Implementation details:

ParameterValue
Redis keydelphos:gpu0:mutex
Lock mechanismSETNX (atomic acquire)
TTL30 seconds
Timeout120 seconds (max inference time)

The lock is acquired before any call to the Reasoning Engine and released upon completion. If the lock holder crashes, the TTL ensures automatic release.

# Usage pattern (simplified)
async with GPUMutex(redis, "delphos:gpu0:mutex", timeout=120):
response = await call_reasoning_engine(prompt)

Data Volumes

All persistent data lives under /data/delphos/ on the host:

/data/delphos/
models/
reasoning-engine/ # Reasoning Engine weights
orchestrator/ # Orchestrator weights
dense-embeddings/ # Semantic Search Engine weights
sparse-embeddings/ # Lexical Search Engine weights
reranker/ # Reranker weights
databases/
postgres/ # PostgreSQL data directory
redis/ # Redis AOF persistence
cache/
huggingface/ # Shared HuggingFace cache
secrets/
postgres_password.txt # Docker secret for DB superuser
monitoring/
prometheus/ # Prometheus TSDB
grafana/ # Grafana dashboards and state
loki/ # Loki log index

Network Security

Port Binding

Only the API Gateway is accessible from the network. All other services bind to 127.0.0.1 (localhost only).

ServiceBindingAccessible From
API Gateway0.0.0.0:8000Network (public)
Reasoning Engine127.0.0.1:8002Localhost only
Orchestrator127.0.0.1:8001Localhost only
Semantic Search127.0.0.1:8013Localhost only
Reranker127.0.0.1:8014Localhost only
Lexical Search127.0.0.1:8015Localhost only
Evolution API127.0.0.1:8080Localhost only
PostgreSQL127.0.0.1:5432Localhost only
Redis127.0.0.1:6379Localhost only

Secrets Management

  • PostgreSQL superuser password is loaded via Docker secrets (/run/secrets/postgres_password)
  • The secret file resides at /data/delphos/secrets/postgres_password.txt on the host
  • Application database password is set via environment variable (migration to Docker secret planned)

Recommendations

  • Place a reverse proxy (NGINX, Caddy, or Traefik) in front of port 8000 for TLS termination
  • Configure CORS_ALLOWED_ORIGINS to list only trusted frontend origins
  • Use firewall rules (ufw or iptables) to restrict port 8000 access to known clients
  • Rotate the Evolution API key and database passwords periodically

Monitoring Stack

An observability stack is deployed via a separate Compose file. It joins the same delphos-network to scrape metrics from application services.

Terminal window
docker compose -f docker-compose-monitoring.yml up -d

Components

ServiceImagePortPurpose
Prometheusprom/prometheus9090Metrics collection and alerting
Grafanagrafana/grafana3000Dashboards and visualization
Lokigrafana/loki127.0.0.1:3100Log aggregation
Promtailgrafana/promtail9080Log collection agent
Node Exporterprom/node-exporter9100Host CPU, memory, disk metrics
DCGM Exporternvidia/dcgm-exporter9400GPU utilization, temperature, memory
Postgres Exporterprometheuscommunity/postgres-exporter9187Query performance, connections
Redis Exporteroliver006/redis_exporter9121Memory, commands, key counts
cAdvisorgcr.io/cadvisor/cadvisor8081Per-container resource usage

Key Metrics to Watch

  • GPU: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED (VRAM), DCGM_FI_DEV_GPU_TEMP
  • Inference: Request latency per service, queue depth, token throughput
  • Database: Active connections, query duration p95, pgvector index scans
  • Redis: Memory usage vs. 512 MB limit, eviction rate, mutex lock wait time
  • Host: CPU utilization, available RAM (watch for swap pressure from KV cache)

Retention

  • Prometheus: 30 days / 10 GB (whichever is reached first)
  • Loki: Configured via loki-config.yaml (default 7 days)
  • Grafana dashboards: Provisioned from monitoring/grafana/dashboards/

Troubleshooting

Common Issues

Reasoning Engine fails to start (OOM)

GPU 0 requires at least 25 GB free VRAM. Ensure no other processes occupy the GPU:

Terminal window
nvidia-smi
# Kill stray processes if needed
sudo fuser -v /dev/nvidia0

Lexical Search Engine reports unhealthy

This service has a 120-second start period for model loading. Wait at least 2 minutes before investigating. Check logs:

Terminal window
docker logs sparse-embeddings --tail 50

API Gateway exits immediately

The gateway requires all upstream services to be healthy. Check which dependency is failing:

Terminal window
docker compose -f docker-compose-v2.yml ps --format "table {{.Name}}\t{{.Status}}"

Redis eviction warnings

If Redis exceeds 512 MB, LRU eviction activates automatically. This is expected behavior under load. Monitor with:

Terminal window
docker exec redis redis-cli info memory | grep used_memory_human

Rollback

If the v2 configuration has issues, revert to the base Compose file. The base file (docker-compose.yml) contains only core data and application services without the full AI engine stack — use it as an emergency fallback.

Terminal window
docker compose -f docker-compose-v2.yml down
docker compose -f docker-compose.yml up -d