Deployment Guide
DELPHOS is a containerized platform deployed via Docker Compose. This guide covers the complete infrastructure: AI engine services, application layer, data stores, networking, and observability.
Architecture Overview
The platform follows a three-tier architecture. Only the API Gateway (port 8000)
is exposed publicly. All other services bind exclusively to 127.0.0.1.
Internet / LAN | [ :8000 ] API Gateway (FastAPI) / | \ ------ | -------- | | | AI Engine Tier Application Tier Data Tier (GPU Services) (CPU Services) (Persistence) +--------------+ +--------------+ +--------------+ | Reasoning | | Evolution API| | PostgreSQL | | Engine :8002 | | (WhatsApp) | | pgvector | | [GPU 0] | | :8080 | | :5432 | +--------------+ +--------------+ +--------------+ | Orchestrator | | Redis | | :8001 | | :6379 | | [GPU 1] | +--------------+ +--------------+ | Sem. Search | | :8013 [GPU1] | +--------------+ | Reranker | | :8014 [GPU1] | +--------------+ | Lex. Search | | :8015 [GPU1] | +--------------+All inter-service communication happens over the internal delphos-network Docker
bridge network. Remote services (Transcription Engine, Document Intelligence,
Speaker Diarization) run on a secondary server and are accessed via HTTP.
Hardware Requirements
Primary Server
| Component | Minimum | Recommended |
|---|---|---|
| GPUs | 2x NVIDIA (24 GB each) | 2x NVIDIA (32 GB each) |
| CPU | 12 cores | 16+ cores |
| RAM | 192 GB | 256 GB |
| Storage | 250 GB NVMe | 500 GB NVMe |
| OS | Linux with NVIDIA Container Toolkit | Ubuntu 22.04+ |
GPU Memory Distribution
GPU 0 — Reasoning Engine (dedicated)
| Resource | Allocation |
|---|---|
| Reasoning Engine | ~25 GB (78% utilization) |
| Headroom | ~7 GB |
GPU 1 — Orchestrator + Retrieval Services
| Resource | Allocation |
|---|---|
| Orchestrator | ~10 GB (62% utilization) |
| Semantic Search Engine | ~2 GB (6%) |
| Reranker | ~1.3 GB (4%) |
| Lexical Search Engine | ~2 GB (6%) |
| Headroom | ~6.7 GB |
Optional Secondary Server
For Transcription Engine, Document Intelligence, and Speaker Diarization:
| Component | Minimum |
|---|---|
| GPU | 1x NVIDIA (16 GB+) |
| CPU | 8 cores |
| RAM | 32 GB |
| VRAM usage | ~10.7 GB total |
Services Reference
AI Engine Tier (GPU)
Reasoning Engine
Complex medical reasoning, differential diagnosis, clinical decision support with chain-of-thought processing.
| Parameter | Value |
|---|---|
| Internal port | 8002 |
| GPU | 0 (dedicated) |
| Context window | 16,384 tokens |
| Max concurrent requests | 2 |
| Quantization | AWQ 4-bit |
| CUDA graphs | PIECEWISE (2-3x throughput) |
| Prefix caching | Disabled (thinking mode) |
| Thinking mode | Enabled by default |
| CPU swap | 32 GB |
Orchestrator
Primary chat interface, medical conversation management, tool-calling orchestration, and routing to downstream services.
| Parameter | Value |
|---|---|
| Internal port | 8001 |
| GPU | 1 |
| Context window | 24,576 tokens |
| Max concurrent requests | 16 |
| Quantization | AWQ 4-bit |
| CUDA graphs | PIECEWISE |
| Prefix caching | Enabled |
| Tool calling | Enabled (auto tool choice) |
| CPU swap | 64 GB |
Semantic Search Engine
Powers semantic search over clinical guidelines, medical codes, and knowledge base retrieval.
| Parameter | Value |
|---|---|
| Internal port | 8013 |
| GPU | 1 |
| Context window | 8,192 tokens |
| Output dimensions | 1,024 |
| API endpoint | POST /v1/embeddings |
Reranker
Cross-encoder reranking of search results for higher retrieval precision. Scores query-document pairs and returns relevance scores.
| Parameter | Value |
|---|---|
| Internal port | 8014 |
| GPU | 1 |
| Context window | 512 tokens |
| API endpoint | POST /v1/score |
Lexical Search Engine
Lexical matching for hybrid search. Combines with the Semantic Search Engine to improve retrieval precision on medical terminology and procedure codes.
| Parameter | Value |
|---|---|
| Internal port | 8015 |
| GPU | 1 |
| API endpoint | POST /v1/sparse/encode |
| Start period | 120s (model loading) |
Application Tier (CPU)
API Gateway
Unified FastAPI entry point for all DELPHOS services. Routes requests to the appropriate backend service.
| Parameter | Value |
|---|---|
| Port | 8000 (public) |
| Framework | FastAPI |
| Documentation | /docs (Swagger), /redoc (ReDoc) |
| Health check | GET /health |
| API prefix | /v1/* |
The API Gateway starts only after all upstream services pass their health checks.
Evolution API (WhatsApp)
Self-hosted WhatsApp Business API for patient communication and voice rescheduling workflows.
| Parameter | Value |
|---|---|
| Internal port | 8080 |
| Feature flag | WHATSAPP_ENABLED |
| Condition | Optional — enable via environment variable |
| Database | Shares PostgreSQL instance |
Data Tier
PostgreSQL (pgvector)
| Parameter | Value |
|---|---|
| Image | pgvector/pgvector:pg16 |
| Internal port | 5432 |
| Extensions | pgvector (vector similarity search) |
| Data volume | /data/delphos/databases/postgres |
| Secrets | Password via Docker secret file |
Redis
| Parameter | Value |
|---|---|
| Image | redis:7-alpine |
| Internal port | 6379 |
| Max memory | 512 MB |
| Eviction policy | allkeys-lru |
| Persistence | AOF (append-only file) |
| Data volume | /data/delphos/databases/redis |
Redis serves as session cache, response cache, rate limiter, and GPU mutex coordinator.
Environment Variables
Database
| Variable | Description | Example |
|---|---|---|
DB_HOST | PostgreSQL hostname | postgres |
DB_PORT | PostgreSQL port | 5432 |
DB_NAME | Database name | delphos |
DB_USER | Application user | delphos_app |
DB_PASSWORD | Application password | (use Docker secret) |
REDIS_URL | Redis connection string | redis://redis:6379 |
Service Endpoints
| Variable | Description | Default |
|---|---|---|
ORCHESTRATOR_ENDPOINT | Orchestrator base URL | http://orchestrator:8001 |
REASONING_ENGINE_ENDPOINT | Reasoning Engine base URL | http://reasoning-engine:8002 |
EMBEDDINGS_ENDPOINT | Semantic Search base URL | http://embeddings:8013 |
RERANKER_ENDPOINT | Reranker base URL | http://reranker:8014 |
SPARSE_EMBEDDINGS_URL | Sparse encode endpoint | http://sparse-embeddings:8015/v1/sparse/encode |
Remote Services (Secondary Server)
| Variable | Description | Example |
|---|---|---|
TRANSCRIPTION_ENDPOINT | Transcription Engine URL | http://<secondary-ip>:8001 |
OCR_ENDPOINT | Document Intelligence URL | http://<secondary-ip>:8002 |
DIARIZATION_ENDPOINT | Speaker Diarization URL | http://<secondary-ip>:8005 |
Feature Flags
| Variable | Description | Default |
|---|---|---|
USE_KITT_DISPATCHER | Enable intelligent request dispatcher | true |
USE_AGENTIC_ROUTER | Enable agentic tool-calling router | true |
QUERY_EXPANSION_ENABLED | Enable query expansion for retrieval | true |
WHATSAPP_ENABLED | Enable WhatsApp integration | false |
KITT_PLANNING_TIMEOUT | Dispatcher planning timeout (seconds) | 15 |
CORS
| Variable | Description |
|---|---|
CORS_ALLOWED_ORIGINS | Comma-separated list of allowed origins |
Logging
| Variable | Description | Default |
|---|---|---|
LOG_FORMAT | Log output format | json |
LOG_LEVEL | Minimum log level | INFO |
LOG_SERVICE_NAME | Service identifier in logs | api-gateway |
WhatsApp (Evolution API)
| Variable | Description |
|---|---|
EVOLUTION_API_KEY | Authentication key for Evolution API |
EVOLUTION_WEBHOOK_URL | Webhook URL for incoming WhatsApp events |
EVOLUTION_DATABASE_URL | PostgreSQL connection URI for Evolution |
Startup Procedure
Services must start in dependency order. GPU model loading takes significant time; the health checks gate downstream services automatically.
Step 1 — Data services
docker compose -f docker-compose-v2.yml up -d postgres redisWait for both to report healthy:
docker compose -f docker-compose-v2.yml ps postgres redisStep 2 — GPU 1 services (Orchestrator + Retrieval)
docker compose -f docker-compose-v2.yml up -d \ orchestrator embeddings reranker sparse-embeddingsThe Orchestrator loads in approximately 90 seconds. The Semantic Search Engine and Reranker load in approximately 60 seconds. The Lexical Search Engine requires up to 120 seconds for model initialization.
Step 3 — GPU 0 service (Reasoning Engine)
docker compose -f docker-compose-v2.yml up -d reasoning-engineThe Reasoning Engine loads a large model and may take up to 120 seconds to become healthy.
Step 4 — API Gateway
docker compose -f docker-compose-v2.yml up -d api-gatewayThe gateway has depends_on conditions for all upstream services. It will not
start until every dependency passes its health check.
Step 5 — WhatsApp (optional)
docker compose -f docker-compose-v2.yml up -d evolution-apiDocker Compose respects depends_on with health check conditions, so a single
command works — services will wait for their dependencies automatically:
docker compose -f docker-compose-v2.yml up -dVerify Deployment
# Check all services are healthydocker compose -f docker-compose-v2.yml ps
# Test API Gatewaycurl http://localhost:8000/health
# Monitor GPU memorywatch -n 1 nvidia-smiExpected idle usage: GPU 0 approximately 25 GB, GPU 1 approximately 16 GB. Significantly higher readings may indicate a resource leak.
Health Checks
Every service includes a Docker health check. The API Gateway uses these to gate its own startup.
| Service | Endpoint | Interval | Start Period | Retries |
|---|---|---|---|---|
| Reasoning Engine | GET /health | 30s | 120s | 3 |
| Orchestrator | GET /health | 30s | 90s | 3 |
| Semantic Search | GET /health | 30s | 60s | 3 |
| Reranker | GET /health | 30s | 60s | 3 |
| Lexical Search | GET /health | 30s | 120s | 3 |
| API Gateway | GET /health | 30s | 15s | 3 |
| Evolution API | GET /api/health | 30s | 30s | 3 |
| PostgreSQL | pg_isready | 10s | 30s | 5 |
| Redis | redis-cli ping | 10s | 10s | 5 |
GPU Mutex
GPU 0 is dedicated to the Reasoning Engine with a hard limit of 2 concurrent sequences. To prevent request queuing and timeouts, the API Gateway implements a Redis-based distributed lock.
Implementation details:
| Parameter | Value |
|---|---|
| Redis key | delphos:gpu0:mutex |
| Lock mechanism | SETNX (atomic acquire) |
| TTL | 30 seconds |
| Timeout | 120 seconds (max inference time) |
The lock is acquired before any call to the Reasoning Engine and released upon completion. If the lock holder crashes, the TTL ensures automatic release.
# Usage pattern (simplified)async with GPUMutex(redis, "delphos:gpu0:mutex", timeout=120): response = await call_reasoning_engine(prompt)Data Volumes
All persistent data lives under /data/delphos/ on the host:
/data/delphos/ models/ reasoning-engine/ # Reasoning Engine weights orchestrator/ # Orchestrator weights dense-embeddings/ # Semantic Search Engine weights sparse-embeddings/ # Lexical Search Engine weights reranker/ # Reranker weights databases/ postgres/ # PostgreSQL data directory redis/ # Redis AOF persistence cache/ huggingface/ # Shared HuggingFace cache secrets/ postgres_password.txt # Docker secret for DB superuser monitoring/ prometheus/ # Prometheus TSDB grafana/ # Grafana dashboards and state loki/ # Loki log indexNetwork Security
Port Binding
Only the API Gateway is accessible from the network. All other services bind to
127.0.0.1 (localhost only).
| Service | Binding | Accessible From |
|---|---|---|
| API Gateway | 0.0.0.0:8000 | Network (public) |
| Reasoning Engine | 127.0.0.1:8002 | Localhost only |
| Orchestrator | 127.0.0.1:8001 | Localhost only |
| Semantic Search | 127.0.0.1:8013 | Localhost only |
| Reranker | 127.0.0.1:8014 | Localhost only |
| Lexical Search | 127.0.0.1:8015 | Localhost only |
| Evolution API | 127.0.0.1:8080 | Localhost only |
| PostgreSQL | 127.0.0.1:5432 | Localhost only |
| Redis | 127.0.0.1:6379 | Localhost only |
Secrets Management
- PostgreSQL superuser password is loaded via Docker secrets (
/run/secrets/postgres_password) - The secret file resides at
/data/delphos/secrets/postgres_password.txton the host - Application database password is set via environment variable (migration to Docker secret planned)
Recommendations
- Place a reverse proxy (NGINX, Caddy, or Traefik) in front of port 8000 for TLS termination
- Configure
CORS_ALLOWED_ORIGINSto list only trusted frontend origins - Use firewall rules (
ufworiptables) to restrict port 8000 access to known clients - Rotate the Evolution API key and database passwords periodically
Monitoring Stack
An observability stack is deployed via a separate Compose file. It joins the
same delphos-network to scrape metrics from application services.
docker compose -f docker-compose-monitoring.yml up -dComponents
| Service | Image | Port | Purpose |
|---|---|---|---|
| Prometheus | prom/prometheus | 9090 | Metrics collection and alerting |
| Grafana | grafana/grafana | 3000 | Dashboards and visualization |
| Loki | grafana/loki | 127.0.0.1:3100 | Log aggregation |
| Promtail | grafana/promtail | 9080 | Log collection agent |
| Node Exporter | prom/node-exporter | 9100 | Host CPU, memory, disk metrics |
| DCGM Exporter | nvidia/dcgm-exporter | 9400 | GPU utilization, temperature, memory |
| Postgres Exporter | prometheuscommunity/postgres-exporter | 9187 | Query performance, connections |
| Redis Exporter | oliver006/redis_exporter | 9121 | Memory, commands, key counts |
| cAdvisor | gcr.io/cadvisor/cadvisor | 8081 | Per-container resource usage |
Key Metrics to Watch
- GPU:
DCGM_FI_DEV_GPU_UTIL,DCGM_FI_DEV_FB_USED(VRAM),DCGM_FI_DEV_GPU_TEMP - Inference: Request latency per service, queue depth, token throughput
- Database: Active connections, query duration p95, pgvector index scans
- Redis: Memory usage vs. 512 MB limit, eviction rate, mutex lock wait time
- Host: CPU utilization, available RAM (watch for swap pressure from KV cache)
Retention
- Prometheus: 30 days / 10 GB (whichever is reached first)
- Loki: Configured via
loki-config.yaml(default 7 days) - Grafana dashboards: Provisioned from
monitoring/grafana/dashboards/
Troubleshooting
Common Issues
Reasoning Engine fails to start (OOM)
GPU 0 requires at least 25 GB free VRAM. Ensure no other processes occupy the GPU:
nvidia-smi# Kill stray processes if neededsudo fuser -v /dev/nvidia0Lexical Search Engine reports unhealthy
This service has a 120-second start period for model loading. Wait at least 2 minutes before investigating. Check logs:
docker logs sparse-embeddings --tail 50API Gateway exits immediately
The gateway requires all upstream services to be healthy. Check which dependency is failing:
docker compose -f docker-compose-v2.yml ps --format "table {{.Name}}\t{{.Status}}"Redis eviction warnings
If Redis exceeds 512 MB, LRU eviction activates automatically. This is expected behavior under load. Monitor with:
docker exec redis redis-cli info memory | grep used_memory_humanRollback
If the v2 configuration has issues, revert to the base Compose file.
The base file (docker-compose.yml) contains only core data and application
services without the full AI engine stack — use it as an emergency fallback.
docker compose -f docker-compose-v2.yml downdocker compose -f docker-compose.yml up -d