Deployment Guide

DELPHOS is a containerized platform deployed via Docker Compose. This guide covers the complete infrastructure: AI engine services, application layer, data stores, networking, and observability.

Architecture Overview

The platform follows a three-tier architecture. Only the API Gateway (port 8000) is exposed publicly. All other services bind exclusively to 127.0.0.1.

                         Internet / LAN
                              |
                         [ :8000 ]
                       API Gateway (FastAPI)
                      /        |        \
               ------          |         --------
              |                |                 |
     AI Engine Tier     Application Tier     Data Tier
    (GPU Services)       (CPU Services)     (Persistence)
    +--------------+    +--------------+   +--------------+
    | Reasoning    |    | Evolution API|   | PostgreSQL   |
    | Engine :8002 |    | (WhatsApp)   |   | pgvector     |
    | [GPU 0]      |    | :8080        |   | :5432        |
    +--------------+    +--------------+   +--------------+
    | Orchestrator |                       | Redis        |
    | :8001        |                       | :6379        |
    | [GPU 1]      |                       +--------------+
    +--------------+
    | Sem. Search |
    | :8013 [GPU1] |
    +--------------+
    | Reranker     |
    | :8014 [GPU1] |
    +--------------+
    | Lex. Search |
    | :8015 [GPU1] |
    +--------------+

All inter-service communication happens over the internal delphos-network Docker bridge network. Remote services (Transcription Engine, Document Intelligence, Speaker Diarization) run on a secondary server and are accessed via HTTP.

Hardware Requirements

Primary Server

Component	Minimum	Recommended
GPUs	2x NVIDIA (24 GB each)	2x NVIDIA (32 GB each)
CPU	12 cores	16+ cores
RAM	192 GB	256 GB
Storage	250 GB NVMe	500 GB NVMe
OS	Linux with NVIDIA Container Toolkit	Ubuntu 22.04+

GPU Memory Distribution

GPU 0 — Reasoning Engine (dedicated)

Resource	Allocation
Reasoning Engine	~25 GB (78% utilization)
Headroom	~7 GB

GPU 1 — Orchestrator + Retrieval Services

Resource	Allocation
Orchestrator	~10 GB (62% utilization)
Semantic Search Engine	~2 GB (6%)
Reranker	~1.3 GB (4%)
Lexical Search Engine	~2 GB (6%)
Headroom	~6.7 GB

Optional Secondary Server

For Transcription Engine, Document Intelligence, and Speaker Diarization:

Component	Minimum
GPU	1x NVIDIA (16 GB+)
CPU	8 cores
RAM	32 GB
VRAM usage	~10.7 GB total

Services Reference

AI Engine Tier (GPU)

Reasoning Engine

Complex medical reasoning, differential diagnosis, clinical decision support with chain-of-thought processing.

Parameter	Value
Internal port	`8002`
GPU	0 (dedicated)
Context window	16,384 tokens
Max concurrent requests	2
Quantization	AWQ 4-bit
CUDA graphs	PIECEWISE (2-3x throughput)
Prefix caching	Disabled (thinking mode)
Thinking mode	Enabled by default
CPU swap	32 GB

Orchestrator

Primary chat interface, medical conversation management, tool-calling orchestration, and routing to downstream services.

Parameter	Value
Internal port	`8001`
GPU	1
Context window	24,576 tokens
Max concurrent requests	16
Quantization	AWQ 4-bit
CUDA graphs	PIECEWISE
Prefix caching	Enabled
Tool calling	Enabled (auto tool choice)
CPU swap	64 GB

Semantic Search Engine

Powers semantic search over clinical guidelines, medical codes, and knowledge base retrieval.

Parameter	Value
Internal port	`8013`
GPU	1
Context window	8,192 tokens
Output dimensions	1,024
API endpoint	`POST /v1/embeddings`

Reranker

Cross-encoder reranking of search results for higher retrieval precision. Scores query-document pairs and returns relevance scores.

Parameter	Value
Internal port	`8014`
GPU	1
Context window	512 tokens
API endpoint	`POST /v1/score`

Lexical Search Engine

Lexical matching for hybrid search. Combines with the Semantic Search Engine to improve retrieval precision on medical terminology and procedure codes.

Parameter	Value
Internal port	`8015`
GPU	1
API endpoint	`POST /v1/sparse/encode`
Start period	120s (model loading)

Application Tier (CPU)

API Gateway

Unified FastAPI entry point for all DELPHOS services. Routes requests to the appropriate backend service.

Parameter	Value
Port	`8000` (public)
Framework	FastAPI
Documentation	`/docs` (Swagger), `/redoc` (ReDoc)
Health check	`GET /health`
API prefix	`/v1/*`

The API Gateway starts only after all upstream services pass their health checks.

Evolution API (WhatsApp)

Self-hosted WhatsApp Business API for patient communication and voice rescheduling workflows.

Parameter	Value
Internal port	`8080`
Feature flag	`WHATSAPP_ENABLED`
Condition	Optional — enable via environment variable
Database	Shares PostgreSQL instance

Data Tier

PostgreSQL (pgvector)

Parameter	Value
Image	`pgvector/pgvector:pg16`
Internal port	`5432`
Extensions	pgvector (vector similarity search)
Data volume	`/data/delphos/databases/postgres`
Secrets	Password via Docker secret file

Redis

Parameter	Value
Image	`redis:7-alpine`
Internal port	`6379`
Max memory	512 MB
Eviction policy	`allkeys-lru`
Persistence	AOF (append-only file)
Data volume	`/data/delphos/databases/redis`

Redis serves as session cache, response cache, rate limiter, and GPU mutex coordinator.

Environment Variables

Database

Variable	Description	Example
`DB_HOST`	PostgreSQL hostname	`postgres`
`DB_PORT`	PostgreSQL port	`5432`
`DB_NAME`	Database name	`delphos`
`DB_USER`	Application user	`delphos_app`
`DB_PASSWORD`	Application password	(use Docker secret)
`REDIS_URL`	Redis connection string	`redis://redis:6379`

Service Endpoints

Variable	Description	Default
`ORCHESTRATOR_ENDPOINT`	Orchestrator base URL	`http://orchestrator:8001`
`REASONING_ENGINE_ENDPOINT`	Reasoning Engine base URL	`http://reasoning-engine:8002`
`EMBEDDINGS_ENDPOINT`	Semantic Search base URL	`http://embeddings:8013`
`RERANKER_ENDPOINT`	Reranker base URL	`http://reranker:8014`
`SPARSE_EMBEDDINGS_URL`	Sparse encode endpoint	`http://sparse-embeddings:8015/v1/sparse/encode`

Remote Services (Secondary Server)

Variable	Description	Example
`TRANSCRIPTION_ENDPOINT`	Transcription Engine URL	`http://<secondary-ip>:8001`
`OCR_ENDPOINT`	Document Intelligence URL	`http://<secondary-ip>:8002`
`DIARIZATION_ENDPOINT`	Speaker Diarization URL	`http://<secondary-ip>:8005`

Feature Flags

Variable	Description	Default
`USE_KITT_DISPATCHER`	Enable intelligent request dispatcher	`true`
`USE_AGENTIC_ROUTER`	Enable agentic tool-calling router	`true`
`QUERY_EXPANSION_ENABLED`	Enable query expansion for retrieval	`true`
`WHATSAPP_ENABLED`	Enable WhatsApp integration	`false`
`KITT_PLANNING_TIMEOUT`	Dispatcher planning timeout (seconds)	`15`

CORS

Variable	Description
`CORS_ALLOWED_ORIGINS`	Comma-separated list of allowed origins

Logging

Variable	Description	Default
`LOG_FORMAT`	Log output format	`json`
`LOG_LEVEL`	Minimum log level	`INFO`
`LOG_SERVICE_NAME`	Service identifier in logs	`api-gateway`

WhatsApp (Evolution API)

Variable	Description
`EVOLUTION_API_KEY`	Authentication key for Evolution API
`EVOLUTION_WEBHOOK_URL`	Webhook URL for incoming WhatsApp events
`EVOLUTION_DATABASE_URL`	PostgreSQL connection URI for Evolution

Startup Procedure

Services must start in dependency order. GPU model loading takes significant time; the health checks gate downstream services automatically.

Step-by-step
All at once

Step 1 — Data services

docker compose -f docker-compose-v2.yml up -d postgres redis

Wait for both to report healthy:

docker compose -f docker-compose-v2.yml ps postgres redis

Step 2 — GPU 1 services (Orchestrator + Retrieval)

docker compose -f docker-compose-v2.yml up -d \
  orchestrator embeddings reranker sparse-embeddings

The Orchestrator loads in approximately 90 seconds. The Semantic Search Engine and Reranker load in approximately 60 seconds. The Lexical Search Engine requires up to 120 seconds for model initialization.

Step 3 — GPU 0 service (Reasoning Engine)

docker compose -f docker-compose-v2.yml up -d reasoning-engine

The Reasoning Engine loads a large model and may take up to 120 seconds to become healthy.

Step 4 — API Gateway

docker compose -f docker-compose-v2.yml up -d api-gateway

The gateway has depends_on conditions for all upstream services. It will not start until every dependency passes its health check.

Step 5 — WhatsApp (optional)

docker compose -f docker-compose-v2.yml up -d evolution-api

Docker Compose respects depends_on with health check conditions, so a single command works — services will wait for their dependencies automatically:

docker compose -f docker-compose-v2.yml up -d

Verify Deployment

# Check all services are healthy
docker compose -f docker-compose-v2.yml ps

# Test API Gateway
curl http://localhost:8000/health

# Monitor GPU memory
watch -n 1 nvidia-smi

Expected idle usage: GPU 0 approximately 25 GB, GPU 1 approximately 16 GB. Significantly higher readings may indicate a resource leak.

Health Checks

Every service includes a Docker health check. The API Gateway uses these to gate its own startup.

Service	Endpoint	Interval	Start Period	Retries
Reasoning Engine	`GET /health`	30s	120s	3
Orchestrator	`GET /health`	30s	90s	3
Semantic Search	`GET /health`	30s	60s	3
Reranker	`GET /health`	30s	60s	3
Lexical Search	`GET /health`	30s	120s	3
API Gateway	`GET /health`	30s	15s	3
Evolution API	`GET /api/health`	30s	30s	3
PostgreSQL	`pg_isready`	10s	30s	5
Redis	`redis-cli ping`	10s	10s	5

GPU Mutex

GPU 0 is dedicated to the Reasoning Engine with a hard limit of 2 concurrent sequences. To prevent request queuing and timeouts, the API Gateway implements a Redis-based distributed lock.

Implementation details:

Parameter	Value
Redis key	`delphos:gpu0:mutex`
Lock mechanism	`SETNX` (atomic acquire)
TTL	30 seconds
Timeout	120 seconds (max inference time)

The lock is acquired before any call to the Reasoning Engine and released upon completion. If the lock holder crashes, the TTL ensures automatic release.

# Usage pattern (simplified)
async with GPUMutex(redis, "delphos:gpu0:mutex", timeout=120):
    response = await call_reasoning_engine(prompt)

Data Volumes

All persistent data lives under /data/delphos/ on the host:

/data/delphos/
  models/
    reasoning-engine/          # Reasoning Engine weights
    orchestrator/              # Orchestrator weights
    dense-embeddings/          # Semantic Search Engine weights
    sparse-embeddings/         # Lexical Search Engine weights
    reranker/                  # Reranker weights
  databases/
    postgres/                  # PostgreSQL data directory
    redis/                     # Redis AOF persistence
  cache/
    huggingface/               # Shared HuggingFace cache
  secrets/
    postgres_password.txt      # Docker secret for DB superuser
  monitoring/
    prometheus/                # Prometheus TSDB
    grafana/                   # Grafana dashboards and state
    loki/                      # Loki log index

Network Security

Port Binding

Only the API Gateway is accessible from the network. All other services bind to 127.0.0.1 (localhost only).

Service	Binding	Accessible From
API Gateway	`0.0.0.0:8000`	Network (public)
Reasoning Engine	`127.0.0.1:8002`	Localhost only
Orchestrator	`127.0.0.1:8001`	Localhost only
Semantic Search	`127.0.0.1:8013`	Localhost only
Reranker	`127.0.0.1:8014`	Localhost only
Lexical Search	`127.0.0.1:8015`	Localhost only
Evolution API	`127.0.0.1:8080`	Localhost only
PostgreSQL	`127.0.0.1:5432`	Localhost only
Redis	`127.0.0.1:6379`	Localhost only

Secrets Management

PostgreSQL superuser password is loaded via Docker secrets (/run/secrets/postgres_password)
The secret file resides at /data/delphos/secrets/postgres_password.txt on the host
Application database password is set via environment variable (migration to Docker secret planned)

Recommendations

Place a reverse proxy (NGINX, Caddy, or Traefik) in front of port 8000 for TLS termination
Configure CORS_ALLOWED_ORIGINS to list only trusted frontend origins
Use firewall rules (ufw or iptables) to restrict port 8000 access to known clients
Rotate the Evolution API key and database passwords periodically

Monitoring Stack

An observability stack is deployed via a separate Compose file. It joins the same delphos-network to scrape metrics from application services.

docker compose -f docker-compose-monitoring.yml up -d

Components

Service	Image	Port	Purpose
Prometheus	`prom/prometheus`	`9090`	Metrics collection and alerting
Grafana	`grafana/grafana`	`3000`	Dashboards and visualization
Loki	`grafana/loki`	`127.0.0.1:3100`	Log aggregation
Promtail	`grafana/promtail`	`9080`	Log collection agent
Node Exporter	`prom/node-exporter`	`9100`	Host CPU, memory, disk metrics
DCGM Exporter	`nvidia/dcgm-exporter`	`9400`	GPU utilization, temperature, memory
Postgres Exporter	`prometheuscommunity/postgres-exporter`	`9187`	Query performance, connections
Redis Exporter	`oliver006/redis_exporter`	`9121`	Memory, commands, key counts
cAdvisor	`gcr.io/cadvisor/cadvisor`	`8081`	Per-container resource usage

Key Metrics to Watch

GPU: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED (VRAM), DCGM_FI_DEV_GPU_TEMP
Inference: Request latency per service, queue depth, token throughput
Database: Active connections, query duration p95, pgvector index scans
Redis: Memory usage vs. 512 MB limit, eviction rate, mutex lock wait time
Host: CPU utilization, available RAM (watch for swap pressure from KV cache)

Retention

Prometheus: 30 days / 10 GB (whichever is reached first)
Loki: Configured via loki-config.yaml (default 7 days)
Grafana dashboards: Provisioned from monitoring/grafana/dashboards/

Troubleshooting

Common Issues

Reasoning Engine fails to start (OOM)

GPU 0 requires at least 25 GB free VRAM. Ensure no other processes occupy the GPU:

nvidia-smi
# Kill stray processes if needed
sudo fuser -v /dev/nvidia0

Lexical Search Engine reports unhealthy

This service has a 120-second start period for model loading. Wait at least 2 minutes before investigating. Check logs:

docker logs sparse-embeddings --tail 50

API Gateway exits immediately

The gateway requires all upstream services to be healthy. Check which dependency is failing:

docker compose -f docker-compose-v2.yml ps --format "table {{.Name}}\t{{.Status}}"

Redis eviction warnings

If Redis exceeds 512 MB, LRU eviction activates automatically. This is expected behavior under load. Monitor with:

docker exec redis redis-cli info memory | grep used_memory_human

Rollback

If the v2 configuration has issues, revert to the base Compose file. The base file (docker-compose.yml) contains only core data and application services without the full AI engine stack — use it as an emergency fallback.

docker compose -f docker-compose-v2.yml down
docker compose -f docker-compose.yml up -d