Production architecture
This page covers two audiences. SaaS customers integrating against a hosted Ledgix Vault only need the first section. Enterprise customers self-hosting Vault should read all three.
Integration requirements (SaaS customers)
Most customers do not need Ledgix hosting details. You do need a clean picture of what your own application must have before it can integrate safely.
- 01A Vault URL per environmentUse the correct Ledgix URL for development, staging, and production. Do not mix API keys across environments.
- 02A tenant API keyCreate the key in the customer dashboard and store it in your own secret-management flow. Ledgix shows the raw key only once.
- 03Outbound HTTPS accessYour application, worker, or server-side route must be able to reach the Vault URL over TLS.
- 04One protected tool boundaryDecide whether your first rollout will call the protected service directly after approval or send the token to an optional gateway in front of that service.
Recommended integration shape
Customer responsibilities
- store
LEDGIX_VAULT_API_KEYsecurely in your own environment - keep
LEDGIX_VAULT_URLaligned with the environment you are deploying - make sure server-side code, workers, and background jobs use the same customer settings
- decide who owns manual review and notification routing before going live
Optional gateway decision
You can start without a gateway if the protected action is simple and your application owns the full execution path. Add a gateway when:
- more than one service can trigger the same sensitive action
- you want the protected service to require a Ledgix token centrally
- you need a clean boundary between approval and execution
Go-live checklist
- The application can reach the Vault URL from every runtime that performs the protected action.
- The correct API key is present in that runtime.
- A policy source has already been uploaded for the first tool you are guarding.
- Reviewers know where pending requests appear and how Slack or email notifications are routed.
- The team has tested at least one approved and one blocked or paused request before production traffic.
Common integration failures
- Wrong Vault URL for the environment.
- Expired, deleted, or misplaced API key.
- Policy content uploaded after the team started testing, which makes early results look inconsistent.
- Threshold and notification settings left at defaults without deciding who handles review.
Runtime architecture (self-hosted enterprise)
A Ledgix production deployment is two services, one TLS ingress, and a backing data plane.
Two-database model
Ledgix deliberately splits control-plane state from tenant state.
- Control plane (Supabase Postgres): memberships, tenant metadata,
clientstable.clients.tenant_secret_refpoints at an AWS Secrets Manager entry — the row itself never contains tenant DB passwords, ledger transport keys, or Confluence tokens. - Tenant databases (one Postgres per tenant): policies (with pgvector embeddings) and the tenant's ledger. Credentials come from Secrets Manager at runtime.
This split is what makes tenant isolation real. A breach of the control plane does not expose tenant policy content or ledger entries.
A-JWT signing
Approval tokens are signed with Ed25519 (EdDSA). Vault publishes the public key at GET /.well-known/jwks.json.
Claims embedded in every A-JWT:
| Claim | Meaning |
|---|---|
iss | alcv-vault by default (override with VAULT_JWT_ISSUER) |
aud | ledgix-sdk by default (override with VAULT_JWT_AUDIENCE) |
exp | iat + VAULT_JWT_TTL (default 300 seconds) |
jti | Unique token ID. Burned on /consume-token; replay returns 409. |
tool | The tool the request was approved for |
agent_id, session_id, policy_id | Context for audit and review |
decision | yes, no, or review |
tool_args_hash | SHA-256 of canonical JSON of the approved arguments |
Merkle ledger and anchoring
Every decision is persisted to the tenant's ledger on DB commit. A background anchor loop sequences accepted events into an append-only Merkle tree, signs the checkpoint, and exports it to S3.
- Durability is synchronous on DB commit.
- Sequencing and anchoring are asynchronous — they run on the
VAULT_LEDGER_ANCHOR_BACKFILL_INTERVAL_SECONDSloop (default 30s). - The anchor bucket must have versioning enabled. Object lock is recommended but not required.
Async clearance queue
POST /request-clearance is queue-backed so slow judge calls cannot block the HTTP path.
- Default: in-memory queue (
VAULT_CLEARANCE_ASYNC_WORKERS,VAULT_CLEARANCE_ASYNC_QUEUE_SIZE). - Production: point
VAULT_CLEARANCE_SQS_QUEUE_URLat an SQS FIFO queue.VAULT_CLEARANCE_SQS_FIFO_GROUP_SHARDScontrols parallelism across FIFO groups.
Self-hosted configuration reference
Vault reads configuration from environment variables. When AWS_SECRET_NAME is set, Vault pulls the named Secrets Manager bundle on startup and merges it over the environment.
Vault — transport and signing
| Field | Type | Required | Description |
|---|---|---|---|
| VAULT_HOST | string | No | Bind address. Defaults to 0.0.0.0. |
| VAULT_PORT | int | No | Listen port. Defaults to 8000. |
| VAULT_SIGNER_BACKEND | enum | No | local or aws_kms. Default local. |
| VAULT_PRIVATE_KEY_FILE | path | Conditional | Ed25519 private key path (local backend). |
| VAULT_PRIVATE_KEY_BASE64 | string | Conditional | Base64-encoded Ed25519 private key (local backend). |
| VAULT_KMS_KEY_ID | string | Conditional | KMS key id (aws_kms backend). |
| VAULT_KEY_ID | string | No | JWKS kid header. Default vault-key-001. |
| VAULT_JWT_TTL | seconds | No | A-JWT validity. Default 300. |
| VAULT_JWT_ISSUER | string | No | iss claim. Default alcv-vault. |
| VAULT_JWT_AUDIENCE | string | No | aud claim. Default ledgix-sdk. |
| VAULT_CORS_ALLOWED_ORIGIN | string | No | CORS origin, e.g. https://dashboard.example.com. |
Vault — data plane
| Field | Type | Required | Description |
|---|---|---|---|
| DATABASE_URL | DSN | Yes | Control plane Postgres DSN. |
| VAULT_CONTROL_PLANE_DB_MAX_OPEN_CONNS | int | No | Control plane pool size. Default 50. |
| VAULT_TENANT_DB_MAX_OPEN_CONNS | int | No | Per-tenant pool size. Default 25. |
| VAULT_TENANT_DB_SSLMODE | enum | No | Per-tenant SSL mode. Default require. |
| TENANT_SECRET_PREFIX | string | Yes | Secrets Manager prefix for per-tenant bundles. Example: ledgix/tenants/prod. |
| VAULT_TENANT_SECRET_CACHE_TTL_SECONDS | seconds | No | Tenant secret cache TTL. Default 300. |
| AWS_SECRET_NAME | string | No | Name of the Vault bootstrap secret. Pulled on startup and merged over env. |
| AWS_REGION | string | No | Default AWS region. Default us-east-1. |
Vault — clearance queue and rate limiting
| Field | Type | Required | Description |
|---|---|---|---|
| VAULT_CLEARANCE_ASYNC_WORKERS | int | No | Workers draining the in-memory queue. Default 16. |
| VAULT_CLEARANCE_ASYNC_QUEUE_SIZE | int | No | In-memory queue depth. Default 2048. |
| VAULT_CLEARANCE_SQS_QUEUE_URL | URL | No | When set, switches async clearance to SQS FIFO. |
| VAULT_CLEARANCE_SQS_FIFO_GROUP_SHARDS | int | No | FIFO group shard count. Default 32. |
| VAULT_CLEARANCE_SQS_WAIT_TIME_SECONDS | seconds | No | Long-poll wait. Default 20. |
| VAULT_CLEARANCE_SQS_VISIBILITY_TIMEOUT_SECONDS | seconds | No | Message visibility timeout. Default 60. |
| VAULT_RATE_LIMIT_RPS | int | No | Per-principal rate limit. Default 20. Set 0 to disable. |
| VAULT_RATE_LIMIT_BURST | int | No | Burst allowance. Default 40. |
| VAULT_RATE_LIMIT_TTL_SECONDS | seconds | No | Rate limit window TTL. Default 180. |
Vault — ledger anchoring
| Field | Type | Required | Description |
|---|---|---|---|
| VAULT_LEDGER_ANCHOR_BUCKET | string | Conditional | S3 bucket for manifest anchoring. |
| VAULT_LEDGER_ANCHOR_BUCKET_TEMPLATE | string | No | Per-tenant bucket template, e.g. ledgix-anchor-{client_id}. |
| VAULT_LEDGER_ANCHOR_PREFIX | string | No | Object key prefix inside the bucket. |
| VAULT_LEDGER_ANCHOR_REGION | string | No | Bucket region. Falls back to AWS_REGION. |
| VAULT_LEDGER_REQUIRE_BUCKET_VERSIONING | bool | No | Default true. Startup fails if the bucket does not have versioning enabled. |
| VAULT_LEDGER_REQUIRE_OBJECT_LOCK | bool | No | Default false. Enforces S3 object lock when true. |
| VAULT_LEDGER_ANCHOR_BACKFILL_INTERVAL_SECONDS | seconds | No | Checkpoint export cadence. Default 30. |
| VAULT_LEDGER_ANCHOR_BACKFILL_BATCH_SIZE | int | No | Max entries per export batch. Default 500. |
| VAULT_TRANSPORT_KEY_BASE64 | string | Conditional | 32-byte base64 key for ledger entry encryption at rest. |
Vault — judge integration
| Field | Type | Required | Description |
|---|---|---|---|
| VAULT_JUDGE_URL | URL | Yes | llm-judge endpoint. |
| VAULT_JUDGE_API_KEY | string | Yes | Service-to-service key sent as X-API-Key. |
| VAULT_ALLOW_STUB_JUDGE | bool | No | Use the deterministic stub judge (development only). |
llm-judge — configuration
| Field | Type | Required | Description |
|---|---|---|---|
| EMBEDDING_MODEL | string | No | LiteLLM embedding model. Default bedrock/amazon.titan-embed-text-v2:0. Changing this requires re-embedding existing policy chunks. |
| EVAL_MODEL | string | No | LiteLLM evaluation model. Default bedrock/amazon.nova-pro-v1:0. |
| JUDGE_API_KEY | string | Yes | Service-to-service API key matching VAULT_JUDGE_API_KEY. |
| JUDGE_ENDPOINT_RATE_LIMIT | string | No | slowapi rate limit string. Default 600/minute. |
| JUDGE_UVICORN_WORKERS | int | No | Worker process count. Default 2. |
| DATABASE_URL | DSN | Yes | Postgres DSN (control plane access for tenant routing). |
| TENANT_SECRET_PREFIX | string | Yes | Must match the Vault prefix. |
| AWS_SECRET_NAME | string | No | Judge bootstrap secret name. |
| LOG_FORMAT | enum | No | json (default, Datadog/CloudWatch) or text. |
Deploy via docker-compose
The reference stack lives in deployments/docker-compose.yml. It launches vault and llm_judge on a private Docker network with Caddy as the TLS ingress.
docker compose --env-file .env.prod up -d --build.env.prod only needs non-sensitive values — primarily VAULT_AWS_SECRET_NAME, JUDGE_AWS_SECRET_NAME, TENANT_SECRET_PREFIX, and AWS_REGION. Each container fetches its secret bundle from AWS Secrets Manager on startup.
For brownfield enterprise rollouts onto pre-existing infrastructure, follow deployments/EXISTING_INFRA_ENTERPRISE_ROLLOUT.md rather than the Terraform stack in deployments/terraform/ (which is greenfield only).