Durable Execution Engine Setup
Durable Execution Engine Setup Guide
Section titled “Durable Execution Engine Setup Guide”This guide explains how to run Everruns with the custom PostgreSQL-backed durable execution engine.
Overview
Section titled “Overview”The durable execution engine is a PostgreSQL-backed workflow orchestration system that provides:
- Event-sourced workflows with automatic retries
- Distributed task queue with backpressure support
- Circuit breakers and dead letter queues
- No additional infrastructure required (uses existing PostgreSQL)
Quick Start
Section titled “Quick Start”1. Prerequisites
Section titled “1. Prerequisites”- PostgreSQL running and accessible
DATABASE_URLenvironment variable set- Migrations applied (includes durable tables)
2. Start API in Durable Mode
Section titled “2. Start API in Durable Mode”# Set runner mode to durableexport RUNNER_MODE=durableexport DATABASE_URL="postgres://postgres:postgres@localhost/everruns"
# Start the API servercargo run -p everruns-serverYou should see:
Using Durable execution engine runner (PostgreSQL-backed)3. Start Durable Worker
Section titled “3. Start Durable Worker”In a separate terminal:
# Workers only need gRPC address - NO DATABASE_URL required!export GRPC_ADDRESS="127.0.0.1:9001"
# Start the durable workercargo run -p everruns-worker --bin durable-workerImportant: Workers communicate with the control-plane via gRPC and do not require direct database access. This improves security and simplifies deployment.
Or programmatically:
use everruns_worker::{DurableWorker, DurableWorkerConfig};
#[tokio::main]async fn main() -> anyhow::Result<()> { let mut worker = DurableWorker::from_env().await?; worker.run().await}Configuration
Section titled “Configuration”Environment Variables
Section titled “Environment Variables”| Variable | Description | Default |
|---|---|---|
RUNNER_MODE | Runner mode (durable only) | durable |
DATABASE_URL | PostgreSQL connection URL | Required |
GRPC_ADDRESS | Control-plane gRPC address | 127.0.0.1:9001 |
WORKER_ID | Unique worker identifier | Auto-generated |
MAX_CONCURRENT_TASKS | Max tasks per worker | 10 |
Database Tables
Section titled “Database Tables”The durable engine uses these tables (created by migration 008):
durable_workflow_instances- Workflow state and metadatadurable_workflow_events- Event sourcing logdurable_task_queue- Distributed task queuedurable_dead_letter_queue- Failed tasks for manual inspectiondurable_workers- Worker registration and heartbeatsdurable_signals- Workflow signals (cancel, custom)durable_circuit_breaker_state- Circuit breaker states
Testing
Section titled “Testing”Unit Tests (No Dependencies)
Section titled “Unit Tests (No Dependencies)”cargo test -p everruns-durable --libExpected: 91+ tests passing
Integration Tests (Requires PostgreSQL)
Section titled “Integration Tests (Requires PostgreSQL)”# Create test databasepsql -U postgres -c "CREATE DATABASE everruns_test;"
# Run migrations (required for tests - server auto-migrates but tests don't start server)DATABASE_URL="postgres://postgres:postgres@localhost/everruns_test" \ sqlx migrate run --source crates/server/migrations
# Run integration testsDATABASE_URL="postgres://postgres:postgres@localhost/everruns_test" \ cargo test -p everruns-durable --test postgres_integration_test -- --test-threads=1Expected: 17 tests passing
Note: In production, migrations are auto-applied when
everruns-serverstarts. For tests, we run migrations manually since tests don’t start the server.
Workflow Lifecycle
Section titled “Workflow Lifecycle”- Message Created: User sends message via API
- Workflow Started:
DurableRunnercreates workflow and enqueuesprocess_inputtask - Input Processing: Worker claims task, processes input, enqueues
reasontask - LLM Reasoning: Worker executes LLM call, may enqueue
acttasks for tools - Completion: Workflow marked as
completedafter final response
Monitoring
Section titled “Monitoring”Check Active Workflows
Section titled “Check Active Workflows”SELECT id, workflow_type, status, created_atFROM durable_workflow_instancesWHERE status IN ('pending', 'running')ORDER BY created_at DESC;Check Pending Tasks
Section titled “Check Pending Tasks”SELECT id, workflow_id, activity_type, status, attemptFROM durable_task_queueWHERE status = 'pending'ORDER BY created_at;Check Dead Letter Queue
Section titled “Check Dead Letter Queue”SELECT id, workflow_id, activity_type, last_error, dead_atFROM durable_dead_letter_queueORDER BY dead_at DESC;Check Worker Status
Section titled “Check Worker Status”SELECT id, status, current_load, last_heartbeat_atFROM durable_workersWHERE status = 'active';Crash Recovery
Section titled “Crash Recovery”The durable execution engine provides automatic crash recovery through:
Worker Heartbeats
Section titled “Worker Heartbeats”Workers send heartbeats every 10 seconds while executing tasks. If a worker crashes:
- The task remains in
claimedstatus with staleheartbeat_at - Control-plane background task detects stale tasks (30s threshold)
- Stale tasks are automatically reset to
pendingstatus - Another worker can claim and retry the task
Stale Task Reclamation
Section titled “Stale Task Reclamation”The control-plane runs a background task (every 10s) that:
- Finds tasks with
status = 'claimed'andheartbeat_atolder than 30s - Resets them to
pendingstatus - Logs reclaimed task IDs for monitoring
-- View tasks that may need reclamationSELECT id, workflow_id, activity_type, claimed_by, heartbeat_atFROM durable_task_queueWHERE status = 'claimed' AND heartbeat_at < NOW() - INTERVAL '30 seconds';Troubleshooting
Section titled “Troubleshooting”Worker Not Processing Tasks
Section titled “Worker Not Processing Tasks”- Check worker is running and connected to correct
GRPC_ADDRESS - Verify
activity_typesmatch task types in queue - Check worker heartbeat in
durable_workerstable
Workflows Stuck in Running
Section titled “Workflows Stuck in Running”- Check for claimed tasks that haven’t completed
- Look for errors in worker logs
- Check DLQ for failed tasks
- Wait for stale task reclamation (30s threshold)
Task Retries Exhausted
Section titled “Task Retries Exhausted”Tasks moved to DLQ after exhausting retries:
-- View DLQ entriesSELECT * FROM durable_dead_letter_queue ORDER BY dead_at DESC;
-- Requeue a taskUPDATE durable_dead_letter_queue SET requeued_at = NOW() WHERE id = '<dlq_id>';Implementation Status
Section titled “Implementation Status”| Phase | Status | Description |
|---|---|---|
| Phase 1-4 | ✅ Complete | Core abstractions, persistence, reliability, worker pool |
| Phase 5 | 🔄 Planned | Observability & Metrics (OpenTelemetry integration) |
| Phase 6 | 🔄 Planned | Scale Testing (1000+ concurrent workers) |
| Phase 7 | ✅ Core Complete | gRPC-based worker integration, crash recovery |
The durable execution engine is production-ready for single-instance deployments.