Skip to content

Durable Execution Engine Setup

This guide explains how to run Everruns with the custom PostgreSQL-backed durable execution engine.

The durable execution engine is a PostgreSQL-backed workflow orchestration system that provides:

  • Event-sourced workflows with automatic retries
  • Distributed task queue with backpressure support
  • Circuit breakers and dead letter queues
  • No additional infrastructure required (uses existing PostgreSQL)
  • PostgreSQL running and accessible
  • DATABASE_URL environment variable set
  • Migrations applied (includes durable tables)
Terminal window
# Set runner mode to durable
export RUNNER_MODE=durable
export DATABASE_URL="postgres://postgres:postgres@localhost/everruns"
# Start the API server
cargo run -p everruns-server

You should see:

Using Durable execution engine runner (PostgreSQL-backed)

In a separate terminal:

Terminal window
# Workers only need gRPC address - NO DATABASE_URL required!
export GRPC_ADDRESS="127.0.0.1:9001"
# Start the durable worker
cargo run -p everruns-worker --bin durable-worker

Important: Workers communicate with the control-plane via gRPC and do not require direct database access. This improves security and simplifies deployment.

Or programmatically:

use everruns_worker::{DurableWorker, DurableWorkerConfig};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let mut worker = DurableWorker::from_env().await?;
worker.run().await
}
VariableDescriptionDefault
RUNNER_MODERunner mode (durable only)durable
DATABASE_URLPostgreSQL connection URLRequired
GRPC_ADDRESSControl-plane gRPC address127.0.0.1:9001
WORKER_IDUnique worker identifierAuto-generated
MAX_CONCURRENT_TASKSMax tasks per worker10

The durable engine uses these tables (created by migration 008):

  • durable_workflow_instances - Workflow state and metadata
  • durable_workflow_events - Event sourcing log
  • durable_task_queue - Distributed task queue
  • durable_dead_letter_queue - Failed tasks for manual inspection
  • durable_workers - Worker registration and heartbeats
  • durable_signals - Workflow signals (cancel, custom)
  • durable_circuit_breaker_state - Circuit breaker states
Terminal window
cargo test -p everruns-durable --lib

Expected: 91+ tests passing

Terminal window
# Create test database
psql -U postgres -c "CREATE DATABASE everruns_test;"
# Run migrations (required for tests - server auto-migrates but tests don't start server)
DATABASE_URL="postgres://postgres:postgres@localhost/everruns_test" \
sqlx migrate run --source crates/server/migrations
# Run integration tests
DATABASE_URL="postgres://postgres:postgres@localhost/everruns_test" \
cargo test -p everruns-durable --test postgres_integration_test -- --test-threads=1

Expected: 17 tests passing

Note: In production, migrations are auto-applied when everruns-server starts. For tests, we run migrations manually since tests don’t start the server.

  1. Message Created: User sends message via API
  2. Workflow Started: DurableRunner creates workflow and enqueues process_input task
  3. Input Processing: Worker claims task, processes input, enqueues reason task
  4. LLM Reasoning: Worker executes LLM call, may enqueue act tasks for tools
  5. Completion: Workflow marked as completed after final response
SELECT id, workflow_type, status, created_at
FROM durable_workflow_instances
WHERE status IN ('pending', 'running')
ORDER BY created_at DESC;
SELECT id, workflow_id, activity_type, status, attempt
FROM durable_task_queue
WHERE status = 'pending'
ORDER BY created_at;
SELECT id, workflow_id, activity_type, last_error, dead_at
FROM durable_dead_letter_queue
ORDER BY dead_at DESC;
SELECT id, status, current_load, last_heartbeat_at
FROM durable_workers
WHERE status = 'active';

The durable execution engine provides automatic crash recovery through:

Workers send heartbeats every 10 seconds while executing tasks. If a worker crashes:

  1. The task remains in claimed status with stale heartbeat_at
  2. Control-plane background task detects stale tasks (30s threshold)
  3. Stale tasks are automatically reset to pending status
  4. Another worker can claim and retry the task

The control-plane runs a background task (every 10s) that:

  • Finds tasks with status = 'claimed' and heartbeat_at older than 30s
  • Resets them to pending status
  • Logs reclaimed task IDs for monitoring
-- View tasks that may need reclamation
SELECT id, workflow_id, activity_type, claimed_by, heartbeat_at
FROM durable_task_queue
WHERE status = 'claimed'
AND heartbeat_at < NOW() - INTERVAL '30 seconds';
  1. Check worker is running and connected to correct GRPC_ADDRESS
  2. Verify activity_types match task types in queue
  3. Check worker heartbeat in durable_workers table
  1. Check for claimed tasks that haven’t completed
  2. Look for errors in worker logs
  3. Check DLQ for failed tasks
  4. Wait for stale task reclamation (30s threshold)

Tasks moved to DLQ after exhausting retries:

-- View DLQ entries
SELECT * FROM durable_dead_letter_queue ORDER BY dead_at DESC;
-- Requeue a task
UPDATE durable_dead_letter_queue SET requeued_at = NOW() WHERE id = '<dlq_id>';
PhaseStatusDescription
Phase 1-4✅ CompleteCore abstractions, persistence, reliability, worker pool
Phase 5🔄 PlannedObservability & Metrics (OpenTelemetry integration)
Phase 6🔄 PlannedScale Testing (1000+ concurrent workers)
Phase 7✅ Core CompletegRPC-based worker integration, crash recovery

The durable execution engine is production-ready for single-instance deployments.