Durable Execution Engine Setup

Durable Execution Engine Setup Guide

This guide explains how to run Everruns with the custom PostgreSQL-backed durable execution engine.

Overview

The durable execution engine is a PostgreSQL-backed workflow orchestration system that provides:

Event-sourced workflows with automatic retries
Distributed task queue with backpressure support
Circuit breakers and dead letter queues
No additional infrastructure required (uses existing PostgreSQL)

Quick Start

1. Prerequisites

PostgreSQL running and accessible
DATABASE_URL environment variable set
Migrations applied (includes durable tables)

2. Start API in Durable Mode

# Set runner mode to durable
export RUNNER_MODE=durable
export DATABASE_URL="postgres://postgres:postgres@localhost/everruns"

# Start the API server
cargo run -p everruns-server

You should see:

Using Durable execution engine runner (PostgreSQL-backed)

3. Start Durable Worker

In a separate terminal:

# Workers only need gRPC address - NO DATABASE_URL required!
export GRPC_ADDRESS="127.0.0.1:9001"

# Start the durable worker
cargo run -p everruns-worker --bin durable-worker

Important: Workers communicate with the control-plane via gRPC and do not require direct database access. This improves security and simplifies deployment.

Or programmatically:

use everruns_worker::{DurableWorker, DurableWorkerConfig};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let mut worker = DurableWorker::from_env().await?;
    worker.run().await
}

Configuration

Environment Variables

Variable	Description	Default
`RUNNER_MODE`	Runner mode (durable only)	`durable`
`DATABASE_URL`	PostgreSQL connection URL	Required
`GRPC_ADDRESS`	Control-plane gRPC address	`127.0.0.1:9001`
`WORKER_ID`	Unique worker identifier	Auto-generated
`MAX_CONCURRENT_TASKS`	Max tasks per worker	`10`

Database Tables

The durable engine uses these tables (created by migration 008):

durable_workflow_instances - Workflow state and metadata
durable_workflow_events - Event sourcing log
durable_task_queue - Distributed task queue
durable_dead_letter_queue - Failed tasks for manual inspection
durable_workers - Worker registration and heartbeats
durable_signals - Workflow signals (cancel, custom)
durable_circuit_breaker_state - Circuit breaker states

Testing

Unit Tests (No Dependencies)

cargo test -p everruns-durable --lib

Expected: 91+ tests passing

Integration Tests (Requires PostgreSQL)

# Create test database
psql -U postgres -c "CREATE DATABASE everruns_test;"

# Run migrations (required for tests - server auto-migrates but tests don't start server)
DATABASE_URL="postgres://postgres:postgres@localhost/everruns_test" \
  sqlx migrate run --source crates/server/migrations

# Run integration tests
DATABASE_URL="postgres://postgres:postgres@localhost/everruns_test" \
  cargo test -p everruns-durable --test postgres_integration_test -- --test-threads=1

Expected: 17 tests passing

Note: In production, migrations are auto-applied when everruns-server starts. For tests, we run migrations manually since tests don’t start the server.

Workflow Lifecycle

Message Created: User sends message via API
Workflow Started: DurableRunner creates workflow and enqueues process_input task
Input Processing: Worker claims task, processes input, enqueues reason task
LLM Reasoning: Worker executes LLM call, may enqueue act tasks for tools
Completion: Workflow marked as completed after final response

Monitoring

Check Active Workflows

SELECT id, workflow_type, status, created_at
FROM durable_workflow_instances
WHERE status IN ('pending', 'running')
ORDER BY created_at DESC;

Check Pending Tasks

SELECT id, workflow_id, activity_type, status, attempt
FROM durable_task_queue
WHERE status = 'pending'
ORDER BY created_at;

Check Dead Letter Queue

SELECT id, workflow_id, activity_type, last_error, dead_at
FROM durable_dead_letter_queue
ORDER BY dead_at DESC;

Check Worker Status

SELECT id, status, current_load, last_heartbeat_at
FROM durable_workers
WHERE status = 'active';

Crash Recovery

The durable execution engine provides automatic crash recovery through:

Worker Heartbeats

Workers send heartbeats every 10 seconds while executing tasks. If a worker crashes:

The task remains in claimed status with stale heartbeat_at
Control-plane background task detects stale tasks (30s threshold)
Stale tasks are automatically reset to pending status
Another worker can claim and retry the task

Stale Task Reclamation

The control-plane runs a background task (every 10s) that:

Finds tasks with status = 'claimed' and heartbeat_at older than 30s
Resets them to pending status
Logs reclaimed task IDs for monitoring

-- View tasks that may need reclamation
SELECT id, workflow_id, activity_type, claimed_by, heartbeat_at
FROM durable_task_queue
WHERE status = 'claimed'
  AND heartbeat_at < NOW() - INTERVAL '30 seconds';

Troubleshooting

Worker Not Processing Tasks

Check worker is running and connected to correct GRPC_ADDRESS
Verify activity_types match task types in queue
Check worker heartbeat in durable_workers table

Workflows Stuck in Running

Check for claimed tasks that haven’t completed
Look for errors in worker logs
Check DLQ for failed tasks
Wait for stale task reclamation (30s threshold)

Task Retries Exhausted

Tasks moved to DLQ after exhausting retries:

-- View DLQ entries
SELECT * FROM durable_dead_letter_queue ORDER BY dead_at DESC;

-- Requeue a task
UPDATE durable_dead_letter_queue SET requeued_at = NOW() WHERE id = '<dlq_id>';

Implementation Status

Phase	Status	Description
Phase 1-4	✅ Complete	Core abstractions, persistence, reliability, worker pool
Phase 5	🔄 Planned	Observability & Metrics (OpenTelemetry integration)
Phase 6	🔄 Planned	Scale Testing (1000+ concurrent workers)
Phase 7	✅ Core Complete	gRPC-based worker integration, crash recovery

The durable execution engine is production-ready for single-instance deployments.