Automatic error detection, recovery, and retry mechanisms for resilient agent workflows
Self-Healing enables agents to automatically detect, diagnose, and recover from errors, making your workflows robust and resilient without manual intervention.
The Self-Healing primitive provides automatic error detection and recovery capabilities that make agent workflows resilient to failures. Instead of failing completely when encountering errors, self-healing agents can detect problems, analyze their cause, attempt fixes, and retry operations automatically.Self-healing is essential for:
Production Reliability: Recover from transient failures without human intervention
Error Tolerance: Handle network issues, rate limits, and temporary service outages
Intelligent Retry: Use smart retry strategies based on error type and context
Automatic Debugging: Detect and fix common programming and configuration errors
Graceful Degradation: Provide partial results when complete success isn’t possible
Automatic Detection
Agents detect errors from tool failures, exceptions, and unexpected outputs
Context-Aware Recovery
Recovery strategies adapt based on error type, context, and previous attempts
Intelligent Retry Logic
Exponential backoff, jitter, and adaptive retry policies prevent cascading failures
Built-In by Default
Self-healing is automatically available in all Agentbase agents
import { Agentbase } from '@agentbase/sdk';const agentbase = new Agentbase({ apiKey: process.env.AGENTBASE_API_KEY});// Agent automatically recovers from errorsconst result = await agentbase.runAgent({ message: "Download data from https://api.example.com/data and save it", mode: "base"});// If API is temporarily unavailable, agent will:// 1. Detect the connection error// 2. Wait briefly (exponential backoff)// 3. Retry the request// 4. Continue with the task once successful
// Agent handles rate limiting intelligentlyconst result = await agentbase.runAgent({ message: "Fetch user data for IDs 1-1000 from the API", mode: "base", mcpServers: [{ serverName: 'api', serverUrl: 'https://api.company.com/mcp' }]});// Self-healing for rate limits:// 1. Makes API requests// 2. Receives "429 Too Many Requests" error// 3. Parses "Retry-After" header (e.g., 60 seconds)// 4. Waits the specified time// 5. Resumes requests// 6. May also implement batching to reduce request rate
Agents detect when a service is consistently failing:
Copy
// Agent recognizes repeated failures and adapts strategyconst result = await agentbase.runAgent({ message: "Process data using external API, fallback to local processing if API is down", mode: "base"});// Circuit breaker logic:// 1. Try API: fails// 2. Retry API: fails// 3. Detect pattern of failures// 4. Open circuit breaker (stop trying API)// 5. Use fallback strategy (local processing)// 6. Periodically test API recovery
// Self-healing data pipelineconst pipeline = await agentbase.runAgent({ message: ` Build a data pipeline: 1. Fetch data from https://api.source.com/data 2. Transform and clean the data 3. Upload to https://api.destination.com/data Handle any network errors, rate limits, or timeouts automatically. `, mode: "base"});// Agent automatically handles:// - Source API downtime (retry with backoff)// - Rate limiting (wait and resume)// - Destination upload failures (retry with exponential backoff)// - Network timeouts (retry with increased timeout)// - Data validation errors (skip invalid rows, continue processing)
Web Scraping with Recovery
Copy
async function robustScraping(urls: string[]) { return await agentbase.runAgent({ message: ` Scrape content from these URLs: ${urls.join(', ')} For each URL: - Handle timeouts by retrying - Handle 404s by logging and continuing - Handle rate limits by waiting - Extract main content even if page structure varies `, mode: "base" }); // Self-healing handles: // - Connection timeouts → retry // - 503 errors → exponential backoff // - Rate limits → respect Retry-After // - Missing elements → adapt selector strategy // - Partial page loads → retry or use what's available}
async function batchProcessing(fileList: string[]) { return await agentbase.runAgent({ message: ` Process ${fileList.length} files: - Convert each file from JSON to CSV - Validate data format - Upload to S3 Track progress and handle errors gracefully. If a file fails, log it and continue with others. `, mode: "base" }); // Self-healing ensures: // - Individual file failures don't stop entire job // - Progress is tracked and resumed if interrupted // - S3 upload retries on network issues // - Malformed files are logged and skipped // - Final report shows success/failure counts}
async function weatherDataAggregation() { return await agentbase.runAgent({ message: ` Collect weather data from 5 different weather APIs. Aggregate the data and calculate average temperatures. Handle API failures gracefully - use data from available APIs. `, mode: "base", mcpServers: [ { serverName: 'weather-api-1', serverUrl: 'https://api1.weather.com/mcp' }, { serverName: 'weather-api-2', serverUrl: 'https://api2.weather.com/mcp' }, // ... more APIs ] }); // Self-healing provides: // - Parallel requests with timeout handling // - Retry failed APIs with backoff // - Calculate result from available APIs if some fail // - Report which APIs failed for monitoring}
// Good: Clear error handling expectationsconst result = await agentbase.runAgent({ message: ` Download data from API. Error handling: - Network errors: Retry up to 3 times with exponential backoff - 404 errors: Skip and log the missing resource - 401 errors: Report authentication failure (don't retry) - 500 errors: Retry with backoff `, mode: "base"});// Avoid: Vague instructionsconst vague = await agentbase.runAgent({ message: "Download data from API and handle errors", mode: "base"});
Set Appropriate Retry Limits
Copy
// Good: Specify retry limits for different error typesconst result = await agentbase.runAgent({ message: ` Process data with these retry policies: - Transient errors (network, timeout): Retry up to 5 times - Rate limits: Wait and retry, max 3 attempts - Data validation errors: Skip invalid items, don't retry - Authentication errors: Fail immediately, don't retry `, mode: "base"});
Implement Graceful Degradation
Copy
// Good: Define acceptable partial successconst result = await agentbase.runAgent({ message: ` Fetch data from 20 sources. Success criteria: At least 15/20 sources must succeed. If fewer than 15 succeed, report error. Always return data from successful sources. `, mode: "base"});
Monitor and Log Recovery
Copy
// Track self-healing eventsconst result = await agentbase.runAgent({ message: ` Process data and log all error recovery attempts: - Log when errors occur - Log retry attempts and outcomes - Log when recovery succeeds - Report summary of all recovery events `, mode: "base", stream: true});for await (const event of result) { if (event.type === 'agent_error') { console.log('Error detected:', event.error); } if (event.type === 'agent_tool_use') { console.log('Recovery attempt:', event.tool); }}
// Implement circuit breaker patternconst result = await agentbase.runAgent({ message: ` Call external API repeatedly. Circuit breaker rules: - If 5 consecutive failures occur, open circuit - When circuit is open, use fallback strategy (cached data) - Test circuit recovery every 60 seconds - Close circuit after 3 consecutive successes `, mode: "base"});
// Minimize recovery timeconst result = await agentbase.runAgent({ message: ` Download data with optimized retry: - First retry: immediate - Second retry: 1 second wait - Third retry: 2 second wait - Max retries: 3 - After 3 failures, use cached data if available `, mode: "base"});
// Balance reliability and resource usageconst result = await agentbase.runAgent({ message: ` Download data efficiently: - Use cache when available (avoid unnecessary requests) - Implement aggressive timeout (fail fast on hung connections) - Limit retries to 3 attempts max - Use exponential backoff to avoid overwhelming services `, mode: "base"});
Problem: Agent keeps retrying without successSolution: Set clear retry limits and failure conditions
Copy
const result = await agentbase.runAgent({ message: ` Download data with strict retry limits: - Max 3 retry attempts - If all retries fail, report error and stop - Don't retry on 404 or 401 errors `, mode: "base"});
Excessive Retry Delays
Problem: Recovery takes too long due to exponential backoffSolution: Configure reasonable backoff limits
Copy
const result = await agentbase.runAgent({ message: ` Use moderate backoff strategy: - Wait times: 1s, 2s, 4s (max 4 seconds) - Total max retry time: 10 seconds - After 10 seconds, fail and report `, mode: "base"});
Not Recovering from Fixable Errors
Problem: Agent gives up on errors that could be fixedSolution: Explicitly guide recovery strategies
Copy
const result = await agentbase.runAgent({ message: ` Run Python script with automatic dependency resolution: - If ModuleNotFoundError: install missing package with pip - If SyntaxError: show error and ask for fix - If FileNotFoundError: create missing directories - Retry after fixing each error `, mode: "base"});
Remember: Self-healing is automatic and built-in. Agents detect and recover from most errors without configuration. Provide clear guidance for complex error scenarios and retry policies.