Evals (evaluations) are a primitive for testing agent behavior systematically, ensuring quality, consistency, and correctness before deploying to production.
Overview
The Evals primitive provides a structured approach to testing agent capabilities, validating outputs, and measuring performance. Like unit tests for code, evals for agents ensure reliability and catch regressions before they impact users.
Evals are essential for:
Quality Assurance : Verify agents produce correct, high-quality outputs
Regression Testing : Catch behavior changes when updating prompts or configs
Performance Validation : Ensure agents meet speed and cost targets
Confidence Building : Deploy with certainty that agents work as expected
Continuous Improvement : Track quality metrics over time
Systematic Testing Define test cases that cover expected behavior and edge cases
Automated Validation Run evals automatically in CI/CD or on schedule
Multiple Eval Types Unit tests, integration tests, performance tests, and quality assessments
Actionable Insights Clear pass/fail results with detailed feedback for improvements
How Evals Work
Evaluation Process
Evals follow a structured testing process:
Test Case Definition : Define inputs, expected outputs, and validation criteria
Execution : Run agent with test inputs
Output Capture : Collect agent responses
Validation : Compare outputs against expected results or quality criteria
Scoring : Assign pass/fail or numerical scores
Reporting : Generate detailed test reports with insights
Eval Types
Different eval types serve different purposes:
Functional Evals : Test core capabilities (can the agent do X?)
Quality Evals : Assess output quality (is the output good enough?)
Performance Evals : Measure speed, cost, efficiency
Regression Evals : Ensure changes don’t break existing functionality
Edge Case Evals : Test behavior in unusual or problematic scenarios
Continuous Testing : Evals should run regularly - on every code change, prompt update, or scheduled intervals to catch issues early.
Code Examples
Basic Eval Structure
import { Agentbase } from '@agentbase/sdk' ;
const agentbase = new Agentbase ({
apiKey: process . env . AGENTBASE_API_KEY
});
// Define an eval case
interface EvalCase {
name : string ;
input : string ;
expectedBehavior : string ;
validate : ( output : string ) => boolean ;
}
// Create eval case
const evalCase : EvalCase = {
name: "Basic Math" ,
input: "What is 15 multiplied by 23?" ,
expectedBehavior: "Should calculate and return 345" ,
validate : ( output : string ) => {
return output . includes ( '345' );
}
};
// Run eval
async function runEval ( evalCase : EvalCase ) : Promise < boolean > {
const result = await agentbase . runAgent ({
message: evalCase . input ,
mode: "base"
});
const passed = evalCase . validate ( result . message );
console . log ( `Eval: ${ evalCase . name } ` );
console . log ( `Input: ${ evalCase . input } ` );
console . log ( `Output: ${ result . message } ` );
console . log ( `Result: ${ passed ? '✅ PASS' : '❌ FAIL' } ` );
return passed ;
}
// Execute
const passed = await runEval ( evalCase );
Eval Suite
// Comprehensive eval suite
class EvalSuite {
private results : Map < string , boolean > = new Map ();
async run ( evalCases : EvalCase []) : Promise < void > {
console . log ( ` \n 🧪 Running ${ evalCases . length } evals... \n ` );
for ( const evalCase of evalCases ) {
try {
const passed = await this . runSingleEval ( evalCase );
this . results . set ( evalCase . name , passed );
} catch ( error ) {
console . error ( `Error in eval ${ evalCase . name } :` , error );
this . results . set ( evalCase . name , false );
}
}
this . printSummary ();
}
async runSingleEval ( evalCase : EvalCase ) : Promise < boolean > {
const result = await agentbase . runAgent ({
message: evalCase . input ,
mode: "base"
});
const passed = evalCase . validate ( result . message );
const status = passed ? '✅ PASS' : '❌ FAIL' ;
console . log ( ` ${ status } ${ evalCase . name } ` );
return passed ;
}
printSummary () : void {
const total = this . results . size ;
const passed = Array . from ( this . results . values ()). filter ( r => r ). length ;
const failed = total - passed ;
const passRate = ( passed / total ) * 100 ;
console . log ( ' \n ' + '=' . repeat ( 50 ));
console . log ( '📊 Eval Summary' );
console . log ( '=' . repeat ( 50 ));
console . log ( `Total: ${ total } ` );
console . log ( `Passed: ${ passed } ✅` );
console . log ( `Failed: ${ failed } ❌` );
console . log ( `Pass Rate: ${ passRate . toFixed ( 1 ) } %` );
console . log ( '=' . repeat ( 50 ) + ' \n ' );
if ( passRate < 100 ) {
console . log ( '❌ Some evals failed. Review failures above. \n ' );
process . exit ( 1 );
} else {
console . log ( '✅ All evals passed! \n ' );
}
}
getResults () {
return this . results ;
}
}
// Define eval cases
const evalCases : EvalCase [] = [
{
name: "File Reading" ,
input: "Read the file data.csv and show the first row" ,
expectedBehavior: "Should read file and display header" ,
validate : ( output ) => output . toLowerCase (). includes ( 'header' ) || output . includes ( ',' )
},
{
name: "Data Calculation" ,
input: "Calculate the sum of 123 and 456" ,
expectedBehavior: "Should return 579" ,
validate : ( output ) => output . includes ( '579' )
},
{
name: "Error Handling" ,
input: "Read nonexistent_file.txt" ,
expectedBehavior: "Should handle missing file gracefully" ,
validate : ( output ) => output . toLowerCase (). includes ( 'not found' ) || output . toLowerCase (). includes ( 'does not exist' )
}
];
// Run suite
const suite = new EvalSuite ();
await suite . run ( evalCases );
// Test performance metrics
interface PerformanceEval {
name : string ;
input : string ;
maxSteps : number ;
maxDuration : number ; // milliseconds
maxCost : number ; // dollars
}
async function runPerformanceEval ( eval_ : PerformanceEval ) : Promise < boolean > {
const startTime = Date . now ();
let stepCount = 0 ;
let totalCost = 0 ;
const result = await agentbase . runAgent ({
message: eval_ . input ,
mode: "base" ,
stream: true
});
for await ( const event of result ) {
if ( event . type === 'agent_step' ) {
stepCount ++ ;
}
if ( event . type === 'agent_cost' ) {
totalCost += parseFloat ( event . cost );
}
}
const duration = Date . now () - startTime ;
const stepsPass = stepCount <= eval_ . maxSteps ;
const durationPass = duration <= eval_ . maxDuration ;
const costPass = totalCost <= eval_ . maxCost ;
const allPass = stepsPass && durationPass && costPass ;
console . log ( ` \n ${ allPass ? '✅' : '❌' } ${ eval_ . name } ` );
console . log ( ` Steps: ${ stepCount } / ${ eval_ . maxSteps } ${ stepsPass ? '✅' : '❌' } ` );
console . log ( ` Duration: ${ duration } ms/ ${ eval_ . maxDuration } ms ${ durationPass ? '✅' : '❌' } ` );
console . log ( ` Cost: $ ${ totalCost . toFixed ( 4 ) } /$ ${ eval_ . maxCost } ${ costPass ? '✅' : '❌' } ` );
return allPass ;
}
// Performance test cases
const performanceEvals : PerformanceEval [] = [
{
name: "Quick Query" ,
input: "What is 2+2?" ,
maxSteps: 2 ,
maxDuration: 3000 ,
maxCost: 0.01
},
{
name: "File Processing" ,
input: "Read data.csv and count rows" ,
maxSteps: 5 ,
maxDuration: 10000 ,
maxCost: 0.05
}
];
// Run performance evals
for ( const eval_ of performanceEvals ) {
await runPerformanceEval ( eval_ );
}
Quality Evals with LLM-as-Judge
// Use LLM to evaluate quality
async function llmAsJudge (
taskDescription : string ,
agentOutput : string ,
criteria : string []
) : Promise <{ score : number ; feedback : string }> {
const evaluationPrompt = `
Evaluate this agent response on a scale of 1-10 for each criterion.
Task: ${ taskDescription }
Agent Response:
${ agentOutput }
Evaluation Criteria:
${ criteria . map (( c , i ) => ` ${ i + 1 } . ${ c } ` ). join ( ' \n ' ) }
Provide:
1. Score for each criterion (1-10)
2. Overall score (average)
3. Brief feedback explaining the scores
Format as JSON:
{
"scores": { "criterion1": score, "criterion2": score, ... },
"overallScore": average,
"feedback": "explanation"
}
` ;
const evaluation = await agentbase . runAgent ({
message: evaluationPrompt ,
mode: "base"
});
// Parse JSON from response
const jsonMatch = evaluation . message . match ( / \{ [ \s\S ] * \} / );
if ( ! jsonMatch ) {
throw new Error ( 'Could not parse evaluation JSON' );
}
return JSON . parse ( jsonMatch [ 0 ]);
}
// Quality eval case
async function runQualityEval (
input : string ,
criteria : string [],
minScore : number = 7
) : Promise < boolean > {
// Get agent response
const result = await agentbase . runAgent ({
message: input ,
mode: "base"
});
// Evaluate quality
const evaluation = await llmAsJudge ( input , result . message , criteria );
const passed = evaluation . overallScore >= minScore ;
console . log ( ` \n ${ passed ? '✅' : '❌' } Quality Eval` );
console . log ( ` Input: ${ input } ` );
console . log ( ` Score: ${ evaluation . overallScore } /10` );
console . log ( ` Feedback: ${ evaluation . feedback } ` );
return passed ;
}
// Example usage
await runQualityEval (
"Explain how photosynthesis works" ,
[
"Accuracy: Is the explanation scientifically correct?" ,
"Clarity: Is it easy to understand?" ,
"Completeness: Does it cover key concepts?" ,
"Examples: Does it include helpful examples?"
],
8.0 // Minimum score
);
Regression Testing
// Test that changes don't break existing functionality
class RegressionTester {
private baseline : Map < string , string > = new Map ();
async createBaseline ( testCases : EvalCase []) : Promise < void > {
console . log ( '📸 Creating baseline...' );
for ( const test of testCases ) {
const result = await agentbase . runAgent ({
message: test . input ,
mode: "base"
});
this . baseline . set ( test . name , result . message );
}
// Save baseline to file
await this . saveBaseline ();
console . log ( '✅ Baseline created \n ' );
}
async testRegression ( testCases : EvalCase []) : Promise < void > {
console . log ( '🔍 Testing for regressions... \n ' );
// Load baseline
await this . loadBaseline ();
let regressions = 0 ;
for ( const test of testCases ) {
const result = await agentbase . runAgent ({
message: test . input ,
mode: "base"
});
const baselineOutput = this . baseline . get ( test . name );
if ( ! baselineOutput ) {
console . log ( `⚠️ ${ test . name } : No baseline found` );
continue ;
}
// Compare outputs
const similar = this . compareOutputs ( baselineOutput , result . message );
if ( ! similar ) {
regressions ++ ;
console . log ( `❌ REGRESSION: ${ test . name } ` );
console . log ( ` Baseline: ${ baselineOutput . substring ( 0 , 100 ) } ...` );
console . log ( ` Current: ${ result . message . substring ( 0 , 100 ) } ...` );
} else {
console . log ( `✅ ${ test . name } ` );
}
}
if ( regressions > 0 ) {
console . log ( ` \n ❌ Found ${ regressions } regression(s)` );
process . exit ( 1 );
} else {
console . log ( ' \n ✅ No regressions detected' );
}
}
compareOutputs ( baseline : string , current : string ) : boolean {
// Implement comparison logic
// Could use:
// - Exact match
// - Semantic similarity
// - Key phrase matching
// - LLM-based comparison
// Simple example: check if key information is present
const baselineKeywords = this . extractKeywords ( baseline );
const currentKeywords = this . extractKeywords ( current );
const overlap = baselineKeywords . filter ( k => currentKeywords . includes ( k ));
const similarity = overlap . length / baselineKeywords . length ;
return similarity > 0.8 ; // 80% of keywords should match
}
extractKeywords ( text : string ) : string [] {
// Simple keyword extraction
return text
. toLowerCase ()
. split ( / \W + / )
. filter ( word => word . length > 4 )
. slice ( 0 , 20 );
}
async saveBaseline () : Promise < void > {
const data = Object . fromEntries ( this . baseline );
await fs . writeFile (
'eval-baseline.json' ,
JSON . stringify ( data , null , 2 )
);
}
async loadBaseline () : Promise < void > {
try {
const data = await fs . readFile ( 'eval-baseline.json' , 'utf-8' );
const parsed = JSON . parse ( data );
this . baseline = new Map ( Object . entries ( parsed ));
} catch ( error ) {
console . error ( 'Could not load baseline:' , error );
}
}
}
// Usage
const tester = new RegressionTester ();
// Create baseline before making changes
await tester . createBaseline ( testCases );
// After making changes, test for regressions
await tester . testRegression ( testCases );
Eval Patterns
Categorized Test Suite
Organize evals by category:
Test individual capabilities in isolation const unitTests = {
math: [
{
name: "Addition" ,
input: "Calculate 10 + 15" ,
validate : ( output ) => output . includes ( '25' )
},
{
name: "Multiplication" ,
input: "Calculate 12 * 8" ,
validate : ( output ) => output . includes ( '96' )
}
],
fileOps: [
{
name: "Read File" ,
input: "Read data.txt" ,
validate : ( output ) => ! output . includes ( 'error' )
},
{
name: "List Files" ,
input: "List files in current directory" ,
validate : ( output ) => output . includes ( 'data.txt' )
}
]
};
Test multi-step workflows const integrationTests = [
{
name: "Data Pipeline" ,
input: "Load data.csv, filter for active users, calculate average age" ,
validate : ( output ) => {
return output . includes ( 'average' ) &&
output . match ( / \d + ( \. \d + ) ? / ); // Has a number
}
},
{
name: "Report Generation" ,
input: "Analyze sales data and create PDF report" ,
validate : ( output ) => {
return output . includes ( 'report' ) &&
output . includes ( 'pdf' );
}
}
];
Test unusual or problematic scenarios const edgeCases = [
{
name: "Empty Input" ,
input: "" ,
validate : ( output ) => output . length > 0 // Should handle gracefully
},
{
name: "Very Long Input" ,
input: "A" . repeat ( 10000 ),
validate : ( output ) => ! output . includes ( 'error' )
},
{
name: "Special Characters" ,
input: "Process text with émojis 🎉 and spëcial çharacters" ,
validate : ( output ) => output . length > 0
}
];
Test error handling const errorScenarios = [
{
name: "Missing File" ,
input: "Read nonexistent.txt" ,
validate : ( output ) => {
return output . toLowerCase (). includes ( 'not found' ) ||
output . toLowerCase (). includes ( 'does not exist' );
}
},
{
name: "Invalid Calculation" ,
input: "Divide 10 by 0" ,
validate : ( output ) => {
return output . toLowerCase (). includes ( 'cannot' ) ||
output . toLowerCase (). includes ( 'undefined' ) ||
output . toLowerCase (). includes ( 'infinity' );
}
}
];
CI/CD Integration
Automate evals in your deployment pipeline:
// GitHub Actions example
// .github/workflows/evals.yml
// In your test script:
async function runCIEvals () {
const suite = new EvalSuite ();
const testCases = [
... unitTests . math ,
... unitTests . fileOps ,
... integrationTests ,
... edgeCases
];
await suite . run ( testCases );
const results = suite . getResults ();
const passRate = calculatePassRate ( results );
// Fail CI if pass rate is below threshold
if ( passRate < 95 ) {
console . error ( `❌ Pass rate ${ passRate } % is below required 95%` );
process . exit ( 1 );
}
console . log ( `✅ All evals passed ( ${ passRate } %)` );
}
runCIEvals ();
Continuous Monitoring
Run evals on schedule to detect production issues:
// Scheduled eval monitoring
async function scheduledEvals () {
const suite = new EvalSuite ();
// Run critical production tests
await suite . run ( productionCriticalTests );
const results = suite . getResults ();
const passRate = calculatePassRate ( results );
// Send results to monitoring
await sendToMonitoring ({
timestamp: new Date (),
passRate ,
results: Array . from ( results . entries ())
});
// Alert if issues detected
if ( passRate < 100 ) {
await sendAlert ({
severity: 'warning' ,
message: `Production evals: ${ passRate } % pass rate` ,
details: results
});
}
}
// Run every hour
setInterval ( scheduledEvals , 60 * 60 * 1000 );
Use Cases
1. Pre-Deployment Validation
Ensure quality before deploying changes:
async function preDeploymentChecks () {
console . log ( '🚀 Running pre-deployment validation... \n ' );
// Run all eval categories
const categories = [
{ name: 'Unit Tests' , tests: unitTests },
{ name: 'Integration Tests' , tests: integrationTests },
{ name: 'Performance Tests' , tests: performanceTests },
{ name: 'Regression Tests' , tests: regressionTests }
];
let allPassed = true ;
for ( const category of categories ) {
console . log ( ` \n 📋 ${ category . name } ` );
const suite = new EvalSuite ();
await suite . run ( category . tests );
const results = suite . getResults ();
const passRate = calculatePassRate ( results );
if ( passRate < 100 ) {
allPassed = false ;
}
}
if ( allPassed ) {
console . log ( ' \n ✅ All pre-deployment checks passed. Safe to deploy!' );
return true ;
} else {
console . log ( ' \n ❌ Some checks failed. Fix issues before deploying.' );
return false ;
}
}
2. Prompt Engineering
Test and refine prompts:
// Compare different prompt versions
async function comparePrompts () {
const prompts = [
"You are a helpful assistant." ,
"You are a professional and concise assistant." ,
"You are a friendly assistant who provides clear, detailed explanations."
];
const testInputs = [
"Explain photosynthesis" ,
"How do I reset my password?" ,
"What is machine learning?"
];
for ( const prompt of prompts ) {
console . log ( ` \n 🔍 Testing prompt: " ${ prompt } "` );
let totalScore = 0 ;
for ( const input of testInputs ) {
const result = await agentbase . runAgent ({
message: input ,
system: prompt ,
mode: "base"
});
const evaluation = await llmAsJudge (
input ,
result . message ,
[ "Clarity" , "Completeness" , "Helpfulness" ]
);
totalScore += evaluation . overallScore ;
console . log ( ` ${ input } : ${ evaluation . overallScore } /10` );
}
const avgScore = totalScore / testInputs . length ;
console . log ( ` Average Score: ${ avgScore . toFixed ( 1 ) } /10` );
}
}
3. Version Validation
Test new versions before promotion:
async function validateVersion ( version : string ) {
console . log ( ` \n 🧪 Validating version: ${ version } ` );
const suite = new EvalSuite ();
const testCases = loadTestCases ();
// Run all tests against this version
const results = [];
for ( const test of testCases ) {
const result = await agentbase . runAgent ({
message: test . input ,
version: version ,
mode: "base"
});
const passed = test . validate ( result . message );
results . push ( passed );
}
const passRate = ( results . filter ( r => r ). length / results . length ) * 100 ;
console . log ( ` \n 📊 Version ${ version } Results:` );
console . log ( ` Pass Rate: ${ passRate . toFixed ( 1 ) } %` );
if ( passRate >= 95 ) {
console . log ( ` ✅ Version ${ version } is ready for production` );
return true ;
} else {
console . log ( ` ❌ Version ${ version } needs improvement` );
return false ;
}
}
Best Practices
Effective Test Design
Base tests on actual user scenarios // Good: Real user scenarios
const realWorldTests = [
{
name: "Customer Support - Order Status" ,
input: "Where is my order #12345?" ,
validate : ( output ) => {
return output . includes ( 'order' ) &&
output . includes ( '12345' );
}
}
];
// Avoid: Artificial test cases
const artificialTest = {
name: "Test1" ,
input: "Test input" ,
validate : ( output ) => true
};
Tests should produce consistent results // Good: Deterministic validation
validate : ( output ) => {
return output . includes ( 'expected phrase' ) ||
output . match ( / \d + users/ );
}
// Avoid: Non-deterministic validation
validate : ( output ) => {
return Math . random () > 0.5 ; // Random results
}
Test boundary conditions and unusual inputs const edgeCases = [
{ input: "" , name: "Empty input" },
{ input: "A" . repeat ( 10000 ), name: "Very long input" },
{ input: "Special chars: @#$%^&*()" , name: "Special characters" },
{ input: "Тест на русском" , name: "Non-English characters" }
];
Integration with Other Primitives
With Traces
Use traces to debug failing evals:
// Debug failing eval with traces
async function debugFailingEval ( evalCase : EvalCase ) {
const result = await agentbase . runAgent ({
message: evalCase . input ,
mode: "base" ,
stream: true // Get traces
});
const trace = [];
for await ( const event of result ) {
trace . push ( event );
if ( event . type === 'agent_error' ) {
console . log ( 'Error in eval:' , event . error );
}
}
// Analyze trace to understand failure
const finalMessage = trace . find ( e => e . type === 'agent_message' );
const passed = evalCase . validate ( finalMessage ?. content || '' );
if ( ! passed ) {
console . log ( 'Eval failed. Trace:' );
trace . forEach ( e => console . log ( ` [ ${ e . type } ]` , e ));
}
}
Learn more: Traces Primitive
With Versioning
Test different versions systematically:
// Compare eval results across versions
async function compareVersions ( testCases : EvalCase [], versions : string []) {
for ( const version of versions ) {
console . log ( ` \n Testing version: ${ version } ` );
const suite = new EvalSuite ();
for ( const test of testCases ) {
const result = await agentbase . runAgent ({
message: test . input ,
version: version ,
mode: "base"
});
const passed = test . validate ( result . message );
suite . results . set ( test . name , passed );
}
suite . printSummary ();
}
}
Learn more: Versioning Primitive
Eval Execution Time
Simple validation : < 100ms overhead
LLM-as-judge : 2-5 seconds per eval
Parallel execution : Run independent tests in parallel
Optimization
// Run evals in parallel
async function parallelEvals ( testCases : EvalCase []) {
const results = await Promise . all (
testCases . map ( async ( test ) => {
const result = await agentbase . runAgent ({
message: test . input ,
mode: "base"
});
return {
name: test . name ,
passed: test . validate ( result . message )
};
})
);
return results ;
}
Troubleshooting
Problem : Tests pass/fail inconsistentlySolutions :
Make validation more flexible
Account for acceptable variations
Use semantic matching instead of exact matching
// More flexible validation
validate : ( output ) => {
// Accept variations
return output . includes ( '345' ) ||
output . includes ( 'three hundred forty-five' ) ||
output . match ( /3 \s * 4 \s * 5/ );
}
Problem : Tests fail on acceptable outputsSolutions :
Focus on essential criteria
Allow for reasonable variations
Use LLM-as-judge for nuanced evaluation
Additional Resources
Remember : Evals are your safety net. Write tests for critical functionality, run them automatically, and let them catch issues before users do.