Voice - Agentbase Docs

Voice enables agents to communicate through natural speech, supporting real-time voice conversations, voice commands, and audio-based interactions.

Overview

The Voice primitive transforms text-based agents into voice-enabled conversational AI systems. With built-in speech-to-text (STT) and text-to-speech (TTS) capabilities, agents can understand spoken language and respond with natural-sounding speech, enabling hands-free interactions and accessibility features. Voice is essential for:

Voice Assistants: Build Alexa/Siri-style voice interfaces
Phone Systems: Create AI-powered phone support and IVR systems
Accessibility: Make applications accessible to visually impaired users
Hands-Free Interactions: Enable voice commands for hands-free operation
Natural Conversations: Provide more natural, human-like interactions
Multilingual Support: Communicate in multiple languages with native accents

Real-Time STT

Convert speech to text in real-time with high accuracy and low latency

Natural TTS

Generate natural-sounding speech in multiple voices, languages, and accents

Voice Streaming

Stream audio input and output for low-latency conversational experiences

Voice Customization

Customize voice characteristics, speed, pitch, and speaking style

How Voice Works

When you enable voice for an agent:

Audio Input: User speaks into microphone or phone
Speech-to-Text: Audio is transcribed to text in real-time
Agent Processing: Agent processes text and generates response
Text-to-Speech: Response is converted to natural speech
Audio Output: Generated speech is played to user
Streaming: For real-time conversations, audio streams continuously

Low Latency: Voice streaming enables natural conversation flow with minimal delay between speaking and response.

Voice Capabilities

Speech-to-Text (STT)

Convert spoken words to text:

{
  type: "speech-to-text",
  features: {
    languages: ["en-US", "es-ES", "fr-FR", "de-DE", "ja-JP", ...],
    accuracy: "high",
    realtime: true,
    streaming: true,
    punctuation: true,
    profanityFilter: optional
  }
}

Text-to-Speech (TTS)

Generate natural speech from text:

{
  type: "text-to-speech",
  voices: {
    languages: ["en-US", "es-ES", "fr-FR", ...],
    genders: ["male", "female", "neutral"],
    styles: ["conversational", "professional", "cheerful", "empathetic"],
    customization: {
      speed: 0.5 to 2.0,
      pitch: -20 to +20,
      volume: 0 to 100
    }
  }
}

Code Examples

Basic Voice Conversation

import { Agentbase } from '@agentbase/sdk';

const agentbase = new Agentbase({
  apiKey: process.env.AGENTBASE_API_KEY
});

// Enable voice for agent
const result = await agentbase.runAgent({
  message: "Hello, how can I help you today?",
  voice: {
    enabled: true,
    input: {
      language: "en-US"
    },
    output: {
      voice: "en-US-Neural2-C", // Natural female voice
      speed: 1.0,
      pitch: 0
    }
  }
});

// Response includes audio URL
console.log('Audio URL:', result.audioUrl);

Real-Time Voice Streaming

// Establish voice stream connection
const voiceStream = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "en-US",
      streaming: true
    },
    output: {
      voice: "en-US-Neural2-J",
      streaming: true
    }
  },
  system: "You are a helpful voice assistant. Keep responses concise and natural."
});

// Handle incoming audio
voiceStream.on('audio', (audioChunk) => {
  // Play audio chunk in real-time
  audioPlayer.play(audioChunk);
});

// Handle transcriptions
voiceStream.on('transcript', (text) => {
  console.log('User said:', text);
});

// Handle agent responses
voiceStream.on('response', (text) => {
  console.log('Agent said:', text);
});

// Send audio from microphone
microphone.on('data', (audioData) => {
  voiceStream.sendAudio(audioData);
});

// Close stream when done
voiceStream.close();

Voice with Multiple Languages

// Multilingual voice support
const result = await agentbase.runAgent({
  message: "Bonjour, comment allez-vous?",
  voice: {
    enabled: true,
    input: {
      language: "auto-detect", // Automatically detect language
      supportedLanguages: ["en-US", "fr-FR", "es-ES", "de-DE"]
    },
    output: {
      voice: "fr-FR-Neural2-A", // French voice
      speed: 1.0
    }
  },
  system: "You are a multilingual assistant. Respond in the same language as the user."
});

Custom Voice Characteristics

// Customize voice output
const result = await agentbase.runAgent({
  message: "Explain quantum computing",
  voice: {
    enabled: true,
    output: {
      voice: "en-US-Neural2-J",
      speed: 0.9, // Slightly slower for clarity
      pitch: -2, // Slightly lower pitch
      style: "professional", // Professional speaking style
      emphasis: {
        technical_terms: "strong" // Emphasize technical terms
      }
    }
  },
  system: "You are a technical educator. Explain complex topics clearly."
});

Phone Integration

// Integrate with phone systems (Twilio example)
import twilio from 'twilio';

const client = twilio(accountSid, authToken);

// Handle incoming call
app.post('/voice/incoming', async (req, res) => {
  const twiml = new twilio.twiml.VoiceResponse();

  // Create voice stream for this call
  const voiceStream = await agentbase.createVoiceStream({
    voice: {
      input: { language: "en-US" },
      output: {
        voice: "en-US-Neural2-C",
        format: "mulaw", // Phone-compatible format
        sampleRate: 8000
      }
    },
    system: "You are a customer service agent. Help the caller with their inquiry."
  });

  // Connect Twilio stream to Agentbase
  twiml.connect().stream({
    url: `wss://your-server.com/voice-stream/${voiceStream.id}`
  });

  res.type('text/xml');
  res.send(twiml.toString());
});

Voice with Interruption Handling

// Handle user interruptions gracefully
const voiceStream = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "en-US",
      vadEnabled: true, // Voice Activity Detection
      interruptionHandling: "graceful"
    },
    output: {
      voice: "en-US-Neural2-A",
      interruptible: true // Allow agent to be interrupted
    }
  },
  system: "You are a conversational assistant. If interrupted, acknowledge and address the new question."
});

voiceStream.on('interruption', (context) => {
  console.log('User interrupted at:', context.interruptedAt);
  console.log('New input:', context.newInput);
  // Agent automatically handles interruption
});

Use Cases

1. Customer Service Phone System

AI-powered phone support:

const phoneAgent = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "en-US",
      profanityFilter: true
    },
    output: {
      voice: "en-US-Neural2-C",
      style: "friendly"
    }
  },
  system: `You are a customer service representative for TechCorp.

  Call Flow:
  1. Greet the customer warmly
  2. Ask how you can help
  3. Listen to their issue
  4. Provide solution or escalate if needed
  5. Confirm resolution
  6. Thank them for calling

  Guidelines:
  - Be empathetic and patient
  - Keep responses concise (2-3 sentences)
  - Confirm understanding before providing solution
  - Offer to escalate if you cannot help`,
  mcpServers: [
    {
      serverName: "crm-system",
      serverUrl: "https://api.company.com/crm"
    }
  ]
});

// Handle call flow
phoneAgent.on('transcript', async (text) => {
  if (text.toLowerCase().includes('speak to human')) {
    await phoneAgent.transfer({
      destination: "human-support",
      context: phoneAgent.getConversationContext()
    });
  }
});

2. Voice-Enabled Personal Assistant

Hands-free assistant for daily tasks:

const assistant = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "en-US",
      wakeWord: "Hey Assistant", // Activate on wake word
      vadEnabled: true
    },
    output: {
      voice: "en-US-Neural2-J",
      style: "conversational"
    }
  },
  memory: {
    namespace: `user_${userId}`,
    enabled: true
  },
  system: `You are a personal voice assistant.

  Capabilities:
  - Set reminders and alarms
  - Check calendar and schedule meetings
  - Send messages
  - Get weather and news
  - Control smart home devices
  - Answer questions

  Keep responses brief and natural.`,
  mcpServers: [
    { serverName: "calendar", serverUrl: "..." },
    { serverName: "messaging", serverUrl: "..." },
    { serverName: "smart-home", serverUrl: "..." }
  ]
});

3. Language Learning Tutor

Interactive language practice:

const languageTutor = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "es-ES", // Student speaks Spanish
      pronunciationFeedback: true
    },
    output: {
      voice: "es-ES-Neural2-A",
      speed: 0.8, // Slower for learning
      style: "conversational"
    }
  },
  system: `You are a Spanish language tutor.

  Teaching Approach:
  - Listen to student's pronunciation
  - Provide gentle corrections
  - Use simple vocabulary
  - Repeat phrases for practice
  - Encourage and praise progress
  - Adapt to student's level

  Focus on conversational fluency.`,
  config: {
    pronunciationAnalysis: true,
    vocabularyLevel: "beginner"
  }
});

languageTutor.on('pronunciation', (analysis) => {
  console.log('Pronunciation score:', analysis.score);
  console.log('Suggestions:', analysis.improvements);
});

4. Healthcare Voice Interface

Accessible medical information:

const healthcareAgent = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "en-US",
      medicalTerminology: true
    },
    output: {
      voice: "en-US-Neural2-C",
      style: "empathetic",
      speed: 0.9
    }
  },
  system: `You are a healthcare information assistant.

  Guidelines:
  - Use clear, simple language
  - Be empathetic and reassuring
  - Never provide medical diagnosis
  - Always recommend consulting healthcare provider
  - Maintain HIPAA compliance
  - Focus on general health information

  Remind users this is not medical advice.`,
  rules: [
    "Never provide specific medical diagnoses",
    "Always recommend consulting a healthcare provider for medical concerns",
    "Do not access or share patient health information without authorization"
  ]
});

Hands-free driving assistance:

const navigationAgent = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "en-US",
      noiseReduction: true, // Filter road noise
      vadEnabled: true
    },
    output: {
      voice: "en-US-Neural2-D",
      volume: "adaptive", // Adjust to ambient noise
      style: "clear"
    }
  },
  system: `You are a voice navigation assistant.

  Provide:
  - Clear, timely directions
  - Traffic updates
  - Alternate routes if needed
  - Nearby points of interest
  - Weather warnings

  Keep instructions brief and precise.
  Announce turns well in advance.`,
  mcpServers: [
    { serverName: "maps", serverUrl: "..." },
    { serverName: "traffic", serverUrl: "..." }
  ]
});

6. Conference Room Assistant

Meeting room voice control:

const conferenceAssistant = await agentbase.createVoiceStream({
  voice: {
    input: {
      language: "en-US",
      multiSpeaker: true, // Identify different speakers
      vadEnabled: true
    },
    output: {
      voice: "en-US-Neural2-F",
      style: "professional"
    }
  },
  system: `You are a conference room assistant.

  Functions:
  - Start/end meetings
  - Dial participants
  - Control AV equipment
  - Take meeting notes
  - Set reminders
  - Book follow-up meetings

  Commands should be clear and confirmed.`,
  mcpServers: [
    { serverName: "zoom", serverUrl: "..." },
    { serverName: "calendar", serverUrl: "..." },
    { serverName: "room-control", serverUrl: "..." }
  ]
});

Best Practices

Voice Design

Keep Responses Concise

// Good: Brief, conversational responses
system: `Keep responses to 1-2 sentences.
Be conversational and natural.
Avoid long explanations unless asked.`

// Avoid: Long, verbose responses
system: `Provide detailed, comprehensive explanations
covering all aspects of the topic...`

Design for Interruptions

// Enable graceful interruption handling
voice: {
  output: {
    interruptible: true,
    pauseOnInterruption: true
  },
  input: {
    interruptionHandling: "graceful"
  }
}

system: `If interrupted:
1. Stop speaking immediately
2. Acknowledge the interruption
3. Address the new question
4. Offer to continue previous topic if relevant`

Use Natural Language

// Good: Conversational, natural
"Sure, I can help with that. Let me check..."
"Great question! The answer is..."

// Avoid: Robotic, formal
"Affirmative. Processing request. Please wait."
"Query acknowledged. Retrieving data."

Provide Audio Feedback

// Use audio cues for better UX
voiceStream.on('listening', () => {
  playSound('listening-chime.mp3');
});

voiceStream.on('processing', () => {
  playSound('thinking.mp3');
});

voiceStream.on('error', () => {
  playSound('error-tone.mp3');
});

Voice Selection

Match Voice to Use Case: Choose voice characteristics that match your application’s personality and audience.

// Customer service: Friendly, warm
voice: {
  output: {
    voice: "en-US-Neural2-C",
    style: "friendly",
    pitch: 2
  }
}

// Professional assistant: Clear, professional
voice: {
  output: {
    voice: "en-US-Neural2-J",
    style: "professional",
    pitch: 0
  }
}

// Educational content: Patient, clear
voice: {
  output: {
    voice: "en-US-Neural2-A",
    style: "conversational",
    speed: 0.85
  }
}

Error Handling

Always Handle Audio Errors: Network issues, microphone problems, and audio format incompatibilities can disrupt voice interactions.

voiceStream.on('error', (error) => {
  switch (error.type) {
    case 'microphone_access_denied':
      speakText("I need microphone access to hear you. Please check your settings.");
      break;

    case 'audio_format_unsupported':
      console.error('Audio format not supported:', error.format);
      // Fallback to supported format
      break;

    case 'network_error':
      speakText("I'm having trouble connecting. Please check your internet connection.");
      break;

    case 'transcription_failed':
      speakText("Sorry, I didn't catch that. Could you repeat?");
      break;

    default:
      speakText("Sorry, I'm having technical difficulties. Please try again.");
  }
});

Privacy and Security

// Implement privacy controls
const voiceConfig = {
  voice: {
    input: {
      recording: {
        enabled: userConsent, // Only record with consent
        retention: "session", // Delete after session
        encryption: true
      },
      profanityFilter: true,
      sensitiveDataRedaction: true // Redact SSN, credit cards, etc.
    }
  },
  privacy: {
    storeAudio: false, // Don't store audio recordings
    storeTranscripts: userConsent,
    anonymizeData: true
  }
};

Integration with Other Primitives

With Memory

Remember conversation context:

const voiceAgent = await agentbase.createVoiceStream({
  voice: {
    enabled: true,
    output: { voice: "en-US-Neural2-C" }
  },
  memory: {
    namespace: `user_${userId}`,
    enabled: true
  },
  system: `Remember user preferences and past conversations.
  Reference previous discussions when relevant.`
});

// Agent recalls: "Last time we spoke about your project deadline..."

Learn more: Memory Primitive

With Custom Tools

Access external services:

const voiceAgent = await agentbase.createVoiceStream({
  voice: {
    enabled: true
  },
  mcpServers: [
    {
      serverName: "calendar",
      serverUrl: "https://api.company.com/calendar"
    }
  ],
  system: `You can:
  - Check calendar
  - Schedule meetings
  - Set reminders

  Use tools to access user's calendar.`
});

Learn more: Custom Tools Primitive

With Multi-Agent

Transfer between voice agents:

const result = await agentbase.createVoiceStream({
  voice: { enabled: true },
  agents: [
    {
      name: "Main Assistant",
      description: "General assistance and routing"
    },
    {
      name: "Technical Support",
      description: "Technical troubleshooting",
      voice: {
        output: {
          voice: "en-US-Neural2-J", // Different voice
          style: "professional"
        }
      }
    },
    {
      name: "Billing Support",
      description: "Billing and payments",
      voice: {
        output: {
          voice: "en-US-Neural2-C",
          style: "friendly"
        }
      }
    }
  ]
});

// Voice changes when transferring between agents

Learn more: Multi-Agent Primitive

Performance Considerations

Latency Optimization

// Minimize latency for real-time conversations
const voiceStream = await agentbase.createVoiceStream({
  voice: {
    input: {
      streaming: true,
      vadEnabled: true, // Detect speech end faster
      endpointingDelay: 500 // 500ms after speech ends
    },
    output: {
      streaming: true,
      format: "opus", // Compressed for faster streaming
      quality: "high" // Balance quality and latency
    }
  },
  config: {
    responseMode: "streaming", // Stream response as generated
    thinkingTime: "minimal" // Start responding quickly
  }
});

Bandwidth Management

Optimize for Network: Choose appropriate audio formats and quality based on network conditions.

// Adaptive quality based on network
function getVoiceConfig(networkSpeed: string) {
  const configs = {
    fast: {
      format: "opus",
      sampleRate: 48000,
      quality: "high"
    },
    medium: {
      format: "opus",
      sampleRate: 24000,
      quality: "medium"
    },
    slow: {
      format: "mulaw",
      sampleRate: 8000,
      quality: "low"
    }
  };

  return configs[networkSpeed];
}

const voiceConfig = getVoiceConfig(detectedNetworkSpeed);

Cost Management

// Optimize costs for voice
const voiceStream = await agentbase.createVoiceStream({
  voice: {
    input: {
      vadEnabled: true, // Only transcribe actual speech
      silenceDetection: true // Stop transcription during silence
    },
    output: {
      caching: true, // Cache common responses
      compression: true
    }
  },
  config: {
    sessionTimeout: 300000, // 5 minute timeout
    maxSessionDuration: 1800000 // 30 minute max
  }
});

Troubleshooting

Poor Audio Quality

Problem: Voice output sounds robotic or distortedSolutions:

Increase sample rate (24kHz or 48kHz)
Use neural voices instead of standard
Check network bandwidth
Reduce speed/pitch modifications
Use appropriate audio format for medium

voice: {
  output: {
    voice: "en-US-Neural2-C", // Neural voice
    sampleRate: 48000, // High quality
    format: "opus", // Good compression
    bitrate: 128000 // Higher bitrate
  }
}

High Latency

Problem: Delay between speaking and responseSolutions:

Enable streaming for input and output
Reduce endpointing delay
Optimize agent response time
Use closer geographic region
Enable voice activity detection

voice: {
  input: {
    streaming: true,
    vadEnabled: true,
    endpointingDelay: 300 // 300ms
  },
  output: {
    streaming: true,
    prebuffer: true // Start generating before full input
  }
}

Transcription Errors

Problem: Speech not transcribed correctlySolutions:

Specify correct language
Improve audio quality (reduce noise)
Use appropriate dialect
Add custom vocabulary
Enable noise reduction

voice: {
  input: {
    language: "en-US",
    noiseReduction: true,
    customVocabulary: ["Agentbase", "API", "webhook"],
    hints: ["technical terms", "product names"]
  }
}

Interruption Not Working

Problem: Agent continues speaking when interruptedSolutions:

Enable interruptible output
Configure interruption handling
Reduce VAD sensitivity
Check audio pipeline configuration

voice: {
  output: {
    interruptible: true,
    pauseOnInterruption: true,
    gracefulStop: true
  },
  input: {
    interruptionHandling: "immediate",
    vadSensitivity: "high"
  }
}

Advanced Features

Emotion Detection

Detect user emotion from voice:

const voiceStream = await agentbase.createVoiceStream({
  voice: {
    input: {
      emotionDetection: true,
      sentimentAnalysis: true
    }
  },
  system: `Adapt your tone based on user emotion.
  If user sounds frustrated, be more empathetic.
  If user sounds happy, match their energy.`
});

voiceStream.on('emotion', (emotion) => {
  console.log('Detected emotion:', emotion.type);
  console.log('Confidence:', emotion.confidence);
  // Adjust response accordingly
});

Voice Biometrics

Identify speakers by voice:

const voiceStream = await agentbase.createVoiceStream({
  voice: {
    input: {
      speakerIdentification: true,
      voiceprint: userVoiceprint // Pre-enrolled voiceprint
    }
  },
  security: {
    requireVoiceAuth: true
  }
});

voiceStream.on('speaker-identified', (speaker) => {
  if (speaker.verified) {
    console.log('Verified user:', speaker.userId);
  } else {
    console.log('Unknown speaker');
  }
});

Multi-Party Conversations

Handle multiple speakers:

const voiceStream = await agentbase.createVoiceStream({
  voice: {
    input: {
      multiSpeaker: true,
      speakerDiarization: true // Separate different speakers
    }
  },
  system: `You are moderating a group conversation.
  Address speakers by name when identified.
  Manage turn-taking and keep discussion on track.`
});

voiceStream.on('speaker-change', (event) => {
  console.log('Now speaking:', event.speakerId);
});

Sessions

Maintain voice conversation context

Memory

Remember user preferences and history

Multi-Agent

Transfer between voice specialists

Custom Tools

Integrate external services in voice flows

Additional Resources

API Reference

Complete voice API documentation

Voice Design Guide

Best practices for voice UX

Examples

Voice integration examples

Remember: Voice interfaces require different design principles than text interfaces. Keep responses concise, design for interruptions, and provide clear audio feedback.

Getting Started

Build

Deploy

Improve

Agent Primitives

API Reference

Resources

​Overview

Real-Time STT

Natural TTS

Voice Streaming

Voice Customization

​How Voice Works

​Voice Capabilities

​Speech-to-Text (STT)

​Text-to-Speech (TTS)

​Code Examples

​Basic Voice Conversation

​Real-Time Voice Streaming

​Voice with Multiple Languages

​Custom Voice Characteristics

​Phone Integration

​Voice with Interruption Handling

​Use Cases

​1. Customer Service Phone System

​2. Voice-Enabled Personal Assistant

​3. Language Learning Tutor

​4. Healthcare Voice Interface

​5. Voice-Activated Navigation

​6. Conference Room Assistant

​Best Practices

​Voice Design

​Voice Selection

​Error Handling

​Privacy and Security

​Integration with Other Primitives

​With Memory

​With Custom Tools

​With Multi-Agent

​Performance Considerations

​Latency Optimization

​Bandwidth Management

​Cost Management

​Troubleshooting

​Advanced Features

​Emotion Detection

​Voice Biometrics

​Multi-Party Conversations

​Related Primitives

Sessions

Memory

Multi-Agent

Custom Tools

​Additional Resources

API Reference

Voice Design Guide

Examples

Overview

How Voice Works

Voice Capabilities

Speech-to-Text (STT)

Text-to-Speech (TTS)

Code Examples

Basic Voice Conversation

Real-Time Voice Streaming

Voice with Multiple Languages

Custom Voice Characteristics

Phone Integration

Voice with Interruption Handling

Use Cases

1. Customer Service Phone System

2. Voice-Enabled Personal Assistant

3. Language Learning Tutor

4. Healthcare Voice Interface

5. Voice-Activated Navigation

6. Conference Room Assistant

Best Practices

Voice Design

Voice Selection

Error Handling

Privacy and Security

Integration with Other Primitives

With Memory

With Custom Tools

With Multi-Agent

Performance Considerations

Latency Optimization

Bandwidth Management

Cost Management

Troubleshooting

Advanced Features

Emotion Detection

Voice Biometrics

Multi-Party Conversations

Related Primitives

Additional Resources