Extract text and data from images, documents, and scanned files
OCR (Optical Character Recognition) enables agents to read and extract text from images, PDFs, scanned documents, screenshots, and handwritten notes, making visual content searchable and actionable.
The OCR primitive empowers agents with visual text recognition capabilities, allowing them to process and understand text embedded in images and documents. This transforms unstructured visual content into structured, machine-readable data that agents can analyze, index, and act upon.OCR is essential for:
Document Digitization: Convert scanned documents and PDFs into editable text
Data Extraction: Extract structured data from invoices, receipts, and forms
Image Analysis: Read text from screenshots, photos, and diagrams
Accessibility: Make visual content accessible and searchable
Automation: Process documents automatically without manual transcription
Multi-language Support: Extract text in multiple languages and scripts
Image to Text
Extract text from images in any format (PNG, JPG, HEIC, WebP)
PDF Processing
Process multi-page PDFs with text and image content
Handwriting Recognition
Recognize handwritten text with high accuracy
Structured Extraction
Extract tables, forms, and structured data from documents
import { Agentbase } from '@agentbase/sdk';import fs from 'fs';const agentbase = new Agentbase({ apiKey: process.env.AGENTBASE_API_KEY});// Extract text from imageconst result = await agentbase.runAgent({ message: "Extract all text from this receipt image", files: [{ name: "receipt.jpg", data: fs.readFileSync('./receipt.jpg') }], capabilities: { ocr: { enabled: true } }});console.log('Extracted text:', result.text);
// Process image from URLconst result = await agentbase.runAgent({ message: "Extract text from this document", files: [{ url: "https://example.com/document.pdf" }], capabilities: { ocr: { enabled: true, language: "en" // Optional: specify language } }});
// Extract text in multiple languagesconst result = await agentbase.runAgent({ message: "Extract text from this multilingual document", files: [{ url: "https://example.com/multilingual.pdf" }], capabilities: { ocr: { enabled: true, languages: ["en", "es", "fr", "zh", "ja"] } }});// Agent automatically detects and extracts text in all languagesconsole.log('Extracted text:', result.text);console.log('Detected languages:', result.detectedLanguages);
const idVerification = await agentbase.runAgent({ message: "Extract information from this driver's license and verify it", files: [{ url: licenseImageUrl }], capabilities: { ocr: { enabled: true, extractionSchema: { fullName: "string", licenseNumber: "string", dateOfBirth: "string", expirationDate: "string", address: "string", state: "string" } } }, system: `Extract all information from the ID document. Validate: - Document is not expired - All text is clearly readable - Photo is present and clear - Format matches expected ID template Flag any concerns about document authenticity.`});// Agent extracts data and performs validation checks
const formProcessor = await agentbase.runAgent({ message: "Extract all fields from this application form", files: [{ url: formPdfUrl }], capabilities: { ocr: { enabled: true, extractTables: true, preserveLayout: true } }, system: `Extract all form fields and values. For each field: - Identify the field name/label - Extract the filled value - Note if field is empty or unclear Organize by form section.`});// Agent extracts structured form data// Can validate completeness and format
const documentDigitizer = await agentbase.runAgent({ message: "Digitize this scanned document archive and create searchable index", files: archivePDFs.map(url => ({ url })), capabilities: { ocr: { enabled: true, pageRange: [1, 1000], // Process up to 1000 pages preserveLayout: true } }, datastores: [{ id: "ds_document_archive", name: "Document Archive" }], system: `Extract text from all documents. For each document: - Extract full text content - Identify document type - Extract metadata (dates, references, etc.) - Index in datastore for searching Preserve document structure and formatting.`});// Agent processes documents and makes them searchable
const cardScanner = await agentbase.runAgent({ message: "Extract contact information from this business card", files: [{ data: cardImageBuffer }], capabilities: { ocr: { enabled: true, extractionSchema: { name: "string", title: "string", company: "string", email: "string", phone: "string", website: "string", address: "string" } } }, mcpServers: [{ serverName: "crm", serverUrl: "https://api.company.com/crm" }], system: `Extract all contact information from the business card. Then: - Check if contact already exists in CRM - If new, create contact record - If exists, update information - Add note about when card was scanned`});// Agent extracts data and syncs with CRM
// Good: High-quality image with clear textcapabilities: { ocr: { enabled: true, minResolution: 300 // DPI }}// Images should be:// - At least 300 DPI for best results// - Clear and well-lit// - Properly oriented// - Free of blur or motion artifacts
Use Schemas for Structured Data: Define extraction schemas to get consistently formatted output and improve accuracy.
Define Clear Schemas
Copy
// Good: Specific schema with typesextractionSchema: { invoiceNumber: "string", date: "string", // YYYY-MM-DD format total: "number", currency: "string"}// Better: Include validation in promptmessage: `Extract invoice data.- Date must be in YYYY-MM-DD format- Total must be numeric only- Currency as 3-letter ISO code (USD, EUR, etc.)`
Validate Extracted Data
Copy
// Implement validation layerconst result = await agentbase.runAgent({ message: "Extract and validate invoice data", files: [{ url: invoiceUrl }], capabilities: { ocr: { enabled: true } }, system: `Extract invoice data and validate: - Invoice number: alphanumeric, 6-12 characters - Date: valid date, not in future - Total: positive number, matches sum of line items - Vendor: not empty Flag any validation errors with specific messages.`});if (result.validation.errors.length > 0) { console.log('Validation errors:', result.validation.errors);}
Handle Ambiguous Cases
Copy
// Provide guidance for unclear casessystem: `When extracting data:- If a field is unclear or illegible, mark as "UNCLEAR"- If multiple interpretations possible, include all- For dates, try multiple formats: MM/DD/YYYY, DD/MM/YYYY, YYYY-MM-DD- For amounts, specify if unclear whether includes tax- Note confidence level for each extracted field`
// Process multiple documents efficientlyconst documents = [ { url: "doc1.pdf" }, { url: "doc2.pdf" }, { url: "doc3.pdf" }];// Process in parallel for speedconst results = await Promise.all( documents.map(doc => agentbase.runAgent({ message: "Extract text from document", files: [doc], capabilities: { ocr: { enabled: true } } }) ));
Cache Results
Copy
// Cache OCR results for frequently accessed documentsconst cache = new Map();async function extractWithCache(documentUrl: string) { if (cache.has(documentUrl)) { return cache.get(documentUrl); } const result = await agentbase.runAgent({ files: [{ url: documentUrl }], capabilities: { ocr: { enabled: true } } }); cache.set(documentUrl, result); return result;}
Selective Processing
Copy
// Only process pages that need OCRcapabilities: { ocr: { enabled: true, skipTextPages: true // Skip pages with existing text layer }}// Or process specific regions onlycapabilities: { ocr: { enabled: true, regions: [ { x: 0, y: 0, width: 500, height: 200 } // Top section only ] }}
const result = await agentbase.runAgent({ message: "Extract medical record data and update patient file", files: [{ url: medicalRecordUrl }], capabilities: { ocr: { enabled: true, extractionSchema: medicalRecordSchema } }, mcpServers: [{ serverName: "healthcare-system", serverUrl: "https://api.hospital.com/ehr" }], system: `Extract patient information from medical record. Then use healthcare system tools to: - Verify patient identity - Update medical history - Flag any critical findings - Schedule follow-up if needed`});
Vision Model Costs: OCR uses vision models which have different pricing than text-only models. See pricing page for details.
Copy
// Optimize costs by selective processingcapabilities: { ocr: { enabled: true, skipTextPages: true, // Don't OCR pages with text layer pageRange: [1, 10], // Limit pages if only excerpt needed useStandardModel: true // Use faster/cheaper model for simple text }}
For handwriting, ensure handwriting mode is enabled
Copy
// Improve accuracy with preprocessingcapabilities: { ocr: { enabled: true, preprocessing: { autoRotate: true, deskew: true, enhanceContrast: true, denoise: true }, model: "claude-3.5-sonnet" // Use best model }}
Tables Not Extracted Correctly
Problem: Table structure is lost or malformedSolutions:
Enable table extraction explicitly
Use preserve layout option
Provide clear instructions about table format
Consider post-processing to validate table structure
Copy
capabilities: { ocr: { enabled: true, extractTables: true, preserveLayout: true }}message: `Extract tables from this document.For each table:- Identify column headers- Preserve row order- Maintain cell alignment- Note any merged cells or special formatting`
Processing Timeout on Large Files
Problem: Large PDFs timeout before completionSolutions:
Process in smaller page ranges
Use parallel processing
Increase timeout limits
Consider breaking into separate jobs
Copy
// Process large document in batchesasync function processLargeDocument(url: string) { const totalPages = 500; const batchSize = 50; const results = []; for (let i = 0; i < totalPages; i += batchSize) { const result = await agentbase.runAgent({ files: [{ url }], capabilities: { ocr: { enabled: true, pageRange: [i + 1, i + batchSize], parallelProcessing: true } }, timeout: 120000 // 2 minutes per batch }); results.push(result); } return results;}
Structured Data Extraction Inconsistent
Problem: Extracted data doesn’t match expected schemaSolutions:
Define very specific extraction schema
Provide examples in prompt
Add validation instructions
Use stricter typing and format requirements
Copy
const result = await agentbase.runAgent({ message: `Extract invoice data following this exact schema. Example output: { "invoiceNumber": "INV-12345", "date": "2024-01-15", "total": 1250.00 } Rules: - Date must be YYYY-MM-DD format - Total must be number with 2 decimal places - Invoice number must include "INV-" prefix`, capabilities: { ocr: { enabled: true, extractionSchema: { invoiceNumber: "string", date: "string", total: "number" } } }});
const result = await agentbase.runAgent({ message: "Identify document type and extract relevant data", files: [{ url: documentUrl }], capabilities: { ocr: { enabled: true } }, system: `First, classify the document type: - Invoice - Receipt - Contract - Form - Letter - Other Based on type, extract appropriate fields using the correct schema.`});// Agent adapts extraction based on document type
const result = await agentbase.runAgent({ message: "Extract data and provide confidence scores", files: [{ url: documentUrl }], capabilities: { ocr: { enabled: true } }, system: `Extract data and for each field provide: - Extracted value - Confidence score (0-1) - Reasoning for low confidence if < 0.7 Flag fields needing human review if confidence < 0.5`});// Review low-confidence extractions manuallyif (result.lowConfidenceFields.length > 0) { // Send for human review}
const result = await agentbase.runAgent({ message: "Extract data from all invoices and create summary report", files: invoiceFiles, capabilities: { ocr: { enabled: true } }, system: `Extract data from all invoices. Then create summary: - Total amount across all invoices - Group by vendor - Identify any duplicates - Flag discrepancies or unusual amounts - Calculate payment due dates`});
Pro Tip: For best results, combine OCR with clear extraction instructions and validation logic. The agent can verify extracted data and flag inconsistencies automatically.