The Crawl & Scrape extension enables agents to systematically extract data from websites, navigate multi-page structures, and gather information at scale. Unlike simple web requests, this extension provides intelligent crawling with rate limiting, session management, and structured data extraction.
Intelligent Crawling
Navigate website structures automatically, following links and pagination
Data Extraction
Extract structured data from HTML, JSON, and APIs
Rate Limiting
Respect robots.txt and avoid overwhelming servers
Session Management
Handle cookies, authentication, and stateful navigation
import Agentbase from "@agentbase/sdk";const agentbase = new Agentbase({ apiKey: process.env.AGENTBASE_API_KEY});// Scrape product informationconst result = await agentbase.runAgent({ message: "Go to example.com/products and extract all product names and prices", mode: "base"});console.log('Extracted data:', result.content);
from agentbase import Agentbaseclient = Agentbase(api_key=os.environ.get("AGENTBASE_API_KEY"))# Scrape product informationresult = client.run_agent( message="Go to example.com/products and extract all product names and prices", mode="base")print("Extracted data:", result.content)
curl -X POST https://api.agentbase.sh/run-agent \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "message": "Go to example.com/products and extract all product names and prices", "mode": "base" }'
// Crawl multiple pages with paginationconst result = await agentbase.runAgent({ message: `Visit example.com/blog and extract: - All article titles - Author names - Publication dates - Follow pagination to get all articles from pages 1-5`, mode: "base"});
// Extract data in specific formatconst result = await agentbase.runAgent({ message: `Scrape example.com/directory and return data as JSON: { "listings": [ { "name": "...", "address": "...", "phone": "...", "rating": "..." } ] }`, mode: "base"});const data = JSON.parse(result.content);
Extract business contact information from directories:
async function extractLeads(directoryUrl: string) { const result = await agentbase.runAgent({ message: `Scrape ${directoryUrl} and extract all business listings: - Business name - Industry/category - Contact email - Phone number - Website URL Follow pagination to get all listings. Return as CSV format.`, mode: "base" }); // Save to file fs.writeFileSync('leads.csv', result.content); return result.content;}
async function aggregateJobs(searchTerms: string[]) { const jobBoards = [ 'https://jobs.example.com', 'https://careers.example.org', 'https://opportunities.example.net' ]; const allJobs = []; for (const board of jobBoards) { for (const term of searchTerms) { const result = await agentbase.runAgent({ message: `Search for "${term}" on ${board} and extract: - Job title - Company name - Location - Salary (if available) - Posted date - Job URL Return as JSON array. Get the first 20 results.`, mode: "base" }); allJobs.push(...JSON.parse(result.content)); } } return allJobs;}
Why: Violating robots.txt can get your IP blocked and is considered bad practice.
Implement Rate Limiting
Avoid overwhelming servers with requests:
// ✅ Good: Stagger requestsasync function crawlWithDelay(urls: string[], delayMs: number = 2000) { const results = []; for (const url of urls) { const result = await agentbase.runAgent({ message: `Extract data from ${url}`, mode: "base" }); results.push(result); // Wait before next request if (urls.indexOf(url) < urls.length - 1) { await new Promise(resolve => setTimeout(resolve, delayMs)); } } return results;}
Handle Errors Gracefully
Expect and handle failures:
async function scrapeSafely(url: string) { try { const result = await agentbase.runAgent({ message: `Extract data from ${url}. If the page is not found or blocked, return an error message.`, mode: "base" }); return { success: true, data: result.content }; } catch (error) { console.error(`Failed to scrape ${url}:`, error); return { success: false, error: error.message, url }; }}
Be Specific About Data Format
Clearly specify desired output format:
// ✅ Good: Specify formatconst result = await agentbase.runAgent({ message: `Extract product data and return as JSON: { "products": [ { "name": "string", "price": "number", "inStock": "boolean", "url": "string" } ] }`, mode: "base"});// ❌ Bad: Vague formatconst result = await agentbase.runAgent({ message: "Get product data", mode: "base"});
Cache Results
Avoid re-scraping unchanged data:
import { LRUCache } from 'lru-cache';const cache = new LRUCache({ max: 1000, ttl: 1000 * 60 * 60 // 1 hour});async function scrapeWithCache(url: string) { // Check cache first const cached = cache.get(url); if (cached) { console.log('Cache hit:', url); return cached; } // Scrape if not cached const result = await agentbase.runAgent({ message: `Extract data from ${url}`, mode: "base" }); // Cache result cache.set(url, result.content); return result.content;}
Monitor for Changes
Detect when websites update structure:
async function detectChanges(url: string, expectedStructure: string[]) { const result = await agentbase.runAgent({ message: `Check if ${url} contains these elements: ${expectedStructure.join(', ')}. Return which are present and which are missing.`, mode: "flash" }); if (result.content.includes('missing')) { await alertTeam({ message: `Website structure changed: ${url}`, details: result.content }); }}