Crawl & Scrape - Agentbase Docs

Overview

The Crawl & Scrape extension enables agents to systematically extract data from websites, navigate multi-page structures, and gather information at scale. Unlike simple web requests, this extension provides intelligent crawling with rate limiting, session management, and structured data extraction.

Intelligent Crawling

Navigate website structures automatically, following links and pagination

Data Extraction

Extract structured data from HTML, JSON, and APIs

Rate Limiting

Respect robots.txt and avoid overwhelming servers

Session Management

Handle cookies, authentication, and stateful navigation

How It Works

Agents use the built-in web tool with crawl mode to systematically extract data from websites:

Target Identification

Agent identifies the target website and data to extract

Page Navigation

Navigates to the target page using the browser environment

Data Extraction

Extracts relevant data using DOM selectors, regex, or structure

Link Following

Optionally follows links to crawl multiple pages

Data Structuring

Organizes extracted data into structured format

Basic Usage

Simple Page Scraping

TypeScript
Python
cURL

import Agentbase from "@agentbase/sdk";

const agentbase = new Agentbase({
  apiKey: process.env.AGENTBASE_API_KEY
});

// Scrape product information
const result = await agentbase.runAgent({
  message: "Go to example.com/products and extract all product names and prices",
  mode: "base"
});

console.log('Extracted data:', result.content);

from agentbase import Agentbase

client = Agentbase(api_key=os.environ.get("AGENTBASE_API_KEY"))

# Scrape product information
result = client.run_agent(
    message="Go to example.com/products and extract all product names and prices",
    mode="base"
)

print("Extracted data:", result.content)

curl -X POST https://api.agentbase.sh/run-agent \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Go to example.com/products and extract all product names and prices",
    "mode": "base"
  }'

Multi-Page Crawling

// Crawl multiple pages with pagination
const result = await agentbase.runAgent({
  message: `Visit example.com/blog and extract:
    - All article titles
    - Author names
    - Publication dates
    - Follow pagination to get all articles from pages 1-5`,
  mode: "base"
});

Structured Data Extraction

// Extract data in specific format
const result = await agentbase.runAgent({
  message: `Scrape example.com/directory and return data as JSON:
    {
      "listings": [
        {
          "name": "...",
          "address": "...",
          "phone": "...",
          "rating": "..."
        }
      ]
    }`,
  mode: "base"
});

const data = JSON.parse(result.content);

Use Cases

1. Competitive Intelligence

Monitor competitor websites for pricing, features, and updates:

async function monitorCompetitors(competitors: string[]) {
  const results = await Promise.all(
    competitors.map(async (url) => {
      return await agentbase.runAgent({
        message: `Visit ${url} and extract:
          - All product names and prices
          - Key features
          - Any promotional banners
          - Last updated date`,
        mode: "base"
      });
    })
  );

  return results.map((r, i) => ({
    competitor: competitors[i],
    data: r.content
  }));
}

// Usage
const intel = await monitorCompetitors([
  'https://competitor1.com/deploy/pricing',
  'https://competitor2.com/products',
  'https://competitor3.com/features'
]);

2. Lead Generation

Extract business contact information from directories:

async function extractLeads(directoryUrl: string) {
  const result = await agentbase.runAgent({
    message: `Scrape ${directoryUrl} and extract all business listings:
      - Business name
      - Industry/category
      - Contact email
      - Phone number
      - Website URL

      Follow pagination to get all listings. Return as CSV format.`,
    mode: "base"
  });

  // Save to file
  fs.writeFileSync('leads.csv', result.content);
  return result.content;
}

3. Content Aggregation

Aggregate content from multiple sources:

async function aggregateNews(topics: string[]) {
  const articles = [];

  for (const topic of topics) {
    const result = await agentbase.runAgent({
      message: `Search for recent articles about "${topic}" and extract:
        - Article title
        - Source/publication
        - Publication date
        - Summary
        - URL

        Get the 10 most recent articles.`,
      mode: "base"
    });

    articles.push({
      topic,
      articles: JSON.parse(result.content)
    });
  }

  return articles;
}

// Usage
const news = await aggregateNews([
  'AI developments',
  'Cloud computing trends',
  'Cybersecurity news'
]);

4. Price Monitoring

Track product prices across e-commerce sites:

interface PriceAlert {
  product: string;
  targetPrice: number;
  urls: string[];
}

async function monitorPrices(alerts: PriceAlert[]) {
  for (const alert of alerts) {
    for (const url of alert.urls) {
      const result = await agentbase.runAgent({
        message: `Visit ${url} and extract the current price for ${alert.product}`,
        mode: "flash"
      });

      // Parse price from response
      const priceMatch = result.content.match(/\$?([\d,]+\.?\d*)/);
      if (priceMatch) {
        const currentPrice = parseFloat(priceMatch[1].replace(',', ''));

        if (currentPrice <= alert.targetPrice) {
          await sendAlert({
            product: alert.product,
            currentPrice,
            targetPrice: alert.targetPrice,
            url
          });
        }
      }
    }
  }
}

5. Job Listings Aggregation

Collect job postings from multiple boards:

async function aggregateJobs(searchTerms: string[]) {
  const jobBoards = [
    'https://jobs.example.com',
    'https://careers.example.org',
    'https://opportunities.example.net'
  ];

  const allJobs = [];

  for (const board of jobBoards) {
    for (const term of searchTerms) {
      const result = await agentbase.runAgent({
        message: `Search for "${term}" on ${board} and extract:
          - Job title
          - Company name
          - Location
          - Salary (if available)
          - Posted date
          - Job URL

          Return as JSON array. Get the first 20 results.`,
        mode: "base"
      });

      allJobs.push(...JSON.parse(result.content));
    }
  }

  return allJobs;
}

6. Market Research

Gather product reviews and ratings:

async function collectReviews(productUrl: string) {
  const result = await agentbase.runAgent({
    message: `Visit ${productUrl} and extract all customer reviews:
      - Reviewer name
      - Rating (stars)
      - Review text
      - Date
      - Verified purchase (if shown)

      Follow pagination to get all reviews. Return as JSON.`,
    mode: "base"
  });

  const reviews = JSON.parse(result.content);

  // Analyze sentiment
  const analysis = await agentbase.runAgent({
    message: `Analyze these reviews and provide:
      - Average rating
      - Common positive themes
      - Common negative themes
      - Overall sentiment

      Reviews: ${JSON.stringify(reviews.slice(0, 50))}`,
    mode: "base"
  });

  return {
    reviews,
    analysis: analysis.content
  };
}

Best Practices

Respect Robots.txt

Always respect website crawling policies:

// ✅ Good: Mention respecting robots.txt
const result = await agentbase.runAgent({
  message: "Crawl example.com (respecting robots.txt) and extract product data",
  mode: "base"
});

Why: Violating robots.txt can get your IP blocked and is considered bad practice.

Implement Rate Limiting

Avoid overwhelming servers with requests:

// ✅ Good: Stagger requests
async function crawlWithDelay(urls: string[], delayMs: number = 2000) {
  const results = [];

  for (const url of urls) {
    const result = await agentbase.runAgent({
      message: `Extract data from ${url}`,
      mode: "base"
    });

    results.push(result);

    // Wait before next request
    if (urls.indexOf(url) < urls.length - 1) {
      await new Promise(resolve => setTimeout(resolve, delayMs));
    }
  }

  return results;
}

Handle Errors Gracefully

Expect and handle failures:

async function scrapeSafely(url: string) {
  try {
    const result = await agentbase.runAgent({
      message: `Extract data from ${url}. If the page is not found or blocked, return an error message.`,
      mode: "base"
    });

    return {
      success: true,
      data: result.content
    };

  } catch (error) {
    console.error(`Failed to scrape ${url}:`, error);

    return {
      success: false,
      error: error.message,
      url
    };
  }
}

Be Specific About Data Format

Clearly specify desired output format:

// ✅ Good: Specify format
const result = await agentbase.runAgent({
  message: `Extract product data and return as JSON:
    {
      "products": [
        {
          "name": "string",
          "price": "number",
          "inStock": "boolean",
          "url": "string"
        }
      ]
    }`,
  mode: "base"
});

// ❌ Bad: Vague format
const result = await agentbase.runAgent({
  message: "Get product data",
  mode: "base"
});

Cache Results

Avoid re-scraping unchanged data:

import { LRUCache } from 'lru-cache';

const cache = new LRUCache({
  max: 1000,
  ttl: 1000 * 60 * 60 // 1 hour
});

async function scrapeWithCache(url: string) {
  // Check cache first
  const cached = cache.get(url);
  if (cached) {
    console.log('Cache hit:', url);
    return cached;
  }

  // Scrape if not cached
  const result = await agentbase.runAgent({
    message: `Extract data from ${url}`,
    mode: "base"
  });

  // Cache result
  cache.set(url, result.content);
  return result.content;
}

Monitor for Changes

Detect when websites update structure:

async function detectChanges(url: string, expectedStructure: string[]) {
  const result = await agentbase.runAgent({
    message: `Check if ${url} contains these elements: ${expectedStructure.join(', ')}. Return which are present and which are missing.`,
    mode: "flash"
  });

  if (result.content.includes('missing')) {
    await alertTeam({
      message: `Website structure changed: ${url}`,
      details: result.content
    });
  }
}

Advanced Patterns

Parallel Scraping

async function scrapeInParallel(urls: string[], concurrency: number = 5) {
  const results = [];

  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);

    const batchResults = await Promise.all(
      batch.map(url =>
        agentbase.runAgent({
          message: `Extract data from ${url}`,
          mode: "base"
        })
      )
    );

    results.push(...batchResults);

    // Rate limit between batches
    if (i + concurrency < urls.length) {
      await new Promise(resolve => setTimeout(resolve, 2000));
    }
  }

  return results;
}

Recursive Crawling

async function crawlSitemap(startUrl: string, maxDepth: number = 3) {
  const visited = new Set();
  const results = [];

  async function crawlPage(url: string, depth: number) {
    if (depth > maxDepth || visited.has(url)) return;

    visited.add(url);

    const result = await agentbase.runAgent({
      message: `Visit ${url} and:
        1. Extract main content
        2. Find all internal links
        3. Return both as JSON`,
      mode: "base"
    });

    results.push({
      url,
      depth,
      content: result.content
    });

    // Crawl linked pages
    const data = JSON.parse(result.content);
    for (const link of data.links || []) {
      await crawlPage(link, depth + 1);
    }
  }

  await crawlPage(startUrl, 0);
  return results;
}

Performance Considerations

Use Flash Mode

For simple extraction tasks, use Flash mode:

// Fast, cheap extraction
mode: "flash"

Batch Requests

Combine multiple extractions when possible

Cache Aggressively

Cache results to avoid re-scraping

Optimize Selectors

Be specific about what to extract to reduce processing

Troubleshooting

Page Not Loading

Problem: Agent can’t access the pageSolutions:

Check if URL is correct and accessible
Verify website doesn’t block automated access
Try with different user agent
Check for CAPTCHA or bot detection

Data Not Extracted

Problem: Agent returns no or incomplete dataSolutions:

Be more specific about what to extract
Check if page structure has changed
Verify data is visible (not behind JavaScript)
Try with more detailed instructions

Rate Limited

Problem: Website blocks requestsSolutions:

Implement delays between requests
Reduce concurrency
Respect robots.txt
Contact website owner for API access

Integration with Other Primitives

Browser

Uses browser environment for navigation and rendering

File System

Save scraped data to files

Sessions

Maintain context across crawls

Background

Run long crawls asynchronously

Web Search

Find pages to scrape

Data Connectors

Store scraped data in databases

Next Steps

Web Search

Search for pages before scraping

Browser Automation

Learn about browser capabilities

Data Connectors

Store scraped data efficiently

Background Tasks

Run long-running crawls

Getting Started

Build

Deploy

Improve

Agent Primitives

API Reference

Resources

​Overview

Intelligent Crawling

Data Extraction

Rate Limiting

Session Management

​How It Works

​Basic Usage

​Simple Page Scraping

​Multi-Page Crawling

​Structured Data Extraction

​Use Cases

​1. Competitive Intelligence

​2. Lead Generation

​3. Content Aggregation

​4. Price Monitoring

​5. Job Listings Aggregation

​6. Market Research

​Best Practices

​Advanced Patterns

​Parallel Scraping

​Recursive Crawling

​Performance Considerations

Use Flash Mode

Batch Requests

Cache Aggressively

Optimize Selectors

​Troubleshooting

​Integration with Other Primitives

Browser

File System

Sessions

Background

Web Search

Data Connectors

​Next Steps

Web Search

Browser Automation

Data Connectors

Background Tasks

Overview

How It Works

Basic Usage

Simple Page Scraping

Multi-Page Crawling

Structured Data Extraction

Use Cases

1. Competitive Intelligence

2. Lead Generation

3. Content Aggregation

4. Price Monitoring

5. Job Listings Aggregation

6. Market Research

Best Practices

Advanced Patterns

Parallel Scraping

Recursive Crawling

Performance Considerations

Troubleshooting

Integration with Other Primitives

Next Steps