Overview
The Crawl & Scrape extension enables agents to systematically extract data from websites, navigate multi-page structures, and gather information at scale. Unlike simple web requests, this extension provides intelligent crawling with rate limiting, session management, and structured data extraction.Intelligent Crawling
Navigate website structures automatically, following links and pagination
Data Extraction
Extract structured data from HTML, JSON, and APIs
Rate Limiting
Respect robots.txt and avoid overwhelming servers
Session Management
Handle cookies, authentication, and stateful navigation
How It Works
Agents use the built-inweb tool with crawl mode to systematically extract data from websites:
1
Target Identification
Agent identifies the target website and data to extract
2
Page Navigation
Navigates to the target page using the browser environment
3
Data Extraction
Extracts relevant data using DOM selectors, regex, or structure
4
Link Following
Optionally follows links to crawl multiple pages
5
Data Structuring
Organizes extracted data into structured format
Basic Usage
Simple Page Scraping
- TypeScript
- Python
- cURL
Multi-Page Crawling
Structured Data Extraction
Use Cases
1. Competitive Intelligence
Monitor competitor websites for pricing, features, and updates:2. Lead Generation
Extract business contact information from directories:3. Content Aggregation
Aggregate content from multiple sources:4. Price Monitoring
Track product prices across e-commerce sites:5. Job Listings Aggregation
Collect job postings from multiple boards:6. Market Research
Gather product reviews and ratings:Best Practices
Respect Robots.txt
Respect Robots.txt
Always respect website crawling policies:Why: Violating robots.txt can get your IP blocked and is considered bad practice.
Implement Rate Limiting
Implement Rate Limiting
Avoid overwhelming servers with requests:
Handle Errors Gracefully
Handle Errors Gracefully
Expect and handle failures:
Be Specific About Data Format
Be Specific About Data Format
Clearly specify desired output format:
Cache Results
Cache Results
Avoid re-scraping unchanged data:
Monitor for Changes
Monitor for Changes
Detect when websites update structure:
Advanced Patterns
Parallel Scraping
Recursive Crawling
Performance Considerations
Use Flash Mode
For simple extraction tasks, use Flash mode:
Batch Requests
Combine multiple extractions when possible
Cache Aggressively
Cache results to avoid re-scraping
Optimize Selectors
Be specific about what to extract to reduce processing
Troubleshooting
Page Not Loading
Page Not Loading
Problem: Agent can’t access the pageSolutions:
- Check if URL is correct and accessible
- Verify website doesn’t block automated access
- Try with different user agent
- Check for CAPTCHA or bot detection
Data Not Extracted
Data Not Extracted
Problem: Agent returns no or incomplete dataSolutions:
- Be more specific about what to extract
- Check if page structure has changed
- Verify data is visible (not behind JavaScript)
- Try with more detailed instructions
Rate Limited
Rate Limited
Problem: Website blocks requestsSolutions:
- Implement delays between requests
- Reduce concurrency
- Respect robots.txt
- Contact website owner for API access
Integration with Other Primitives
Browser
Uses browser environment for navigation and rendering
File System
Save scraped data to files
Sessions
Maintain context across crawls
Background
Run long crawls asynchronously
Web Search
Find pages to scrape
Data Connectors
Store scraped data in databases