mirror of
https://github.com/crewAIInc/crewAI.git
synced 2026-01-09 16:18:30 +00:00
docs: major docs updates (#2897)
This commit is contained in:
103
docs/tools/web-scraping/overview.mdx
Normal file
103
docs/tools/web-scraping/overview.mdx
Normal file
@@ -0,0 +1,103 @@
|
||||
---
|
||||
title: "Overview"
|
||||
description: "Extract data from websites and automate browser interactions with powerful scraping tools"
|
||||
icon: "face-smile"
|
||||
---
|
||||
|
||||
These tools enable your agents to interact with the web, extract data from websites, and automate browser-based tasks. From simple web scraping to complex browser automation, these tools cover all your web interaction needs.
|
||||
|
||||
## **Available Tools**
|
||||
|
||||
<CardGroup cols={2}>
|
||||
<Card title="Scrape Website Tool" icon="globe" href="/tools/web-scraping/scrapewebsitetool">
|
||||
General-purpose web scraping tool for extracting content from any website.
|
||||
</Card>
|
||||
|
||||
<Card title="Scrape Element Tool" icon="crosshairs" href="/tools/web-scraping/scrapeelementfromwebsitetool">
|
||||
Target specific elements on web pages with precision scraping capabilities.
|
||||
</Card>
|
||||
|
||||
<Card title="Firecrawl Crawl Tool" icon="spider" href="/tools/web-scraping/firecrawlcrawlwebsitetool">
|
||||
Crawl entire websites systematically with Firecrawl's powerful engine.
|
||||
</Card>
|
||||
|
||||
<Card title="Firecrawl Scrape Tool" icon="fire" href="/tools/web-scraping/firecrawlscrapewebsitetool">
|
||||
High-performance web scraping with Firecrawl's advanced capabilities.
|
||||
</Card>
|
||||
|
||||
<Card title="Firecrawl Search Tool" icon="magnifying-glass" href="/tools/web-scraping/firecrawlsearchtool">
|
||||
Search and extract specific content using Firecrawl's search features.
|
||||
</Card>
|
||||
|
||||
<Card title="Selenium Scraping Tool" icon="robot" href="/tools/web-scraping/seleniumscrapingtool">
|
||||
Browser automation and scraping with Selenium WebDriver capabilities.
|
||||
</Card>
|
||||
|
||||
<Card title="ScrapFly Tool" icon="plane" href="/tools/web-scraping/scrapflyscrapetool">
|
||||
Professional web scraping with ScrapFly's premium scraping service.
|
||||
</Card>
|
||||
|
||||
<Card title="ScrapGraph Tool" icon="network-wired" href="/tools/web-scraping/scrapegraphscrapetool">
|
||||
Graph-based web scraping for complex data relationships.
|
||||
</Card>
|
||||
|
||||
<Card title="Spider Tool" icon="spider" href="/tools/web-scraping/spidertool">
|
||||
Comprehensive web crawling and data extraction capabilities.
|
||||
</Card>
|
||||
|
||||
<Card title="BrowserBase Tool" icon="browser" href="/tools/web-scraping/browserbaseloadtool">
|
||||
Cloud-based browser automation with BrowserBase infrastructure.
|
||||
</Card>
|
||||
|
||||
<Card title="HyperBrowser Tool" icon="window-maximize" href="/tools/web-scraping/hyperbrowserloadtool">
|
||||
Fast browser interactions with HyperBrowser's optimized engine.
|
||||
</Card>
|
||||
|
||||
<Card title="Stagehand Tool" icon="hand" href="/tools/web-scraping/stagehandtool">
|
||||
Intelligent browser automation with natural language commands.
|
||||
</Card>
|
||||
</CardGroup>
|
||||
|
||||
## **Common Use Cases**
|
||||
|
||||
- **Data Extraction**: Scrape product information, prices, and reviews
|
||||
- **Content Monitoring**: Track changes on websites and news sources
|
||||
- **Lead Generation**: Extract contact information and business data
|
||||
- **Market Research**: Gather competitive intelligence and market data
|
||||
- **Testing & QA**: Automate browser testing and validation workflows
|
||||
- **Social Media**: Extract posts, comments, and social media analytics
|
||||
|
||||
## **Quick Start Example**
|
||||
|
||||
```python
|
||||
from crewai_tools import ScrapeWebsiteTool, FirecrawlScrapeWebsiteTool, SeleniumScrapingTool
|
||||
|
||||
# Create scraping tools
|
||||
simple_scraper = ScrapeWebsiteTool()
|
||||
advanced_scraper = FirecrawlScrapeWebsiteTool()
|
||||
browser_automation = SeleniumScrapingTool()
|
||||
|
||||
# Add to your agent
|
||||
agent = Agent(
|
||||
role="Web Research Specialist",
|
||||
tools=[simple_scraper, advanced_scraper, browser_automation],
|
||||
goal="Extract and analyze web data efficiently"
|
||||
)
|
||||
```
|
||||
|
||||
## **Scraping Best Practices**
|
||||
|
||||
- **Respect robots.txt**: Always check and follow website scraping policies
|
||||
- **Rate Limiting**: Implement delays between requests to avoid overwhelming servers
|
||||
- **User Agents**: Use appropriate user agent strings to identify your bot
|
||||
- **Legal Compliance**: Ensure your scraping activities comply with terms of service
|
||||
- **Error Handling**: Implement robust error handling for network issues and blocked requests
|
||||
- **Data Quality**: Validate and clean extracted data before processing
|
||||
|
||||
## **Tool Selection Guide**
|
||||
|
||||
- **Simple Tasks**: Use `ScrapeWebsiteTool` for basic content extraction
|
||||
- **JavaScript-Heavy Sites**: Use `SeleniumScrapingTool` for dynamic content
|
||||
- **Scale & Performance**: Use `FirecrawlScrapeWebsiteTool` for high-volume scraping
|
||||
- **Cloud Infrastructure**: Use `BrowserBaseLoadTool` for scalable browser automation
|
||||
- **Complex Workflows**: Use `StagehandTool` for intelligent browser interactions
|
||||
Reference in New Issue
Block a user