mirror of
https://github.com/crewAIInc/crewAI.git
synced 2026-01-09 08:08:32 +00:00
2.0 KiB
2.0 KiB
ScrapflyScrapeWebsiteTool
Description
ScrapFly is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown or text.
Setup and Installation
-
Install ScrapFly Python SDK: Install
scrapfly-sdkPython package is installed to use the ScrapFly Web Loader. Install it via pip with the following command:pip install scrapfly-sdk -
API Key: Register for free from scrapfly.io/register to obtain your API key.
Example Usage
Utilize the ScrapflyScrapeWebsiteTool as follows to retrieve a web page data as text, markdown (LLM accissible) or HTML:
from crewai_tools import ScrapflyScrapeWebsiteTool
tool = ScrapflyScrapeWebsiteTool(
api_key="Your ScrapFly API key"
)
result = tool._run(
url="https://web-scraping.dev/products",
scrape_format="markdown",
ignore_scrape_failures=True
)
Additional Arguments
The ScrapflyScrapeWebsiteTool also allows passigng ScrapeConfig object for customizing the scrape request. See the API params documentation for the full feature details and their API params:
from crewai_tools import ScrapflyScrapeWebsiteTool
tool = ScrapflyScrapeWebsiteTool(
api_key="Your ScrapFly API key"
)
scrapfly_scrape_config = {
"asp": True, # Bypass scraping blocking and solutions, like Cloudflare
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
"country": "us", # Select a proxy location
"auto_scroll": True, # Auto scroll the page
"js": "" # Execute custom JavaScript code by the headless browser
}
result = tool._run(
url="https://web-scraping.dev/products",
scrape_format="markdown",
ignore_scrape_failures=True,
scrape_config=scrapfly_scrape_config
)