mirror of
https://github.com/crewAIInc/crewAI.git
synced 2026-01-07 23:28:30 +00:00
remove full tool, refined tool
This commit is contained in:
@@ -1,55 +0,0 @@
|
||||
# SpiderFullTool
|
||||
|
||||
## Description
|
||||
|
||||
This is the full fledged Spider tool, with all the possible params listed to the agent. This can eat ut tokens and be a big chunk of your token limit, if this is a problem, check out the `SpiderTool` which probably has most of the features you are looking for. But if you truly want to experience the full power of Spider...
|
||||
|
||||
[Spider](https://spider.cloud/?ref=crewai) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) open source scraper and crawler that returns LLM-ready data. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
|
||||
|
||||
## Installation
|
||||
|
||||
To use the Spider API you need to download the [Spider SDK](https://pypi.org/project/spider-client/) and the crewai[tools] SDK too:
|
||||
|
||||
```python
|
||||
pip install spider-client 'crewai[tools]'
|
||||
```
|
||||
|
||||
## Example
|
||||
|
||||
This example shows you how you can use the full Spider tool to enable your agent to scrape and crawl websites. The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.
|
||||
|
||||
```python
|
||||
from crewai_tools import SpiderFullTool
|
||||
|
||||
tool = SpiderFullTool()
|
||||
```
|
||||
|
||||
## Arguments
|
||||
|
||||
- `api_key` (string, optional): Specifies Spider API key. If not specified, it looks for `SPIDER_API_KEY` in environment variables.
|
||||
- `params` (object, optional): Optional parameters for the request. Defaults to `{"return_format": "markdown"}` to return the website's content in a format that fits LLMs better.
|
||||
- `request` (string): The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform an HTTP request by default until JavaScript rendering is needed for the HTML.
|
||||
- `limit` (int): The maximum number of pages allowed to crawl per website. Remove the value or set it to `0` to crawl all pages.
|
||||
- `depth` (int): The crawl limit for maximum depth. If `0`, no limit will be applied.
|
||||
- `cache` (bool): Use HTTP caching for the crawl to speed up repeated runs. Default is `true`.
|
||||
- `budget` (object): Object that has paths with a counter for limiting the amount of pages example `{"*":1}` for only crawling the root page.
|
||||
- `locale` (string): The locale to use for request, example `en-US`.
|
||||
- `cookies` (string): Add HTTP cookies to use for request.
|
||||
- `stealth` (bool): Use stealth mode for headless chrome request to help prevent being blocked. The default is `true` on chrome.
|
||||
- `headers` (object): Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
|
||||
- `metadata` (bool): Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to `false` unless you have the website already stored with the configuration enabled.
|
||||
- `viewport` (object): Configure the viewport for chrome. Defaults to `800x600`.
|
||||
- `encoding` (string): The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc.
|
||||
- `subdomains` (bool): Allow subdomains to be included. Default is `false`.
|
||||
- `user_agent` (string): Add a custom HTTP user agent to the request. By default this is set to a random agent.
|
||||
- `store_data` (bool): Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to `false`.
|
||||
- `gpt_config` (object): Use AI to generate actions to perform during the crawl. You can pass an array for the `"prompt"` to chain steps.
|
||||
- `fingerprint` (bool): Use advanced fingerprint for chrome.
|
||||
- `storageless` (bool): Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to `false` unless you have the website already stored.
|
||||
- `readability` (bool): Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage.
|
||||
`return_format` (string): The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like HTML etc.
|
||||
- `proxy_enabled` (bool): Enable high performance premium proxies for the request to prevent being blocked at the network level.
|
||||
- `query_selector` (string): The CSS query selector to use when extracting content from the markup.
|
||||
- `full_resources` (bool): Crawl and download all the resources for a website.
|
||||
- `request_timeout` (int): The timeout to use for request. Timeouts can be from `5-60`. The default is `30` seconds.
|
||||
- `run_in_background` (bool): Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
|
||||
@@ -1,85 +0,0 @@
|
||||
from typing import Optional, Any, Type, Dict, Literal
|
||||
from pydantic.v1 import BaseModel, Field
|
||||
from crewai_tools.tools.base_tool import BaseTool
|
||||
import requests
|
||||
|
||||
class SpiderFullParams(BaseModel):
|
||||
request: Optional[str] = Field(description="The request type to perform. Possible values are `http`, `chrome`, and `smart`.")
|
||||
limit: Optional[int] = Field(description="The maximum number of pages allowed to crawl per website. Remove the value or set it to `0` to crawl all pages.")
|
||||
depth: Optional[int] = Field(description="The crawl limit for maximum depth. If `0`, no limit will be applied.")
|
||||
cache: Optional[bool] = Field(default=True, description="Use HTTP caching for the crawl to speed up repeated runs.")
|
||||
budget: Optional[Dict[str, int]] = Field(description="Object that has paths with a counter for limiting the number of pages, e.g., `{'*':1}` for only crawling the root page.")
|
||||
locale: Optional[str] = Field(description="The locale to use for request, e.g., `en-US`.")
|
||||
cookies: Optional[str] = Field(description="Add HTTP cookies to use for request.")
|
||||
stealth: Optional[bool] = Field(default=True, description="Use stealth mode for headless chrome request to help prevent being blocked. Default is `true` on chrome.")
|
||||
headers: Optional[Dict[str, str]] = Field(description="Forward HTTP headers to use for all requests. The object is expected to be a map of key-value pairs.")
|
||||
metadata: Optional[bool] = Field(default=False, description="Boolean to store metadata about the pages and content found. Defaults to `false` unless enabled.")
|
||||
viewport: Optional[str] = Field(default="800x600", description="Configure the viewport for chrome. Defaults to `800x600`.")
|
||||
encoding: Optional[str] = Field(description="The type of encoding to use, e.g., `UTF-8`, `SHIFT_JIS`.")
|
||||
subdomains: Optional[bool] = Field(default=False, description="Allow subdomains to be included. Default is `false`.")
|
||||
user_agent: Optional[str] = Field(description="Add a custom HTTP user agent to the request. Default is a random agent.")
|
||||
store_data: Optional[bool] = Field(default=False, description="Boolean to determine if storage should be used. Defaults to `false`.")
|
||||
gpt_config: Optional[Dict[str, Any]] = Field(description="Use AI to generate actions to perform during the crawl. Can pass an array for the `prompt` to chain steps.")
|
||||
fingerprint: Optional[bool] = Field(description="Use advanced fingerprinting for chrome.")
|
||||
storageless: Optional[bool] = Field(default=False, description="Boolean to prevent storing any data for the request. Defaults to `false`.")
|
||||
readability: Optional[bool] = Field(description="Use readability to pre-process the content for reading.")
|
||||
return_format: Optional[str] = Field(default="markdown", description="The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`.")
|
||||
proxy_enabled: Optional[bool] = Field(description="Enable high-performance premium proxies to prevent being blocked.")
|
||||
query_selector: Optional[str] = Field(description="The CSS query selector to use when extracting content from the markup.")
|
||||
full_resources: Optional[bool] = Field(description="Crawl and download all resources for a website.")
|
||||
request_timeout: Optional[int] = Field(default=30, description="The timeout for requests. Ranges from `5-60` seconds. Default is `30` seconds.")
|
||||
run_in_background: Optional[bool] = Field(description="Run the request in the background. Useful if storing data and triggering crawls to the dashboard.")
|
||||
|
||||
class SpiderFullToolSchema(BaseModel):
|
||||
url: str = Field(description="Website URL")
|
||||
params: Optional[SpiderFullParams] = Field(default=SpiderFullParams(), description="All the params available")
|
||||
mode: Optional[Literal["scrape", "crawl"]] = Field(default="scrape", description="Mode, either `scrape` or `crawl` the URL")
|
||||
|
||||
class SpiderFullTool(BaseTool):
|
||||
name: str = "Spider scrape & crawl tool"
|
||||
description: str = "Scrape & Crawl any URL and return LLM-ready data."
|
||||
args_schema: Type[BaseModel] = SpiderFullToolSchema
|
||||
api_key: Optional[str] = None
|
||||
spider: Optional[Any] = None
|
||||
|
||||
def __init__(self, api_key: Optional[str] = None, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
try:
|
||||
from spider import Spider # type: ignore
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"`spider-client` package not found, please run `pip install spider-client`"
|
||||
)
|
||||
|
||||
self.spider = Spider(api_key=api_key)
|
||||
|
||||
def _run(
|
||||
self,
|
||||
url: str,
|
||||
params: Optional[SpiderFullParams] = None,
|
||||
mode: Optional[Literal["scrape", "crawl"]] = "scrape"
|
||||
):
|
||||
if mode not in ["scrape", "crawl"]:
|
||||
raise ValueError(
|
||||
"Unknown mode in `mode` parameter, `scrape` or `crawl` are the allowed modes"
|
||||
)
|
||||
|
||||
if params is None:
|
||||
print("PARAMS IT NONE")
|
||||
params = SpiderFullParams()
|
||||
print(params)
|
||||
|
||||
action = self.spider.scrape_url if mode == "scrape" else self.spider.crawl_url
|
||||
response = action(url=url, params=params.dict())
|
||||
|
||||
# Debugging: Print the response content
|
||||
print(f"Response status code: {response.status_code}")
|
||||
print(f"Response content: {response.text}")
|
||||
|
||||
try:
|
||||
spider_docs = response.json()
|
||||
except requests.exceptions.JSONDecodeError as e:
|
||||
print(f"JSONDecodeError: {e}")
|
||||
spider_docs = {"error": "Failed to decode JSON response"}
|
||||
|
||||
return spider_docs
|
||||
@@ -4,8 +4,17 @@ from crewai_tools.tools.base_tool import BaseTool
|
||||
|
||||
class SpiderToolSchema(BaseModel):
|
||||
url: str = Field(description="Website URL")
|
||||
params: Optional[Dict[str, Any]] = Field(default={"return_format": "markdown"}, description="Set additional params. Leave empty for this to return LLM-ready data")
|
||||
mode: Optional[Literal["scrape", "crawl"]] = Field(defualt="scrape", description="Mode, the only two allowed modes are `scrape` or `crawl` the url")
|
||||
params: Optional[Dict[str, Any]] = Field(
|
||||
description="Set additional params. Options include:\n"
|
||||
"- `limit`: Optional[int] - The maximum number of pages allowed to crawl per website. Remove the value or set it to `0` to crawl all pages.\n"
|
||||
"- `depth`: Optional[int] - The crawl limit for maximum depth. If `0`, no limit will be applied.\n"
|
||||
"- `metadata`: Optional[bool] - Boolean to include metadata or not. Defaults to `False` unless set to `True`. If the user wants metadata, include params.metadata = True.\n"
|
||||
"- `query_selector`: Optional[str] - The CSS query selector to use when extracting content from the markup.\n"
|
||||
)
|
||||
mode: Literal["scrape", "crawl"] = Field(
|
||||
default="scrape",
|
||||
description="Mode, the only two allowed modes are `scrape` or `crawl`. `scrape` will only scrape the one page of the url provided, while `crawl` will crawl the website following all the subpages found."
|
||||
)
|
||||
|
||||
class SpiderTool(BaseTool):
|
||||
name: str = "Spider scrape & crawl tool"
|
||||
@@ -28,7 +37,7 @@ class SpiderTool(BaseTool):
|
||||
def _run(
|
||||
self,
|
||||
url: str,
|
||||
params: Optional[Dict[str, any]] = None,
|
||||
params: Optional[Dict[str, Any]] = None,
|
||||
mode: Optional[Literal["scrape", "crawl"]] = "scrape"
|
||||
):
|
||||
if mode not in ["scrape", "crawl"]:
|
||||
@@ -36,7 +45,10 @@ class SpiderTool(BaseTool):
|
||||
"Unknown mode in `mode` parameter, `scrape` or `crawl` are the allowed modes"
|
||||
)
|
||||
|
||||
if params is None or params == {}:
|
||||
# Ensure 'return_format': 'markdown' is always included
|
||||
if params:
|
||||
params["return_format"] = "markdown"
|
||||
else:
|
||||
params = {"return_format": "markdown"}
|
||||
|
||||
action = (
|
||||
|
||||
@@ -1,30 +0,0 @@
|
||||
from crewai_tools.tools.spider_full_tool.spider_full_tool import SpiderFullTool, SpiderFullParams
|
||||
|
||||
def test_spider_full_tool():
|
||||
spider_tool = SpiderFullTool(api_key="your_api_key")
|
||||
url = "https://spider.cloud"
|
||||
params = SpiderFullParams(
|
||||
request="http",
|
||||
limit=1,
|
||||
depth=1,
|
||||
cache=True,
|
||||
locale="en-US",
|
||||
stealth=True,
|
||||
headers={"User-Agent": "test-agent"},
|
||||
metadata=False,
|
||||
viewport="800x600",
|
||||
encoding="UTF-8",
|
||||
subdomains=False,
|
||||
user_agent="test-agent",
|
||||
store_data=False,
|
||||
proxy_enabled=False,
|
||||
query_selector=None,
|
||||
full_resources=False,
|
||||
request_timeout=30,
|
||||
run_in_background=False
|
||||
)
|
||||
docs = spider_tool._run(url=url, params=params)
|
||||
print(docs)
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_spider_full_tool()
|
||||
@@ -10,22 +10,39 @@ def test_spider_tool():
|
||||
goal="Find related information from specific URL's",
|
||||
backstory="An expert web researcher that uses the web extremely well",
|
||||
tools=[spider_tool],
|
||||
verbose=True
|
||||
verbose=True,
|
||||
cache=False
|
||||
)
|
||||
|
||||
summarize_spider = Task(
|
||||
description="Summarize the content of spider.cloud",
|
||||
expected_output="A summary that goes over what spider does",
|
||||
choose_between_scrape_crawl = Task(
|
||||
description="Scrape the page of spider.cloud and return a summary of how fast it is",
|
||||
expected_output="spider.cloud is a fast scraping and crawling tool",
|
||||
agent=searcher
|
||||
)
|
||||
|
||||
|
||||
return_metadata = Task(
|
||||
description="Scrape https://spider.cloud with a limit of 1 and enable metadata",
|
||||
expected_output="Metadata and 10 word summary of spider.cloud",
|
||||
agent=searcher
|
||||
)
|
||||
|
||||
css_selector = Task(
|
||||
description="Scrape one page of spider.cloud with the `body > div > main > section.grid.md\:grid-cols-2.gap-10.place-items-center.md\:max-w-screen-xl.mx-auto.pb-8.pt-20 > div:nth-child(1) > h1` CSS selector",
|
||||
expected_output="The content of the element with the css selector body > div > main > section.grid.md\:grid-cols-2.gap-10.place-items-center.md\:max-w-screen-xl.mx-auto.pb-8.pt-20 > div:nth-child(1) > h1",
|
||||
agent=searcher
|
||||
)
|
||||
|
||||
crew = Crew(
|
||||
agents=[searcher],
|
||||
tasks=[summarize_spider],
|
||||
tasks=[
|
||||
choose_between_scrape_crawl,
|
||||
return_metadata,
|
||||
css_selector
|
||||
],
|
||||
verbose=2
|
||||
)
|
||||
|
||||
crew.kickoff()
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_spider_tool()
|
||||
test_spider_tool()
|
||||
|
||||
Reference in New Issue
Block a user