crewAI/docs/edge/en/tools/web-scraping/spidertool.mdx

---
title: Spider Scraper
description: The `SpiderTool` is designed to extract and read the content of a specified website using Spider.
icon: spider-web
mode: "wide"
---

# `SpiderTool`

## Description

[Spider](https://spider.cloud/?ref=crewai) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results)
open source scraper and crawler that returns LLM-ready data.
It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

## Installation

To use the `SpiderTool` you need to download the [Spider SDK](https://pypi.org/project/spider-client/)
and the `crewai[tools]` SDK too:

```shell
pip install spider-client 'crewai[tools]'
```

## Example

This example shows you how you can use the `SpiderTool` to enable your agent to scrape and crawl websites.
The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.

```python Code
from crewai_tools import SpiderTool

def main():
    spider_tool = SpiderTool()

    searcher = Agent(
        role="Web Research Expert",
        goal="Find related information from specific URL's",
        backstory="An expert web researcher that uses the web extremely well",
        tools=[spider_tool],
        verbose=True,
    )

    return_metadata = Task(
        description="Scrape https://spider.cloud with a limit of 1 and enable metadata",
        expected_output="Metadata and 10 word summary of spider.cloud",
        agent=searcher
    )

    crew = Crew(
        agents=[searcher],
        tasks=[
            return_metadata,
        ],
        verbose=2
    )

    crew.kickoff()

if __name__ == "__main__":
    main()
```

## Arguments
| Argument          | Type     | Description                                                                                                                                           |
|:------------------|:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------|
| **api_key**         | `string`   | Specifies Spider API key. If not specified, it looks for `SPIDER_API_KEY` in environment variables.                                                   |
| **params**          | `object`   | Optional parameters for the request. Defaults to `{"return_format": "markdown"}` to optimize content for LLMs.                                        |
| **request**         | `string`   | Type of request to perform (`http`, `chrome`, `smart`). `smart` defaults to HTTP, switching to JavaScript rendering if needed.                        |
| **limit**           | `int`      | Max pages to crawl per website. Set to `0` or omit for unlimited.                                                                                     |
| **depth**           | `int`      | Max crawl depth. Set to `0` for no limit.                                                                                                             |
| **cache**           | `bool`     | Enables HTTP caching to speed up repeated runs. Default is `true`.                                                                                    |
| **budget**          | `object`   | Sets path-based limits for crawled pages, e.g., `{"*":1}` for root page only.                                                                          |
| **locale**          | `string`   | Locale for the request, e.g., `en-US`.                                                                                                                |
| **cookies**         | `string`   | HTTP cookies for the request.                                                                                                                         |
| **stealth**         | `bool`     | Enables stealth mode for Chrome requests to avoid detection. Default is `true`.                                                                       |
| **headers**         | `object`   | HTTP headers as a map of key-value pairs for all requests.                                                                                            |
| **metadata**        | `bool`     | Stores metadata about pages and content, aiding AI interoperability. Defaults to `false`.                                                             |
| **viewport**        | `object`   | Sets Chrome viewport dimensions. Default is `800x600`.                                                                                                |
| **encoding**        | `string`   | Specifies encoding type, e.g., `UTF-8`, `SHIFT_JIS`.                                                                                                  |
| **subdomains**      | `bool`     | Includes subdomains in the crawl. Default is `false`.                                                                                                 |
| **user_agent**      | `string`   | Custom HTTP user agent. Defaults to a random agent.                                                                                                   |
| **store_data**      | `bool`     | Enables data storage for the request. Overrides `storageless` when set. Default is `false`.                                                           |
| **gpt_config**      | `object`   | Allows AI to generate crawl actions, with optional chaining steps via an array for `"prompt"`.                                                        |
| **fingerprint**     | `bool`     | Enables advanced fingerprinting for Chrome.                                                                                                           |
| **storageless**     | `bool`     | Prevents all data storage, including AI embeddings. Default is `false`.                                                                               |
| **readability**     | `bool`     | Pre-processes content for reading via [Mozilla’s readability](https://github.com/mozilla/readability). Improves content for LLMs.                      |
| **return_format**   | `string`   | Format to return data: `markdown`, `raw`, `text`, `html2text`. Use `raw` for default page format.                                                     |
| **proxy_enabled**   | `bool`     | Enables high-performance proxies to avoid network-level blocking.                                                                                     |
| **query_selector**  | `string`   | CSS query selector for content extraction from markup.                                                                                                |
| **full_resources**  | `bool`     | Downloads all resources linked to the website.                                                                                                        |
| **request_timeout** | `int`      | Timeout in seconds for requests (5-60). Default is `30`.                                                                                              |
| **run_in_background** | `bool`   | Runs the request in the background, useful for data storage and triggering dashboard crawls. No effect if `storageless` is set.                        |