Files
crewAI/src/crewai_tools/tools/selenium_scraping_tool/README.md
Devin AI c7c8cd0a3c feat: add URL validation and return_html examples
- Add comprehensive URL validation in schema and _create_driver
- Add URL format, length, and character validation
- Add meaningful error messages for validation failures
- Add return_html usage examples in README.md

Co-Authored-By: Joe Moura <joao@crewai.com>
2024-12-28 00:54:49 +00:00

2.2 KiB

SeleniumScrapingTool

Description

This tool is designed for efficient web scraping, enabling users to extract content from web pages. It supports targeted scraping by allowing the specification of a CSS selector for desired elements. The flexibility of the tool enables it to be used on any website URL provided by the user, making it a versatile tool for various web scraping needs.

Installation

Install the crewai_tools package

pip install 'crewai[tools]'

Example

from crewai_tools import SeleniumScrapingTool

# Example 1: Scrape any website it finds during its execution
tool = SeleniumScrapingTool()

# Example 2: Scrape the entire webpage
tool = SeleniumScrapingTool(website_url='https://example.com')

# Example 3: Scrape a specific CSS element from the webpage
tool = SeleniumScrapingTool(website_url='https://example.com', css_element='.main-content')

# Example 4: Scrape using optional parameters for customized scraping
tool = SeleniumScrapingTool(website_url='https://example.com', css_element='.main-content', cookie={'name': 'user', 'value': 'John Doe'})

# Example 5: Scrape content in HTML format
tool = SeleniumScrapingTool(website_url='https://example.com', return_html=True)
result = tool._run()
# Returns HTML content like: ['<div class="content">Hello World</div>', '<div class="footer">Copyright 2024</div>']

# Example 6: Scrape content in text format (default)
tool = SeleniumScrapingTool(website_url='https://example.com', return_html=False)
result = tool._run()
# Returns text content like: ['Hello World', 'Copyright 2024']

Arguments

  • website_url: Mandatory. The URL of the website to scrape.
  • css_element: Mandatory. The CSS selector for a specific element to scrape from the website.
  • cookie: Optional. A dictionary containing cookie information. This parameter allows the tool to simulate a session with cookie information, providing access to content that may be restricted to logged-in users.
  • wait_time: Optional. The number of seconds the tool waits after loading the website and after setting a cookie, before scraping the content. This allows for dynamic content to load properly.
  • return_html: Optional. If True, the tool returns HTML content. If False, the tool returns text content.