mirror of https://github.com/crewAIInc/crewAI.git synced 2026-01-11 09:08:31 +00:00

Files

Devin AI c7c8cd0a3c feat: add URL validation and return_html examples

- Add comprehensive URL validation in schema and _create_driver
- Add URL format, length, and character validation
- Add meaningful error messages for validation failures
- Add return_html usage examples in README.md

Co-Authored-By: Joe Moura <joao@crewai.com>

2024-12-28 00:54:49 +00:00

2.2 KiB

Raw Blame History

SeleniumScrapingTool

Description

This tool is designed for efficient web scraping, enabling users to extract content from web pages. It supports targeted scraping by allowing the specification of a CSS selector for desired elements. The flexibility of the tool enables it to be used on any website URL provided by the user, making it a versatile tool for various web scraping needs.

Installation

Install the crewai_tools package

pip install 'crewai[tools]'

Example

from crewai_tools import SeleniumScrapingTool

# Example 1: Scrape any website it finds during its execution
tool = SeleniumScrapingTool()

# Example 2: Scrape the entire webpage
tool = SeleniumScrapingTool(website_url='https://example.com')

# Example 3: Scrape a specific CSS element from the webpage
tool = SeleniumScrapingTool(website_url='https://example.com', css_element='.main-content')

# Example 4: Scrape using optional parameters for customized scraping
tool = SeleniumScrapingTool(website_url='https://example.com', css_element='.main-content', cookie={'name': 'user', 'value': 'John Doe'})

# Example 5: Scrape content in HTML format
tool = SeleniumScrapingTool(website_url='https://example.com', return_html=True)
result = tool._run()
# Returns HTML content like: ['<div class="content">Hello World</div>', '<div class="footer">Copyright 2024</div>']

# Example 6: Scrape content in text format (default)
tool = SeleniumScrapingTool(website_url='https://example.com', return_html=False)
result = tool._run()
# Returns text content like: ['Hello World', 'Copyright 2024']

Arguments

website_url: Mandatory. The URL of the website to scrape.
css_element: Mandatory. The CSS selector for a specific element to scrape from the website.
cookie: Optional. A dictionary containing cookie information. This parameter allows the tool to simulate a session with cookie information, providing access to content that may be restricted to logged-in users.
wait_time: Optional. The number of seconds the tool waits after loading the website and after setting a cookie, before scraping the content. This allows for dynamic content to load properly.
return_html: Optional. If True, the tool returns HTML content. If False, the tool returns text content.

2.2 KiB Raw Blame History