mirror of
https://github.com/crewAIInc/crewAI.git
synced 2026-01-09 16:18:30 +00:00
- Add selenium and webdriver-manager to installation instructions - Add prerequisites and system requirements - Add troubleshooting guidelines - Add basic usage example with error handling - Fixes #2153 Co-Authored-By: Joe Moura <joao@crewai.com>
116 lines
3.9 KiB
Plaintext
116 lines
3.9 KiB
Plaintext
---
|
|
title: Selenium Scraper
|
|
description: The `SeleniumScrapingTool` is designed to extract and read the content of a specified website using Selenium.
|
|
icon: clipboard-user
|
|
---
|
|
|
|
# `SeleniumScrapingTool`
|
|
|
|
<Note>
|
|
This tool is currently in development. As we refine its capabilities, users may encounter unexpected behavior.
|
|
Your feedback is invaluable to us for making improvements.
|
|
</Note>
|
|
|
|
## Description
|
|
|
|
The SeleniumScrapingTool is crafted for high-efficiency web scraping tasks.
|
|
It allows for precise extraction of content from web pages by using CSS selectors to target specific elements.
|
|
Its design caters to a wide range of scraping needs, offering flexibility to work with any provided website URL.
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.7 or higher
|
|
- Chrome browser installed (for ChromeDriver)
|
|
|
|
## Installation
|
|
|
|
### Option 1: All-in-one installation
|
|
```shell
|
|
pip install 'crewai[tools]' selenium>=4.0.0 webdriver-manager>=3.8.0
|
|
```
|
|
|
|
### Option 2: Step-by-step installation
|
|
```shell
|
|
pip install 'crewai[tools]'
|
|
pip install selenium>=4.0.0
|
|
pip install webdriver-manager>=3.8.0
|
|
```
|
|
|
|
### Common Installation Issues
|
|
|
|
1. If you encounter WebDriver issues, ensure your Chrome browser is up-to-date
|
|
2. For Linux users, you might need to install additional system packages:
|
|
```shell
|
|
sudo apt-get install chromium-chromedriver
|
|
```
|
|
|
|
## Basic Usage
|
|
|
|
Here's a simple example to get you started with error handling:
|
|
|
|
```python
|
|
from crewai_tools import SeleniumScrapingTool
|
|
|
|
try:
|
|
# Initialize the tool with a specific website
|
|
tool = SeleniumScrapingTool(website_url='https://example.com')
|
|
|
|
# Extract content
|
|
content = tool.run()
|
|
print(content)
|
|
except Exception as e:
|
|
print(f"Error during scraping: {str(e)}")
|
|
# Ensure proper cleanup in case of errors
|
|
tool.cleanup()
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
Below are some scenarios where the SeleniumScrapingTool can be utilized:
|
|
|
|
```python Code
|
|
from crewai_tools import SeleniumScrapingTool
|
|
|
|
# Example 1:
|
|
# Initialize the tool without any parameters to scrape
|
|
# the current page it navigates to
|
|
tool = SeleniumScrapingTool()
|
|
|
|
# Example 2:
|
|
# Scrape the entire webpage of a given URL
|
|
tool = SeleniumScrapingTool(website_url='https://example.com')
|
|
|
|
# Example 3:
|
|
# Target and scrape a specific CSS element from a webpage
|
|
tool = SeleniumScrapingTool(
|
|
website_url='https://example.com',
|
|
css_element='.main-content'
|
|
)
|
|
|
|
# Example 4:
|
|
# Perform scraping with additional parameters for a customized experience
|
|
tool = SeleniumScrapingTool(
|
|
website_url='https://example.com',
|
|
css_element='.main-content',
|
|
cookie={'name': 'user', 'value': 'John Doe'},
|
|
wait_time=10
|
|
)
|
|
```
|
|
|
|
## Arguments
|
|
|
|
The following parameters can be used to customize the SeleniumScrapingTool's scraping process:
|
|
|
|
| Argument | Type | Description |
|
|
|:---------------|:---------|:-------------------------------------------------------------------------------------------------------------------------------------|
|
|
| **website_url** | `string` | **Mandatory**. Specifies the URL of the website from which content is to be scraped. |
|
|
| **css_element** | `string` | **Mandatory**. The CSS selector for a specific element to target on the website, enabling focused scraping of a particular part of a webpage. |
|
|
| **cookie** | `object` | **Optional**. A dictionary containing cookie information, useful for simulating a logged-in session to access restricted content. |
|
|
| **wait_time** | `int` | **Optional**. Specifies the delay (in seconds) before scraping, allowing the website and any dynamic content to fully load. |
|
|
|
|
|
|
<Warning>
|
|
Since the `SeleniumScrapingTool` is under active development, the parameters and functionality may evolve over time.
|
|
Users are encouraged to keep the tool updated and report any issues or suggestions for enhancements.
|
|
</Warning>
|