mirror of
https://github.com/crewAIInc/crewAI.git
synced 2026-04-14 15:02:37 +00:00
* fix: add path and URL validation to RAG tools Add validation utilities to prevent unauthorized file reads and SSRF when RAG tools accept LLM-controlled paths/URLs at runtime. Changes: - New crewai_tools.utilities.safe_path module with validate_file_path(), validate_directory_path(), and validate_url() - File paths validated against base directory (defaults to cwd). Resolves symlinks and ../ traversal. Rejects escape attempts. - URLs validated: file:// blocked entirely. HTTP/HTTPS resolves DNS and blocks private/reserved IPs (10.x, 172.16-31.x, 192.168.x, 127.x, 169.254.x, 0.0.0.0, ::1, fc00::/7). - Validation applied in RagTool.add() — catches all RAG search tools (JSON, CSV, PDF, TXT, DOCX, MDX, Directory, etc.) - Removed file:// scheme support from DataTypes.from_content() - CREWAI_TOOLS_ALLOW_UNSAFE_PATHS=true env var for backward compat - 27 tests covering traversal, symlinks, private IPs, cloud metadata, IPv6, escape hatch, and valid paths/URLs * fix: validate path/URL keyword args in RagTool.add() The original patch validated positional *args but left all keyword arguments (path=, file_path=, directory_path=, url=, website=, github_url=, youtube_url=) unvalidated, providing a trivial bypass for both path-traversal and SSRF checks. Applies validate_file_path() to path/file_path/directory_path kwargs and validate_url() to url/website/github_url/youtube_url kwargs before they reach the adapter. Adds a regression-test file covering all eight kwarg vectors plus the two existing positional-arg checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address CodeQL and review comments on RAG path/URL validation - Replace insecure tempfile.mktemp() with inline symlink target in test - Remove unused 'target' variable and unused tempfile import - Narrow broad except Exception: pass to only catch urlparse errors; validate_url ValueError now propagates instead of being silently swallowed - Fix ruff B904 (raise-without-from-inside-except) in safe_path.py - Fix ruff B007 (unused loop variable 'family') in safe_path.py - Use validate_directory_path in DirectorySearchTool.add() so the public utility is exercised in production code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix ruff format + remaining lint issues * fix: resolve mypy type errors in RAG path/URL validation - Cast sockaddr[0] to str() to satisfy mypy (socket.getaddrinfo returns sockaddr where [0] is str but typed as str | int) - Remove now-unnecessary `type: ignore[assignment]` and `type: ignore[literal-required]` comments in rag_tool.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: unroll dynamic TypedDict key loops to satisfy mypy literal-required Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: allow tmp paths in RAG data-type tests via CREWAI_TOOLS_ALLOW_UNSAFE_PATHS TemporaryDirectory creates files under /tmp/ which is outside CWD and is correctly blocked by the new path validation. These tests exercise data-type handling, not security, so add an autouse fixture that sets CREWAI_TOOLS_ALLOW_UNSAFE_PATHS=true for the whole file. Path/URL security is covered by test_rag_tool_path_validation.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: allow tmp paths in search-tool and rag_tool tests via CREWAI_TOOLS_ALLOW_UNSAFE_PATHS test_search_tools.py has tests for TXTSearchTool, CSVSearchTool, MDXSearchTool, JSONSearchTool, and DirectorySearchTool that create files under /tmp/ via tempfile, which is outside CWD and correctly blocked by the new path validation. rag_tool_test.py has one test that calls tool.add() with a TemporaryDirectory path. Add the same autouse allow_tmp_paths fixture used in test_rag_tool_add_data_type.py. Security is covered separately by test_rag_tool_path_validation.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: update tool specifications * docs: document CodeInterpreterTool removal and RAG path/URL validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address three review comments on path/URL validation - safe_path._is_private_or_reserved: after unwrapping IPv4-mapped IPv6 to IPv4, only check against IPv4 networks to avoid TypeError when comparing an IPv4Address against IPv6Network objects. - safe_path.validate_file_path: handle filesystem-root base_dir ('/') by not appending os.sep when the base already ends with a separator, preventing the '//'-prefix bug. - rag_tool.add: path-detection heuristic now checks for both '/' and os.sep so forward-slash paths are caught on Windows as well as Unix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove unused _BLOCKED_NETWORKS variable after IPv4/IPv6 split * chore: update tool specifications --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
124 lines
4.9 KiB
Plaintext
124 lines
4.9 KiB
Plaintext
---
|
|
title: PDF RAG Search
|
|
description: The `PDFSearchTool` is designed to search PDF files and return the most relevant results.
|
|
icon: file-pdf
|
|
mode: "wide"
|
|
---
|
|
|
|
# `PDFSearchTool`
|
|
|
|
<Note>
|
|
We are still working on improving tools, so there might be unexpected behavior or changes in the future.
|
|
</Note>
|
|
|
|
## Description
|
|
|
|
The PDFSearchTool is a RAG tool designed for semantic searches within PDF content. It allows for inputting a search query and a PDF document, leveraging advanced search techniques to find relevant content efficiently.
|
|
This capability makes it especially useful for extracting specific information from large PDF files quickly.
|
|
|
|
## Installation
|
|
|
|
To get started with the PDFSearchTool, first, ensure the crewai_tools package is installed with the following command:
|
|
|
|
```shell
|
|
pip install 'crewai[tools]'
|
|
```
|
|
|
|
## Example
|
|
Here's how to use the PDFSearchTool to search within a PDF document:
|
|
|
|
```python Code
|
|
from crewai_tools import PDFSearchTool
|
|
|
|
# Initialize the tool allowing for any PDF content search if the path is provided during execution
|
|
tool = PDFSearchTool()
|
|
|
|
# OR
|
|
|
|
# Initialize the tool with a specific PDF path for exclusive search within that document
|
|
tool = PDFSearchTool(pdf='path/to/your/document.pdf')
|
|
```
|
|
|
|
## Arguments
|
|
|
|
- `pdf`: **Optional** The PDF path for the search. Can be provided at initialization or within the `run` method's arguments. If provided at initialization, the tool confines its search to the specified document.
|
|
|
|
## Custom model and embeddings
|
|
|
|
By default, the tool uses OpenAI for both embeddings and summarization. To customize the model, you can use a config dictionary as follows. Note: a vector database is required because generated embeddings must be stored and queried from a vectordb.
|
|
|
|
```python Code
|
|
from crewai_tools import PDFSearchTool
|
|
|
|
# - embedding_model (required): choose provider + provider-specific config
|
|
# - vectordb (required): choose vector DB and pass its config
|
|
|
|
tool = PDFSearchTool(
|
|
config={
|
|
"embedding_model": {
|
|
# Supported providers: "openai", "azure", "google-generativeai", "google-vertex",
|
|
# "voyageai", "cohere", "huggingface", "jina", "sentence-transformer",
|
|
# "text2vec", "ollama", "openclip", "instructor", "onnx", "roboflow", "watsonx", "custom"
|
|
"provider": "openai", # or: "google-generativeai", "cohere", "ollama", ...
|
|
"config": {
|
|
# Model identifier for the chosen provider. "model" will be auto-mapped to "model_name" internally.
|
|
"model": "text-embedding-3-small",
|
|
# Optional: API key. If omitted, the tool will use provider-specific env vars
|
|
# (e.g., OPENAI_API_KEY or EMBEDDINGS_OPENAI_API_KEY for OpenAI).
|
|
# "api_key": "sk-...",
|
|
|
|
# Provider-specific examples:
|
|
# --- Google Generative AI ---
|
|
# (Set provider="google-generativeai" above)
|
|
# "model_name": "gemini-embedding-001",
|
|
# "task_type": "RETRIEVAL_DOCUMENT",
|
|
# "title": "Embeddings",
|
|
|
|
# --- Cohere ---
|
|
# (Set provider="cohere" above)
|
|
# "model": "embed-english-v3.0",
|
|
|
|
# --- Ollama (local) ---
|
|
# (Set provider="ollama" above)
|
|
# "model": "nomic-embed-text",
|
|
},
|
|
},
|
|
"vectordb": {
|
|
"provider": "chromadb", # or "qdrant"
|
|
"config": {
|
|
# For ChromaDB: pass "settings" (chromadb.config.Settings) or rely on defaults.
|
|
# Example (uncomment and import):
|
|
# from chromadb.config import Settings
|
|
# "settings": Settings(
|
|
# persist_directory="/content/chroma",
|
|
# allow_reset=True,
|
|
# is_persistent=True,
|
|
# ),
|
|
|
|
# For Qdrant: pass "vectors_config" (qdrant_client.models.VectorParams).
|
|
# Example (uncomment and import):
|
|
# from qdrant_client.models import VectorParams, Distance
|
|
# "vectors_config": VectorParams(size=384, distance=Distance.COSINE),
|
|
|
|
# Note: collection name is controlled by the tool (default: "rag_tool_collection"), not set here.
|
|
}
|
|
},
|
|
}
|
|
)
|
|
|
|
## Security
|
|
|
|
### Path Validation
|
|
|
|
File paths provided to this tool are validated against the current working directory. Paths that resolve outside the working directory are rejected with a `ValueError`.
|
|
|
|
To allow paths outside the working directory (for example, in tests or trusted pipelines), set the environment variable:
|
|
|
|
```shell
|
|
CREWAI_TOOLS_ALLOW_UNSAFE_PATHS=true
|
|
```
|
|
|
|
### URL Validation
|
|
|
|
URL inputs are validated: `file://` URIs and requests targeting private or reserved IP ranges are blocked to prevent server-side request forgery (SSRF) attacks.
|
|
``` |