mirror of
https://github.com/crewAIInc/crewAI.git
synced 2026-04-12 22:12:37 +00:00
* fix: add path and URL validation to RAG tools Add validation utilities to prevent unauthorized file reads and SSRF when RAG tools accept LLM-controlled paths/URLs at runtime. Changes: - New crewai_tools.utilities.safe_path module with validate_file_path(), validate_directory_path(), and validate_url() - File paths validated against base directory (defaults to cwd). Resolves symlinks and ../ traversal. Rejects escape attempts. - URLs validated: file:// blocked entirely. HTTP/HTTPS resolves DNS and blocks private/reserved IPs (10.x, 172.16-31.x, 192.168.x, 127.x, 169.254.x, 0.0.0.0, ::1, fc00::/7). - Validation applied in RagTool.add() — catches all RAG search tools (JSON, CSV, PDF, TXT, DOCX, MDX, Directory, etc.) - Removed file:// scheme support from DataTypes.from_content() - CREWAI_TOOLS_ALLOW_UNSAFE_PATHS=true env var for backward compat - 27 tests covering traversal, symlinks, private IPs, cloud metadata, IPv6, escape hatch, and valid paths/URLs * fix: validate path/URL keyword args in RagTool.add() The original patch validated positional *args but left all keyword arguments (path=, file_path=, directory_path=, url=, website=, github_url=, youtube_url=) unvalidated, providing a trivial bypass for both path-traversal and SSRF checks. Applies validate_file_path() to path/file_path/directory_path kwargs and validate_url() to url/website/github_url/youtube_url kwargs before they reach the adapter. Adds a regression-test file covering all eight kwarg vectors plus the two existing positional-arg checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address CodeQL and review comments on RAG path/URL validation - Replace insecure tempfile.mktemp() with inline symlink target in test - Remove unused 'target' variable and unused tempfile import - Narrow broad except Exception: pass to only catch urlparse errors; validate_url ValueError now propagates instead of being silently swallowed - Fix ruff B904 (raise-without-from-inside-except) in safe_path.py - Fix ruff B007 (unused loop variable 'family') in safe_path.py - Use validate_directory_path in DirectorySearchTool.add() so the public utility is exercised in production code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix ruff format + remaining lint issues * fix: resolve mypy type errors in RAG path/URL validation - Cast sockaddr[0] to str() to satisfy mypy (socket.getaddrinfo returns sockaddr where [0] is str but typed as str | int) - Remove now-unnecessary `type: ignore[assignment]` and `type: ignore[literal-required]` comments in rag_tool.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: unroll dynamic TypedDict key loops to satisfy mypy literal-required Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: allow tmp paths in RAG data-type tests via CREWAI_TOOLS_ALLOW_UNSAFE_PATHS TemporaryDirectory creates files under /tmp/ which is outside CWD and is correctly blocked by the new path validation. These tests exercise data-type handling, not security, so add an autouse fixture that sets CREWAI_TOOLS_ALLOW_UNSAFE_PATHS=true for the whole file. Path/URL security is covered by test_rag_tool_path_validation.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: allow tmp paths in search-tool and rag_tool tests via CREWAI_TOOLS_ALLOW_UNSAFE_PATHS test_search_tools.py has tests for TXTSearchTool, CSVSearchTool, MDXSearchTool, JSONSearchTool, and DirectorySearchTool that create files under /tmp/ via tempfile, which is outside CWD and correctly blocked by the new path validation. rag_tool_test.py has one test that calls tool.add() with a TemporaryDirectory path. Add the same autouse allow_tmp_paths fixture used in test_rag_tool_add_data_type.py. Security is covered separately by test_rag_tool_path_validation.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: update tool specifications * docs: document CodeInterpreterTool removal and RAG path/URL validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address three review comments on path/URL validation - safe_path._is_private_or_reserved: after unwrapping IPv4-mapped IPv6 to IPv4, only check against IPv4 networks to avoid TypeError when comparing an IPv4Address against IPv6Network objects. - safe_path.validate_file_path: handle filesystem-root base_dir ('/') by not appending os.sep when the base already ends with a separator, preventing the '//'-prefix bug. - rag_tool.add: path-detection heuristic now checks for both '/' and os.sep so forward-slash paths are caught on Windows as well as Unix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove unused _BLOCKED_NETWORKS variable after IPv4/IPv6 split * chore: update tool specifications --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
93 lines
3.3 KiB
Plaintext
93 lines
3.3 KiB
Plaintext
---
|
|
title: JSON RAG Search
|
|
description: The `JSONSearchTool` is designed to search JSON files and return the most relevant results.
|
|
icon: file-code
|
|
mode: "wide"
|
|
---
|
|
|
|
# `JSONSearchTool`
|
|
|
|
<Note>
|
|
The JSONSearchTool is currently in an experimental phase. This means the tool
|
|
is under active development, and users might encounter unexpected behavior or
|
|
changes. We highly encourage feedback on any issues or suggestions for
|
|
improvements.
|
|
</Note>
|
|
|
|
## Description
|
|
|
|
The JSONSearchTool is designed to facilitate efficient and precise searches within JSON file contents. It utilizes a RAG (Retrieve and Generate) search mechanism, allowing users to specify a JSON path for targeted searches within a particular JSON file. This capability significantly improves the accuracy and relevance of search results.
|
|
|
|
## Installation
|
|
|
|
To install the JSONSearchTool, use the following pip command:
|
|
|
|
```shell
|
|
pip install 'crewai[tools]'
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
Here are updated examples on how to utilize the JSONSearchTool effectively for searching within JSON files. These examples take into account the current implementation and usage patterns identified in the codebase.
|
|
|
|
```python Code
|
|
from crewai_tools import JSONSearchTool
|
|
|
|
# General JSON content search
|
|
# This approach is suitable when the JSON path is either known beforehand or can be dynamically identified.
|
|
tool = JSONSearchTool()
|
|
|
|
# Restricting search to a specific JSON file
|
|
# Use this initialization method when you want to limit the search scope to a specific JSON file.
|
|
tool = JSONSearchTool(json_path='./path/to/your/file.json')
|
|
```
|
|
|
|
## Arguments
|
|
|
|
- `json_path` (str, optional): Specifies the path to the JSON file to be searched. This argument is not required if the tool is initialized for a general search. When provided, it confines the search to the specified JSON file.
|
|
|
|
## Configuration Options
|
|
|
|
The JSONSearchTool supports extensive customization through a configuration dictionary. This allows users to select different models for embeddings and summarization based on their requirements.
|
|
|
|
```python Code
|
|
tool = JSONSearchTool(
|
|
config={
|
|
"llm": {
|
|
"provider": "ollama", # Other options include google, openai, anthropic, llama2, etc.
|
|
"config": {
|
|
"model": "llama2",
|
|
# Additional optional configurations can be specified here.
|
|
# temperature=0.5,
|
|
# top_p=1,
|
|
# stream=true,
|
|
},
|
|
},
|
|
"embedding_model": {
|
|
"provider": "google-generativeai", # or openai, ollama, ...
|
|
"config": {
|
|
"model_name": "gemini-embedding-001",
|
|
"task_type": "RETRIEVAL_DOCUMENT",
|
|
# Further customization options can be added here.
|
|
},
|
|
},
|
|
}
|
|
)
|
|
```
|
|
|
|
## Security
|
|
|
|
### Path Validation
|
|
|
|
File paths provided to this tool are validated against the current working directory. Paths that resolve outside the working directory are rejected with a `ValueError`.
|
|
|
|
To allow paths outside the working directory (for example, in tests or trusted pipelines), set the environment variable:
|
|
|
|
```shell
|
|
CREWAI_TOOLS_ALLOW_UNSAFE_PATHS=true
|
|
```
|
|
|
|
### URL Validation
|
|
|
|
URL inputs are validated: `file://` URIs and requests targeting private or reserved IP ranges are blocked to prevent server-side request forgery (SSRF) attacks.
|