Compare commits

..

33 Commits

Author SHA1 Message Date
Devin AI
1bdce4cc76 Update uv.lock after environment fix
Co-Authored-By: João <joao@crewai.com>
2025-07-30 14:38:09 +00:00
Devin AI
22761d74ba Fix #3242: Add reasoning parameter to LLM class to disable reasoning mode
- Add reasoning parameter to LLM.__init__() method
- Implement logic in _prepare_completion_params to handle reasoning=False
- Add comprehensive tests for reasoning parameter functionality
- Ensure reasoning=False overrides reasoning_effort parameter
- Add integration test with Agent class

The fix ensures that when reasoning=False is set, the reasoning_effort
parameter is not included in the LLM completion call, effectively
disabling reasoning mode for models like Qwen and Cogito with Ollama.

Co-Authored-By: João <joao@crewai.com>
2025-07-30 14:37:53 +00:00
Vidit Ostwal
498e8dc6e8 Changed the import error to show missing module files (#2423)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
* Fix issue #2421: Handle missing google.genai dependency gracefully

Co-Authored-By: Joe Moura <joao@crewai.com>

* Fix import sorting in test file

Co-Authored-By: Joe Moura <joao@crewai.com>

* Fix import sorting with ruff

Co-Authored-By: Joe Moura <joao@crewai.com>

* Removed unwatned test case

* Added dynamic catching for all the embedder function

* Dropped the comment

* Added test case

* Fixed Linting Issue

* Flaky test case in 3.13

* Test Case fixed

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Joe Moura <joao@crewai.com>
Co-authored-by: Lucas Gomide <lucaslg200@gmail.com>
2025-07-30 10:01:17 -04:00
Lorenze Jay
cb522cf500 Enhance Flow class to support custom flow names (#3234)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
- Added an optional `name` attribute to the Flow class for better identification.
- Updated event emissions to utilize the new `name` attribute, ensuring accurate flow naming in events.
- Added tests to verify the correct flow name is set and emitted during flow execution.
2025-07-29 15:41:30 -07:00
Vini Brasil
017acc74f5 Add timezone to event timestamps (#3231)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
Events were lacking timezone information, making them naive datetimes,
which can be ambiguous.
2025-07-28 17:09:06 -03:00
Greyson LaLonde
fab86d197a Refactor: Move RAG components to dedicated top-level module (#3222)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
* Move RAG components to top-level module

- Create src/crewai/rag directory structure
- Move embeddings configurator from utilities to rag module
- Update imports across codebase and documentation
- Remove deprecated embedding files

* Remove empty knowledge/embedder directory
2025-07-25 10:55:31 -04:00
Vidit Ostwal
864e9bfb76 Changed the default value in Mem0 config (#3216)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
* Changed the default value in Mem0 config

* Added regression test for this

* Fixed Linting issues
2025-07-24 13:20:18 -04:00
Lucas Gomide
d3b45d197c fix: remove crewai signup references, replaced by crewai login (#3213) 2025-07-24 07:47:35 -04:00
Manuka Yasas
579153b070 docs: fix incorrect model naming in Google Vertex AI documentation (#3189)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
- Change model format from "gemini/gemini-1.5-pro-latest" to "gemini-1.5-pro-latest"
  in Vertex AI section examples
- Update both English and Portuguese documentation files
- Fixes incorrect provider prefix usage for Vertex AI models
- Ensures consistency with Vertex AI provider requirements

Files changed:
- docs/en/concepts/llms.mdx (line 272)
- docs/pt-BR/concepts/llms.mdx (line 270)

Co-authored-by: Tony Kipkemboi <iamtonykipkemboi@gmail.com>
2025-07-23 16:58:57 -04:00
Lorenze Jay
b1fdcdfa6e chore: update dependencies and version in project files (#3212)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
- Updated `crewai-tools` dependency from `0.55.0` to `0.58.0` in `pyproject.toml` and `uv.lock`.
- Added new packages `anthropic`, `browserbase`, `playwright`, `pyee`, and `stagehand` with their respective versions in `uv.lock`.
- Bumped the version of the CrewAI library from `0.148.0` to `0.150.0` in `__init__.py`.
- Updated dependency versions in CLI templates for crew, flow, and tool projects to reflect the new CrewAI version.
2025-07-23 11:03:50 -07:00
Mike Plachta
18d76a270c docs: add SerperScrapeWebsiteTool documentation and reorganize SerperDevTool setup instructions (#3211) 2025-07-23 12:12:59 -04:00
Vidit Ostwal
30541239ad Changed Mem0 Storage v1.1 -> v2 (#2893)
* Changed v1.1 -> v2

* Fixed Test Cases:

* Fixed linting issues

* Changed docs

* Refractored the storage

* Fixed test cases

* Fixing run-time checks

* Fixed Test Case

* Updated docs and added test case for custom categories

* Add the TODO back

* Minor Changes

* Added output_format in search

* Minor changes

* Added output_format and version in both search and save

* Small change

* Minor bugs

* Fixed test cases

* Changed docs

---------

Co-authored-by: Lucas Gomide <lucaslg200@gmail.com>
2025-07-23 08:30:52 -04:00
Tony Kipkemboi
9a65573955 Feature/update docs (#3205)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
* docs: add create_directory parameter

* docs: remove string guardrails to focus on function guardrails

* docs: remove get help from docs.json

* docs: update pt-BR docs.json changes
2025-07-22 13:55:27 -04:00
Lucas Gomide
27623a1d01 feat: remove duplicate print on LLM call error (#3183)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
By improving litellm handler error / outputs

Co-authored-by: Lorenze Jay <63378463+lorenzejay@users.noreply.github.com>
2025-07-21 22:08:07 -04:00
João Moura
2593242234 Adding Support to adhoc tool calling using the internal LLM class (#3195)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
* Adding Support to adhoc tool calling using the internal LLM class

* fix type
2025-07-21 19:36:48 -03:00
Greyson LaLonde
2ab6c31544 chore: add deprecation notices to UserMemory (#3201)
- Mark UserMemory and UserMemoryItem for removal in v0.156.0 or 2025-08-04
- Update all references with deprecation warnings
- Users should migrate to ExternalMemory
2025-07-21 15:26:34 -04:00
Lucas Gomide
3c55c8a22a fix: append user message when last message is from assistent when using Ollama models (#3200)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Ollama doesn't supports last message to be 'assistant'
We can drop this commit after merging https://github.com/BerriAI/litellm/pull/10917
2025-07-21 13:30:40 -04:00
Ranuga Disansa
424433ff58 docs: Add Tavily Search & Extractor tools to Search-Research suite (#3146)
* docs: Add Tavily Search and Extractor tools documentation

* docs: Add Tavily Search and Extractor tools to the documentation

---------

Co-authored-by: Tony Kipkemboi <iamtonykipkemboi@gmail.com>
2025-07-21 12:01:29 -04:00
Lucas Gomide
2fd99503ed build: upgrade LiteLLM to 1.74.3 (#3199) 2025-07-21 09:58:47 -04:00
Vidit Ostwal
942014962e fixed save method, changed the test cases (#3187)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
* fixed save method, changed the test cases

* Linting fixed
2025-07-18 15:10:26 -04:00
Lucas Gomide
2ab79a7dd5 feat: drop unsupported stop parameter for LLM models automatically (#3184) 2025-07-18 13:54:28 -04:00
Lucas Gomide
27c449c9c4 test: remove workaround related to SQLite without FTS5 (#3179)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
For more details check out [here](actions/runner-images#12576)
2025-07-18 09:37:15 -04:00
Vini Brasil
9737333ffd Use file lock around Chroma client initialization (#3181)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
This commit fixes a bug with concurrent processess and Chroma where
`table collections already exists` (and similar) were raised.

https://cookbook.chromadb.dev/core/system_constraints/
2025-07-17 11:50:45 -03:00
Lucas Gomide
bf248d5118 docs: fix neatlogs documentation (#3171)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
2025-07-16 21:18:04 -04:00
Lorenze Jay
2490e8cd46 Update CrewAI version to 0.148.0 in project templates and dependencies (#3172)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
* Update CrewAI version to 0.148.0 in project templates and dependencies

* Update crewai-tools dependency to version 0.55.0 in pyproject.toml and uv.lock for improved functionality and performance.
2025-07-16 12:36:43 -07:00
Lucas Gomide
9b67e5a15f Emit events about Agent eval (#3168)
* feat: emit events abou Agent Eval

We are triggering events when an evaluation has started/completed/failed

* style: fix type checking issues
2025-07-16 13:18:59 -04:00
Lucas Gomide
6ebb6c9b63 Supporting eval single Agent/LiteAgent (#3167)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
* refactor: rely on task completion event to evaluate agents

* feat: remove Crew dependency to evaluate agent

* feat: drop execution_context in AgentEvaluator

* chore: drop experimental Agent Eval feature from stable crew.test

* feat: support eval LiteAgent

* resolve linter issues
2025-07-15 09:22:41 -04:00
Lucas Gomide
53f674be60 chore: remove evaluation folder (#3159)
This folder was moved to `experimental` folder
2025-07-15 08:30:20 -04:00
Paras Sakarwal
11717a5213 docs: added integration with neatlogs (#3138)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
2025-07-14 11:08:24 -04:00
Lucas Gomide
b6d699f764 Implement thread-safe AgentEvaluator (#3157)
Some checks failed
Notify Downstream / notify-downstream (push) Has been cancelled
* refactor: implement thread-safe AgentEvaluator with hybrid state management

* chore: remove useless comments
2025-07-14 10:05:42 -04:00
Lucas Gomide
5b15061b87 test: add test helper to assert Agent Experiments (#3156) 2025-07-14 09:24:49 -04:00
Lucas Gomide
1b6b2b36d9 Introduce Evaluator Experiment (#3133)
* feat: add exchanged messages in LLMCallCompletedEvent

* feat: add GoalAlignment metric for Agent evaluation

* feat: add SemanticQuality metric for Agent evaluation

* feat: add Tool Metrics for Agent evaluation

* feat: add Reasoning Metrics for Agent evaluation, still in progress

* feat: add AgentEvaluator class

This class will evaluate Agent' results and report to user

* fix: do not evaluate Agent by default

This is a experimental feature we still need refine it further

* test: add Agent eval tests

* fix: render all feedback per iteration

* style: resolve linter issues

* style: fix mypy issues

* fix: allow messages be empty on LLMCallCompletedEvent

* feat: add Experiment evaluation framework with baseline comparison

* fix: reset evaluator for each experiement iteraction

* fix: fix track of new test cases

* chore: split Experimental evaluation classes

* refactor: remove unused method

* refactor: isolate Console print in a dedicated class

* fix: make crew required to run an experiment

* fix: use time-aware to define experiment result

* test: add tests for Evaluator Experiment

* style: fix linter issues

* fix: encode string before hashing

* style: resolve linter issues

* feat: add experimental folder for beta features (#3141)

* test: move tests to experimental folder
2025-07-14 09:06:45 -04:00
devin-ai-integration[bot]
3ada4053bd Fix #3149: Add missing create_directory parameter to Task class (#3150)
* Fix #3149: Add missing create_directory parameter to Task class

- Add create_directory field with default value True for backward compatibility
- Update _save_file method to respect create_directory parameter
- Add comprehensive tests covering all scenarios
- Maintain existing behavior when create_directory=True (default)

The create_directory parameter was documented but missing from implementation.
Users can now control directory creation behavior:
- create_directory=True (default): Creates directories if they don't exist
- create_directory=False: Raises RuntimeError if directory doesn't exist

Fixes issue where users got TypeError when trying to use the documented
create_directory parameter.

Co-Authored-By: Jo\u00E3o <joao@crewai.com>

* Fix lint: Remove unused import os from test_create_directory_true

- Removes F401 lint error: 'os' imported but unused
- All lint checks should now pass

Co-Authored-By: Jo\u00E3o <joao@crewai.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Jo\u00E3o <joao@crewai.com>
2025-07-14 08:15:41 -04:00
97 changed files with 7583 additions and 4378 deletions

View File

@@ -37,25 +37,9 @@ jobs:
- name: Install the project
run: uv sync --dev --all-extras
- name: Install SQLite with FTS5 support
run: |
# WORKAROUND: GitHub Actions' Ubuntu runner uses SQLite without FTS5 support compiled in.
# This is a temporary fix until the runner includes SQLite with FTS5 or Python's sqlite3
# module is compiled with FTS5 support by default.
# TODO: Remove this workaround once GitHub Actions runners include SQLite FTS5 support
# Install pysqlite3-binary which has FTS5 support
uv pip install pysqlite3-binary
# Create a sitecustomize.py to override sqlite3 with pysqlite3
mkdir -p .pytest_sqlite_override
echo "import sys; import pysqlite3; sys.modules['sqlite3'] = pysqlite3" > .pytest_sqlite_override/sitecustomize.py
# Test FTS5 availability
PYTHONPATH=.pytest_sqlite_override uv run python -c "import sqlite3; print(f'SQLite version: {sqlite3.sqlite_version}')"
PYTHONPATH=.pytest_sqlite_override uv run python -c "import sqlite3; conn = sqlite3.connect(':memory:'); conn.execute('CREATE VIRTUAL TABLE test USING fts5(content)'); print('FTS5 module available')"
- name: Run tests (group ${{ matrix.group }} of 8)
run: |
PYTHONPATH=.pytest_sqlite_override uv run pytest \
uv run pytest \
--block-network \
--timeout=30 \
-vv \

3
.gitignore vendored
View File

@@ -26,4 +26,5 @@ test_flow.html
crewairules.mdc
plan.md
conceptual_plan.md
build_image
build_image
chromadb-*.lock

View File

@@ -9,12 +9,7 @@
},
"favicon": "/images/favicon.svg",
"contextual": {
"options": [
"copy",
"view",
"chatgpt",
"claude"
]
"options": ["copy", "view", "chatgpt", "claude"]
},
"navigation": {
"languages": [
@@ -37,11 +32,6 @@
"href": "https://chatgpt.com/g/g-qqTuUWsBY-crewai-assistant",
"icon": "robot"
},
{
"anchor": "Get Help",
"href": "mailto:support@crewai.com",
"icon": "headset"
},
{
"anchor": "Releases",
"href": "https://github.com/crewAIInc/crewAI/releases",
@@ -55,32 +45,22 @@
"groups": [
{
"group": "Get Started",
"pages": [
"en/introduction",
"en/installation",
"en/quickstart"
]
"pages": ["en/introduction", "en/installation", "en/quickstart"]
},
{
"group": "Guides",
"pages": [
{
"group": "Strategy",
"pages": [
"en/guides/concepts/evaluating-use-cases"
]
"pages": ["en/guides/concepts/evaluating-use-cases"]
},
{
"group": "Agents",
"pages": [
"en/guides/agents/crafting-effective-agents"
]
"pages": ["en/guides/agents/crafting-effective-agents"]
},
{
"group": "Crews",
"pages": [
"en/guides/crews/first-crew"
]
"pages": ["en/guides/crews/first-crew"]
},
{
"group": "Flows",
@@ -94,7 +74,6 @@
"pages": [
"en/guides/advanced/customizing-prompts",
"en/guides/advanced/fingerprinting"
]
}
]
@@ -182,7 +161,9 @@
"en/tools/search-research/websitesearchtool",
"en/tools/search-research/codedocssearchtool",
"en/tools/search-research/youtubechannelsearchtool",
"en/tools/search-research/youtubevideosearchtool"
"en/tools/search-research/youtubevideosearchtool",
"en/tools/search-research/tavilysearchtool",
"en/tools/search-research/tavilyextractortool"
]
},
{
@@ -241,6 +222,7 @@
"en/observability/langtrace",
"en/observability/maxim",
"en/observability/mlflow",
"en/observability/neatlogs",
"en/observability/openlit",
"en/observability/opik",
"en/observability/patronus-evaluation",
@@ -274,9 +256,7 @@
},
{
"group": "Telemetry",
"pages": [
"en/telemetry"
]
"pages": ["en/telemetry"]
}
]
},
@@ -285,9 +265,7 @@
"groups": [
{
"group": "Getting Started",
"pages": [
"en/enterprise/introduction"
]
"pages": ["en/enterprise/introduction"]
},
{
"group": "Features",
@@ -342,9 +320,7 @@
},
{
"group": "Resources",
"pages": [
"en/enterprise/resources/frequently-asked-questions"
]
"pages": ["en/enterprise/resources/frequently-asked-questions"]
}
]
},
@@ -353,9 +329,7 @@
"groups": [
{
"group": "Getting Started",
"pages": [
"en/api-reference/introduction"
]
"pages": ["en/api-reference/introduction"]
},
{
"group": "Endpoints",
@@ -365,16 +339,13 @@
},
{
"tab": "Examples",
"groups": [
"groups": [
{
"group": "Examples",
"pages": [
"en/examples/example"
]
"pages": ["en/examples/example"]
}
]
}
]
},
{
@@ -396,11 +367,6 @@
"href": "https://chatgpt.com/g/g-qqTuUWsBY-crewai-assistant",
"icon": "robot"
},
{
"anchor": "Obter Ajuda",
"href": "mailto:support@crewai.com",
"icon": "headset"
},
{
"anchor": "Lançamentos",
"href": "https://github.com/crewAIInc/crewAI/releases",
@@ -425,21 +391,15 @@
"pages": [
{
"group": "Estratégia",
"pages": [
"pt-BR/guides/concepts/evaluating-use-cases"
]
"pages": ["pt-BR/guides/concepts/evaluating-use-cases"]
},
{
"group": "Agentes",
"pages": [
"pt-BR/guides/agents/crafting-effective-agents"
]
"pages": ["pt-BR/guides/agents/crafting-effective-agents"]
},
{
"group": "Crews",
"pages": [
"pt-BR/guides/crews/first-crew"
]
"pages": ["pt-BR/guides/crews/first-crew"]
},
{
"group": "Flows",
@@ -632,9 +592,7 @@
},
{
"group": "Telemetria",
"pages": [
"pt-BR/telemetry"
]
"pages": ["pt-BR/telemetry"]
}
]
},
@@ -643,9 +601,7 @@
"groups": [
{
"group": "Começando",
"pages": [
"pt-BR/enterprise/introduction"
]
"pages": ["pt-BR/enterprise/introduction"]
},
{
"group": "Funcionalidades",
@@ -710,9 +666,7 @@
"groups": [
{
"group": "Começando",
"pages": [
"pt-BR/api-reference/introduction"
]
"pages": ["pt-BR/api-reference/introduction"]
},
{
"group": "Endpoints",
@@ -722,16 +676,13 @@
},
{
"tab": "Exemplos",
"groups": [
"groups": [
{
"group": "Exemplos",
"pages": [
"pt-BR/examples/example"
]
"pages": ["pt-BR/examples/example"]
}
]
}
]
}
]

View File

@@ -270,7 +270,7 @@ In this section, you'll find detailed examples that help you select, configure,
from crewai import LLM
llm = LLM(
model="gemini/gemini-1.5-pro-latest",
model="gemini-1.5-pro-latest", # or vertex_ai/gemini-1.5-pro-latest
temperature=0.7,
vertex_credentials=vertex_credentials_json
)

View File

@@ -623,7 +623,7 @@ for provider in providers_to_test:
**Model not found errors:**
```python
# Verify model availability
from crewai.utilities.embedding_configurator import EmbeddingConfigurator
from crewai.rag.embeddings.configurator import EmbeddingConfigurator
configurator = EmbeddingConfigurator()
try:
@@ -712,7 +712,7 @@ crew = Crew(
memory_config={
"provider": "mem0",
"config": {"user_id": "john"},
"user_memory": {} # Required - triggers user memory initialization
"user_memory": {} # DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, use external_memory instead
},
process=Process.sequential,
verbose=True
@@ -720,7 +720,16 @@ crew = Crew(
```
### Advanced Mem0 Configuration
When using Mem0 Client, you can customize the memory configuration further, by using parameters like 'includes', 'excludes', 'custom_categories', 'infer' and 'run_id' (this is only for short-term memory).
You can find more details in the [Mem0 documentation](https://docs.mem0.ai/).
```python
new_categories = [
{"lifestyle_management_concerns": "Tracks daily routines, habits, hobbies and interests including cooking, time management and work-life balance"},
{"seeking_structure": "Documents goals around creating routines, schedules, and organized systems in various life areas"},
{"personal_information": "Basic information about the user including name, preferences, and personality traits"}
]
crew = Crew(
agents=[...],
tasks=[...],
@@ -732,6 +741,11 @@ crew = Crew(
"org_id": "my_org_id", # Optional
"project_id": "my_project_id", # Optional
"api_key": "custom-api-key" # Optional - overrides env var
"run_id": "my_run_id", # Optional - for short-term memory
"includes": "include1", # Optional
"excludes": "exclude1", # Optional
"infer": True # Optional defaults to True
"custom_categories": new_categories # Optional - custom categories for user memory
},
"user_memory": {}
}
@@ -761,7 +775,8 @@ crew = Crew(
"provider": "openai",
"config": {"api_key": "your-api-key", "model": "text-embedding-3-small"}
}
}
},
"infer": True # Optional defaults to True
},
"user_memory": {}
}

View File

@@ -54,10 +54,11 @@ crew = Crew(
| **Markdown** _(optional)_ | `markdown` | `Optional[bool]` | Whether the task should instruct the agent to return the final answer formatted in Markdown. Defaults to False. |
| **Config** _(optional)_ | `config` | `Optional[Dict[str, Any]]` | Task-specific configuration parameters. |
| **Output File** _(optional)_ | `output_file` | `Optional[str]` | File path for storing the task output. |
| **Create Directory** _(optional)_ | `create_directory` | `Optional[bool]` | Whether to create the directory for output_file if it doesn't exist. Defaults to True. |
| **Output JSON** _(optional)_ | `output_json` | `Optional[Type[BaseModel]]` | A Pydantic model to structure the JSON output. |
| **Output Pydantic** _(optional)_ | `output_pydantic` | `Optional[Type[BaseModel]]` | A Pydantic model for task output. |
| **Callback** _(optional)_ | `callback` | `Optional[Any]` | Function/object to be executed after task completion. |
| **Guardrail** _(optional)_ | `guardrail` | `Optional[Union[Callable, str]]` | Function or string description to validate task output before proceeding to next task. |
| **Guardrail** _(optional)_ | `guardrail` | `Optional[Callable]` | Function to validate task output before proceeding to next task. |
## Creating Tasks
@@ -87,7 +88,6 @@ research_task:
expected_output: >
A list with 10 bullet points of the most relevant information about {topic}
agent: researcher
guardrail: ensure each bullet contains a minimum of 100 words
reporting_task:
description: >
@@ -334,9 +334,7 @@ Task guardrails provide a way to validate and transform task outputs before they
are passed to the next task. This feature helps ensure data quality and provides
feedback to agents when their output doesn't meet specific criteria.
**Guardrails can be defined in two ways:**
1. **Function-based guardrails**: Python functions that implement custom validation logic
2. **String-based guardrails**: Natural language descriptions that are automatically converted to LLM-powered validation
Guardrails are implemented as Python functions that contain custom validation logic, giving you complete control over the validation process and ensuring reliable, deterministic results.
### Function-Based Guardrails
@@ -378,82 +376,7 @@ blog_task = Task(
- On success: it returns a tuple of `(bool, Any)`. For example: `(True, validated_result)`
- On Failure: it returns a tuple of `(bool, str)`. For example: `(False, "Error message explain the failure")`
### String-Based Guardrails
String-based guardrails allow you to describe validation criteria in natural language. When you provide a string instead of a function, CrewAI automatically converts it to an `LLMGuardrail` that uses an AI agent to validate the task output.
#### Using String Guardrails in Python
```python Code
from crewai import Task
# Simple string-based guardrail
blog_task = Task(
description="Write a blog post about AI",
expected_output="A blog post under 200 words",
agent=blog_agent,
guardrail="Ensure the blog post is under 200 words and includes practical examples"
)
# More complex validation criteria
research_task = Task(
description="Research AI trends for 2025",
expected_output="A comprehensive research report",
agent=research_agent,
guardrail="Ensure each finding includes a credible source and is backed by recent data from 2024-2025"
)
```
#### Using String Guardrails in YAML
```yaml
research_task:
description: Research the latest AI developments
expected_output: A list of 10 bullet points about AI
agent: researcher
guardrail: ensure each bullet contains a minimum of 100 words
validation_task:
description: Validate the research findings
expected_output: A validation report
agent: validator
guardrail: confirm all sources are from reputable publications and published within the last 2 years
```
#### How String Guardrails Work
When you provide a string guardrail, CrewAI automatically:
1. Creates an `LLMGuardrail` instance using the string as validation criteria
2. Uses the task's agent LLM to power the validation
3. Creates a temporary validation agent that checks the output against your criteria
4. Returns detailed feedback if validation fails
This approach is ideal when you want to use natural language to describe validation rules without writing custom validation functions.
### LLMGuardrail Class
The `LLMGuardrail` class is the underlying mechanism that powers string-based guardrails. You can also use it directly for more advanced control:
```python Code
from crewai import Task
from crewai.tasks.llm_guardrail import LLMGuardrail
from crewai.llm import LLM
# Create a custom LLMGuardrail with specific LLM
custom_guardrail = LLMGuardrail(
description="Ensure the response contains exactly 5 bullet points with proper citations",
llm=LLM(model="gpt-4o-mini")
)
task = Task(
description="Research AI safety measures",
expected_output="A detailed analysis with bullet points",
agent=research_agent,
guardrail=custom_guardrail
)
```
**Note**: When you use a string guardrail, CrewAI automatically creates an `LLMGuardrail` instance using your task's agent LLM. Using `LLMGuardrail` directly gives you more control over the validation process and LLM selection.
### Error Handling Best Practices
@@ -881,21 +804,87 @@ These validations help in maintaining the consistency and reliability of task ex
## Creating Directories when Saving Files
You can now specify if a task should create directories when saving its output to a file. This is particularly useful for organizing outputs and ensuring that file paths are correctly structured.
The `create_directory` parameter controls whether CrewAI should automatically create directories when saving task outputs to files. This feature is particularly useful for organizing outputs and ensuring that file paths are correctly structured, especially when working with complex project hierarchies.
### Default Behavior
By default, `create_directory=True`, which means CrewAI will automatically create any missing directories in the output file path:
```python Code
# ...
save_output_task = Task(
description='Save the summarized AI news to a file',
expected_output='File saved successfully',
agent=research_agent,
tools=[file_save_tool],
output_file='outputs/ai_news_summary.txt',
create_directory=True
# Default behavior - directories are created automatically
report_task = Task(
description='Generate a comprehensive market analysis report',
expected_output='A detailed market analysis with charts and insights',
agent=analyst_agent,
output_file='reports/2025/market_analysis.md', # Creates 'reports/2025/' if it doesn't exist
markdown=True
)
```
#...
### Disabling Directory Creation
If you want to prevent automatic directory creation and ensure that the directory already exists, set `create_directory=False`:
```python Code
# Strict mode - directory must already exist
strict_output_task = Task(
description='Save critical data that requires existing infrastructure',
expected_output='Data saved to pre-configured location',
agent=data_agent,
output_file='secure/vault/critical_data.json',
create_directory=False # Will raise RuntimeError if 'secure/vault/' doesn't exist
)
```
### YAML Configuration
You can also configure this behavior in your YAML task definitions:
```yaml tasks.yaml
analysis_task:
description: >
Generate quarterly financial analysis
expected_output: >
A comprehensive financial report with quarterly insights
agent: financial_analyst
output_file: reports/quarterly/q4_2024_analysis.pdf
create_directory: true # Automatically create 'reports/quarterly/' directory
audit_task:
description: >
Perform compliance audit and save to existing audit directory
expected_output: >
A compliance audit report
agent: auditor
output_file: audit/compliance_report.md
create_directory: false # Directory must already exist
```
### Use Cases
**Automatic Directory Creation (`create_directory=True`):**
- Development and prototyping environments
- Dynamic report generation with date-based folders
- Automated workflows where directory structure may vary
- Multi-tenant applications with user-specific folders
**Manual Directory Management (`create_directory=False`):**
- Production environments with strict file system controls
- Security-sensitive applications where directories must be pre-configured
- Systems with specific permission requirements
- Compliance environments where directory creation is audited
### Error Handling
When `create_directory=False` and the directory doesn't exist, CrewAI will raise a `RuntimeError`:
```python Code
try:
result = crew.kickoff()
except RuntimeError as e:
# Handle missing directory error
print(f"Directory creation failed: {e}")
# Create directory manually or use fallback location
```
Check out the video below to see how to use structured outputs in CrewAI:

View File

@@ -0,0 +1,134 @@
---
title: Neatlogs Integration
description: Understand, debug, and share your CrewAI agent runs
icon: magnifying-glass-chart
---
# Introduction
Neatlogs helps you **see what your agent did**, **why**, and **share it**.
It captures every step: thoughts, tool calls, responses, evaluations. No raw logs. Just clear, structured traces. Great for debugging and collaboration.
## Why use Neatlogs?
CrewAI agents use multiple tools and reasoning steps. When something goes wrong, you need context — not just errors.
Neatlogs lets you:
- Follow the full decision path
- Add feedback directly on steps
- Chat with the trace using AI assistant
- Share runs publicly for feedback
- Turn insights into tasks
All in one place.
Manage your traces effortlessly
![Traces](/images/neatlogs-1.png)
![Trace Response](/images/neatlogs-2.png)
The best UX to view a CrewAI trace. Post comments anywhere you want. Use AI to debug.
![Trace Details](/images/neatlogs-3.png)
![Ai Chat Bot With A Trace](/images/neatlogs-4.png)
![Comments Drawer](/images/neatlogs-5.png)
## Core Features
- **Trace Viewer**: Track thoughts, tools, and decisions in sequence
- **Inline Comments**: Tag teammates on any trace step
- **Feedback & Evaluation**: Mark outputs as correct or incorrect
- **Error Highlighting**: Automatic flagging of API/tool failures
- **Task Conversion**: Convert comments into assigned tasks
- **Ask the Trace (AI)**: Chat with your trace using Neatlogs AI bot
- **Public Sharing**: Publish trace links to your community
## Quick Setup with CrewAI
<Steps>
<Step title="Sign Up & Get API Key">
Visit [neatlogs.com](https://neatlogs.com/?utm_source=crewAI-docs), create a project, copy the API key.
</Step>
<Step title="Install SDK">
```bash
pip install neatlogs
```
(Latest version 0.8.0, Python 3.8+; MIT license)
</Step>
<Step title="Initialize Neatlogs">
Before starting Crew agents, add:
```python
import neatlogs
neatlogs.init("YOUR_PROJECT_API_KEY")
```
Agents run as usual. Neatlogs captures everything automatically.
</Step>
</Steps>
## Under the Hood
According to GitHub, Neatlogs:
- Captures thoughts, tool calls, responses, errors, and token stats
- Supports AI-powered task generation and robust evaluation workflows
All with just two lines of code.
## Watch It Work
### 🔍 Full Demo (4min)
<iframe
width="100%"
height="315"
src="https://www.youtube.com/embed/8KDme9T2I7Q?si=b8oHteaBwFNs_Duk"
title="YouTube video player"
frameBorder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowFullScreen
></iframe>
### ⚙️ CrewAI Integration (30s)
<iframe
className="w-full aspect-video rounded-xl"
src="https://www.loom.com/embed/9c78b552af43452bb3e4783cb8d91230?sid=e9d7d370-a91a-49b0-809e-2f375d9e801d"
title="Loom video player"
frameBorder="0"
allowFullScreen
></iframe>
## Links & Support
- 📘 [Neatlogs Docs](https://docs.neatlogs.com/)
- 🔐 [Dashboard & API Key](https://app.neatlogs.com/)
- 🐦 [Follow on Twitter](https://twitter.com/neatlogs)
- 📧 Contact: hello@neatlogs.com
- 🛠 [GitHub SDK](https://github.com/NeatLogs/neatlogs)
## TL;DR
With just:
```bash
pip install neatlogs
import neatlogs
neatlogs.init("YOUR_API_KEY")
You can now capture, understand, share, and act on your CrewAI agent runs in seconds.
No setup overhead. Full trace transparency. Full team collaboration.
```

View File

@@ -44,6 +44,14 @@ These tools enable your agents to search the web, research topics, and find info
<Card title="YouTube Video Search" icon="play" href="/en/tools/search-research/youtubevideosearchtool">
Find and analyze YouTube videos by topic, keyword, or criteria.
</Card>
<Card title="Tavily Search Tool" icon="magnifying-glass" href="/en/tools/search-research/tavilysearchtool">
Comprehensive web search using Tavily's AI-powered search API.
</Card>
<Card title="Tavily Extractor Tool" icon="file-text" href="/en/tools/search-research/tavilyextractortool">
Extract structured content from web pages using the Tavily API.
</Card>
</CardGroup>
## **Common Use Cases**
@@ -55,17 +63,19 @@ These tools enable your agents to search the web, research topics, and find info
- **Academic Research**: Find scholarly articles and technical papers
```python
from crewai_tools import SerperDevTool, GitHubSearchTool, YoutubeVideoSearchTool
from crewai_tools import SerperDevTool, GitHubSearchTool, YoutubeVideoSearchTool, TavilySearchTool, TavilyExtractorTool
# Create research tools
web_search = SerperDevTool()
code_search = GitHubSearchTool()
video_research = YoutubeVideoSearchTool()
tavily_search = TavilySearchTool()
content_extractor = TavilyExtractorTool()
# Add to your agent
agent = Agent(
role="Research Analyst",
tools=[web_search, code_search, video_research],
tools=[web_search, code_search, video_research, tavily_search, content_extractor],
goal="Gather comprehensive information on any topic"
)
```

View File

@@ -6,10 +6,6 @@ icon: google
# `SerperDevTool`
<Note>
We are still working on improving tools, so there might be unexpected behavior or changes in the future.
</Note>
## Description
This tool is designed to perform a semantic search for a specified query from a text's content across the internet. It utilizes the [serper.dev](https://serper.dev) API
@@ -17,6 +13,12 @@ to fetch and display the most relevant search results based on the query provide
## Installation
To effectively use the `SerperDevTool`, follow these steps:
1. **Package Installation**: Confirm that the `crewai[tools]` package is installed in your Python environment.
2. **API Key Acquisition**: Acquire a `serper.dev` API key by registering for a free account at `serper.dev`.
3. **Environment Configuration**: Store your obtained API key in an environment variable named `SERPER_API_KEY` to facilitate its use by the tool.
To incorporate this tool into your project, follow the installation instructions below:
```shell
@@ -34,14 +36,6 @@ from crewai_tools import SerperDevTool
tool = SerperDevTool()
```
## Steps to Get Started
To effectively use the `SerperDevTool`, follow these steps:
1. **Package Installation**: Confirm that the `crewai[tools]` package is installed in your Python environment.
2. **API Key Acquisition**: Acquire a `serper.dev` API key by registering for a free account at `serper.dev`.
3. **Environment Configuration**: Store your obtained API key in an environment variable named `SERPER_API_KEY` to facilitate its use by the tool.
## Parameters
The `SerperDevTool` comes with several parameters that will be passed to the API :

View File

@@ -0,0 +1,139 @@
---
title: "Tavily Extractor Tool"
description: "Extract structured content from web pages using the Tavily API"
icon: "file-text"
---
The `TavilyExtractorTool` allows CrewAI agents to extract structured content from web pages using the Tavily API. It can process single URLs or lists of URLs and provides options for controlling the extraction depth and including images.
## Installation
To use the `TavilyExtractorTool`, you need to install the `tavily-python` library:
```shell
pip install 'crewai[tools]' tavily-python
```
You also need to set your Tavily API key as an environment variable:
```bash
export TAVILY_API_KEY='your-tavily-api-key'
```
## Example Usage
Here's how to initialize and use the `TavilyExtractorTool` within a CrewAI agent:
```python
import os
from crewai import Agent, Task, Crew
from crewai_tools import TavilyExtractorTool
# Ensure TAVILY_API_KEY is set in your environment
# os.environ["TAVILY_API_KEY"] = "YOUR_API_KEY"
# Initialize the tool
tavily_tool = TavilyExtractorTool()
# Create an agent that uses the tool
extractor_agent = Agent(
role='Web Content Extractor',
goal='Extract key information from specified web pages',
backstory='You are an expert at extracting relevant content from websites using the Tavily API.',
tools=[tavily_tool],
verbose=True
)
# Define a task for the agent
extract_task = Task(
description='Extract the main content from the URL https://example.com using basic extraction depth.',
expected_output='A JSON string containing the extracted content from the URL.',
agent=extractor_agent
)
# Create and run the crew
crew = Crew(
agents=[extractor_agent],
tasks=[extract_task],
verbose=2
)
result = crew.kickoff()
print(result)
```
## Configuration Options
The `TavilyExtractorTool` accepts the following arguments:
- `urls` (Union[List[str], str]): **Required**. A single URL string or a list of URL strings to extract data from.
- `include_images` (Optional[bool]): Whether to include images in the extraction results. Defaults to `False`.
- `extract_depth` (Literal["basic", "advanced"]): The depth of extraction. Use `"basic"` for faster, surface-level extraction or `"advanced"` for more comprehensive extraction. Defaults to `"basic"`.
- `timeout` (int): The maximum time in seconds to wait for the extraction request to complete. Defaults to `60`.
## Advanced Usage
### Multiple URLs with Advanced Extraction
```python
# Example with multiple URLs and advanced extraction
multi_extract_task = Task(
description='Extract content from https://example.com and https://anotherexample.org using advanced extraction.',
expected_output='A JSON string containing the extracted content from both URLs.',
agent=extractor_agent
)
# Configure the tool with custom parameters
custom_extractor = TavilyExtractorTool(
extract_depth='advanced',
include_images=True,
timeout=120
)
agent_with_custom_tool = Agent(
role="Advanced Content Extractor",
goal="Extract comprehensive content with images",
tools=[custom_extractor]
)
```
### Tool Parameters
You can customize the tool's behavior by setting parameters during initialization:
```python
# Initialize with custom configuration
extractor_tool = TavilyExtractorTool(
extract_depth='advanced', # More comprehensive extraction
include_images=True, # Include image results
timeout=90 # Custom timeout
)
```
## Features
- **Single or Multiple URLs**: Extract content from one URL or process multiple URLs in a single request
- **Configurable Depth**: Choose between basic (fast) and advanced (comprehensive) extraction modes
- **Image Support**: Optionally include images in the extraction results
- **Structured Output**: Returns well-formatted JSON containing the extracted content
- **Error Handling**: Robust handling of network timeouts and extraction errors
## Response Format
The tool returns a JSON string representing the structured data extracted from the provided URL(s). The exact structure depends on the content of the pages and the `extract_depth` used.
Common response elements include:
- **Title**: The page title
- **Content**: Main text content of the page
- **Images**: Image URLs and metadata (when `include_images=True`)
- **Metadata**: Additional page information like author, description, etc.
## Use Cases
- **Content Analysis**: Extract and analyze content from competitor websites
- **Research**: Gather structured data from multiple sources for analysis
- **Content Migration**: Extract content from existing websites for migration
- **Monitoring**: Regular extraction of content for change detection
- **Data Collection**: Systematic extraction of information from web sources
Refer to the [Tavily API documentation](https://docs.tavily.com/docs/tavily-api/python-sdk#extract) for detailed information about the response structure and available options.

View File

@@ -0,0 +1,122 @@
---
title: "Tavily Search Tool"
description: "Perform comprehensive web searches using the Tavily Search API"
icon: "magnifying-glass"
---
The `TavilySearchTool` provides an interface to the Tavily Search API, enabling CrewAI agents to perform comprehensive web searches. It allows for specifying search depth, topics, time ranges, included/excluded domains, and whether to include direct answers, raw content, or images in the results.
## Installation
To use the `TavilySearchTool`, you need to install the `tavily-python` library:
```shell
pip install 'crewai[tools]' tavily-python
```
## Environment Variables
Ensure your Tavily API key is set as an environment variable:
```bash
export TAVILY_API_KEY='your_tavily_api_key'
```
## Example Usage
Here's how to initialize and use the `TavilySearchTool` within a CrewAI agent:
```python
import os
from crewai import Agent, Task, Crew
from crewai_tools import TavilySearchTool
# Ensure the TAVILY_API_KEY environment variable is set
# os.environ["TAVILY_API_KEY"] = "YOUR_TAVILY_API_KEY"
# Initialize the tool
tavily_tool = TavilySearchTool()
# Create an agent that uses the tool
researcher = Agent(
role='Market Researcher',
goal='Find information about the latest AI trends',
backstory='An expert market researcher specializing in technology.',
tools=[tavily_tool],
verbose=True
)
# Create a task for the agent
research_task = Task(
description='Search for the top 3 AI trends in 2024.',
expected_output='A JSON report summarizing the top 3 AI trends found.',
agent=researcher
)
# Form the crew and kick it off
crew = Crew(
agents=[researcher],
tasks=[research_task],
verbose=2
)
result = crew.kickoff()
print(result)
```
## Configuration Options
The `TavilySearchTool` accepts the following arguments during initialization or when calling the `run` method:
- `query` (str): **Required**. The search query string.
- `search_depth` (Literal["basic", "advanced"], optional): The depth of the search. Defaults to `"basic"`.
- `topic` (Literal["general", "news", "finance"], optional): The topic to focus the search on. Defaults to `"general"`.
- `time_range` (Literal["day", "week", "month", "year"], optional): The time range for the search. Defaults to `None`.
- `days` (int, optional): The number of days to search back. Relevant if `time_range` is not set. Defaults to `7`.
- `max_results` (int, optional): The maximum number of search results to return. Defaults to `5`.
- `include_domains` (Sequence[str], optional): A list of domains to prioritize in the search. Defaults to `None`.
- `exclude_domains` (Sequence[str], optional): A list of domains to exclude from the search. Defaults to `None`.
- `include_answer` (Union[bool, Literal["basic", "advanced"]], optional): Whether to include a direct answer synthesized from the search results. Defaults to `False`.
- `include_raw_content` (bool, optional): Whether to include the raw HTML content of the searched pages. Defaults to `False`.
- `include_images` (bool, optional): Whether to include image results. Defaults to `False`.
- `timeout` (int, optional): The request timeout in seconds. Defaults to `60`.
## Advanced Usage
You can configure the tool with custom parameters:
```python
# Example: Initialize with specific parameters
custom_tavily_tool = TavilySearchTool(
search_depth='advanced',
max_results=10,
include_answer=True
)
# The agent will use these defaults
agent_with_custom_tool = Agent(
role="Advanced Researcher",
goal="Conduct detailed research with comprehensive results",
tools=[custom_tavily_tool]
)
```
## Features
- **Comprehensive Search**: Access to Tavily's powerful search index
- **Configurable Depth**: Choose between basic and advanced search modes
- **Topic Filtering**: Focus searches on general, news, or finance topics
- **Time Range Control**: Limit results to specific time periods
- **Domain Control**: Include or exclude specific domains
- **Direct Answers**: Get synthesized answers from search results
- **Content Filtering**: Prevent context window issues with automatic content truncation
## Response Format
The tool returns search results as a JSON string containing:
- Search results with titles, URLs, and content snippets
- Optional direct answers to queries
- Optional image results
- Optional raw HTML content (when enabled)
Content for each result is automatically truncated to prevent context window issues while maintaining the most relevant information.

View File

@@ -0,0 +1,100 @@
---
title: Serper Scrape Website
description: The `SerperScrapeWebsiteTool` is designed to scrape websites and extract clean, readable content using Serper's scraping API.
icon: globe
---
# `SerperScrapeWebsiteTool`
## Description
This tool is designed to scrape website content and extract clean, readable text from any website URL. It utilizes the [serper.dev](https://serper.dev) scraping API to fetch and process web pages, optionally including markdown formatting for better structure and readability.
## Installation
To effectively use the `SerperScrapeWebsiteTool`, follow these steps:
1. **Package Installation**: Confirm that the `crewai[tools]` package is installed in your Python environment.
2. **API Key Acquisition**: Acquire a `serper.dev` API key by registering for an account at `serper.dev`.
3. **Environment Configuration**: Store your obtained API key in an environment variable named `SERPER_API_KEY` to facilitate its use by the tool.
To incorporate this tool into your project, follow the installation instructions below:
```shell
pip install 'crewai[tools]'
```
## Example
The following example demonstrates how to initialize the tool and scrape a website:
```python Code
from crewai_tools import SerperScrapeWebsiteTool
# Initialize the tool for website scraping capabilities
tool = SerperScrapeWebsiteTool()
# Scrape a website with markdown formatting
result = tool.run(url="https://example.com", include_markdown=True)
```
## Arguments
The `SerperScrapeWebsiteTool` accepts the following arguments:
- **url**: Required. The URL of the website to scrape.
- **include_markdown**: Optional. Whether to include markdown formatting in the scraped content. Defaults to `True`.
## Example with Parameters
Here is an example demonstrating how to use the tool with different parameters:
```python Code
from crewai_tools import SerperScrapeWebsiteTool
tool = SerperScrapeWebsiteTool()
# Scrape with markdown formatting (default)
markdown_result = tool.run(
url="https://docs.crewai.com",
include_markdown=True
)
# Scrape without markdown formatting for plain text
plain_result = tool.run(
url="https://docs.crewai.com",
include_markdown=False
)
print("Markdown formatted content:")
print(markdown_result)
print("\nPlain text content:")
print(plain_result)
```
## Use Cases
The `SerperScrapeWebsiteTool` is particularly useful for:
- **Content Analysis**: Extract and analyze website content for research purposes
- **Data Collection**: Gather structured information from web pages
- **Documentation Processing**: Convert web-based documentation into readable formats
- **Competitive Analysis**: Scrape competitor websites for market research
- **Content Migration**: Extract content from existing websites for migration purposes
## Error Handling
The tool includes comprehensive error handling for:
- **Network Issues**: Handles connection timeouts and network errors gracefully
- **API Errors**: Provides detailed error messages for API-related issues
- **Invalid URLs**: Validates and reports issues with malformed URLs
- **Authentication**: Clear error messages for missing or invalid API keys
## Security Considerations
- Always store your `SERPER_API_KEY` in environment variables, never hardcode it in your source code
- Be mindful of rate limits imposed by the Serper API
- Respect robots.txt and website terms of service when scraping content
- Consider implementing delays between requests for large-scale scraping operations

BIN
docs/images/neatlogs-1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 222 KiB

BIN
docs/images/neatlogs-2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 329 KiB

BIN
docs/images/neatlogs-3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 590 KiB

BIN
docs/images/neatlogs-4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 216 KiB

BIN
docs/images/neatlogs-5.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 277 KiB

View File

@@ -84,8 +84,8 @@ filename = "seu_modelo.pkl"
try:
SuaCrew().crew().train(
n_iterations=n_iterations,
inputs=inputs,
n_iterations=n_iterations,
inputs=inputs,
filename=filename
)
except Exception as e:
@@ -103,7 +103,7 @@ crewai replay [OPTIONS]
- `-t, --task_id TEXT`: Reexecuta o crew a partir deste task ID, incluindo todas as tarefas subsequentes
Exemplo:
```shell Terminal
```shell Terminal
crewai replay -t task_123456
```
@@ -149,7 +149,7 @@ crewai test [OPTIONS]
- `-m, --model TEXT`: Modelo LLM para executar os testes no Crew (padrão: "gpt-4o-mini")
Exemplo:
```shell Terminal
```shell Terminal
crewai test -n 5 -m gpt-3.5-turbo
```
@@ -203,10 +203,7 @@ def crew(self) -> Crew:
Implemente o crew ou flow no [CrewAI Enterprise](https://app.crewai.com).
- **Autenticação**: Você precisa estar autenticado para implementar no CrewAI Enterprise.
```shell Terminal
crewai signup
```
Caso já tenha uma conta, você pode fazer login com:
Você pode fazer login ou criar uma conta com:
```shell Terminal
crewai login
```
@@ -253,7 +250,7 @@ Você deve estar autenticado no CrewAI Enterprise para usar estes comandos de ge
- **Implantar o Crew**: Depois de autenticado, você pode implantar seu crew ou flow no CrewAI Enterprise.
```shell Terminal
crewai deploy push
```
```
- Inicia o processo de deployment na plataforma CrewAI Enterprise.
- Após a iniciação bem-sucedida, será exibida a mensagem Deployment created successfully! juntamente com o Nome do Deployment e um Deployment ID (UUID) único.
@@ -326,4 +323,4 @@ Ao escolher um provedor, o CLI solicitará que você informe o nome da chave e a
Veja o seguinte link para o nome de chave de cada provedor:
* [LiteLLM Providers](https://docs.litellm.ai/docs/providers)
* [LiteLLM Providers](https://docs.litellm.ai/docs/providers)

View File

@@ -268,7 +268,7 @@ Nesta seção, você encontrará exemplos detalhados que ajudam a selecionar, co
from crewai import LLM
llm = LLM(
model="gemini/gemini-1.5-pro-latest",
model="gemini-1.5-pro-latest", # or vertex_ai/gemini-1.5-pro-latest
temperature=0.7,
vertex_credentials=vertex_credentials_json
)

View File

@@ -623,7 +623,7 @@ for provider in providers_to_test:
**Erros de modelo não encontrado:**
```python
# Verifique disponibilidade do modelo
from crewai.utilities.embedding_configurator import EmbeddingConfigurator
from crewai.rag.embeddings.configurator import EmbeddingConfigurator
configurator = EmbeddingConfigurator()
try:

View File

@@ -54,10 +54,11 @@ crew = Crew(
| **Markdown** _(opcional)_ | `markdown` | `Optional[bool]` | Se a tarefa deve instruir o agente a retornar a resposta final formatada em Markdown. O padrão é False. |
| **Config** _(opcional)_ | `config` | `Optional[Dict[str, Any]]` | Parâmetros de configuração específicos da tarefa. |
| **Arquivo de Saída** _(opcional)_| `output_file` | `Optional[str]` | Caminho do arquivo para armazenar a saída da tarefa. |
| **Criar Diretório** _(opcional)_ | `create_directory` | `Optional[bool]` | Se deve criar o diretório para output_file caso não exista. O padrão é True. |
| **Saída JSON** _(opcional)_ | `output_json` | `Optional[Type[BaseModel]]` | Um modelo Pydantic para estruturar a saída em JSON. |
| **Output Pydantic** _(opcional)_ | `output_pydantic` | `Optional[Type[BaseModel]]` | Um modelo Pydantic para a saída da tarefa. |
| **Callback** _(opcional)_ | `callback` | `Optional[Any]` | Função/objeto a ser executado após a conclusão da tarefa. |
| **Guardrail** _(opcional)_ | `guardrail` | `Optional[Union[Callable, str]]` | Função ou descrição em string para validar a saída da tarefa antes de prosseguir para a próxima tarefa. |
| **Guardrail** _(opcional)_ | `guardrail` | `Optional[Callable]` | Função para validar a saída da tarefa antes de prosseguir para a próxima tarefa. |
## Criando Tarefas
@@ -87,7 +88,6 @@ research_task:
expected_output: >
Uma lista com 10 tópicos em bullet points das informações mais relevantes sobre {topic}
agent: researcher
guardrail: garanta que cada bullet point contenha no mínimo 100 palavras
reporting_task:
description: >
@@ -332,9 +332,7 @@ analysis_task = Task(
Guardrails (trilhas de proteção) de tarefas fornecem uma maneira de validar e transformar as saídas das tarefas antes que elas sejam passadas para a próxima tarefa. Esse recurso assegura a qualidade dos dados e oferece feedback aos agentes quando sua saída não atende a critérios específicos.
**Guardrails podem ser definidos de duas maneiras:**
1. **Guardrails baseados em função**: Funções Python que implementam lógica de validação customizada
2. **Guardrails baseados em string**: Descrições em linguagem natural que são automaticamente convertidas em validação baseada em LLM
Guardrails são implementados como funções Python que contêm lógica de validação customizada, proporcionando controle total sobre o processo de validação e garantindo resultados confiáveis e determinísticos.
### Guardrails Baseados em Função
@@ -376,82 +374,7 @@ blog_task = Task(
- Em caso de sucesso: retorna uma tupla `(True, resultado_validado)`
- Em caso de falha: retorna uma tupla `(False, "mensagem de erro explicando a falha")`
### Guardrails Baseados em String
Guardrails baseados em string permitem que você descreva critérios de validação em linguagem natural. Quando você fornece uma string em vez de uma função, o CrewAI automaticamente a converte em um `LLMGuardrail` que usa um agente de IA para validar a saída da tarefa.
#### Usando Guardrails de String em Python
```python Code
from crewai import Task
# Guardrail simples baseado em string
blog_task = Task(
description="Escreva um post de blog sobre IA",
expected_output="Um post de blog com menos de 200 palavras",
agent=blog_agent,
guardrail="Garanta que o post do blog tenha menos de 200 palavras e inclua exemplos práticos"
)
# Critérios de validação mais complexos
research_task = Task(
description="Pesquise tendências de IA para 2025",
expected_output="Um relatório abrangente de pesquisa",
agent=research_agent,
guardrail="Garanta que cada descoberta inclua uma fonte confiável e seja respaldada por dados recentes de 2024-2025"
)
```
#### Usando Guardrails de String em YAML
```yaml
research_task:
description: Pesquise os últimos desenvolvimentos em IA
expected_output: Uma lista de 10 bullet points sobre IA
agent: researcher
guardrail: garanta que cada bullet point contenha no mínimo 100 palavras
validation_task:
description: Valide os achados da pesquisa
expected_output: Um relatório de validação
agent: validator
guardrail: confirme que todas as fontes são de publicações respeitáveis e publicadas nos últimos 2 anos
```
#### Como Funcionam os Guardrails de String
Quando você fornece um guardrail de string, o CrewAI automaticamente:
1. Cria uma instância `LLMGuardrail` usando a string como critério de validação
2. Usa o LLM do agente da tarefa para alimentar a validação
3. Cria um agente temporário de validação que verifica a saída contra seus critérios
4. Retorna feedback detalhado se a validação falhar
Esta abordagem é ideal quando você quer usar linguagem natural para descrever regras de validação sem escrever funções de validação customizadas.
### Classe LLMGuardrail
A classe `LLMGuardrail` é o mecanismo subjacente que alimenta os guardrails baseados em string. Você também pode usá-la diretamente para maior controle avançado:
```python Code
from crewai import Task
from crewai.tasks.llm_guardrail import LLMGuardrail
from crewai.llm import LLM
# Crie um LLMGuardrail customizado com LLM específico
custom_guardrail = LLMGuardrail(
description="Garanta que a resposta contenha exatamente 5 bullet points com citações adequadas",
llm=LLM(model="gpt-4o-mini")
)
task = Task(
description="Pesquise medidas de segurança em IA",
expected_output="Uma análise detalhada com bullet points",
agent=research_agent,
guardrail=custom_guardrail
)
```
**Nota**: Quando você usa um guardrail de string, o CrewAI automaticamente cria uma instância `LLMGuardrail` usando o LLM do agente da sua tarefa. Usar `LLMGuardrail` diretamente lhe dá mais controle sobre o processo de validação e seleção de LLM.
### Melhores Práticas de Tratamento de Erros
@@ -902,26 +825,7 @@ task = Task(
)
```
#### Use uma abordagem no-code para validação
```python Code
from crewai import Task
task = Task(
description="Gerar dados em JSON",
expected_output="Objeto JSON válido",
guardrail="Garanta que a resposta é um objeto JSON válido"
)
```
#### Usando YAML
```yaml
research_task:
...
guardrail: garanta que cada bullet tenha no mínimo 100 palavras
...
```
```python Code
@CrewBase
@@ -1037,21 +941,87 @@ task = Task(
## Criando Diretórios ao Salvar Arquivos
Agora é possível especificar se uma tarefa deve criar diretórios ao salvar sua saída em arquivo. Isso é útil para organizar outputs e garantir que os caminhos estejam corretos.
O parâmetro `create_directory` controla se o CrewAI deve criar automaticamente diretórios ao salvar saídas de tarefas em arquivos. Este recurso é particularmente útil para organizar outputs e garantir que os caminhos de arquivos estejam estruturados corretamente, especialmente ao trabalhar com hierarquias de projetos complexas.
### Comportamento Padrão
Por padrão, `create_directory=True`, o que significa que o CrewAI criará automaticamente qualquer diretório ausente no caminho do arquivo de saída:
```python Code
# ...
save_output_task = Task(
description='Salve o resumo das notícias de IA em um arquivo',
expected_output='Arquivo salvo com sucesso',
agent=research_agent,
tools=[file_save_tool],
output_file='outputs/ai_news_summary.txt',
create_directory=True
# Comportamento padrão - diretórios são criados automaticamente
report_task = Task(
description='Gerar um relatório abrangente de análise de mercado',
expected_output='Uma análise detalhada de mercado com gráficos e insights',
agent=analyst_agent,
output_file='reports/2025/market_analysis.md', # Cria 'reports/2025/' se não existir
markdown=True
)
```
#...
### Desabilitando a Criação de Diretórios
Se você quiser evitar a criação automática de diretórios e garantir que o diretório já exista, defina `create_directory=False`:
```python Code
# Modo estrito - o diretório já deve existir
strict_output_task = Task(
description='Salvar dados críticos que requerem infraestrutura existente',
expected_output='Dados salvos em localização pré-configurada',
agent=data_agent,
output_file='secure/vault/critical_data.json',
create_directory=False # Gerará RuntimeError se 'secure/vault/' não existir
)
```
### Configuração YAML
Você também pode configurar este comportamento em suas definições de tarefas YAML:
```yaml tasks.yaml
analysis_task:
description: >
Gerar análise financeira trimestral
expected_output: >
Um relatório financeiro abrangente com insights trimestrais
agent: financial_analyst
output_file: reports/quarterly/q4_2024_analysis.pdf
create_directory: true # Criar automaticamente o diretório 'reports/quarterly/'
audit_task:
description: >
Realizar auditoria de conformidade e salvar no diretório de auditoria existente
expected_output: >
Um relatório de auditoria de conformidade
agent: auditor
output_file: audit/compliance_report.md
create_directory: false # O diretório já deve existir
```
### Casos de Uso
**Criação Automática de Diretórios (`create_directory=True`):**
- Ambientes de desenvolvimento e prototipagem
- Geração dinâmica de relatórios com pastas baseadas em datas
- Fluxos de trabalho automatizados onde a estrutura de diretórios pode variar
- Aplicações multi-tenant com pastas específicas do usuário
**Gerenciamento Manual de Diretórios (`create_directory=False`):**
- Ambientes de produção com controles rígidos do sistema de arquivos
- Aplicações sensíveis à segurança onde diretórios devem ser pré-configurados
- Sistemas com requisitos específicos de permissão
- Ambientes de conformidade onde a criação de diretórios é auditada
### Tratamento de Erros
Quando `create_directory=False` e o diretório não existe, o CrewAI gerará um `RuntimeError`:
```python Code
try:
result = crew.kickoff()
except RuntimeError as e:
# Tratar erro de diretório ausente
print(f"Falha na criação do diretório: {e}")
# Criar diretório manualmente ou usar local alternativo
```
Veja o vídeo abaixo para aprender como utilizar saídas estruturadas no CrewAI:

View File

@@ -11,7 +11,7 @@ dependencies = [
# Core Dependencies
"pydantic>=2.4.2",
"openai>=1.13.3",
"litellm==1.72.6",
"litellm==1.74.3",
"instructor>=1.3.3",
# Text Processing
"pdfplumber>=0.11.4",
@@ -39,6 +39,7 @@ dependencies = [
"tomli>=2.0.2",
"blinker>=1.9.0",
"json5>=0.10.0",
"portalocker==2.7.0",
]
[project.urls]
@@ -47,7 +48,7 @@ Documentation = "https://docs.crewai.com"
Repository = "https://github.com/crewAIInc/crewAI"
[project.optional-dependencies]
tools = ["crewai-tools~=0.51.0"]
tools = ["crewai-tools~=0.58.0"]
embeddings = [
"tiktoken~=0.8.0"
]

View File

@@ -54,7 +54,7 @@ def _track_install_async():
_track_install_async()
__version__ = "0.141.0"
__version__ = "0.150.0"
__all__ = [
"Agent",
"Crew",

View File

@@ -120,11 +120,8 @@ class CrewAgentExecutor(CrewAgentExecutorMixin):
raise
except Exception as e:
handle_unknown_error(self._printer, e)
if e.__class__.__module__.startswith("litellm"):
# Do not retry on litellm errors
raise e
else:
raise e
raise
if self.ask_for_human_input:
formatted_answer = self._handle_human_feedback(formatted_answer)

View File

@@ -26,7 +26,7 @@ class PlusAPIMixin:
"Please sign up/login to CrewAI+ before using the CLI.",
style="bold red",
)
console.print("Run 'crewai signup' to sign up/login.", style="bold green")
console.print("Run 'crewai login' to sign up/login.", style="bold green")
raise SystemExit
def _validate_response(self, response: requests.Response) -> None:

View File

@@ -5,7 +5,7 @@ description = "{{name}} using crewAI"
authors = [{ name = "Your Name", email = "you@example.com" }]
requires-python = ">=3.10,<3.14"
dependencies = [
"crewai[tools]>=0.141.0,<1.0.0"
"crewai[tools]>=0.150.0,<1.0.0"
]
[project.scripts]

View File

@@ -5,7 +5,7 @@ description = "{{name}} using crewAI"
authors = [{ name = "Your Name", email = "you@example.com" }]
requires-python = ">=3.10,<3.14"
dependencies = [
"crewai[tools]>=0.141.0,<1.0.0",
"crewai[tools]>=0.150.0,<1.0.0",
]
[project.scripts]

View File

@@ -5,7 +5,7 @@ description = "Power up your crews with {{folder_name}}"
readme = "README.md"
requires-python = ">=3.10,<3.14"
dependencies = [
"crewai[tools]>=0.141.0"
"crewai[tools]>=0.150.0"
]
[tool.crewai]

View File

@@ -161,7 +161,7 @@ class Crew(FlowTrackable, BaseModel):
)
user_memory: Optional[InstanceOf[UserMemory]] = Field(
default=None,
description="An instance of the UserMemory to be used by the Crew to store/fetch memories of a specific user.",
description="DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, whichever comes first. Use external_memory instead.",
)
external_memory: Optional[InstanceOf[ExternalMemory]] = Field(
default=None,
@@ -327,7 +327,7 @@ class Crew(FlowTrackable, BaseModel):
self._short_term_memory = self.short_term_memory
self._entity_memory = self.entity_memory
# UserMemory is gonna to be deprecated in the future, but we have to initialize a default value for now
# UserMemory will be removed in version 0.156.0 or on 2025-08-04, whichever comes first
self._user_memory = None
if self.memory:
@@ -1255,6 +1255,7 @@ class Crew(FlowTrackable, BaseModel):
if self.external_memory:
copied_data["external_memory"] = self.external_memory.model_copy(deep=True)
if self.user_memory:
# DEPRECATED: UserMemory will be removed in version 0.156.0 or on 2025-08-04
copied_data["user_memory"] = self.user_memory.model_copy(deep=True)
copied_data.pop("agents", None)
@@ -1313,7 +1314,6 @@ class Crew(FlowTrackable, BaseModel):
n_iterations: int,
eval_llm: Union[str, InstanceOf[BaseLLM]],
inputs: Optional[Dict[str, Any]] = None,
include_agent_eval: Optional[bool] = False
) -> None:
"""Test and evaluate the Crew with the given inputs for n iterations concurrently using concurrent.futures."""
try:
@@ -1333,28 +1333,13 @@ class Crew(FlowTrackable, BaseModel):
)
test_crew = self.copy()
# TODO: Refator to use a single Evaluator Manage class
evaluator = CrewEvaluator(test_crew, llm_instance)
if include_agent_eval:
from crewai.evaluation import create_default_evaluator
agent_evaluator = create_default_evaluator(crew=test_crew)
for i in range(1, n_iterations + 1):
evaluator.set_iteration(i)
if include_agent_eval:
agent_evaluator.set_iteration(i)
test_crew.kickoff(inputs=inputs)
# TODO: Refactor to use ListenerEvents instead of trigger each iteration manually
if include_agent_eval:
agent_evaluator.evaluate_current_iteration()
evaluator.print_crew_evaluation_result()
if include_agent_eval:
agent_evaluator.get_agent_evaluation(include_evaluation_feedback=True)
crewai_event_bus.emit(
self,

View File

@@ -1,178 +0,0 @@
from crewai.evaluation.base_evaluator import AgentEvaluationResult, AggregationStrategy
from crewai.agent import Agent
from crewai.task import Task
from crewai.evaluation.evaluation_display import EvaluationDisplayFormatter
from typing import Any, Dict
from collections import defaultdict
from crewai.evaluation import BaseEvaluator, create_evaluation_callbacks
from collections.abc import Sequence
from crewai.crew import Crew
from crewai.utilities.events.crewai_event_bus import crewai_event_bus
from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
class AgentEvaluator:
def __init__(
self,
evaluators: Sequence[BaseEvaluator] | None = None,
crew: Crew | None = None,
):
self.crew: Crew | None = crew
self.evaluators: Sequence[BaseEvaluator] | None = evaluators
self.agent_evaluators: dict[str, Sequence[BaseEvaluator] | None] = {}
if crew is not None:
assert crew and crew.agents is not None
for agent in crew.agents:
self.agent_evaluators[str(agent.id)] = self.evaluators
self.callback = create_evaluation_callbacks()
self.console_formatter = ConsoleFormatter()
self.display_formatter = EvaluationDisplayFormatter()
self.iteration = 1
self.iterations_results: dict[int, dict[str, list[AgentEvaluationResult]]] = {}
def set_iteration(self, iteration: int) -> None:
self.iteration = iteration
def evaluate_current_iteration(self) -> dict[str, list[AgentEvaluationResult]]:
if not self.crew:
raise ValueError("Cannot evaluate: no crew was provided to the evaluator.")
if not self.callback:
raise ValueError("Cannot evaluate: no callback was set. Use set_callback() method first.")
from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
evaluation_results: defaultdict[str, list[AgentEvaluationResult]] = defaultdict(list)
total_evals = 0
for agent in self.crew.agents:
for task in self.crew.tasks:
if task.agent and task.agent.id == agent.id and self.agent_evaluators.get(str(agent.id)):
total_evals += 1
with Progress(
SpinnerColumn(),
TextColumn("[bold blue]{task.description}[/bold blue]"),
BarColumn(),
TextColumn("{task.percentage:.0f}% completed"),
console=self.console_formatter.console
) as progress:
eval_task = progress.add_task(f"Evaluating agents (iteration {self.iteration})...", total=total_evals)
for agent in self.crew.agents:
evaluator = self.agent_evaluators.get(str(agent.id))
if not evaluator:
continue
for task in self.crew.tasks:
if task.agent and str(task.agent.id) != str(agent.id):
continue
trace = self.callback.get_trace(str(agent.id), str(task.id))
if not trace:
self.console_formatter.print(f"[yellow]Warning: No trace found for agent {agent.role} on task {task.description[:30]}...[/yellow]")
progress.update(eval_task, advance=1)
continue
with crewai_event_bus.scoped_handlers():
result = self.evaluate(
agent=agent,
task=task,
execution_trace=trace,
final_output=task.output
)
evaluation_results[agent.role].append(result)
progress.update(eval_task, advance=1)
self.iterations_results[self.iteration] = evaluation_results
return evaluation_results
def get_evaluation_results(self):
if self.iteration in self.iterations_results:
return self.iterations_results[self.iteration]
return self.evaluate_current_iteration()
def display_results_with_iterations(self):
self.display_formatter.display_summary_results(self.iterations_results)
def get_agent_evaluation(self, strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE, include_evaluation_feedback: bool = False):
agent_results = {}
with crewai_event_bus.scoped_handlers():
task_results = self.get_evaluation_results()
for agent_role, results in task_results.items():
if not results:
continue
agent_id = results[0].agent_id
aggregated_result = self.display_formatter._aggregate_agent_results(
agent_id=agent_id,
agent_role=agent_role,
results=results,
strategy=strategy
)
agent_results[agent_role] = aggregated_result
if self.iteration == max(self.iterations_results.keys()):
self.display_results_with_iterations()
if include_evaluation_feedback:
self.display_evaluation_with_feedback()
return agent_results
def display_evaluation_with_feedback(self):
self.display_formatter.display_evaluation_with_feedback(self.iterations_results)
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: Any
) -> AgentEvaluationResult:
result = AgentEvaluationResult(
agent_id=str(agent.id),
task_id=str(task.id)
)
assert self.evaluators is not None
for evaluator in self.evaluators:
try:
score = evaluator.evaluate(
agent=agent,
task=task,
execution_trace=execution_trace,
final_output=final_output
)
result.metrics[evaluator.metric_category] = score
except Exception as e:
self.console_formatter.print(f"Error in {evaluator.metric_category.value} evaluator: {str(e)}")
return result
def create_default_evaluator(crew, llm=None):
from crewai.evaluation import (
GoalAlignmentEvaluator,
SemanticQualityEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator,
ReasoningEfficiencyEvaluator
)
evaluators = [
GoalAlignmentEvaluator(llm=llm),
SemanticQualityEvaluator(llm=llm),
ToolSelectionEvaluator(llm=llm),
ParameterExtractionEvaluator(llm=llm),
ToolInvocationEvaluator(llm=llm),
ReasoningEfficiencyEvaluator(llm=llm),
]
return AgentEvaluator(evaluators=evaluators, crew=crew)

View File

@@ -0,0 +1,40 @@
from crewai.experimental.evaluation import (
BaseEvaluator,
EvaluationScore,
MetricCategory,
AgentEvaluationResult,
SemanticQualityEvaluator,
GoalAlignmentEvaluator,
ReasoningEfficiencyEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator,
EvaluationTraceCallback,
create_evaluation_callbacks,
AgentEvaluator,
create_default_evaluator,
ExperimentRunner,
ExperimentResults,
ExperimentResult,
)
__all__ = [
"BaseEvaluator",
"EvaluationScore",
"MetricCategory",
"AgentEvaluationResult",
"SemanticQualityEvaluator",
"GoalAlignmentEvaluator",
"ReasoningEfficiencyEvaluator",
"ToolSelectionEvaluator",
"ParameterExtractionEvaluator",
"ToolInvocationEvaluator",
"EvaluationTraceCallback",
"create_evaluation_callbacks",
"AgentEvaluator",
"create_default_evaluator",
"ExperimentRunner",
"ExperimentResults",
"ExperimentResult"
]

View File

@@ -1,40 +1,35 @@
from crewai.evaluation.base_evaluator import (
from crewai.experimental.evaluation.base_evaluator import (
BaseEvaluator,
EvaluationScore,
MetricCategory,
AgentEvaluationResult
)
from crewai.evaluation.metrics.semantic_quality_metrics import (
SemanticQualityEvaluator
)
from crewai.evaluation.metrics.goal_metrics import (
GoalAlignmentEvaluator
)
from crewai.evaluation.metrics.reasoning_metrics import (
ReasoningEfficiencyEvaluator
)
from crewai.evaluation.metrics.tools_metrics import (
from crewai.experimental.evaluation.metrics import (
SemanticQualityEvaluator,
GoalAlignmentEvaluator,
ReasoningEfficiencyEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator
)
from crewai.evaluation.evaluation_listener import (
from crewai.experimental.evaluation.evaluation_listener import (
EvaluationTraceCallback,
create_evaluation_callbacks
)
from crewai.evaluation.agent_evaluator import (
from crewai.experimental.evaluation.agent_evaluator import (
AgentEvaluator,
create_default_evaluator
)
from crewai.experimental.evaluation.experiment import (
ExperimentRunner,
ExperimentResults,
ExperimentResult
)
__all__ = [
"BaseEvaluator",
"EvaluationScore",
@@ -49,5 +44,8 @@ __all__ = [
"EvaluationTraceCallback",
"create_evaluation_callbacks",
"AgentEvaluator",
"create_default_evaluator"
]
"create_default_evaluator",
"ExperimentRunner",
"ExperimentResults",
"ExperimentResult"
]

View File

@@ -0,0 +1,245 @@
import threading
from typing import Any
from crewai.experimental.evaluation.base_evaluator import AgentEvaluationResult, AggregationStrategy
from crewai.agent import Agent
from crewai.task import Task
from crewai.experimental.evaluation.evaluation_display import EvaluationDisplayFormatter
from crewai.utilities.events.agent_events import AgentEvaluationStartedEvent, AgentEvaluationCompletedEvent, AgentEvaluationFailedEvent
from crewai.experimental.evaluation import BaseEvaluator, create_evaluation_callbacks
from collections.abc import Sequence
from crewai.utilities.events.crewai_event_bus import crewai_event_bus
from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
from crewai.utilities.events.task_events import TaskCompletedEvent
from crewai.utilities.events.agent_events import LiteAgentExecutionCompletedEvent
from crewai.experimental.evaluation.base_evaluator import AgentAggregatedEvaluationResult, EvaluationScore, MetricCategory
class ExecutionState:
def __init__(self):
self.traces = {}
self.current_agent_id: str | None = None
self.current_task_id: str | None = None
self.iteration = 1
self.iterations_results = {}
self.agent_evaluators = {}
class AgentEvaluator:
def __init__(
self,
agents: list[Agent],
evaluators: Sequence[BaseEvaluator] | None = None,
):
self.agents: list[Agent] = agents
self.evaluators: Sequence[BaseEvaluator] | None = evaluators
self.callback = create_evaluation_callbacks()
self.console_formatter = ConsoleFormatter()
self.display_formatter = EvaluationDisplayFormatter()
self._thread_local: threading.local = threading.local()
for agent in self.agents:
self._execution_state.agent_evaluators[str(agent.id)] = self.evaluators
self._subscribe_to_events()
@property
def _execution_state(self) -> ExecutionState:
if not hasattr(self._thread_local, 'execution_state'):
self._thread_local.execution_state = ExecutionState()
return self._thread_local.execution_state
def _subscribe_to_events(self) -> None:
from typing import cast
crewai_event_bus.register_handler(TaskCompletedEvent, cast(Any, self._handle_task_completed))
crewai_event_bus.register_handler(LiteAgentExecutionCompletedEvent, cast(Any, self._handle_lite_agent_completed))
def _handle_task_completed(self, source: Any, event: TaskCompletedEvent) -> None:
assert event.task is not None
agent = event.task.agent
if agent and str(getattr(agent, 'id', 'unknown')) in self._execution_state.agent_evaluators:
self.emit_evaluation_started_event(agent_role=agent.role, agent_id=str(agent.id), task_id=str(event.task.id))
state = ExecutionState()
state.current_agent_id = str(agent.id)
state.current_task_id = str(event.task.id)
assert state.current_agent_id is not None and state.current_task_id is not None
trace = self.callback.get_trace(state.current_agent_id, state.current_task_id)
if not trace:
return
result = self.evaluate(
agent=agent,
task=event.task,
execution_trace=trace,
final_output=event.output,
state=state
)
current_iteration = self._execution_state.iteration
if current_iteration not in self._execution_state.iterations_results:
self._execution_state.iterations_results[current_iteration] = {}
if agent.role not in self._execution_state.iterations_results[current_iteration]:
self._execution_state.iterations_results[current_iteration][agent.role] = []
self._execution_state.iterations_results[current_iteration][agent.role].append(result)
def _handle_lite_agent_completed(self, source: object, event: LiteAgentExecutionCompletedEvent) -> None:
agent_info = event.agent_info
agent_id = str(agent_info["id"])
if agent_id in self._execution_state.agent_evaluators:
state = ExecutionState()
state.current_agent_id = agent_id
state.current_task_id = "lite_task"
target_agent = None
for agent in self.agents:
if str(agent.id) == agent_id:
target_agent = agent
break
if not target_agent:
return
assert state.current_agent_id is not None and state.current_task_id is not None
trace = self.callback.get_trace(state.current_agent_id, state.current_task_id)
if not trace:
return
result = self.evaluate(
agent=target_agent,
execution_trace=trace,
final_output=event.output,
state=state
)
current_iteration = self._execution_state.iteration
if current_iteration not in self._execution_state.iterations_results:
self._execution_state.iterations_results[current_iteration] = {}
agent_role = target_agent.role
if agent_role not in self._execution_state.iterations_results[current_iteration]:
self._execution_state.iterations_results[current_iteration][agent_role] = []
self._execution_state.iterations_results[current_iteration][agent_role].append(result)
def set_iteration(self, iteration: int) -> None:
self._execution_state.iteration = iteration
def reset_iterations_results(self) -> None:
self._execution_state.iterations_results = {}
def get_evaluation_results(self) -> dict[str, list[AgentEvaluationResult]]:
if self._execution_state.iterations_results and self._execution_state.iteration in self._execution_state.iterations_results:
return self._execution_state.iterations_results[self._execution_state.iteration]
return {}
def display_results_with_iterations(self) -> None:
self.display_formatter.display_summary_results(self._execution_state.iterations_results)
def get_agent_evaluation(self, strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE, include_evaluation_feedback: bool = True) -> dict[str, AgentAggregatedEvaluationResult]:
agent_results = {}
with crewai_event_bus.scoped_handlers():
task_results = self.get_evaluation_results()
for agent_role, results in task_results.items():
if not results:
continue
agent_id = results[0].agent_id
aggregated_result = self.display_formatter._aggregate_agent_results(
agent_id=agent_id,
agent_role=agent_role,
results=results,
strategy=strategy
)
agent_results[agent_role] = aggregated_result
if self._execution_state.iterations_results and self._execution_state.iteration == max(self._execution_state.iterations_results.keys(), default=0):
self.display_results_with_iterations()
if include_evaluation_feedback:
self.display_evaluation_with_feedback()
return agent_results
def display_evaluation_with_feedback(self) -> None:
self.display_formatter.display_evaluation_with_feedback(self._execution_state.iterations_results)
def evaluate(
self,
agent: Agent,
execution_trace: dict[str, Any],
final_output: Any,
state: ExecutionState,
task: Task | None = None,
) -> AgentEvaluationResult:
result = AgentEvaluationResult(
agent_id=state.current_agent_id or str(agent.id),
task_id=state.current_task_id or (str(task.id) if task else "unknown_task")
)
assert self.evaluators is not None
task_id = str(task.id) if task else None
for evaluator in self.evaluators:
try:
self.emit_evaluation_started_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id)
score = evaluator.evaluate(
agent=agent,
task=task,
execution_trace=execution_trace,
final_output=final_output
)
result.metrics[evaluator.metric_category] = score
self.emit_evaluation_completed_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id, metric_category=evaluator.metric_category, score=score)
except Exception as e:
self.emit_evaluation_failed_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id, error=str(e))
self.console_formatter.print(f"Error in {evaluator.metric_category.value} evaluator: {str(e)}")
return result
def emit_evaluation_started_event(self, agent_role: str, agent_id: str, task_id: str | None = None):
crewai_event_bus.emit(
self,
AgentEvaluationStartedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration)
)
def emit_evaluation_completed_event(self, agent_role: str, agent_id: str, task_id: str | None = None, metric_category: MetricCategory | None = None, score: EvaluationScore | None = None):
crewai_event_bus.emit(
self,
AgentEvaluationCompletedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration, metric_category=metric_category, score=score)
)
def emit_evaluation_failed_event(self, agent_role: str, agent_id: str, error: str, task_id: str | None = None):
crewai_event_bus.emit(
self,
AgentEvaluationFailedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration, error=error)
)
def create_default_evaluator(agents: list[Agent], llm: None = None):
from crewai.experimental.evaluation import (
GoalAlignmentEvaluator,
SemanticQualityEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator,
ReasoningEfficiencyEvaluator
)
evaluators = [
GoalAlignmentEvaluator(llm=llm),
SemanticQualityEvaluator(llm=llm),
ToolSelectionEvaluator(llm=llm),
ParameterExtractionEvaluator(llm=llm),
ToolInvocationEvaluator(llm=llm),
ReasoningEfficiencyEvaluator(llm=llm),
]
return AgentEvaluator(evaluators=evaluators, agents=agents)

View File

@@ -57,9 +57,9 @@ class BaseEvaluator(abc.ABC):
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: Any,
task: Task | None = None,
) -> EvaluationScore:
pass

View File

@@ -3,8 +3,8 @@ from typing import Dict, Any, List
from rich.table import Table
from rich.box import HEAVY_EDGE, ROUNDED
from collections.abc import Sequence
from crewai.evaluation.base_evaluator import AgentAggregatedEvaluationResult, AggregationStrategy, AgentEvaluationResult, MetricCategory
from crewai.evaluation import EvaluationScore
from crewai.experimental.evaluation.base_evaluator import AgentAggregatedEvaluationResult, AggregationStrategy, AgentEvaluationResult, MetricCategory
from crewai.experimental.evaluation import EvaluationScore
from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
from crewai.utilities.llm_utils import create_llm
@@ -17,7 +17,6 @@ class EvaluationDisplayFormatter:
self.console_formatter.print("[yellow]No evaluation results to display[/yellow]")
return
# Get all agent roles across all iterations
all_agent_roles: set[str] = set()
for iter_results in iterations_results.values():
all_agent_roles.update(iter_results.keys())
@@ -25,7 +24,6 @@ class EvaluationDisplayFormatter:
for agent_role in sorted(all_agent_roles):
self.console_formatter.print(f"\n[bold cyan]Agent: {agent_role}[/bold cyan]")
# Process each iteration
for iter_num, results in sorted(iterations_results.items()):
if agent_role not in results or not results[agent_role]:
continue
@@ -33,23 +31,19 @@ class EvaluationDisplayFormatter:
agent_results = results[agent_role]
agent_id = agent_results[0].agent_id
# Aggregate results for this agent in this iteration
aggregated_result = self._aggregate_agent_results(
agent_id=agent_id,
agent_role=agent_role,
results=agent_results,
)
# Display iteration header
self.console_formatter.print(f"\n[bold]Iteration {iter_num}[/bold]")
# Create table for this iteration
table = Table(box=ROUNDED)
table.add_column("Metric", style="cyan")
table.add_column("Score (1-10)", justify="center")
table.add_column("Feedback", style="green")
# Add metrics to table
if aggregated_result.metrics:
for metric, evaluation_score in aggregated_result.metrics.items():
score = evaluation_score.score
@@ -91,7 +85,6 @@ class EvaluationDisplayFormatter:
"Overall agent evaluation score"
)
# Print the table for this iteration
self.console_formatter.print(table)
def display_summary_results(self, iterations_results: Dict[int, Dict[str, List[AgentAggregatedEvaluationResult]]]):
@@ -248,7 +241,6 @@ class EvaluationDisplayFormatter:
feedback_summary = None
if feedbacks:
if len(feedbacks) > 1:
# Use the summarization method for multiple feedbacks
feedback_summary = self._summarize_feedbacks(
agent_role=agent_role,
metric=category.title(),
@@ -307,7 +299,7 @@ class EvaluationDisplayFormatter:
strategy_guidance = "Focus on the highest-scoring aspects and strengths demonstrated."
elif strategy == AggregationStrategy.WORST_PERFORMANCE:
strategy_guidance = "Focus on areas that need improvement and common issues across tasks."
else: # Default/average strategies
else:
strategy_guidance = "Provide a balanced analysis of strengths and weaknesses across all tasks."
prompt = [

View File

@@ -9,7 +9,9 @@ from crewai.utilities.events.base_event_listener import BaseEventListener
from crewai.utilities.events.crewai_event_bus import CrewAIEventsBus
from crewai.utilities.events.agent_events import (
AgentExecutionStartedEvent,
AgentExecutionCompletedEvent
AgentExecutionCompletedEvent,
LiteAgentExecutionStartedEvent,
LiteAgentExecutionCompletedEvent
)
from crewai.utilities.events.tool_usage_events import (
ToolUsageFinishedEvent,
@@ -52,10 +54,18 @@ class EvaluationTraceCallback(BaseEventListener):
def on_agent_started(source, event: AgentExecutionStartedEvent):
self.on_agent_start(event.agent, event.task)
@event_bus.on(LiteAgentExecutionStartedEvent)
def on_lite_agent_started(source, event: LiteAgentExecutionStartedEvent):
self.on_lite_agent_start(event.agent_info)
@event_bus.on(AgentExecutionCompletedEvent)
def on_agent_completed(source, event: AgentExecutionCompletedEvent):
self.on_agent_finish(event.agent, event.task, event.output)
@event_bus.on(LiteAgentExecutionCompletedEvent)
def on_lite_agent_completed(source, event: LiteAgentExecutionCompletedEvent):
self.on_lite_agent_finish(event.output)
@event_bus.on(ToolUsageFinishedEvent)
def on_tool_completed(source, event: ToolUsageFinishedEvent):
self.on_tool_use(event.tool_name, event.tool_args, event.output, success=True)
@@ -88,19 +98,38 @@ class EvaluationTraceCallback(BaseEventListener):
def on_llm_call_completed(source, event: LLMCallCompletedEvent):
self.on_llm_call_end(event.messages, event.response)
def on_lite_agent_start(self, agent_info: dict[str, Any]):
self.current_agent_id = agent_info['id']
self.current_task_id = "lite_task"
trace_key = f"{self.current_agent_id}_{self.current_task_id}"
self._init_trace(
trace_key=trace_key,
agent_id=self.current_agent_id,
task_id=self.current_task_id,
tool_uses=[],
llm_calls=[],
start_time=datetime.now(),
final_output=None
)
def _init_trace(self, trace_key: str, **kwargs: Any):
self.traces[trace_key] = kwargs
def on_agent_start(self, agent: Agent, task: Task):
self.current_agent_id = agent.id
self.current_task_id = task.id
trace_key = f"{agent.id}_{task.id}"
self.traces[trace_key] = {
"agent_id": agent.id,
"task_id": task.id,
"tool_uses": [],
"llm_calls": [],
"start_time": datetime.now(),
"final_output": None
}
self._init_trace(
trace_key=trace_key,
agent_id=agent.id,
task_id=task.id,
tool_uses=[],
llm_calls=[],
start_time=datetime.now(),
final_output=None
)
def on_agent_finish(self, agent: Agent, task: Task, output: Any):
trace_key = f"{agent.id}_{task.id}"
@@ -108,9 +137,20 @@ class EvaluationTraceCallback(BaseEventListener):
self.traces[trace_key]["final_output"] = output
self.traces[trace_key]["end_time"] = datetime.now()
self._reset_current()
def _reset_current(self):
self.current_agent_id = None
self.current_task_id = None
def on_lite_agent_finish(self, output: Any):
trace_key = f"{self.current_agent_id}_lite_task"
if trace_key in self.traces:
self.traces[trace_key]["final_output"] = output
self.traces[trace_key]["end_time"] = datetime.now()
self._reset_current()
def on_tool_use(self, tool_name: str, tool_args: dict[str, Any] | str, result: Any,
success: bool = True, error_type: str | None = None):
if not self.current_agent_id or not self.current_task_id:
@@ -187,4 +227,8 @@ class EvaluationTraceCallback(BaseEventListener):
def create_evaluation_callbacks() -> EvaluationTraceCallback:
return EvaluationTraceCallback()
from crewai.utilities.events.crewai_event_bus import crewai_event_bus
callback = EvaluationTraceCallback()
callback.setup_listeners(crewai_event_bus)
return callback

View File

@@ -0,0 +1,8 @@
from crewai.experimental.evaluation.experiment.runner import ExperimentRunner
from crewai.experimental.evaluation.experiment.result import ExperimentResults, ExperimentResult
__all__ = [
"ExperimentRunner",
"ExperimentResults",
"ExperimentResult"
]

View File

@@ -0,0 +1,122 @@
import json
import os
from datetime import datetime, timezone
from typing import Any
from pydantic import BaseModel
class ExperimentResult(BaseModel):
identifier: str
inputs: dict[str, Any]
score: int | dict[str, int | float]
expected_score: int | dict[str, int | float]
passed: bool
agent_evaluations: dict[str, Any] | None = None
class ExperimentResults:
def __init__(self, results: list[ExperimentResult], metadata: dict[str, Any] | None = None):
self.results = results
self.metadata = metadata or {}
self.timestamp = datetime.now(timezone.utc)
from crewai.experimental.evaluation.experiment.result_display import ExperimentResultsDisplay
self.display = ExperimentResultsDisplay()
def to_json(self, filepath: str | None = None) -> dict[str, Any]:
data = {
"timestamp": self.timestamp.isoformat(),
"metadata": self.metadata,
"results": [r.model_dump(exclude={"agent_evaluations"}) for r in self.results]
}
if filepath:
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
self.display.console.print(f"[green]Results saved to {filepath}[/green]")
return data
def compare_with_baseline(self, baseline_filepath: str, save_current: bool = True, print_summary: bool = False) -> dict[str, Any]:
baseline_runs = []
if os.path.exists(baseline_filepath) and os.path.getsize(baseline_filepath) > 0:
try:
with open(baseline_filepath, 'r') as f:
baseline_data = json.load(f)
if isinstance(baseline_data, dict) and "timestamp" in baseline_data:
baseline_runs = [baseline_data]
elif isinstance(baseline_data, list):
baseline_runs = baseline_data
except (json.JSONDecodeError, FileNotFoundError) as e:
self.display.console.print(f"[yellow]Warning: Could not load baseline file: {str(e)}[/yellow]")
if not baseline_runs:
if save_current:
current_data = self.to_json()
with open(baseline_filepath, 'w') as f:
json.dump([current_data], f, indent=2)
self.display.console.print(f"[green]Saved current results as new baseline to {baseline_filepath}[/green]")
return {"is_baseline": True, "changes": {}}
baseline_runs.sort(key=lambda x: x.get("timestamp", ""), reverse=True)
latest_run = baseline_runs[0]
comparison = self._compare_with_run(latest_run)
if print_summary:
self.display.comparison_summary(comparison, latest_run["timestamp"])
if save_current:
current_data = self.to_json()
baseline_runs.append(current_data)
with open(baseline_filepath, 'w') as f:
json.dump(baseline_runs, f, indent=2)
self.display.console.print(f"[green]Added current results to baseline file {baseline_filepath}[/green]")
return comparison
def _compare_with_run(self, baseline_run: dict[str, Any]) -> dict[str, Any]:
baseline_results = baseline_run.get("results", [])
baseline_lookup = {}
for result in baseline_results:
test_identifier = result.get("identifier")
if test_identifier:
baseline_lookup[test_identifier] = result
improved = []
regressed = []
unchanged = []
new_tests = []
for result in self.results:
test_identifier = result.identifier
if not test_identifier or test_identifier not in baseline_lookup:
new_tests.append(test_identifier)
continue
baseline_result = baseline_lookup[test_identifier]
baseline_passed = baseline_result.get("passed", False)
if result.passed and not baseline_passed:
improved.append(test_identifier)
elif not result.passed and baseline_passed:
regressed.append(test_identifier)
else:
unchanged.append(test_identifier)
missing_tests = []
current_test_identifiers = {result.identifier for result in self.results}
for result in baseline_results:
test_identifier = result.get("identifier")
if test_identifier and test_identifier not in current_test_identifiers:
missing_tests.append(test_identifier)
return {
"improved": improved,
"regressed": regressed,
"unchanged": unchanged,
"new_tests": new_tests,
"missing_tests": missing_tests,
"total_compared": len(improved) + len(regressed) + len(unchanged),
"baseline_timestamp": baseline_run.get("timestamp", "unknown")
}

View File

@@ -0,0 +1,70 @@
from typing import Dict, Any
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from crewai.experimental.evaluation.experiment.result import ExperimentResults
class ExperimentResultsDisplay:
def __init__(self):
self.console = Console()
def summary(self, experiment_results: ExperimentResults):
total = len(experiment_results.results)
passed = sum(1 for r in experiment_results.results if r.passed)
table = Table(title="Experiment Summary")
table.add_column("Metric", style="cyan")
table.add_column("Value", style="green")
table.add_row("Total Test Cases", str(total))
table.add_row("Passed", str(passed))
table.add_row("Failed", str(total - passed))
table.add_row("Success Rate", f"{(passed / total * 100):.1f}%" if total > 0 else "N/A")
self.console.print(table)
def comparison_summary(self, comparison: Dict[str, Any], baseline_timestamp: str):
self.console.print(Panel(f"[bold]Comparison with baseline run from {baseline_timestamp}[/bold]",
expand=False))
table = Table(title="Results Comparison")
table.add_column("Metric", style="cyan")
table.add_column("Count", style="white")
table.add_column("Details", style="dim")
improved = comparison.get("improved", [])
if improved:
details = ", ".join([f"{test_identifier}" for test_identifier in improved[:3]])
if len(improved) > 3:
details += f" and {len(improved) - 3} more"
table.add_row("✅ Improved", str(len(improved)), details)
else:
table.add_row("✅ Improved", "0", "")
regressed = comparison.get("regressed", [])
if regressed:
details = ", ".join([f"{test_identifier}" for test_identifier in regressed[:3]])
if len(regressed) > 3:
details += f" and {len(regressed) - 3} more"
table.add_row("❌ Regressed", str(len(regressed)), details, style="red")
else:
table.add_row("❌ Regressed", "0", "")
unchanged = comparison.get("unchanged", [])
table.add_row("⏺ Unchanged", str(len(unchanged)), "")
new_tests = comparison.get("new_tests", [])
if new_tests:
details = ", ".join(new_tests[:3])
if len(new_tests) > 3:
details += f" and {len(new_tests) - 3} more"
table.add_row(" New Tests", str(len(new_tests)), details)
missing_tests = comparison.get("missing_tests", [])
if missing_tests:
details = ", ".join(missing_tests[:3])
if len(missing_tests) > 3:
details += f" and {len(missing_tests) - 3} more"
table.add_row(" Missing Tests", str(len(missing_tests)), details)
self.console.print(table)

View File

@@ -0,0 +1,125 @@
from collections import defaultdict
from hashlib import md5
from typing import Any
from crewai import Crew, Agent
from crewai.experimental.evaluation import AgentEvaluator, create_default_evaluator
from crewai.experimental.evaluation.experiment.result_display import ExperimentResultsDisplay
from crewai.experimental.evaluation.experiment.result import ExperimentResults, ExperimentResult
from crewai.experimental.evaluation.evaluation_display import AgentAggregatedEvaluationResult
class ExperimentRunner:
def __init__(self, dataset: list[dict[str, Any]]):
self.dataset = dataset or []
self.evaluator: AgentEvaluator | None = None
self.display = ExperimentResultsDisplay()
def run(self, crew: Crew | None = None, agents: list[Agent] | None = None, print_summary: bool = False) -> ExperimentResults:
if crew and not agents:
agents = crew.agents
assert agents is not None
self.evaluator = create_default_evaluator(agents=agents)
results = []
for test_case in self.dataset:
self.evaluator.reset_iterations_results()
result = self._run_test_case(test_case=test_case, crew=crew, agents=agents)
results.append(result)
experiment_results = ExperimentResults(results)
if print_summary:
self.display.summary(experiment_results)
return experiment_results
def _run_test_case(self, test_case: dict[str, Any], agents: list[Agent], crew: Crew | None = None) -> ExperimentResult:
inputs = test_case["inputs"]
expected_score = test_case["expected_score"]
identifier = test_case.get("identifier") or md5(str(test_case).encode(), usedforsecurity=False).hexdigest()
try:
self.display.console.print(f"[dim]Running crew with input: {str(inputs)[:50]}...[/dim]")
self.display.console.print("\n")
if crew:
crew.kickoff(inputs=inputs)
else:
for agent in agents:
agent.kickoff(**inputs)
assert self.evaluator is not None
agent_evaluations = self.evaluator.get_agent_evaluation()
actual_score = self._extract_scores(agent_evaluations)
passed = self._assert_scores(expected_score, actual_score)
return ExperimentResult(
identifier=identifier,
inputs=inputs,
score=actual_score,
expected_score=expected_score,
passed=passed,
agent_evaluations=agent_evaluations
)
except Exception as e:
self.display.console.print(f"[red]Error running test case: {str(e)}[/red]")
return ExperimentResult(
identifier=identifier,
inputs=inputs,
score=0,
expected_score=expected_score,
passed=False
)
def _extract_scores(self, agent_evaluations: dict[str, AgentAggregatedEvaluationResult]) -> float | dict[str, float]:
all_scores: dict[str, list[float]] = defaultdict(list)
for evaluation in agent_evaluations.values():
for metric_name, score in evaluation.metrics.items():
if score.score is not None:
all_scores[metric_name.value].append(score.score)
avg_scores = {m: sum(s)/len(s) for m, s in all_scores.items()}
if len(avg_scores) == 1:
return list(avg_scores.values())[0]
return avg_scores
def _assert_scores(self, expected: float | dict[str, float],
actual: float | dict[str, float]) -> bool:
"""
Compare expected and actual scores, and return whether the test case passed.
The rules for comparison are as follows:
- If both expected and actual scores are single numbers, the actual score must be >= expected.
- If expected is a single number and actual is a dict, compare against the average of actual values.
- If expected is a dict and actual is a single number, actual must be >= all expected values.
- If both are dicts, actual must have matching keys with values >= expected values.
"""
if isinstance(expected, (int, float)) and isinstance(actual, (int, float)):
return actual >= expected
if isinstance(expected, dict) and isinstance(actual, (int, float)):
return all(actual >= exp_score for exp_score in expected.values())
if isinstance(expected, (int, float)) and isinstance(actual, dict):
if not actual:
return False
avg_score = sum(actual.values()) / len(actual)
return avg_score >= expected
if isinstance(expected, dict) and isinstance(actual, dict):
if not expected:
return True
matching_keys = set(expected.keys()) & set(actual.keys())
if not matching_keys:
return False
# All matching keys must have actual >= expected
return all(actual[key] >= expected[key] for key in matching_keys)
return False

View File

@@ -0,0 +1,26 @@
from crewai.experimental.evaluation.metrics.reasoning_metrics import (
ReasoningEfficiencyEvaluator
)
from crewai.experimental.evaluation.metrics.tools_metrics import (
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator
)
from crewai.experimental.evaluation.metrics.goal_metrics import (
GoalAlignmentEvaluator
)
from crewai.experimental.evaluation.metrics.semantic_quality_metrics import (
SemanticQualityEvaluator
)
__all__ = [
"ReasoningEfficiencyEvaluator",
"ToolSelectionEvaluator",
"ParameterExtractionEvaluator",
"ToolInvocationEvaluator",
"GoalAlignmentEvaluator",
"SemanticQualityEvaluator"
]

View File

@@ -3,8 +3,8 @@ from typing import Any, Dict
from crewai.agent import Agent
from crewai.task import Task
from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.evaluation.json_parser import extract_json_from_llm_response
from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response
class GoalAlignmentEvaluator(BaseEvaluator):
@property
@@ -14,10 +14,14 @@ class GoalAlignmentEvaluator(BaseEvaluator):
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: Any,
task: Task | None = None,
) -> EvaluationScore:
task_context = ""
if task is not None:
task_context = f"Task description: {task.description}\nExpected output: {task.expected_output}\n"
prompt = [
{"role": "system", "content": """You are an expert evaluator assessing how well an AI agent's output aligns with its assigned task goal.
@@ -37,8 +41,7 @@ Return your evaluation as JSON with fields 'score' (number) and 'feedback' (stri
{"role": "user", "content": f"""
Agent role: {agent.role}
Agent goal: {agent.goal}
Task description: {task.description}
Expected output: {task.expected_output}
{task_context}
Agent's final output:
{final_output}

View File

@@ -16,8 +16,8 @@ from collections.abc import Sequence
from crewai.agent import Agent
from crewai.task import Task
from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.evaluation.json_parser import extract_json_from_llm_response
from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response
from crewai.tasks.task_output import TaskOutput
class ReasoningPatternType(Enum):
@@ -36,10 +36,14 @@ class ReasoningEfficiencyEvaluator(BaseEvaluator):
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: TaskOutput,
final_output: TaskOutput | str,
task: Task | None = None,
) -> EvaluationScore:
task_context = ""
if task is not None:
task_context = f"Task description: {task.description}\nExpected output: {task.expected_output}\n"
llm_calls = execution_trace.get("llm_calls", [])
if not llm_calls or len(llm_calls) < 2:
@@ -83,6 +87,8 @@ class ReasoningEfficiencyEvaluator(BaseEvaluator):
call_samples = self._get_call_samples(llm_calls)
final_output = final_output.raw if isinstance(final_output, TaskOutput) else final_output
prompt = [
{"role": "system", "content": """You are an expert evaluator assessing the reasoning efficiency of an AI agent's thought process.
@@ -117,7 +123,7 @@ Return your evaluation as JSON with the following structure:
}"""},
{"role": "user", "content": f"""
Agent role: {agent.role}
Task description: {task.description}
{task_context}
Reasoning efficiency metrics:
- Total LLM calls: {efficiency_metrics["total_llm_calls"]}
@@ -130,7 +136,7 @@ Sample of agent reasoning flow (chronological sequence):
{call_samples}
Agent's final output:
{final_output.raw[:500]}... (truncated)
{final_output[:500]}... (truncated)
Evaluate the reasoning efficiency of this agent based on these interaction patterns.
Identify any inefficient reasoning patterns and provide specific suggestions for optimization.

View File

@@ -3,8 +3,8 @@ from typing import Any, Dict
from crewai.agent import Agent
from crewai.task import Task
from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.evaluation.json_parser import extract_json_from_llm_response
from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response
class SemanticQualityEvaluator(BaseEvaluator):
@property
@@ -14,10 +14,13 @@ class SemanticQualityEvaluator(BaseEvaluator):
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: Any,
task: Task | None = None,
) -> EvaluationScore:
task_context = ""
if task is not None:
task_context = f"Task description: {task.description}"
prompt = [
{"role": "system", "content": """You are an expert evaluator assessing the semantic quality of an AI agent's output.
@@ -37,7 +40,7 @@ Return your evaluation as JSON with fields 'score' (number) and 'feedback' (stri
"""},
{"role": "user", "content": f"""
Agent role: {agent.role}
Task description: {task.description}
{task_context}
Agent's final output:
{final_output}

View File

@@ -1,8 +1,8 @@
import json
from typing import Dict, Any
from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.evaluation.json_parser import extract_json_from_llm_response
from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response
from crewai.agent import Agent
from crewai.task import Task
@@ -16,10 +16,14 @@ class ToolSelectionEvaluator(BaseEvaluator):
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: str,
task: Task | None = None,
) -> EvaluationScore:
task_context = ""
if task is not None:
task_context = f"Task description: {task.description}"
tool_uses = execution_trace.get("tool_uses", [])
tool_count = len(tool_uses)
unique_tool_types = set([tool.get("tool", "Unknown tool") for tool in tool_uses])
@@ -72,7 +76,7 @@ Return your evaluation as JSON with these fields:
"""},
{"role": "user", "content": f"""
Agent role: {agent.role}
Task description: {task.description}
{task_context}
Available tools for this agent:
{available_tools_info}
@@ -128,10 +132,13 @@ class ParameterExtractionEvaluator(BaseEvaluator):
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: str,
task: Task | None = None,
) -> EvaluationScore:
task_context = ""
if task is not None:
task_context = f"Task description: {task.description}"
tool_uses = execution_trace.get("tool_uses", [])
tool_count = len(tool_uses)
@@ -212,7 +219,7 @@ Return your evaluation as JSON with these fields:
"""},
{"role": "user", "content": f"""
Agent role: {agent.role}
Task description: {task.description}
{task_context}
Parameter extraction examples:
{param_samples_text}
@@ -267,10 +274,13 @@ class ToolInvocationEvaluator(BaseEvaluator):
def evaluate(
self,
agent: Agent,
task: Task,
execution_trace: Dict[str, Any],
final_output: str,
task: Task | None = None,
) -> EvaluationScore:
task_context = ""
if task is not None:
task_context = f"Task description: {task.description}"
tool_uses = execution_trace.get("tool_uses", [])
tool_errors = []
tool_count = len(tool_uses)
@@ -352,7 +362,7 @@ Return your evaluation as JSON with these fields:
"""},
{"role": "user", "content": f"""
Agent role: {agent.role}
Task description: {task.description}
{task_context}
Tool invocation examples:
{invocation_samples_text}

View File

@@ -0,0 +1,52 @@
import inspect
from typing_extensions import Any
import warnings
from crewai.experimental.evaluation.experiment import ExperimentResults, ExperimentRunner
from crewai import Crew, Agent
def assert_experiment_successfully(experiment_results: ExperimentResults, baseline_filepath: str | None = None) -> None:
failed_tests = [result for result in experiment_results.results if not result.passed]
if failed_tests:
detailed_failures: list[str] = []
for result in failed_tests:
expected = result.expected_score
actual = result.score
detailed_failures.append(f"- {result.identifier}: expected {expected}, got {actual}")
failure_details = "\n".join(detailed_failures)
raise AssertionError(f"The following test cases failed:\n{failure_details}")
baseline_filepath = baseline_filepath or _get_baseline_filepath_fallback()
comparison = experiment_results.compare_with_baseline(baseline_filepath=baseline_filepath)
assert_experiment_no_regression(comparison)
def assert_experiment_no_regression(comparison_result: dict[str, list[str]]) -> None:
regressed = comparison_result.get("regressed", [])
if regressed:
raise AssertionError(f"Regression detected! The following tests that previously passed now fail: {regressed}")
missing_tests = comparison_result.get("missing_tests", [])
if missing_tests:
warnings.warn(
f"Warning: {len(missing_tests)} tests from the baseline are missing in the current run: {missing_tests}",
UserWarning
)
def run_experiment(dataset: list[dict[str, Any]], crew: Crew | None = None, agents: list[Agent] | None = None, verbose: bool = False) -> ExperimentResults:
runner = ExperimentRunner(dataset=dataset)
return runner.run(agents=agents, crew=crew, print_summary=verbose)
def _get_baseline_filepath_fallback() -> str:
test_func_name = "experiment_fallback"
try:
current_frame = inspect.currentframe()
if current_frame is not None:
test_func_name = current_frame.f_back.f_back.f_code.co_name # type: ignore[union-attr]
except Exception:
...
return f"{test_func_name}_results.json"

View File

@@ -436,6 +436,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
_routers: Set[str] = set()
_router_paths: Dict[str, List[str]] = {}
initial_state: Union[Type[T], T, None] = None
name: Optional[str] = None
def __class_getitem__(cls: Type["Flow"], item: Type[T]) -> Type["Flow"]:
class _FlowGeneric(cls): # type: ignore
@@ -473,7 +474,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
self,
FlowCreatedEvent(
type="flow_created",
flow_name=self.__class__.__name__,
flow_name=self.name or self.__class__.__name__,
),
)
@@ -769,7 +770,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
self,
FlowStartedEvent(
type="flow_started",
flow_name=self.__class__.__name__,
flow_name=self.name or self.__class__.__name__,
inputs=inputs,
),
)
@@ -792,7 +793,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
self,
FlowFinishedEvent(
type="flow_finished",
flow_name=self.__class__.__name__,
flow_name=self.name or self.__class__.__name__,
result=final_output,
),
)
@@ -834,7 +835,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
MethodExecutionStartedEvent(
type="method_execution_started",
method_name=method_name,
flow_name=self.__class__.__name__,
flow_name=self.name or self.__class__.__name__,
params=dumped_params,
state=self._copy_state(),
),
@@ -856,7 +857,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
MethodExecutionFinishedEvent(
type="method_execution_finished",
method_name=method_name,
flow_name=self.__class__.__name__,
flow_name=self.name or self.__class__.__name__,
state=self._copy_state(),
result=result,
),
@@ -869,7 +870,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
MethodExecutionFailedEvent(
type="method_execution_failed",
method_name=method_name,
flow_name=self.__class__.__name__,
flow_name=self.name or self.__class__.__name__,
error=e,
),
)
@@ -1076,7 +1077,7 @@ class Flow(Generic[T], metaclass=FlowMeta):
self,
FlowPlotEvent(
type="flow_plot",
flow_name=self.__class__.__name__,
flow_name=self.name or self.__class__.__name__,
),
)
plot_flow(self, filename)

View File

@@ -1,55 +0,0 @@
from abc import ABC, abstractmethod
from typing import List
import numpy as np
class BaseEmbedder(ABC):
"""
Abstract base class for text embedding models
"""
@abstractmethod
def embed_chunks(self, chunks: List[str]) -> np.ndarray:
"""
Generate embeddings for a list of text chunks
Args:
chunks: List of text chunks to embed
Returns:
Array of embeddings
"""
pass
@abstractmethod
def embed_texts(self, texts: List[str]) -> np.ndarray:
"""
Generate embeddings for a list of texts
Args:
texts: List of texts to embed
Returns:
Array of embeddings
"""
pass
@abstractmethod
def embed_text(self, text: str) -> np.ndarray:
"""
Generate embedding for a single text
Args:
text: Text to embed
Returns:
Embedding array
"""
pass
@property
@abstractmethod
def dimension(self) -> int:
"""Get the dimension of the embeddings"""
pass

View File

@@ -13,11 +13,12 @@ from chromadb.api.types import OneOrMany
from chromadb.config import Settings
from crewai.knowledge.storage.base_knowledge_storage import BaseKnowledgeStorage
from crewai.utilities import EmbeddingConfigurator
from crewai.rag.embeddings.configurator import EmbeddingConfigurator
from crewai.utilities.chromadb import sanitize_collection_name
from crewai.utilities.constants import KNOWLEDGE_DIRECTORY
from crewai.utilities.logger import Logger
from crewai.utilities.paths import db_storage_path
from crewai.utilities.chromadb import create_persistent_client
@contextlib.contextmanager
@@ -84,14 +85,11 @@ class KnowledgeStorage(BaseKnowledgeStorage):
raise Exception("Collection not initialized")
def initialize_knowledge_storage(self):
base_path = os.path.join(db_storage_path(), "knowledge")
chroma_client = chromadb.PersistentClient(
path=base_path,
self.app = create_persistent_client(
path=os.path.join(db_storage_path(), "knowledge"),
settings=Settings(allow_reset=True),
)
self.app = chroma_client
try:
collection_name = (
f"knowledge_{self.collection_name}"
@@ -111,9 +109,8 @@ class KnowledgeStorage(BaseKnowledgeStorage):
def reset(self):
base_path = os.path.join(db_storage_path(), KNOWLEDGE_DIRECTORY)
if not self.app:
self.app = chromadb.PersistentClient(
path=base_path,
settings=Settings(allow_reset=True),
self.app = create_persistent_client(
path=base_path, settings=Settings(allow_reset=True)
)
self.app.reset()

View File

@@ -305,6 +305,7 @@ class LiteAgent(FlowTrackable, BaseModel):
"""
# Create agent info for event emission
agent_info = {
"id": self.id,
"role": self.role,
"goal": self.goal,
"backstory": self.backstory,

View File

@@ -59,6 +59,7 @@ from crewai.utilities.exceptions.context_window_exceeding_exception import (
load_dotenv()
litellm.suppress_debug_info = True
class FilteredStream(io.TextIOBase):
_lock = None
@@ -76,9 +77,7 @@ class FilteredStream(io.TextIOBase):
# Skip common noisy LiteLLM banners and any other lines that contain "litellm"
if (
"give feedback / get help" in lower_s
or "litellm.info:" in lower_s
or "litellm" in lower_s
"litellm.info:" in lower_s
or "Consider using a smaller input or implementing a text splitting strategy" in lower_s
):
return 0
@@ -309,6 +308,7 @@ class LLM(BaseLLM):
api_version: Optional[str] = None,
api_key: Optional[str] = None,
callbacks: List[Any] = [],
reasoning: Optional[bool] = None,
reasoning_effort: Optional[Literal["none", "low", "medium", "high"]] = None,
stream: bool = False,
**kwargs,
@@ -333,6 +333,7 @@ class LLM(BaseLLM):
self.api_key = api_key
self.callbacks = callbacks
self.context_window_size = 0
self.reasoning = reasoning
self.reasoning_effort = reasoning_effort
self.additional_params = kwargs
self.is_anthropic = self._is_anthropic_model(model)
@@ -407,10 +408,15 @@ class LLM(BaseLLM):
"api_key": self.api_key,
"stream": self.stream,
"tools": tools,
"reasoning_effort": self.reasoning_effort,
**self.additional_params,
}
if self.reasoning is False:
# When reasoning is explicitly disabled, don't include reasoning_effort
pass
elif self.reasoning is True or self.reasoning_effort is not None:
params["reasoning_effort"] = self.reasoning_effort
# Remove None values from params
return {k: v for k, v in params.items() if v is not None}
@@ -760,7 +766,7 @@ class LLM(BaseLLM):
available_functions: Optional[Dict[str, Any]] = None,
from_task: Optional[Any] = None,
from_agent: Optional[Any] = None,
) -> str:
) -> str | Any:
"""Handle a non-streaming response from the LLM.
Args:
@@ -784,13 +790,11 @@ class LLM(BaseLLM):
# Convert litellm's context window error to our own exception type
# for consistent handling in the rest of the codebase
raise LLMContextLengthExceededException(str(e))
# --- 2) Extract response message and content
response_message = cast(Choices, cast(ModelResponse, response).choices)[
0
].message
text_response = response_message.content or ""
# --- 3) Handle callbacks with usage info
if callbacks and len(callbacks) > 0:
for callback in callbacks:
@@ -803,21 +807,22 @@ class LLM(BaseLLM):
start_time=0,
end_time=0,
)
# --- 4) Check for tool calls
tool_calls = getattr(response_message, "tool_calls", [])
# --- 5) If no tool calls or no available functions, return the text response directly
if not tool_calls or not available_functions:
# --- 5) If no tool calls or no available functions, return the text response directly as long as there is a text response
if (not tool_calls or not available_functions) and text_response:
self._handle_emit_call_events(response=text_response, call_type=LLMCallType.LLM_CALL, from_task=from_task, from_agent=from_agent, messages=params["messages"])
return text_response
# --- 6) If there is no text response, no available functions, but there are tool calls, return the tool calls
elif tool_calls and not available_functions and not text_response:
return tool_calls
# --- 6) Handle tool calls if present
# --- 7) Handle tool calls if present
tool_result = self._handle_tool_call(tool_calls, available_functions)
if tool_result is not None:
return tool_result
# --- 7) If tool call handling didn't return a result, emit completion event and return text response
# --- 8) If tool call handling didn't return a result, emit completion event and return text response
self._handle_emit_call_events(response=text_response, call_type=LLMCallType.LLM_CALL, from_task=from_task, from_agent=from_agent, messages=params["messages"])
return text_response
@@ -952,22 +957,18 @@ class LLM(BaseLLM):
# --- 3) Convert string messages to proper format if needed
if isinstance(messages, str):
messages = [{"role": "user", "content": messages}]
# --- 4) Handle O1 model special case (system messages not supported)
if "o1" in self.model.lower():
for message in messages:
if message.get("role") == "system":
message["role"] = "assistant"
# --- 5) Set up callbacks if provided
with suppress_warnings():
if callbacks and len(callbacks) > 0:
self.set_callbacks(callbacks)
try:
# --- 6) Prepare parameters for the completion call
params = self._prepare_completion_params(messages, tools)
# --- 7) Make the completion call and handle response
if self.stream:
return self._handle_streaming_response(
@@ -984,12 +985,32 @@ class LLM(BaseLLM):
# whether to summarize the content or abort based on the respect_context_window flag
raise
except Exception as e:
unsupported_stop = "Unsupported parameter" in str(e) and "'stop'" in str(e)
if unsupported_stop:
if "additional_drop_params" in self.additional_params and isinstance(self.additional_params["additional_drop_params"], list):
self.additional_params["additional_drop_params"].append("stop")
else:
self.additional_params = {"additional_drop_params": ["stop"]}
logging.info(
"Retrying LLM call without the unsupported 'stop'"
)
return self.call(
messages,
tools=tools,
callbacks=callbacks,
available_functions=available_functions,
from_task=from_task,
from_agent=from_agent,
)
assert hasattr(crewai_event_bus, "emit")
crewai_event_bus.emit(
self,
event=LLMCallFailedEvent(error=str(e), from_task=from_task, from_agent=from_agent),
)
logging.error(f"LiteLLM call failed: {str(e)}")
raise
def _handle_emit_call_events(self, response: Any, call_type: LLMCallType, from_task: Optional[Any] = None, from_agent: Optional[Any] = None, messages: str | list[dict[str, Any]] | None = None):
@@ -1058,6 +1079,15 @@ class LLM(BaseLLM):
messages.append({"role": "user", "content": "Please continue."})
return messages
# TODO: Remove this code after merging PR https://github.com/BerriAI/litellm/pull/10917
# Ollama doesn't supports last message to be 'assistant'
if "ollama" in self.model.lower() and messages and messages[-1]["role"] == "assistant":
messages = messages.copy()
messages.append(
{"role": "user", "content": ""}
)
return messages
# Handle Anthropic models
if not self.is_anthropic:
return messages

View File

@@ -108,6 +108,7 @@ class ContextualMemory:
def _fetch_user_context(self, query: str) -> str:
"""
DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, whichever comes first.
Fetches and formats relevant user information from User Memory.
Args:
query (str): The search query to find relevant user memories.

View File

@@ -4,7 +4,6 @@ from typing import Any, Dict, List
from mem0 import Memory, MemoryClient
from crewai.memory.storage.interface import Storage
from crewai.utilities.chromadb import sanitize_collection_name
MAX_AGENT_ID_LENGTH_MEM0 = 255
@@ -13,135 +12,150 @@ class Mem0Storage(Storage):
"""
Extends Storage to handle embedding and searching across entities using Mem0.
"""
def __init__(self, type, crew=None, config=None):
super().__init__()
supported_types = ["user", "short_term", "long_term", "entities", "external"]
if type not in supported_types:
raise ValueError(
f"Invalid type '{type}' for Mem0Storage. Must be one of: "
+ ", ".join(supported_types)
)
self._validate_type(type)
self.memory_type = type
self.crew = crew
self.config = config or {}
# TODO: Memory config will be removed in the future the config will be passed as a parameter
self.memory_config = self.config or getattr(crew, "memory_config", {}) or {}
# User ID is required for user memory type "user" since it's used as a unique identifier for the user.
user_id = self._get_user_id()
if type == "user" and not user_id:
# TODO: Memory config will be removed in the future the config will be passed as a parameter
self.config = config or getattr(crew, "memory_config", {}).get("config", {}) or {}
self._validate_user_id()
self._extract_config_values()
self._initialize_memory()
def _validate_type(self, type):
supported_types = {"user", "short_term", "long_term", "entities", "external"}
if type not in supported_types:
raise ValueError(
f"Invalid type '{type}' for Mem0Storage. Must be one of: {', '.join(supported_types)}"
)
def _validate_user_id(self):
if self.memory_type == "user" and not self.config.get("user_id", ""):
raise ValueError("User ID is required for user memory type")
# API key in memory config overrides the environment variable
config = self._get_config()
mem0_api_key = config.get("api_key") or os.getenv("MEM0_API_KEY")
mem0_org_id = config.get("org_id")
mem0_project_id = config.get("project_id")
mem0_local_config = config.get("local_mem0_config")
def _extract_config_values(self):
cfg = self.config
self.mem0_run_id = cfg.get("run_id")
self.includes = cfg.get("includes")
self.excludes = cfg.get("excludes")
self.custom_categories = cfg.get("custom_categories")
self.infer = cfg.get("infer", True)
# Initialize MemoryClient or Memory based on the presence of the mem0_api_key
if mem0_api_key:
if mem0_org_id and mem0_project_id:
self.memory = MemoryClient(
api_key=mem0_api_key, org_id=mem0_org_id, project_id=mem0_project_id
)
else:
self.memory = MemoryClient(api_key=mem0_api_key)
def _initialize_memory(self):
api_key = self.config.get("api_key") or os.getenv("MEM0_API_KEY")
org_id = self.config.get("org_id")
project_id = self.config.get("project_id")
local_config = self.config.get("local_mem0_config")
if api_key:
self.memory = (
MemoryClient(api_key=api_key, org_id=org_id, project_id=project_id)
if org_id and project_id
else MemoryClient(api_key=api_key)
)
if self.custom_categories:
self.memory.update_project(custom_categories=self.custom_categories)
else:
if mem0_local_config and len(mem0_local_config):
self.memory = Memory.from_config(mem0_local_config)
else:
self.memory = Memory()
self.memory = (
Memory.from_config(local_config)
if local_config and len(local_config)
else Memory()
)
def _sanitize_role(self, role: str) -> str:
def _create_filter_for_search(self):
"""
Sanitizes agent roles to ensure valid directory names.
Returns:
dict: A filter dictionary containing AND conditions for querying data.
- Includes user_id if memory_type is 'external'.
- Includes run_id if memory_type is 'short_term' and mem0_run_id is present.
"""
return role.replace("\n", "").replace(" ", "_").replace("/", "_")
filter = {
"AND": []
}
# Add user_id condition if the memory type is external
if self.memory_type == "external":
filter["AND"].append({"user_id": self.config.get("user_id", "")})
# Add run_id condition if the memory type is short_term and a run ID is set
if self.memory_type == "short_term" and self.mem0_run_id:
filter["AND"].append({"run_id": self.mem0_run_id})
return filter
def save(self, value: Any, metadata: Dict[str, Any]) -> None:
user_id = self._get_user_id()
agent_name = self._get_agent_name()
params = None
if self.memory_type == "short_term":
params = {
"agent_id": agent_name,
"infer": False,
"metadata": {"type": "short_term", **metadata},
}
elif self.memory_type == "long_term":
params = {
"agent_id": agent_name,
"infer": False,
"metadata": {"type": "long_term", **metadata},
}
elif self.memory_type == "entities":
params = {
"agent_id": agent_name,
"infer": False,
"metadata": {"type": "entity", **metadata},
}
elif self.memory_type == "external":
params = {
"user_id": user_id,
"agent_id": agent_name,
"metadata": {"type": "external", **metadata},
}
user_id = self.config.get("user_id", "")
assistant_message = [{"role" : "assistant","content" : value}]
if params:
if isinstance(self.memory, MemoryClient):
params["output_format"] = "v1.1"
self.memory.add(value, **params)
base_metadata = {
"short_term": "short_term",
"long_term": "long_term",
"entities": "entity",
"external": "external"
}
def search(
self,
query: str,
limit: int = 3,
score_threshold: float = 0.35,
) -> List[Any]:
params = {"query": query, "limit": limit, "output_format": "v1.1"}
if user_id := self._get_user_id():
# Shared base params
params: dict[str, Any] = {
"metadata": {"type": base_metadata[self.memory_type], **metadata},
"infer": self.infer
}
if self.memory_type == "external":
params["user_id"] = user_id
agent_name = self._get_agent_name()
if self.memory_type == "short_term":
params["agent_id"] = agent_name
params["metadata"] = {"type": "short_term"}
elif self.memory_type == "long_term":
params["agent_id"] = agent_name
params["metadata"] = {"type": "long_term"}
elif self.memory_type == "entities":
params["agent_id"] = agent_name
params["metadata"] = {"type": "entity"}
elif self.memory_type == "external":
params["agent_id"] = agent_name
params["metadata"] = {"type": "external"}
if params:
# MemoryClient-specific overrides
if isinstance(self.memory, MemoryClient):
params["includes"] = self.includes
params["excludes"] = self.excludes
params["output_format"] = "v1.1"
params["version"]="v2"
if self.memory_type == "short_term":
params["run_id"] = self.mem0_run_id
self.memory.add(assistant_message, **params)
def search(self,query: str,limit: int = 3,score_threshold: float = 0.35) -> List[Any]:
params = {
"query": query,
"limit": limit,
"version": "v2",
"output_format": "v1.1"
}
if user_id := self.config.get("user_id", ""):
params["user_id"] = user_id
memory_type_map = {
"short_term": {"type": "short_term"},
"long_term": {"type": "long_term"},
"entities": {"type": "entity"},
"external": {"type": "external"},
}
if self.memory_type in memory_type_map:
params["metadata"] = memory_type_map[self.memory_type]
if self.memory_type == "short_term":
params["run_id"] = self.mem0_run_id
# Discard the filters for now since we create the filters
# automatically when the crew is created.
params["filters"] = self._create_filter_for_search()
params['threshold'] = score_threshold
if isinstance(self.memory, Memory):
del params["metadata"], params["output_format"]
del params["metadata"], params["version"], params["run_id"], params['output_format']
results = self.memory.search(**params)
return [r for r in results["results"] if r["score"] >= score_threshold]
def _get_user_id(self) -> str:
return self._get_config().get("user_id", "")
def _get_agent_name(self) -> str:
if not self.crew:
return ""
agents = self.crew.agents
agents = [self._sanitize_role(agent.role) for agent in agents]
agents = "_".join(agents)
return sanitize_collection_name(name=agents,max_collection_length=MAX_AGENT_ID_LENGTH_MEM0)
def _get_config(self) -> Dict[str, Any]:
return self.config or getattr(self, "memory_config", {}).get("config", {}) or {}
return [r for r in results["results"]]
def reset(self):
if self.memory:
self.memory.reset()

View File

@@ -4,12 +4,12 @@ import logging
import os
import shutil
import uuid
from typing import Any, Dict, List, Optional
from chromadb.api import ClientAPI
from crewai.memory.storage.base_rag_storage import BaseRAGStorage
from crewai.utilities import EmbeddingConfigurator
from crewai.rag.storage.base_rag_storage import BaseRAGStorage
from crewai.rag.embeddings.configurator import EmbeddingConfigurator
from crewai.utilities.chromadb import create_persistent_client
from crewai.utilities.constants import MAX_FILE_NAME_LENGTH
from crewai.utilities.paths import db_storage_path
@@ -60,17 +60,15 @@ class RAGStorage(BaseRAGStorage):
self.embedder_config = configurator.configure_embedder(self.embedder_config)
def _initialize_app(self):
import chromadb
from chromadb.config import Settings
self._set_embedder_config()
chroma_client = chromadb.PersistentClient(
self.app = create_persistent_client(
path=self.path if self.path else self.storage_file_name,
settings=Settings(allow_reset=self.allow_reset),
)
self.app = chroma_client
self.collection = self.app.get_or_create_collection(
name=self.type, embedding_function=self.embedder_config
)

View File

@@ -14,7 +14,8 @@ class UserMemory(Memory):
def __init__(self, crew=None):
warnings.warn(
"UserMemory is deprecated and will be removed in a future version. "
"UserMemory is deprecated and will be removed in version 0.156.0 "
"or on 2025-08-04, whichever comes first. "
"Please use ExternalMemory instead.",
DeprecationWarning,
stacklevel=2,

View File

@@ -1,8 +1,16 @@
import warnings
from typing import Any, Dict, Optional
class UserMemoryItem:
def __init__(self, data: Any, user: str, metadata: Optional[Dict[str, Any]] = None):
warnings.warn(
"UserMemoryItem is deprecated and will be removed in version 0.156.0 "
"or on 2025-08-04, whichever comes first. "
"Please use ExternalMemory instead.",
DeprecationWarning,
stacklevel=2,
)
self.data = data
self.user = user
self.metadata = metadata if metadata is not None else {}

View File

@@ -0,0 +1 @@
"""RAG (Retrieval-Augmented Generation) infrastructure for CrewAI."""

View File

@@ -0,0 +1 @@
"""Embedding components for RAG infrastructure."""

View File

@@ -38,7 +38,14 @@ class EmbeddingConfigurator:
f"Unsupported embedding provider: {provider}, supported providers: {list(self.embedding_functions.keys())}"
)
embedding_function = self.embedding_functions[provider]
try:
embedding_function = self.embedding_functions[provider]
except ImportError as e:
missing_package = str(e).split()[-1]
raise ImportError(
f"{missing_package} is not installed. Please install it with: pip install {missing_package}"
)
return (
embedding_function(config)
if provider == "custom"

View File

@@ -0,0 +1 @@
"""Storage components for RAG infrastructure."""

View File

@@ -10,7 +10,6 @@ from .rpm_controller import RPMController
from .exceptions.context_window_exceeding_exception import (
LLMContextLengthExceededException,
)
from .embedding_configurator import EmbeddingConfigurator
__all__ = [
"Converter",
@@ -24,5 +23,4 @@ __all__ = [
"RPMController",
"YamlParser",
"LLMContextLengthExceededException",
"EmbeddingConfigurator",
]

View File

@@ -157,10 +157,6 @@ def get_llm_response(
from_agent=from_agent,
)
except Exception as e:
printer.print(
content=f"Error during LLM call: {e}",
color="red",
)
raise e
if not answer:
printer.print(
@@ -232,12 +228,17 @@ def handle_unknown_error(printer: Any, exception: Exception) -> None:
printer: Printer instance for output
exception: The exception that occurred
"""
error_message = str(exception)
if "litellm" in error_message:
return
printer.print(
content="An unknown error occurred. Please check the details below.",
color="red",
)
printer.print(
content=f"Error details: {exception}",
content=f"Error details: {error_message}",
color="red",
)

View File

@@ -1,6 +1,10 @@
import re
import portalocker
from chromadb import PersistentClient
from hashlib import md5
from typing import Optional
MIN_COLLECTION_LENGTH = 3
MAX_COLLECTION_LENGTH = 63
DEFAULT_COLLECTION = "default_collection"
@@ -60,3 +64,16 @@ def sanitize_collection_name(name: Optional[str], max_collection_length: int = M
sanitized = sanitized[:-1] + "z"
return sanitized
def create_persistent_client(path: str, **kwargs):
"""
Creates a persistent client for ChromaDB with a lock file to prevent
concurrent creations. Works for both multi-threads and multi-processes
environments.
"""
lockfile = f"chromadb-{md5(path.encode(), usedforsecurity=False).hexdigest()}.lock"
with portalocker.Lock(lockfile):
client = PersistentClient(path=path, **kwargs)
return client

View File

@@ -17,6 +17,9 @@ from .agent_events import (
AgentExecutionStartedEvent,
AgentExecutionCompletedEvent,
AgentExecutionErrorEvent,
AgentEvaluationStartedEvent,
AgentEvaluationCompletedEvent,
AgentEvaluationFailedEvent,
)
from .task_events import (
TaskStartedEvent,
@@ -74,6 +77,9 @@ __all__ = [
"AgentExecutionStartedEvent",
"AgentExecutionCompletedEvent",
"AgentExecutionErrorEvent",
"AgentEvaluationStartedEvent",
"AgentEvaluationCompletedEvent",
"AgentEvaluationFailedEvent",
"TaskStartedEvent",
"TaskCompletedEvent",
"TaskFailedEvent",

View File

@@ -123,3 +123,28 @@ class AgentLogsExecutionEvent(BaseEvent):
type: str = "agent_logs_execution"
model_config = {"arbitrary_types_allowed": True}
# Agent Eval events
class AgentEvaluationStartedEvent(BaseEvent):
agent_id: str
agent_role: str
task_id: str | None = None
iteration: int
type: str = "agent_evaluation_started"
class AgentEvaluationCompletedEvent(BaseEvent):
agent_id: str
agent_role: str
task_id: str | None = None
iteration: int
metric_category: Any
score: Any
type: str = "agent_evaluation_completed"
class AgentEvaluationFailedEvent(BaseEvent):
agent_id: str
agent_role: str
task_id: str | None = None
iteration: int
error: str
type: str = "agent_evaluation_failed"

View File

@@ -1,6 +1,5 @@
from datetime import datetime
from datetime import datetime, timezone
from typing import Any, Dict, Optional
from pydantic import BaseModel, Field
from crewai.utilities.serialization import to_serializable
@@ -9,7 +8,7 @@ from crewai.utilities.serialization import to_serializable
class BaseEvent(BaseModel):
"""Base class for all events"""
timestamp: datetime = Field(default_factory=datetime.now)
timestamp: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
type: str
source_fingerprint: Optional[str] = None # UUID string of the source entity
source_type: Optional[str] = None # "agent", "task", "crew", "memory", "entity_memory", "short_term_memory", "long_term_memory", "external_memory"

View File

@@ -4,6 +4,7 @@ from .agent_events import (
AgentExecutionCompletedEvent,
AgentExecutionErrorEvent,
AgentExecutionStartedEvent,
LiteAgentExecutionCompletedEvent,
)
from .crew_events import (
CrewKickoffCompletedEvent,
@@ -80,6 +81,7 @@ EventTypes = Union[
CrewTrainFailedEvent,
AgentExecutionStartedEvent,
AgentExecutionCompletedEvent,
LiteAgentExecutionCompletedEvent,
TaskStartedEvent,
TaskCompletedEvent,
TaskFailedEvent,

View File

@@ -2,6 +2,7 @@
import json
import pytest
from unittest.mock import patch
from crewai import Agent, Task
from crewai.llm import LLM
@@ -259,3 +260,31 @@ def test_agent_with_function_calling_fallback():
assert result == "4"
assert "Reasoning Plan:" in task.description
assert "Invalid JSON that will trigger fallback" in task.description
def test_agent_with_llm_reasoning_disabled():
"""Test agent with LLM reasoning disabled."""
llm = LLM("gpt-3.5-turbo", reasoning=False)
agent = Agent(
role="Test Agent",
goal="To test the LLM reasoning parameter",
backstory="I am a test agent created to verify the LLM reasoning parameter works correctly.",
llm=llm,
reasoning=False,
verbose=True
)
task = Task(
description="Simple math task: What's 3+3?",
expected_output="The answer should be a number.",
agent=agent
)
with patch.object(agent.llm, 'call') as mock_call:
mock_call.return_value = "6"
result = agent.execute_task(task)
assert result == "6"
assert "Reasoning Plan:" not in task.description

View File

@@ -2010,7 +2010,6 @@ def test_crew_agent_executor_litellm_auth_error():
from litellm.exceptions import AuthenticationError
from crewai.agents.tools_handler import ToolsHandler
from crewai.utilities import Printer
# Create an agent and executor
agent = Agent(
@@ -2043,7 +2042,6 @@ def test_crew_agent_executor_litellm_auth_error():
# Mock the LLM call to raise AuthenticationError
with (
patch.object(LLM, "call") as mock_llm_call,
patch.object(Printer, "print") as mock_printer,
pytest.raises(AuthenticationError) as exc_info,
):
mock_llm_call.side_effect = AuthenticationError(
@@ -2057,13 +2055,6 @@ def test_crew_agent_executor_litellm_auth_error():
}
)
# Verify error handling messages
error_message = f"Error during LLM call: {str(mock_llm_call.side_effect)}"
mock_printer.assert_any_call(
content=error_message,
color="red",
)
# Verify the call was only made once (no retries)
mock_llm_call.assert_called_once()

View File

@@ -0,0 +1,237 @@
interactions:
- request:
body: '{"messages": [{"role": "system", "content": "You are Test Agent. An agent
created for testing purposes\nYour personal goal is: Complete test tasks successfully\n\nTo
give my best complete final answer to the task respond using the exact following
format:\n\nThought: I now can give a great answer\nFinal Answer: Your final
answer must be the great and the most complete as possible, it must be outcome
described.\n\nI MUST use these formats, my job depends on it!"}, {"role": "user",
"content": "Complete this task successfully"}], "model": "gpt-4o-mini", "stop":
["\nObservation:"]}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '583'
content-type:
- application/json
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.93.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.93.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAAwAAAP//jFNNb9swDL3nVxA6J0U+HKTNbd0woMAOw7Bu6LbCUCXa1iqLgkgnzYr8
98FKWqdbB+wiQHx81OMj9TgCUM6qNSjTaDFt9JNL+TZ7N/dfrusPN01NyV6vPk3f/mrl5vLrXI17
Bt39RCNPrDNDbfQojsIBNgm1YF91tlrOl+fzxXKWgZYs+p5WR5kUNGldcJP5dF5MpqvJ7PzIbsgZ
ZLWG7yMAgMd89jqDxQe1hun4KdIis65RrZ+TAFQi30eUZnYsOogaD6ChIBiy9M8NdXUja7iCQFsw
OkDtNgga6l4/6MBbTAA/wnsXtIc3+b6Gjx41I8REG2cRWoStkwakQeCIxlXOgEXRzjNQgvzigwBV
OUU038OOOgiIFhr0MdPHoIOFK9g67wEDdwlBCI7OIjgB7oxB5qrzfpeznxRokIZS3wwk5EiB8ey0
54RVx7r3PXTenwA6BBLdzy27fXtE9s/+eqpjojv+g6oqFxw3ZULNFHovWSiqjO5HALd5jt2L0aiY
qI1SCt1jfu7i4lBODdszgEVxBIVE+yE+KxbjV8qVR79PFkEZbRq0A3XYGt1ZRyfA6KTpv9W8VvvQ
uAv1/5QfAGMwCtoyJrTOvOx4SEvYf65/pT2bnAUrxrRxBktxmPpBWKx05w8rr3jHgm1ZuVBjiskd
9r6K5aLQy0LjxcKo0X70GwAA//8DAMz2wVUFBAAA
headers:
CF-RAY:
- 95f93ea9af627e0b-GRU
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Tue, 15 Jul 2025 12:25:54 GMT
Server:
- cloudflare
Set-Cookie:
- __cf_bm=GRZmZLrjW5ZRHNmUJa4ccrMcy20D1rmeqK6Ptlv0mRY-1752582354-1.0.1.1-xKd_yga48Eedech5TRlThlEpDgsB2whxkWHlCyAGOVMqMcvH1Ju9FdXYbuQ9NdUQcVxPLgiGM35lYhqSLVQiXDyK01dnyp2Gvm560FBN9DY;
path=/; expires=Tue, 15-Jul-25 12:55:54 GMT; domain=.api.openai.com; HttpOnly;
Secure; SameSite=None
- _cfuvid=MYqswpSR7sqr4kGp6qZVkaL7HDYwMiww49PeN9QBP.A-1752582354973-0.0.1.1-604800000;
path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '4047'
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '4440'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999885'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_5704c0f206a927ddc12aa1a19b612a75
status:
code: 200
message: OK
- request:
body: '{"messages": [{"role": "system", "content": "You are an expert evaluator
assessing how well an AI agent''s output aligns with its assigned task goal.\n\nScore
the agent''s goal alignment on a scale from 0-10 where:\n- 0: Complete misalignment,
agent did not understand or attempt the task goal\n- 5: Partial alignment, agent
attempted the task but missed key requirements\n- 10: Perfect alignment, agent
fully satisfied all task requirements\n\nConsider:\n1. Did the agent correctly
interpret the task goal?\n2. Did the final output directly address the requirements?\n3.
Did the agent focus on relevant aspects of the task?\n4. Did the agent provide
all requested information or deliverables?\n\nReturn your evaluation as JSON
with fields ''score'' (number) and ''feedback'' (string).\n"}, {"role": "user",
"content": "\nAgent role: Test Agent\nAgent goal: Complete test tasks successfully\n\n\nAgent''s
final output:\nPlease provide me with the specific details or context of the
task you need help with, and I will ensure to complete it successfully and provide
a thorough response.\n\nEvaluate how well the agent''s output aligns with the
assigned task goal.\n"}], "model": "gpt-4o-mini", "stop": []}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '1196'
content-type:
- application/json
cookie:
- __cf_bm=GRZmZLrjW5ZRHNmUJa4ccrMcy20D1rmeqK6Ptlv0mRY-1752582354-1.0.1.1-xKd_yga48Eedech5TRlThlEpDgsB2whxkWHlCyAGOVMqMcvH1Ju9FdXYbuQ9NdUQcVxPLgiGM35lYhqSLVQiXDyK01dnyp2Gvm560FBN9DY;
_cfuvid=MYqswpSR7sqr4kGp6qZVkaL7HDYwMiww49PeN9QBP.A-1752582354973-0.0.1.1-604800000
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.93.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.93.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAA4xUy27bQAy8+yuIPdtGbMdN4FvbSxM0QIsEKNA6MJhdSmK82hWWVFwj8L8XKz/k
9AH0ogOHnOFjVq8DAMPOLMDYCtXWjR990O+TT7dfZs/v5OtFy/ef7++mxfu7j83t/cONGeaK+PRM
Vo9VYxvrxpNyDHvYJkKlzDq5mk/n19PZfN4BdXTkc1nZ6OgyjmoOPJpeTC9HF1ejyfWhuopsScwC
fgwAAF67b+4zOPppFnAxPEZqEsGSzOKUBGBS9DliUIRFMagZ9qCNQSl0rb8uA8DSiI2JlmYB0+E+
UBC5J7TrHFuah4oASwoKjh2EqOCojkE0oRIgWE+YoA2OUhZzHEqIBWhFoChrKCP6IWwqthWwgEY4
bItASbRLEpDWWhIpWu+3Y7gJooRuCKyAsiYHRUxQx0TgSJG9DIGDY4ua5RA82nVW5cDKqPxCWYhC
iSXBhrU69TOGbxV7ysxSxY0Awoa951AGkq69/do67QLZk8vBJsUXdgQYtoBWW/SQSJoYpFPq2Ptp
MLjTttC51DFXVIPjRFb9drw0y7A7v0uiohXM3git92cAhhAVs7c6RzwekN3JAz6WTYpP8lupKTiw
VKtEKDHke4vGxnTobgDw2HmtfWMf06RYN7rSuKZObjo7eM30Fu/R6yOoUdH38dnkCLzhWx1ud+ZW
Y9FW5PrS3trYOo5nwOBs6j+7+Rv3fnIO5f/Q94C11Ci5VZPIsX07cZ+WKP8B/pV22nLXsBFKL2xp
pUwpX8JRga3fv0sjW1GqVwWHklKTuHuc+ZKD3eAXAAAA//8DADksFsafBAAA
headers:
CF-RAY:
- 95f93ec73a1c7e0b-GRU
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Tue, 15 Jul 2025 12:25:57 GMT
Server:
- cloudflare
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '1544'
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '1546'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999732'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_44930ba12ad8d1e3f0beed1d5e3d8b0c
status:
code: 200
message: OK
version: 1

File diff suppressed because one or more lines are too long

View File

@@ -427,4 +427,140 @@ interactions:
status:
code: 200
message: OK
- request:
body: '{"messages": [{"role": "system", "content": "You are an expert evaluator
assessing how well an AI agent''s output aligns with its assigned task goal.\n\nScore
the agent''s goal alignment on a scale from 0-10 where:\n- 0: Complete misalignment,
agent did not understand or attempt the task goal\n- 5: Partial alignment, agent
attempted the task but missed key requirements\n- 10: Perfect alignment, agent
fully satisfied all task requirements\n\nConsider:\n1. Did the agent correctly
interpret the task goal?\n2. Did the final output directly address the requirements?\n3.
Did the agent focus on relevant aspects of the task?\n4. Did the agent provide
all requested information or deliverables?\n\nReturn your evaluation as JSON
with fields ''score'' (number) and ''feedback'' (string).\n"}, {"role": "user",
"content": "\nAgent role: Test Agent\nAgent goal: Complete test tasks successfully\nTask
description: Test task description\nExpected output: Expected test output\n\nAgent''s
final output:\nThe expected test output is a comprehensive document that outlines
the specific parameters and criteria that define success for the task at hand.
It should include detailed descriptions of the tasks, the goals that need to
be achieved, and any specific formatting or structural requirements necessary
for the output. Each component of the task must be analyzed and addressed, providing
context as well as examples where applicable. Additionally, any tools or methodologies
that are relevant to executing the tasks successfully should be outlined, including
any potential risks or challenges that may arise during the process. This document
serves as a guiding framework to ensure that all aspects of the task are thoroughly
considered and executed to meet the high standards expected.\n\nEvaluate how
well the agent''s output aligns with the assigned task goal.\n"}], "model":
"gpt-4o-mini", "stop": []}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '1893'
content-type:
- application/json
cookie:
- _cfuvid=XwsgBfgvDGlKFQ4LiGYGIARIoSNTiwidqoo9UZcc.XY-1752087999227-0.0.1.1-604800000
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.93.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.93.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAAwAAAP//jFRNbxs5DL37VxA6jwPHddrUxxwWi2BRtEAPRevCYCSOh41GUkWOnTTI
fy8kf4zT5rCXOfCRT4+P5DxNAAw7swRjO1TbJz+90dvFxy//vX0za7dfr29+3eo/n75++Mh0O/za
maZUxLsfZPVYdWFjnzwpx7CHbSZUKqyX767mV/PL2eKqAn105EvZJul0Eac9B57OZ/PFdPZuenl9
qO4iWxKzhG8TAICn+i06g6MHs4RZc4z0JIIbMstTEoDJ0ZeIQREWxaCmGUEbg1Ko0p9WAWBlxMZM
K7OEq2YfaIncHdr7EluZzx0BbigopBy37MgBgiNF9uTAkdjMqbQOsYVdhwraEdBDIqvkIA6aBgXp
4uAdcLB+cNTArmPbAQfHFpUEJPYEQ3CUi2LHYVPoCpOi3EOmnwNn6imoXMC/cUdbyk3FWw7oj8+4
SAIhKkgiyy1b9P4RHHneUn4pTEn0WIYC6YDX5866aqDH+yKHFRJm5cqInjeB3AWM7vQsUgzhTFb9
48GtUlloSwMkZ4bEDMetOaSg1QH9XldVwSrk2wY4iBLWSs/hmG47zGiVMouylZP7WHkzdRSEtwQu
2qH4dhyBjcWKHWsXhzJTEgpVAwagByySirgzRSfLDrtzsTKr8Hy+VJnaQbAsdhi8PwMwhKhYfKzr
/P2APJ8W2MdNyvFO/ig1LQeWbp0JJYayrKIxmYo+TwC+10MZXuy+STn2Sdca76k+92ax2POZ8T5H
9P31AdSo6Mf4YjFvXuFb71dezk7NWLQdubF0vEscHMczYHLW9d9qXuPed85h83/oR8BaSkpunTI5
ti87HtMy/agTfT3t5HIVbITyli2tlSmXSThqcfD7n4qRR1Hq1y2HDeWUuf5ZyiQnz5PfAAAA//8D
AEfUP8BcBQAA
headers:
CF-RAY:
- 95f365f1bfc87ded-GRU
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Mon, 14 Jul 2025 19:24:07 GMT
Server:
- cloudflare
Set-Cookie:
- __cf_bm=PcC3_3T8.MK_WpZlQLdZfwpNv9Pe45AIYmrXOSgJ65E-1752521047-1.0.1.1-eyqwSWfQC7ZV6.JwTsTihK1ZWCrEmxd52CtNcfe.fw1UjjBN9rdTU4G7hRZiNqHQYo4sVZMmgRgqM9k7HRSzN2zln0bKmMiOuSQTZh6xF_I;
path=/; expires=Mon, 14-Jul-25 19:54:07 GMT; domain=.api.openai.com; HttpOnly;
Secure; SameSite=None
- _cfuvid=JvQ1c4qYZefNwOPoVNgAtX8ET7ObU.JKDvGc43LOR6g-1752521047741-0.0.1.1-604800000;
path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '2729'
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '2789'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999559'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_74f6e8ff49db25dbea3d3525cc149e8e
status:
code: 200
message: OK
version: 1

View File

@@ -0,0 +1,123 @@
interactions:
- request:
body: '{"messages": [{"role": "system", "content": "You are Test Agent. An agent
created for testing purposes\nYour personal goal is: Complete test tasks successfully\nTo
give my best complete final answer to the task respond using the exact following
format:\n\nThought: I now can give a great answer\nFinal Answer: Your final
answer must be the great and the most complete as possible, it must be outcome
described.\n\nI MUST use these formats, my job depends on it!"}, {"role": "user",
"content": "\nCurrent Task: Test task description\n\nThis is the expected criteria
for your final answer: Expected test output\nyou MUST return the actual complete
content as the final answer, not a summary.\n\nBegin! This is VERY important
to you, use the tools available and give your best Final Answer, your job depends
on it!\n\nThought:"}], "model": "gpt-4o-mini", "stop": ["\nObservation:"]}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '879'
content-type:
- application/json
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.93.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.93.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAAwAAAP//jFTBbhtHDL3rK4g5rwRbtaNYt9RoEaNoUaBODm0DgZnh7jKe5WyHXDmO
4X8vZiRLcupDLwvsPPLxPQ45jzMAx8GtwfkezQ9jnP9oeLv98N5+vfl9+4v89Mf76+XV7XDz8Yc/
r39T15SM9PkLeXvOWvg0jJGMk+xgnwmNCuv56nJ5+XZ1tbqswJACxZLWjTa/SPOBhefLs+XF/Gw1
P3+7z+4Te1K3hr9mAACP9Vt0SqCvbg1nzfPJQKrYkVsfggBcTrGcOFRlNRRzzRH0SYykSr8BSffg
UaDjLQFCV2QDit5TBvhbfmbBCO/q/xpue1ZgBesJ6OtI3iiAkRqkycbJGrjv2ffgk5S6CqkFhECG
HClAIPWZx9Kkgtz3aJVq37vChXoH2qcpBogp3UHkO1rAbU/QViW7Os8hLD5OgQBjBCFfOpEfgKVN
ecBSpoFAQxK1jMbSgY+Y2R6aWjJTT6K8JSHVBlACYOgpk3gCS4DyADqS55YpQDdxoMhCuoCbgwKf
tpSB0PeAJdaKseKpOsn0z8SZBhJrgESnXERY8S0JRsxWulkoilkKkDJ0JJQx8jcKi13DX3pWyuWm
FPDQN8jU7mW3KRfdSaj2r5ZLMEmgXOYg7K5OlcQYI1Cs4vSFavSVmLWnsDgdnEztpFiGV6YYTwAU
SVYbXkf20x55OgxpTN2Y02f9LtW1LKz9JhNqkjKQaml0FX2aAXyqyzC9mG835jSMtrF0R7Xc+Zvz
HZ877uARvXqzBy0ZxuP58nLVvMK32Q2rnqyT8+h7CsfU4+7hFDidALMT1/9V8xr3zjlL93/oj4D3
NBqFzZgpsH/p+BiW6Utd0dfDDl2ugl2ZK/a0MaZcbiJQi1PcPRxOH9Ro2LQsHeUxc309yk3Onmb/
AgAA//8DAAbYfvVABQAA
headers:
CF-RAY:
- 95f9c7ffa8331b11-GRU
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Tue, 15 Jul 2025 13:59:38 GMT
Server:
- cloudflare
Set-Cookie:
- __cf_bm=J_xe1AP.B5P6D2GVMCesyioeS5E9DnYT34rbwQUefFc-1752587978-1.0.1.1-5Dflk5cAj6YCsOSVbCFWWSpXpw_mXsczIdzWzs2h2OwDL01HQbduE5LAToy67sfjFjHeeO4xRrqPLUQpySy2QqyHXbI_fzX4UAt3.UdwHxU;
path=/; expires=Tue, 15-Jul-25 14:29:38 GMT; domain=.api.openai.com; HttpOnly;
Secure; SameSite=None
- _cfuvid=0rTD8RMpxBQQy42jzmum16_eoRtWNfaZMG_TJkhGS7I-1752587978437-0.0.1.1-604800000;
path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '2623'
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '2626'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999813'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_ccc347e91010713379c920aa0efd1f4f
status:
code: 200
message: OK
version: 1

View File

@@ -0,0 +1,209 @@
interactions:
- request:
body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
"model": "o1-mini", "stop": ["stop"]}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '115'
content-type:
- application/json
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.75.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.75.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: "{\n \"error\": {\n \"message\": \"Unsupported parameter: 'stop'
is not supported with this model.\",\n \"type\": \"invalid_request_error\",\n
\ \"param\": \"stop\",\n \"code\": \"unsupported_parameter\"\n }\n}"
headers:
CF-RAY:
- 961215744c94cb45-GIG
Connection:
- keep-alive
Content-Length:
- '196'
Content-Type:
- application/json
Date:
- Fri, 18 Jul 2025 12:46:46 GMT
Server:
- cloudflare
Set-Cookie:
- __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
path=/; expires=Fri, 18-Jul-25 13:16:46 GMT; domain=.api.openai.com; HttpOnly;
Secure; SameSite=None
- _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000;
path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '20'
openai-project:
- proj_xitITlrFeen7zjNSzML82h9x
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '32'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999990'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_7be4715c3ee32aa406eacb68c7cc966e
status:
code: 400
message: Bad Request
- request:
body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
"model": "o1-mini"}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '97'
content-type:
- application/json
cookie:
- __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
_cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.75.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.75.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAA3RSwU7jMBC95ytGPlYNakJhQ2/sgSsg7QUhFA32pJni2JHtwFao/76yC3XQwsWH
efOe35uZ9wJAsBIbELLHIIdRl78nGvaqOt/dPDxf71/fdg/9bXO3e5ETXt+LZWTY5x3J8Mk6k3YY
NQW25ghLRxgoqla/LupmXTeXqwQMVpGONFuVAxsu61W9LldXZVV/MHvLkrzYwGMBAPCe3ujRKPor
NpB0UmUg73FLYnNqAhDO6lgR6D37gCaIZQalNYFMsv2nJ5A4ckANtoMbh0YSsIfF4g4d+8XibM50
1E0eo3MzaT0D0BgbMCZPnp8+kMPJZceGfd86Qm9N/NkHO4qEHgqAp5R6+hJEjM4OY2iDfaEkW62P
ciLPOYPNJxhsQJ3rV83yG7VWUUDWfjY1IVH2pDIzjxgnxXYGFLNs/5v5TvuYm802q1yuf9TPgJQ0
BlLt6Eix/Jo4tzmKZ/hT22nIybHw5F5ZUhuYXFyEog4nfTwQ4fc+0NB2bLbkRsfpSuKui0PxDwAA
//8DAN7IUy8kAwAA
headers:
CF-RAY:
- 961216c3f9837e07-GRU
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Fri, 18 Jul 2025 12:47:41 GMT
Server:
- cloudflare
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '1027'
openai-project:
- proj_xitITlrFeen7zjNSzML82h9x
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '1029'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999990'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_19a0763b09f0410b9d09598078a04cd6
status:
code: 200
message: OK
version: 1

View File

@@ -0,0 +1,206 @@
interactions:
- request:
body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
"model": "o1-mini", "stop": ["stop"]}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '115'
content-type:
- application/json
cookie:
- __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
_cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.75.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.75.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: "{\n \"error\": {\n \"message\": \"Unsupported parameter: 'stop'
is not supported with this model.\",\n \"type\": \"invalid_request_error\",\n
\ \"param\": \"stop\",\n \"code\": \"unsupported_parameter\"\n }\n}"
headers:
CF-RAY:
- 961220323a627e05-GRU
Connection:
- keep-alive
Content-Length:
- '196'
Content-Type:
- application/json
Date:
- Fri, 18 Jul 2025 12:54:06 GMT
Server:
- cloudflare
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '9'
openai-project:
- proj_xitITlrFeen7zjNSzML82h9x
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '11'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999990'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_e8d7880c5977029062d8487d215e5282
status:
code: 400
message: Bad Request
- request:
body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
"model": "o1-mini"}'
headers:
accept:
- application/json
accept-encoding:
- gzip, deflate, zstd
connection:
- keep-alive
content-length:
- '97'
content-type:
- application/json
cookie:
- __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
_cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
host:
- api.openai.com
user-agent:
- OpenAI/Python 1.75.0
x-stainless-arch:
- arm64
x-stainless-async:
- 'false'
x-stainless-lang:
- python
x-stainless-os:
- MacOS
x-stainless-package-version:
- 1.75.0
x-stainless-raw-response:
- 'true'
x-stainless-read-timeout:
- '600.0'
x-stainless-retry-count:
- '0'
x-stainless-runtime:
- CPython
x-stainless-runtime-version:
- 3.11.12
method: POST
uri: https://api.openai.com/v1/chat/completions
response:
body:
string: !!binary |
H4sIAAAAAAAAA3SSQW/bMAyF7/4Vgo5BXCSeV6c5bkAPPTVbMaAYCoOT6JitLAkSPbQo8t8HKWns
Yu1FB3181HsUXwshJGm5FVL1wGrwpvw2In/fXY3Pcd/sftzf9ENvnurm569dc9/IZVK4P4+o+E11
odzgDTI5e8QqIDCmruvma7Wpv1T1ZQaD02iSzK3LgSyV1aqqy9VVua5Oyt6Rwii34nchhBCv+Uwe
rcZnuRWr5dvNgDHCHuX2XCSEDM6kGwkxUmSwLJcTVM4y2mz7rkehwBODEa4T1wGsQkFRLBa3ECgu
FhdzZcBujJCc29GYGQBrHUNKnj0/nMjh7LIjS7FvA0J0Nr0c2XmZ6aEQ4iGnHt8FkT64wXPL7glz
23V9bCenOc/h5kTZMZgZuKyWH/RrNTKQibO5SQWqRz1JpyHDqMnNQDFL97+dj3ofk5Pdz5xVm08f
mIBS6Bl16wNqUu9DT2UB0yZ+Vnaec7YsI4a/pLBlwpD+QmMHoznuiIwvkXFoO7J7DD5QXpT03cWh
+AcAAP//AwAo/zsSJwMAAA==
headers:
CF-RAY:
- 961220338bd47e05-GRU
Connection:
- keep-alive
Content-Encoding:
- gzip
Content-Type:
- application/json
Date:
- Fri, 18 Jul 2025 12:54:08 GMT
Server:
- cloudflare
Transfer-Encoding:
- chunked
X-Content-Type-Options:
- nosniff
access-control-expose-headers:
- X-Request-ID
alt-svc:
- h3=":443"; ma=86400
cf-cache-status:
- DYNAMIC
openai-organization:
- crewai-iuxna1
openai-processing-ms:
- '1280'
openai-project:
- proj_xitITlrFeen7zjNSzML82h9x
openai-version:
- '2020-10-01'
strict-transport-security:
- max-age=31536000; includeSubDomains; preload
x-envoy-upstream-service-time:
- '1286'
x-ratelimit-limit-requests:
- '30000'
x-ratelimit-limit-tokens:
- '150000000'
x-ratelimit-remaining-requests:
- '29999'
x-ratelimit-remaining-tokens:
- '149999990'
x-ratelimit-reset-requests:
- 2ms
x-ratelimit-reset-tokens:
- 0s
x-request-id:
- req_b7390d46fa4e14380d42162cb22045df
status:
code: 200
message: OK
version: 1

View File

@@ -1,95 +0,0 @@
import pytest
from crewai.agent import Agent
from crewai.task import Task
from crewai.crew import Crew
from crewai.evaluation.agent_evaluator import AgentEvaluator
from crewai.evaluation.base_evaluator import AgentEvaluationResult
from crewai.evaluation import (
GoalAlignmentEvaluator,
SemanticQualityEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator,
ReasoningEfficiencyEvaluator
)
from crewai.evaluation import create_default_evaluator
class TestAgentEvaluator:
@pytest.fixture
def mock_crew(self):
agent = Agent(
role="Test Agent",
goal="Complete test tasks successfully",
backstory="An agent created for testing purposes",
allow_delegation=False,
verbose=False
)
task = Task(
description="Test task description",
agent=agent,
expected_output="Expected test output"
)
crew = Crew(
agents=[agent],
tasks=[task]
)
return crew
def test_set_iteration(self):
agent_evaluator = AgentEvaluator()
agent_evaluator.set_iteration(3)
assert agent_evaluator.iteration == 3
@pytest.mark.vcr(filter_headers=["authorization"])
def test_evaluate_current_iteration(self, mock_crew):
agent_evaluator = AgentEvaluator(crew=mock_crew, evaluators=[GoalAlignmentEvaluator()])
mock_crew.kickoff()
results = agent_evaluator.evaluate_current_iteration()
assert isinstance(results, dict)
agent, = mock_crew.agents
task, = mock_crew.tasks
assert len(mock_crew.agents) == 1
assert agent.role in results
assert len(results[agent.role]) == 1
result, = results[agent.role]
assert isinstance(result, AgentEvaluationResult)
assert result.agent_id == str(agent.id)
assert result.task_id == str(task.id)
goal_alignment, = result.metrics.values()
assert goal_alignment.score == 5.0
expected_feedback = "The agent's output demonstrates an understanding of the need for a comprehensive document"
assert expected_feedback in goal_alignment.feedback
assert goal_alignment.raw_response is not None
assert '"score": 5' in goal_alignment.raw_response
def test_create_default_evaluator(self, mock_crew):
agent_evaluator = create_default_evaluator(crew=mock_crew)
assert isinstance(agent_evaluator, AgentEvaluator)
assert agent_evaluator.crew == mock_crew
expected_types = [
GoalAlignmentEvaluator,
SemanticQualityEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator,
ReasoningEfficiencyEvaluator
]
assert len(agent_evaluator.evaluators) == len(expected_types)
for evaluator, expected_type in zip(agent_evaluator.evaluators, expected_types):
assert isinstance(evaluator, expected_type)

View File

@@ -1,8 +1,8 @@
from unittest.mock import patch, MagicMock
from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
from crewai.evaluation.base_evaluator import EvaluationScore
from crewai.evaluation.metrics.goal_metrics import GoalAlignmentEvaluator
from crewai.experimental.evaluation.base_evaluator import EvaluationScore
from crewai.experimental.evaluation.metrics.goal_metrics import GoalAlignmentEvaluator
from crewai.utilities.llm_utils import LLM

View File

@@ -3,12 +3,12 @@ from unittest.mock import patch, MagicMock
from typing import List, Dict, Any
from crewai.tasks.task_output import TaskOutput
from crewai.evaluation.metrics.reasoning_metrics import (
from crewai.experimental.evaluation.metrics.reasoning_metrics import (
ReasoningEfficiencyEvaluator,
)
from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
from crewai.utilities.llm_utils import LLM
from crewai.evaluation.base_evaluator import EvaluationScore
from crewai.experimental.evaluation.base_evaluator import EvaluationScore
class TestReasoningEfficiencyEvaluator(BaseEvaluationMetricsTest):
@pytest.fixture

View File

@@ -1,8 +1,8 @@
from unittest.mock import patch, MagicMock
from crewai.evaluation.base_evaluator import EvaluationScore
from crewai.evaluation.metrics.semantic_quality_metrics import SemanticQualityEvaluator
from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
from crewai.experimental.evaluation.base_evaluator import EvaluationScore
from crewai.experimental.evaluation.metrics.semantic_quality_metrics import SemanticQualityEvaluator
from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
from crewai.utilities.llm_utils import LLM
class TestSemanticQualityEvaluator(BaseEvaluationMetricsTest):

View File

@@ -1,12 +1,12 @@
from unittest.mock import patch, MagicMock
from crewai.evaluation.metrics.tools_metrics import (
from crewai.experimental.evaluation.metrics.tools_metrics import (
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator
)
from crewai.utilities.llm_utils import LLM
from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
class TestToolSelectionEvaluator(BaseEvaluationMetricsTest):
def test_no_tools_available(self, mock_task, mock_agent):

View File

@@ -0,0 +1,278 @@
import pytest
from crewai.agent import Agent
from crewai.task import Task
from crewai.crew import Crew
from crewai.experimental.evaluation.agent_evaluator import AgentEvaluator
from crewai.experimental.evaluation.base_evaluator import AgentEvaluationResult
from crewai.experimental.evaluation import (
GoalAlignmentEvaluator,
SemanticQualityEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator,
ReasoningEfficiencyEvaluator,
MetricCategory,
EvaluationScore
)
from crewai.utilities.events.agent_events import AgentEvaluationStartedEvent, AgentEvaluationCompletedEvent, AgentEvaluationFailedEvent
from crewai.utilities.events.crewai_event_bus import crewai_event_bus
from crewai.experimental.evaluation import create_default_evaluator
class TestAgentEvaluator:
@pytest.fixture
def mock_crew(self):
agent = Agent(
role="Test Agent",
goal="Complete test tasks successfully",
backstory="An agent created for testing purposes",
allow_delegation=False,
verbose=False
)
task = Task(
description="Test task description",
agent=agent,
expected_output="Expected test output"
)
crew = Crew(
agents=[agent],
tasks=[task]
)
return crew
def test_set_iteration(self):
agent_evaluator = AgentEvaluator(agents=[])
agent_evaluator.set_iteration(3)
assert agent_evaluator._execution_state.iteration == 3
@pytest.mark.vcr(filter_headers=["authorization"])
def test_evaluate_current_iteration(self, mock_crew):
agent_evaluator = AgentEvaluator(agents=mock_crew.agents, evaluators=[GoalAlignmentEvaluator()])
mock_crew.kickoff()
results = agent_evaluator.get_evaluation_results()
assert isinstance(results, dict)
agent, = mock_crew.agents
task, = mock_crew.tasks
assert len(mock_crew.agents) == 1
assert agent.role in results
assert len(results[agent.role]) == 1
result, = results[agent.role]
assert isinstance(result, AgentEvaluationResult)
assert result.agent_id == str(agent.id)
assert result.task_id == str(task.id)
goal_alignment, = result.metrics.values()
assert goal_alignment.score == 5.0
expected_feedback = "The agent's output demonstrates an understanding of the need for a comprehensive document outlining task"
assert expected_feedback in goal_alignment.feedback
assert goal_alignment.raw_response is not None
assert '"score": 5' in goal_alignment.raw_response
def test_create_default_evaluator(self, mock_crew):
agent_evaluator = create_default_evaluator(agents=mock_crew.agents)
assert isinstance(agent_evaluator, AgentEvaluator)
assert agent_evaluator.agents == mock_crew.agents
expected_types = [
GoalAlignmentEvaluator,
SemanticQualityEvaluator,
ToolSelectionEvaluator,
ParameterExtractionEvaluator,
ToolInvocationEvaluator,
ReasoningEfficiencyEvaluator
]
assert len(agent_evaluator.evaluators) == len(expected_types)
for evaluator, expected_type in zip(agent_evaluator.evaluators, expected_types):
assert isinstance(evaluator, expected_type)
@pytest.mark.vcr(filter_headers=["authorization"])
def test_eval_lite_agent(self):
agent = Agent(
role="Test Agent",
goal="Complete test tasks successfully",
backstory="An agent created for testing purposes",
)
with crewai_event_bus.scoped_handlers():
events = {}
@crewai_event_bus.on(AgentEvaluationStartedEvent)
def capture_started(source, event):
events["started"] = event
@crewai_event_bus.on(AgentEvaluationCompletedEvent)
def capture_completed(source, event):
events["completed"] = event
@crewai_event_bus.on(AgentEvaluationFailedEvent)
def capture_failed(source, event):
events["failed"] = event
agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[GoalAlignmentEvaluator()])
agent.kickoff(messages="Complete this task successfully")
assert events.keys() == {"started", "completed"}
assert events["started"].agent_id == str(agent.id)
assert events["started"].agent_role == agent.role
assert events["started"].task_id is None
assert events["started"].iteration == 1
assert events["completed"].agent_id == str(agent.id)
assert events["completed"].agent_role == agent.role
assert events["completed"].task_id is None
assert events["completed"].iteration == 1
assert events["completed"].metric_category == MetricCategory.GOAL_ALIGNMENT
assert isinstance(events["completed"].score, EvaluationScore)
assert events["completed"].score.score == 2.0
results = agent_evaluator.get_evaluation_results()
assert isinstance(results, dict)
result, = results[agent.role]
assert isinstance(result, AgentEvaluationResult)
assert result.agent_id == str(agent.id)
assert result.task_id == "lite_task"
goal_alignment, = result.metrics.values()
assert goal_alignment.score == 2.0
expected_feedback = "The agent did not demonstrate a clear understanding of the task goal, which is to complete test tasks successfully"
assert expected_feedback in goal_alignment.feedback
assert goal_alignment.raw_response is not None
assert '"score": 2' in goal_alignment.raw_response
@pytest.mark.vcr(filter_headers=["authorization"])
def test_eval_specific_agents_from_crew(self, mock_crew):
agent = Agent(
role="Test Agent Eval",
goal="Complete test tasks successfully",
backstory="An agent created for testing purposes",
)
task = Task(
description="Test task description",
agent=agent,
expected_output="Expected test output"
)
mock_crew.agents.append(agent)
mock_crew.tasks.append(task)
with crewai_event_bus.scoped_handlers():
events = {}
@crewai_event_bus.on(AgentEvaluationStartedEvent)
def capture_started(source, event):
events["started"] = event
@crewai_event_bus.on(AgentEvaluationCompletedEvent)
def capture_completed(source, event):
events["completed"] = event
@crewai_event_bus.on(AgentEvaluationFailedEvent)
def capture_failed(source, event):
events["failed"] = event
agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[GoalAlignmentEvaluator()])
mock_crew.kickoff()
assert events.keys() == {"started", "completed"}
assert events["started"].agent_id == str(agent.id)
assert events["started"].agent_role == agent.role
assert events["started"].task_id == str(task.id)
assert events["started"].iteration == 1
assert events["completed"].agent_id == str(agent.id)
assert events["completed"].agent_role == agent.role
assert events["completed"].task_id == str(task.id)
assert events["completed"].iteration == 1
assert events["completed"].metric_category == MetricCategory.GOAL_ALIGNMENT
assert isinstance(events["completed"].score, EvaluationScore)
assert events["completed"].score.score == 5.0
results = agent_evaluator.get_evaluation_results()
assert isinstance(results, dict)
assert len(results.keys()) == 1
result, = results[agent.role]
assert isinstance(result, AgentEvaluationResult)
assert result.agent_id == str(agent.id)
assert result.task_id == str(task.id)
goal_alignment, = result.metrics.values()
assert goal_alignment.score == 5.0
expected_feedback = "The agent provided a thorough guide on how to conduct a test task but failed to produce specific expected output"
assert expected_feedback in goal_alignment.feedback
assert goal_alignment.raw_response is not None
assert '"score": 5' in goal_alignment.raw_response
@pytest.mark.vcr(filter_headers=["authorization"])
def test_failed_evaluation(self, mock_crew):
agent, = mock_crew.agents
task, = mock_crew.tasks
with crewai_event_bus.scoped_handlers():
events = {}
@crewai_event_bus.on(AgentEvaluationStartedEvent)
def capture_started(source, event):
events["started"] = event
@crewai_event_bus.on(AgentEvaluationCompletedEvent)
def capture_completed(source, event):
events["completed"] = event
@crewai_event_bus.on(AgentEvaluationFailedEvent)
def capture_failed(source, event):
events["failed"] = event
# Create a mock evaluator that will raise an exception
from crewai.experimental.evaluation.base_evaluator import BaseEvaluator
from crewai.experimental.evaluation import MetricCategory
class FailingEvaluator(BaseEvaluator):
metric_category = MetricCategory.GOAL_ALIGNMENT
def evaluate(self, agent, task, execution_trace, final_output):
raise ValueError("Forced evaluation failure")
agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[FailingEvaluator()])
mock_crew.kickoff()
assert events.keys() == {"started", "failed"}
assert events["started"].agent_id == str(agent.id)
assert events["started"].agent_role == agent.role
assert events["started"].task_id == str(task.id)
assert events["started"].iteration == 1
assert events["failed"].agent_id == str(agent.id)
assert events["failed"].agent_role == agent.role
assert events["failed"].task_id == str(task.id)
assert events["failed"].iteration == 1
assert events["failed"].error == "Forced evaluation failure"
results = agent_evaluator.get_evaluation_results()
result, = results[agent.role]
assert isinstance(result, AgentEvaluationResult)
assert result.agent_id == str(agent.id)
assert result.task_id == str(task.id)
assert result.metrics == {}

View File

@@ -0,0 +1,111 @@
import pytest
from unittest.mock import MagicMock, patch
from crewai.experimental.evaluation.experiment.result import ExperimentResult, ExperimentResults
class TestExperimentResult:
@pytest.fixture
def mock_results(self):
return [
ExperimentResult(
identifier="test-1",
inputs={"query": "What is the capital of France?"},
score=10,
expected_score=7,
passed=True
),
ExperimentResult(
identifier="test-2",
inputs={"query": "Who wrote Hamlet?"},
score={"relevance": 9, "factuality": 8},
expected_score={"relevance": 7, "factuality": 7},
passed=True,
agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
),
ExperimentResult(
identifier="test-3",
inputs={"query": "Any query"},
score={"relevance": 9, "factuality": 8},
expected_score={"relevance": 7, "factuality": 7},
passed=False,
agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
),
ExperimentResult(
identifier="test-4",
inputs={"query": "Another query"},
score={"relevance": 9, "factuality": 8},
expected_score={"relevance": 7, "factuality": 7},
passed=True,
agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
),
ExperimentResult(
identifier="test-6",
inputs={"query": "Yet another query"},
score={"relevance": 9, "factuality": 8},
expected_score={"relevance": 7, "factuality": 7},
passed=True,
agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
)
]
@patch('os.path.exists', return_value=True)
@patch('os.path.getsize', return_value=1)
@patch('json.load')
@patch('builtins.open', new_callable=MagicMock)
def test_experiment_results_compare_with_baseline(self, mock_open, mock_json_load, mock_path_getsize, mock_path_exists, mock_results):
baseline_data = {
"timestamp": "2023-01-01T00:00:00+00:00",
"results": [
{
"identifier": "test-1",
"inputs": {"query": "What is the capital of France?"},
"score": 7,
"expected_score": 7,
"passed": False
},
{
"identifier": "test-2",
"inputs": {"query": "Who wrote Hamlet?"},
"score": {"relevance": 8, "factuality": 7},
"expected_score": {"relevance": 7, "factuality": 7},
"passed": True
},
{
"identifier": "test-3",
"inputs": {"query": "Any query"},
"score": {"relevance": 8, "factuality": 7},
"expected_score": {"relevance": 7, "factuality": 7},
"passed": True
},
{
"identifier": "test-4",
"inputs": {"query": "Another query"},
"score": {"relevance": 8, "factuality": 7},
"expected_score": {"relevance": 7, "factuality": 7},
"passed": True
},
{
"identifier": "test-5",
"inputs": {"query": "Another query"},
"score": {"relevance": 8, "factuality": 7},
"expected_score": {"relevance": 7, "factuality": 7},
"passed": True
}
]
}
mock_json_load.return_value = baseline_data
results = ExperimentResults(results=mock_results)
results.display = MagicMock()
comparison = results.compare_with_baseline(baseline_filepath="baseline.json")
assert "baseline_timestamp" in comparison
assert comparison["baseline_timestamp"] == "2023-01-01T00:00:00+00:00"
assert comparison["improved"] == ["test-1"]
assert comparison["regressed"] == ["test-3"]
assert comparison["unchanged"] == ["test-2", "test-4"]
assert comparison["new_tests"] == ["test-6"]
assert comparison["missing_tests"] == ["test-5"]

View File

@@ -0,0 +1,197 @@
import pytest
from unittest.mock import MagicMock, patch
from crewai.crew import Crew
from crewai.experimental.evaluation.experiment.runner import ExperimentRunner
from crewai.experimental.evaluation.experiment.result import ExperimentResults
from crewai.experimental.evaluation.evaluation_display import AgentAggregatedEvaluationResult
from crewai.experimental.evaluation.base_evaluator import MetricCategory, EvaluationScore
class TestExperimentRunner:
@pytest.fixture
def mock_crew(self):
return MagicMock(llm=Crew)
@pytest.fixture
def mock_evaluator_results(self):
agent_evaluation = AgentAggregatedEvaluationResult(
agent_id="Test Agent",
agent_role="Test Agent Role",
metrics={
MetricCategory.GOAL_ALIGNMENT: EvaluationScore(
score=9,
feedback="Test feedback for goal alignment",
raw_response="Test raw response for goal alignment"
),
MetricCategory.REASONING_EFFICIENCY: EvaluationScore(
score=None,
feedback="Reasoning efficiency not applicable",
raw_response="Reasoning efficiency not applicable"
),
MetricCategory.PARAMETER_EXTRACTION: EvaluationScore(
score=7,
feedback="Test parameter extraction explanation",
raw_response="Test raw output"
),
MetricCategory.TOOL_SELECTION: EvaluationScore(
score=8,
feedback="Test tool selection explanation",
raw_response="Test raw output"
)
}
)
return {"Test Agent": agent_evaluation}
@patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
def test_run_success(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
dataset = [
{
"identifier": "test-case-1",
"inputs": {"query": "Test query 1"},
"expected_score": 8
},
{
"identifier": "test-case-2",
"inputs": {"query": "Test query 2"},
"expected_score": {"goal_alignment": 7}
},
{
"inputs": {"query": "Test query 3"},
"expected_score": {"tool_selection": 9}
}
]
mock_evaluator = MagicMock()
mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
mock_evaluator.reset_iterations_results = MagicMock()
mock_create_evaluator.return_value = mock_evaluator
runner = ExperimentRunner(dataset=dataset)
results = runner.run(crew=mock_crew)
assert isinstance(results, ExperimentResults)
result_1, result_2, result_3 = results.results
assert len(results.results) == 3
assert result_1.identifier == "test-case-1"
assert result_1.inputs == {"query": "Test query 1"}
assert result_1.expected_score == 8
assert result_1.passed is True
assert result_2.identifier == "test-case-2"
assert result_2.inputs == {"query": "Test query 2"}
assert isinstance(result_2.expected_score, dict)
assert "goal_alignment" in result_2.expected_score
assert result_2.passed is True
assert result_3.identifier == "c2ed49e63aa9a83af3ca382794134fd5"
assert result_3.inputs == {"query": "Test query 3"}
assert isinstance(result_3.expected_score, dict)
assert "tool_selection" in result_3.expected_score
assert result_3.passed is False
assert mock_crew.kickoff.call_count == 3
mock_crew.kickoff.assert_any_call(inputs={"query": "Test query 1"})
mock_crew.kickoff.assert_any_call(inputs={"query": "Test query 2"})
mock_crew.kickoff.assert_any_call(inputs={"query": "Test query 3"})
assert mock_evaluator.reset_iterations_results.call_count == 3
assert mock_evaluator.get_agent_evaluation.call_count == 3
@patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
def test_run_success_with_unknown_metric(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
dataset = [
{
"identifier": "test-case-2",
"inputs": {"query": "Test query 2"},
"expected_score": {"goal_alignment": 7, "unknown_metric": 8}
}
]
mock_evaluator = MagicMock()
mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
mock_evaluator.reset_iterations_results = MagicMock()
mock_create_evaluator.return_value = mock_evaluator
runner = ExperimentRunner(dataset=dataset)
results = runner.run(crew=mock_crew)
result, = results.results
assert result.identifier == "test-case-2"
assert result.inputs == {"query": "Test query 2"}
assert isinstance(result.expected_score, dict)
assert "goal_alignment" in result.expected_score.keys()
assert "unknown_metric" in result.expected_score.keys()
assert result.passed is True
@patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
def test_run_success_with_single_metric_evaluator_and_expected_specific_metric(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
dataset = [
{
"identifier": "test-case-2",
"inputs": {"query": "Test query 2"},
"expected_score": {"goal_alignment": 7}
}
]
mock_evaluator = MagicMock()
mock_create_evaluator["Test Agent"].metrics = {
MetricCategory.GOAL_ALIGNMENT: EvaluationScore(
score=9,
feedback="Test feedback for goal alignment",
raw_response="Test raw response for goal alignment"
)
}
mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
mock_evaluator.reset_iterations_results = MagicMock()
mock_create_evaluator.return_value = mock_evaluator
runner = ExperimentRunner(dataset=dataset)
results = runner.run(crew=mock_crew)
result, = results.results
assert result.identifier == "test-case-2"
assert result.inputs == {"query": "Test query 2"}
assert isinstance(result.expected_score, dict)
assert "goal_alignment" in result.expected_score.keys()
assert result.passed is True
@patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
def test_run_success_when_expected_metric_is_not_available(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
dataset = [
{
"identifier": "test-case-2",
"inputs": {"query": "Test query 2"},
"expected_score": {"unknown_metric": 7}
}
]
mock_evaluator = MagicMock()
mock_create_evaluator["Test Agent"].metrics = {
MetricCategory.GOAL_ALIGNMENT: EvaluationScore(
score=5,
feedback="Test feedback for goal alignment",
raw_response="Test raw response for goal alignment"
)
}
mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
mock_evaluator.reset_iterations_results = MagicMock()
mock_create_evaluator.return_value = mock_evaluator
runner = ExperimentRunner(dataset=dataset)
results = runner.run(crew=mock_crew)
result, = results.results
assert result.identifier == "test-case-2"
assert result.inputs == {"query": "Test query 2"}
assert isinstance(result.expected_score, dict)
assert "unknown_metric" in result.expected_score.keys()
assert result.passed is False

View File

@@ -755,3 +755,15 @@ def test_multiple_routers_from_same_trigger():
assert execution_order.index("anemia_analysis") > execution_order.index(
"anemia_router"
)
def test_flow_name():
class MyFlow(Flow):
name = "MyFlow"
@start()
def start(self):
return "Hello, world!"
flow = MyFlow()
assert flow.name == "MyFlow"

View File

@@ -1,3 +1,4 @@
import logging
import os
from time import sleep
from unittest.mock import MagicMock, patch
@@ -664,3 +665,145 @@ def test_handle_streaming_tool_calls_no_tools(mock_emit):
expected_completed_llm_call=1,
expected_final_chunk_result=response,
)
@pytest.mark.vcr(filter_headers=["authorization"])
def test_llm_call_when_stop_is_unsupported(caplog):
llm = LLM(model="o1-mini", stop=["stop"])
with caplog.at_level(logging.INFO):
result = llm.call("What is the capital of France?")
assert "Retrying LLM call without the unsupported 'stop'" in caplog.text
assert isinstance(result, str)
assert "Paris" in result
@pytest.mark.vcr(filter_headers=["authorization"])
def test_llm_call_when_stop_is_unsupported_when_additional_drop_params_is_provided(caplog):
llm = LLM(model="o1-mini", stop=["stop"], additional_drop_params=["another_param"])
with caplog.at_level(logging.INFO):
result = llm.call("What is the capital of France?")
assert "Retrying LLM call without the unsupported 'stop'" in caplog.text
assert isinstance(result, str)
assert "Paris" in result
@pytest.fixture
def ollama_llm():
return LLM(model="ollama/llama3.2:3b")
def test_ollama_appends_dummy_user_message_when_last_is_assistant(ollama_llm):
original_messages = [
{"role": "user", "content": "Hi there"},
{"role": "assistant", "content": "Hello!"},
]
formatted = ollama_llm._format_messages_for_provider(original_messages)
assert len(formatted) == len(original_messages) + 1
assert formatted[-1]["role"] == "user"
assert formatted[-1]["content"] == ""
def test_ollama_does_not_modify_when_last_is_user(ollama_llm):
original_messages = [
{"role": "user", "content": "Tell me a joke."},
]
formatted = ollama_llm._format_messages_for_provider(original_messages)
assert formatted == original_messages
def test_llm_reasoning_parameter_false():
"""Test that reasoning=False disables reasoning mode."""
llm = LLM(model="ollama/qwen", reasoning=False)
with patch("litellm.completion") as mock_completion:
mock_message = MagicMock()
mock_message.content = "Test response"
mock_choice = MagicMock()
mock_choice.message = mock_message
mock_response = MagicMock()
mock_response.choices = [mock_choice]
mock_response.usage = {"prompt_tokens": 5, "completion_tokens": 5, "total_tokens": 10}
mock_completion.return_value = mock_response
llm.call("Test message")
_, kwargs = mock_completion.call_args
assert "reasoning_effort" not in kwargs
def test_llm_reasoning_parameter_true():
"""Test that reasoning=True enables reasoning mode."""
llm = LLM(model="ollama/qwen", reasoning=True, reasoning_effort="medium")
with patch("litellm.completion") as mock_completion:
mock_message = MagicMock()
mock_message.content = "Test response"
mock_choice = MagicMock()
mock_choice.message = mock_message
mock_response = MagicMock()
mock_response.choices = [mock_choice]
mock_response.usage = {"prompt_tokens": 5, "completion_tokens": 5, "total_tokens": 10}
mock_completion.return_value = mock_response
llm.call("Test message")
_, kwargs = mock_completion.call_args
assert kwargs["reasoning_effort"] == "medium"
def test_llm_reasoning_parameter_none_with_reasoning_effort():
"""Test that reasoning=None with reasoning_effort still includes reasoning_effort."""
llm = LLM(model="ollama/qwen", reasoning=None, reasoning_effort="high")
with patch("litellm.completion") as mock_completion:
mock_message = MagicMock()
mock_message.content = "Test response"
mock_choice = MagicMock()
mock_choice.message = mock_message
mock_response = MagicMock()
mock_response.choices = [mock_choice]
mock_response.usage = {"prompt_tokens": 5, "completion_tokens": 5, "total_tokens": 10}
mock_completion.return_value = mock_response
llm.call("Test message")
_, kwargs = mock_completion.call_args
assert kwargs["reasoning_effort"] == "high"
def test_llm_reasoning_false_overrides_reasoning_effort():
"""Test that reasoning=False overrides reasoning_effort."""
llm = LLM(model="ollama/qwen", reasoning=False, reasoning_effort="high")
with patch("litellm.completion") as mock_completion:
mock_message = MagicMock()
mock_message.content = "Test response"
mock_choice = MagicMock()
mock_choice.message = mock_message
mock_response = MagicMock()
mock_response.choices = [mock_choice]
mock_response.usage = {"prompt_tokens": 5, "completion_tokens": 5, "total_tokens": 10}
mock_completion.return_value = mock_response
llm.call("Test message")
_, kwargs = mock_completion.call_args
assert "reasoning_effort" not in kwargs
@pytest.mark.vcr(filter_headers=["authorization"])
def test_ollama_qwen_with_reasoning_disabled():
"""Test Ollama Qwen model with reasoning disabled."""
if not os.getenv("OLLAMA_BASE_URL"):
pytest.skip("OLLAMA_BASE_URL not set; skipping test.")
llm = LLM(
model="ollama/qwen",
base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434"),
reasoning=False
)
result = llm.call("What is 2+2?")
assert isinstance(result, str)
assert len(result.strip()) > 0

View File

@@ -1,14 +1,10 @@
import os
from unittest.mock import MagicMock, patch
import pytest
from mem0.client.main import MemoryClient
from mem0.memory.main import Memory
from crewai.agent import Agent
from crewai.crew import Crew
from crewai.memory.storage.mem0_storage import Mem0Storage
from crewai.task import Task
# Define the class (if not already defined)
@@ -59,10 +55,11 @@ def mem0_storage_with_mocked_config(mock_mem0_memory):
}
# Instantiate the class with memory_config
# Parameters like run_id, includes, and excludes doesn't matter in Memory OSS
crew = MockCrew(
memory_config={
"provider": "mem0",
"config": {"user_id": "test_user", "local_mem0_config": config},
"config": {"user_id": "test_user", "local_mem0_config": config, "run_id": "my_run_id", "includes": "include1","excludes": "exclude1", "infer" : True},
}
)
@@ -99,6 +96,10 @@ def mem0_storage_with_memory_client_using_config_from_crew(mock_mem0_memory_clie
"api_key": "ABCDEFGH",
"org_id": "my_org_id",
"project_id": "my_project_id",
"run_id": "my_run_id",
"includes": "include1",
"excludes": "exclude1",
"infer": True
},
}
)
@@ -154,11 +155,37 @@ def test_mem0_storage_with_explict_config(
assert (
mem0_storage_with_memory_client_using_explictly_config.config == expected_config
)
assert (
mem0_storage_with_memory_client_using_explictly_config.memory_config
== expected_config
def test_mem0_storage_updates_project_with_custom_categories(mock_mem0_memory_client):
mock_mem0_memory_client.update_project = MagicMock()
new_categories = [
{"lifestyle_management_concerns": "Tracks daily routines, habits, hobbies and interests including cooking, time management and work-life balance"},
]
crew = MockCrew(
memory_config={
"provider": "mem0",
"config": {
"user_id": "test_user",
"api_key": "ABCDEFGH",
"org_id": "my_org_id",
"project_id": "my_project_id",
"custom_categories": new_categories,
},
}
)
with patch.object(MemoryClient, "__new__", return_value=mock_mem0_memory_client):
_ = Mem0Storage(type="short_term", crew=crew)
mock_mem0_memory_client.update_project.assert_called_once_with(
custom_categories=new_categories
)
def test_save_method_with_memory_oss(mem0_storage_with_mocked_config):
"""Test save method for different memory types"""
@@ -172,9 +199,8 @@ def test_save_method_with_memory_oss(mem0_storage_with_mocked_config):
mem0_storage.save(test_value, test_metadata)
mem0_storage.memory.add.assert_called_once_with(
test_value,
agent_id="Test_Agent",
infer=False,
[{'role': 'assistant' , 'content': test_value}],
infer=True,
metadata={"type": "short_term", "key": "value"},
)
@@ -191,11 +217,14 @@ def test_save_method_with_memory_client(mem0_storage_with_memory_client_using_co
mem0_storage.save(test_value, test_metadata)
mem0_storage.memory.add.assert_called_once_with(
test_value,
agent_id="Test_Agent",
infer=False,
[{'role': 'assistant' , 'content': test_value}],
infer=True,
metadata={"type": "short_term", "key": "value"},
output_format="v1.1"
version="v2",
run_id="my_run_id",
includes="include1",
excludes="exclude1",
output_format='v1.1'
)
@@ -210,11 +239,12 @@ def test_search_method_with_memory_oss(mem0_storage_with_mocked_config):
mem0_storage.memory.search.assert_called_once_with(
query="test query",
limit=5,
agent_id="Test_Agent",
user_id="test_user"
user_id="test_user",
filters={'AND': [{'run_id': 'my_run_id'}]},
threshold=0.5
)
assert len(results) == 1
assert len(results) == 2
assert results[0]["content"] == "Result 1"
@@ -229,11 +259,31 @@ def test_search_method_with_memory_client(mem0_storage_with_memory_client_using_
mem0_storage.memory.search.assert_called_once_with(
query="test query",
limit=5,
agent_id="Test_Agent",
metadata={"type": "short_term"},
user_id="test_user",
output_format='v1.1'
version='v2',
run_id="my_run_id",
output_format='v1.1',
filters={'AND': [{'run_id': 'my_run_id'}]},
threshold=0.5
)
assert len(results) == 1
assert len(results) == 2
assert results[0]["content"] == "Result 1"
def test_mem0_storage_default_infer_value(mock_mem0_memory_client):
"""Test that Mem0Storage sets infer=True by default for short_term memory."""
with patch.object(MemoryClient, "__new__", return_value=mock_mem0_memory_client):
crew = MockCrew(
memory_config={
"provider": "mem0",
"config": {
"user_id": "test_user",
"api_key": "ABCDEFGH"
},
}
)
mem0_storage = Mem0Storage(type="short_term", crew=crew)
assert mem0_storage.infer is True

View File

@@ -1,16 +1,27 @@
import multiprocessing
import tempfile
import unittest
from typing import Any, Dict, List, Union
import pytest
from chromadb.config import Settings
from unittest.mock import patch, MagicMock
from crewai.utilities.chromadb import (
MAX_COLLECTION_LENGTH,
MIN_COLLECTION_LENGTH,
is_ipv4_pattern,
sanitize_collection_name,
create_persistent_client,
)
def persistent_client_worker(path, queue):
try:
create_persistent_client(path=path)
queue.put(None)
except Exception as e:
queue.put(e)
class TestChromadbUtils(unittest.TestCase):
def test_sanitize_collection_name_long_name(self):
"""Test sanitizing a very long collection name."""
@@ -79,3 +90,34 @@ class TestChromadbUtils(unittest.TestCase):
self.assertLessEqual(len(sanitized), MAX_COLLECTION_LENGTH)
self.assertTrue(sanitized[0].isalnum())
self.assertTrue(sanitized[-1].isalnum())
def test_create_persistent_client_passes_args(self):
with patch(
"crewai.utilities.chromadb.PersistentClient"
) as mock_persistent_client, tempfile.TemporaryDirectory() as tmpdir:
mock_instance = MagicMock()
mock_persistent_client.return_value = mock_instance
settings = Settings(allow_reset=True)
client = create_persistent_client(path=tmpdir, settings=settings)
mock_persistent_client.assert_called_once_with(
path=tmpdir, settings=settings
)
self.assertIs(client, mock_instance)
def test_create_persistent_client_process_safe(self):
with tempfile.TemporaryDirectory() as tmpdir:
queue = multiprocessing.Queue()
processes = [
multiprocessing.Process(
target=persistent_client_worker, args=(tmpdir, queue)
)
for _ in range(5)
]
[p.start() for p in processes]
[p.join() for p in processes]
errors = [queue.get(timeout=5) for _ in processes]
self.assertTrue(all(err is None for err in errors))

View File

@@ -0,0 +1,25 @@
from unittest.mock import patch
import pytest
from crewai.rag.embeddings.configurator import EmbeddingConfigurator
def test_configure_embedder_importerror():
configurator = EmbeddingConfigurator()
embedder_config = {
'provider': 'openai',
'config': {
'model': 'text-embedding-ada-002',
}
}
with patch('chromadb.utils.embedding_functions.openai_embedding_function.OpenAIEmbeddingFunction') as mock_openai:
mock_openai.side_effect = ImportError("Module not found.")
with pytest.raises(ImportError) as exc_info:
configurator.configure_embedder(embedder_config)
assert str(exc_info.value) == "Module not found."
mock_openai.assert_called_once()

View File

@@ -64,7 +64,8 @@ def base_agent():
llm="gpt-4o-mini",
goal="Just say hi",
backstory="You are a helpful assistant that just says hi",
)
)
@pytest.fixture(scope="module")
def base_task(base_agent):
@@ -74,6 +75,7 @@ def base_task(base_agent):
agent=base_agent,
)
event_listener = EventListener()
@@ -448,6 +450,27 @@ def test_flow_emits_start_event():
assert received_events[0].type == "flow_started"
def test_flow_name_emitted_to_event_bus():
received_events = []
class MyFlowClass(Flow):
name = "PRODUCTION_FLOW"
@start()
def start(self):
return "Hello, world!"
@crewai_event_bus.on(FlowStartedEvent)
def handle_flow_start(source, event):
received_events.append(event)
flow = MyFlowClass()
flow.kickoff()
assert len(received_events) == 1
assert received_events[0].flow_name == "PRODUCTION_FLOW"
def test_flow_emits_finish_event():
received_events = []
@@ -756,6 +779,7 @@ def test_streaming_empty_response_handling():
received_chunks = []
with crewai_event_bus.scoped_handlers():
@crewai_event_bus.on(LLMStreamChunkEvent)
def handle_stream_chunk(source, event):
received_chunks.append(event.chunk)
@@ -793,6 +817,7 @@ def test_streaming_empty_response_handling():
# Restore the original method
llm.call = original_call
@pytest.mark.vcr(filter_headers=["authorization"])
def test_stream_llm_emits_event_with_task_and_agent_info():
completed_event = []
@@ -801,6 +826,7 @@ def test_stream_llm_emits_event_with_task_and_agent_info():
stream_event = []
with crewai_event_bus.scoped_handlers():
@crewai_event_bus.on(LLMCallFailedEvent)
def handle_llm_failed(source, event):
failed_event.append(event)
@@ -827,7 +853,7 @@ def test_stream_llm_emits_event_with_task_and_agent_info():
description="Just say hi",
expected_output="hi",
llm=LLM(model="gpt-4o-mini", stream=True),
agent=agent
agent=agent,
)
crew = Crew(agents=[agent], tasks=[task])
@@ -855,6 +881,7 @@ def test_stream_llm_emits_event_with_task_and_agent_info():
assert set(all_task_id) == {task.id}
assert set(all_task_name) == {task.name}
@pytest.mark.vcr(filter_headers=["authorization"])
def test_llm_emits_event_with_task_and_agent_info(base_agent, base_task):
completed_event = []
@@ -863,6 +890,7 @@ def test_llm_emits_event_with_task_and_agent_info(base_agent, base_task):
stream_event = []
with crewai_event_bus.scoped_handlers():
@crewai_event_bus.on(LLMCallFailedEvent)
def handle_llm_failed(source, event):
failed_event.append(event)
@@ -904,6 +932,7 @@ def test_llm_emits_event_with_task_and_agent_info(base_agent, base_task):
assert set(all_task_id) == {base_task.id}
assert set(all_task_name) == {base_task.name}
@pytest.mark.vcr(filter_headers=["authorization"])
def test_llm_emits_event_with_lite_agent():
completed_event = []
@@ -912,6 +941,7 @@ def test_llm_emits_event_with_lite_agent():
stream_event = []
with crewai_event_bus.scoped_handlers():
@crewai_event_bus.on(LLMCallFailedEvent)
def handle_llm_failed(source, event):
failed_event.append(event)
@@ -936,7 +966,6 @@ def test_llm_emits_event_with_lite_agent():
)
agent.kickoff(messages=[{"role": "user", "content": "say hi!"}])
assert len(completed_event) == 2
assert len(failed_event) == 0
assert len(started_event) == 2

6668
uv.lock generated

File diff suppressed because it is too large Load Diff