feat: change litellm dependency from strict pin to minimum version constraint

- Change litellm==1.74.3 to litellm>=1.74.3 in pyproject.toml - Update uv.lock with new dependency constraint - Add comprehensive tests to verify minimum version constraint works - Allows users to install newer litellm versions for features like Claude 4 Sonnet Fixes #3207 Co-Authored-By: Jo\u00E3o <joao@crewai.com>
Feature/update docs (#3205 )
2026-01-06 06:38:29 +00:00 · 2025-07-22 23:50:14 +00:00 · 2025-07-22 13:55:27 -04:00 · 2025-07-21 22:08:07 -04:00 · 2025-07-21 19:36:48 -03:00 · 2025-07-21 15:26:34 -04:00
71 changed files with 6069 additions and 6366 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -37,25 +37,9 @@ jobs:
      - name: Install the project
        run: uv sync --dev --all-extras

-      - name: Install SQLite with FTS5 support
-        run: |
-          # WORKAROUND: GitHub Actions' Ubuntu runner uses SQLite without FTS5 support compiled in.
-          # This is a temporary fix until the runner includes SQLite with FTS5 or Python's sqlite3
-          # module is compiled with FTS5 support by default.
-          # TODO: Remove this workaround once GitHub Actions runners include SQLite FTS5 support
-          
-          # Install pysqlite3-binary which has FTS5 support
-          uv pip install pysqlite3-binary
-          # Create a sitecustomize.py to override sqlite3 with pysqlite3
-          mkdir -p .pytest_sqlite_override
-          echo "import sys; import pysqlite3; sys.modules['sqlite3'] = pysqlite3" > .pytest_sqlite_override/sitecustomize.py
-          # Test FTS5 availability
-          PYTHONPATH=.pytest_sqlite_override uv run python -c "import sqlite3; print(f'SQLite version: {sqlite3.sqlite_version}')"
-          PYTHONPATH=.pytest_sqlite_override uv run python -c "import sqlite3; conn = sqlite3.connect(':memory:'); conn.execute('CREATE VIRTUAL TABLE test USING fts5(content)'); print('FTS5 module available')"
-
      - name: Run tests (group ${{ matrix.group }} of 8)
        run: |
-          PYTHONPATH=.pytest_sqlite_override uv run pytest \
+          uv run pytest \
            --block-network \
            --timeout=30 \
            -vv \
--- a/.gitignore
+++ b/.gitignore
@@ -26,4 +26,5 @@ test_flow.html
 crewairules.mdc
 plan.md
 conceptual_plan.md
-build_image
+build_image
+chromadb-*.lock
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -9,12 +9,7 @@
  },
  "favicon": "/images/favicon.svg",
  "contextual": {
-    "options": [
-      "copy",
-      "view",
-      "chatgpt",
-      "claude"
-    ]
+    "options": ["copy", "view", "chatgpt", "claude"]
  },
  "navigation": {
    "languages": [
@@ -37,11 +32,6 @@
              "href": "https://chatgpt.com/g/g-qqTuUWsBY-crewai-assistant",
              "icon": "robot"
            },
-            {
-              "anchor": "Get Help",
-              "href": "mailto:support@crewai.com",
-              "icon": "headset"
-            },
            {
              "anchor": "Releases",
              "href": "https://github.com/crewAIInc/crewAI/releases",
@@ -55,32 +45,22 @@
            "groups": [
              {
                "group": "Get Started",
-                "pages": [
-                  "en/introduction",
-                  "en/installation",
-                  "en/quickstart"
-                ]
+                "pages": ["en/introduction", "en/installation", "en/quickstart"]
              },
              {
                "group": "Guides",
                "pages": [
                  {
                    "group": "Strategy",
-                    "pages": [
-                      "en/guides/concepts/evaluating-use-cases"
-                    ]
+                    "pages": ["en/guides/concepts/evaluating-use-cases"]
                  },
                  {
                    "group": "Agents",
-                    "pages": [
-                      "en/guides/agents/crafting-effective-agents"
-                    ]
+                    "pages": ["en/guides/agents/crafting-effective-agents"]
                  },
                  {
                    "group": "Crews",
-                    "pages": [
-                      "en/guides/crews/first-crew"
-                    ]
+                    "pages": ["en/guides/crews/first-crew"]
                  },
                  {
                    "group": "Flows",
@@ -94,7 +74,6 @@
                    "pages": [
                      "en/guides/advanced/customizing-prompts",
                      "en/guides/advanced/fingerprinting"
-
                    ]
                  }
                ]
@@ -182,7 +161,9 @@
                      "en/tools/search-research/websitesearchtool",
                      "en/tools/search-research/codedocssearchtool",
                      "en/tools/search-research/youtubechannelsearchtool",
-                      "en/tools/search-research/youtubevideosearchtool"
+                      "en/tools/search-research/youtubevideosearchtool",
+                      "en/tools/search-research/tavilysearchtool",
+                      "en/tools/search-research/tavilyextractortool"
                    ]
                  },
                  {
@@ -241,6 +222,7 @@
                  "en/observability/langtrace",
                  "en/observability/maxim",
                  "en/observability/mlflow",
+                  "en/observability/neatlogs",
                  "en/observability/openlit",
                  "en/observability/opik",
                  "en/observability/patronus-evaluation",
@@ -274,9 +256,7 @@
              },
              {
                "group": "Telemetry",
-                "pages": [
-                  "en/telemetry"
-                ]
+                "pages": ["en/telemetry"]
              }
            ]
          },
@@ -285,9 +265,7 @@
            "groups": [
              {
                "group": "Getting Started",
-                "pages": [
-                  "en/enterprise/introduction"
-                ]
+                "pages": ["en/enterprise/introduction"]
              },
              {
                "group": "Features",
@@ -342,9 +320,7 @@
              },
              {
                "group": "Resources",
-                "pages": [
-                  "en/enterprise/resources/frequently-asked-questions"
-                ]
+                "pages": ["en/enterprise/resources/frequently-asked-questions"]
              }
            ]
          },
@@ -353,9 +329,7 @@
            "groups": [
              {
                "group": "Getting Started",
-                "pages": [
-                  "en/api-reference/introduction"
-                ]
+                "pages": ["en/api-reference/introduction"]
              },
              {
                "group": "Endpoints",
@@ -365,16 +339,13 @@
          },
          {
            "tab": "Examples",
-                        "groups": [
+            "groups": [
              {
                "group": "Examples",
-                "pages": [
-                  "en/examples/example"
-                ]
+                "pages": ["en/examples/example"]
              }
            ]
          }
-
        ]
      },
      {
@@ -396,11 +367,6 @@
              "href": "https://chatgpt.com/g/g-qqTuUWsBY-crewai-assistant",
              "icon": "robot"
            },
-            {
-              "anchor": "Obter Ajuda",
-              "href": "mailto:support@crewai.com",
-              "icon": "headset"
-            },
            {
              "anchor": "Lançamentos",
              "href": "https://github.com/crewAIInc/crewAI/releases",
@@ -425,21 +391,15 @@
                "pages": [
                  {
                    "group": "Estratégia",
-                    "pages": [
-                      "pt-BR/guides/concepts/evaluating-use-cases"
-                    ]
+                    "pages": ["pt-BR/guides/concepts/evaluating-use-cases"]
                  },
                  {
                    "group": "Agentes",
-                    "pages": [
-                      "pt-BR/guides/agents/crafting-effective-agents"
-                    ]
+                    "pages": ["pt-BR/guides/agents/crafting-effective-agents"]
                  },
                  {
                    "group": "Crews",
-                    "pages": [
-                      "pt-BR/guides/crews/first-crew"
-                    ]
+                    "pages": ["pt-BR/guides/crews/first-crew"]
                  },
                  {
                    "group": "Flows",
@@ -632,9 +592,7 @@
              },
              {
                "group": "Telemetria",
-                "pages": [
-                  "pt-BR/telemetry"
-                ]
+                "pages": ["pt-BR/telemetry"]
              }
            ]
          },
@@ -643,9 +601,7 @@
            "groups": [
              {
                "group": "Começando",
-                "pages": [
-                  "pt-BR/enterprise/introduction"
-                ]
+                "pages": ["pt-BR/enterprise/introduction"]
              },
              {
                "group": "Funcionalidades",
@@ -710,9 +666,7 @@
            "groups": [
              {
                "group": "Começando",
-                "pages": [
-                  "pt-BR/api-reference/introduction"
-                ]
+                "pages": ["pt-BR/api-reference/introduction"]
              },
              {
                "group": "Endpoints",
@@ -722,16 +676,13 @@
          },
          {
            "tab": "Exemplos",
-                        "groups": [
+            "groups": [
              {
                "group": "Exemplos",
-                "pages": [
-                  "pt-BR/examples/example"
-                ]
+                "pages": ["pt-BR/examples/example"]
              }
            ]
          }
-
        ]
      }
    ]
--- a/docs/en/concepts/memory.mdx
+++ b/docs/en/concepts/memory.mdx
@@ -712,7 +712,7 @@ crew = Crew(
    memory_config={
        "provider": "mem0",
        "config": {"user_id": "john"},
-        "user_memory": {}  # Required - triggers user memory initialization
+        "user_memory": {}  # DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, use external_memory instead
    },
    process=Process.sequential,
    verbose=True
--- a/docs/en/concepts/tasks.mdx
+++ b/docs/en/concepts/tasks.mdx
@@ -54,10 +54,11 @@ crew = Crew(
 | **Markdown** _(optional)_        | `markdown`        | `Optional[bool]`              | Whether the task should instruct the agent to return the final answer formatted in Markdown. Defaults to False.      |
 | **Config** _(optional)_          | `config`          | `Optional[Dict[str, Any]]`    | Task-specific configuration parameters.                                                                              |
 | **Output File** _(optional)_     | `output_file`     | `Optional[str]`               | File path for storing the task output.                                                                               |
+| **Create Directory** _(optional)_ | `create_directory` | `Optional[bool]`             | Whether to create the directory for output_file if it doesn't exist. Defaults to True.                               |
 | **Output JSON** _(optional)_     | `output_json`     | `Optional[Type[BaseModel]]`   | A Pydantic model to structure the JSON output.                                                                       |
 | **Output Pydantic** _(optional)_ | `output_pydantic` | `Optional[Type[BaseModel]]`   | A Pydantic model for task output.                                                                                    |
 | **Callback** _(optional)_        | `callback`        | `Optional[Any]`               | Function/object to be executed after task completion.                                                                |
-| **Guardrail** _(optional)_       | `guardrail`       | `Optional[Union[Callable, str]]` | Function or string description to validate task output before proceeding to next task.                            |
+| **Guardrail** _(optional)_       | `guardrail`       | `Optional[Callable]`             | Function to validate task output before proceeding to next task.                                                  |

 ## Creating Tasks

@@ -87,7 +88,6 @@ research_task:
  expected_output: >
    A list with 10 bullet points of the most relevant information about {topic}
  agent: researcher
-  guardrail: ensure each bullet contains a minimum of 100 words

 reporting_task:
  description: >
@@ -334,9 +334,7 @@ Task guardrails provide a way to validate and transform task outputs before they
 are passed to the next task. This feature helps ensure data quality and provides
 feedback to agents when their output doesn't meet specific criteria.

-**Guardrails can be defined in two ways:**
-1. **Function-based guardrails**: Python functions that implement custom validation logic
-2. **String-based guardrails**: Natural language descriptions that are automatically converted to LLM-powered validation
+Guardrails are implemented as Python functions that contain custom validation logic, giving you complete control over the validation process and ensuring reliable, deterministic results.

 ### Function-Based Guardrails

@@ -378,82 +376,7 @@ blog_task = Task(
   - On success: it returns a tuple of `(bool, Any)`. For example: `(True, validated_result)`
   - On Failure: it returns a tuple of `(bool, str)`. For example: `(False, "Error message explain the failure")`

-### String-Based Guardrails

-String-based guardrails allow you to describe validation criteria in natural language. When you provide a string instead of a function, CrewAI automatically converts it to an `LLMGuardrail` that uses an AI agent to validate the task output.
-
-#### Using String Guardrails in Python
-
-```python Code
-from crewai import Task
-
-# Simple string-based guardrail
-blog_task = Task(
-    description="Write a blog post about AI",
-    expected_output="A blog post under 200 words",
-    agent=blog_agent,
-    guardrail="Ensure the blog post is under 200 words and includes practical examples"
-)
-
-# More complex validation criteria
-research_task = Task(
-    description="Research AI trends for 2025",
-    expected_output="A comprehensive research report",
-    agent=research_agent,
-    guardrail="Ensure each finding includes a credible source and is backed by recent data from 2024-2025"
-)
-```
-
-#### Using String Guardrails in YAML
-
-```yaml
-research_task:
-  description: Research the latest AI developments
-  expected_output: A list of 10 bullet points about AI
-  agent: researcher
-  guardrail: ensure each bullet contains a minimum of 100 words
-
-validation_task:
-  description: Validate the research findings
-  expected_output: A validation report
-  agent: validator
-  guardrail: confirm all sources are from reputable publications and published within the last 2 years
-```
-
-#### How String Guardrails Work
-
-When you provide a string guardrail, CrewAI automatically:
-1. Creates an `LLMGuardrail` instance using the string as validation criteria
-2. Uses the task's agent LLM to power the validation
-3. Creates a temporary validation agent that checks the output against your criteria
-4. Returns detailed feedback if validation fails
-
-This approach is ideal when you want to use natural language to describe validation rules without writing custom validation functions.
-
-### LLMGuardrail Class
-
-The `LLMGuardrail` class is the underlying mechanism that powers string-based guardrails. You can also use it directly for more advanced control:
-
-```python Code
-from crewai import Task
-from crewai.tasks.llm_guardrail import LLMGuardrail
-from crewai.llm import LLM
-
-# Create a custom LLMGuardrail with specific LLM
-custom_guardrail = LLMGuardrail(
-    description="Ensure the response contains exactly 5 bullet points with proper citations",
-    llm=LLM(model="gpt-4o-mini")
-)
-
-task = Task(
-    description="Research AI safety measures",
-    expected_output="A detailed analysis with bullet points",
-    agent=research_agent,
-    guardrail=custom_guardrail
-)
-```
-
-**Note**: When you use a string guardrail, CrewAI automatically creates an `LLMGuardrail` instance using your task's agent LLM. Using `LLMGuardrail` directly gives you more control over the validation process and LLM selection.

 ### Error Handling Best Practices

@@ -881,21 +804,87 @@ These validations help in maintaining the consistency and reliability of task ex

 ## Creating Directories when Saving Files

-You can now specify if a task should create directories when saving its output to a file. This is particularly useful for organizing outputs and ensuring that file paths are correctly structured.
+The `create_directory` parameter controls whether CrewAI should automatically create directories when saving task outputs to files. This feature is particularly useful for organizing outputs and ensuring that file paths are correctly structured, especially when working with complex project hierarchies.
+
+### Default Behavior
+
+By default, `create_directory=True`, which means CrewAI will automatically create any missing directories in the output file path:

 ```python Code
-# ...
-
-save_output_task = Task(
-    description='Save the summarized AI news to a file',
-    expected_output='File saved successfully',
-    agent=research_agent,
-    tools=[file_save_tool],
-    output_file='outputs/ai_news_summary.txt',
-    create_directory=True
+# Default behavior - directories are created automatically
+report_task = Task(
+    description='Generate a comprehensive market analysis report',
+    expected_output='A detailed market analysis with charts and insights',
+    agent=analyst_agent,
+    output_file='reports/2025/market_analysis.md',  # Creates 'reports/2025/' if it doesn't exist
+    markdown=True
 )
+```

-#...
+### Disabling Directory Creation
+
+If you want to prevent automatic directory creation and ensure that the directory already exists, set `create_directory=False`:
+
+```python Code
+# Strict mode - directory must already exist
+strict_output_task = Task(
+    description='Save critical data that requires existing infrastructure',
+    expected_output='Data saved to pre-configured location',
+    agent=data_agent,
+    output_file='secure/vault/critical_data.json',
+    create_directory=False  # Will raise RuntimeError if 'secure/vault/' doesn't exist
+)
+```
+
+### YAML Configuration
+
+You can also configure this behavior in your YAML task definitions:
+
+```yaml tasks.yaml
+analysis_task:
+  description: >
+    Generate quarterly financial analysis
+  expected_output: >
+    A comprehensive financial report with quarterly insights
+  agent: financial_analyst
+  output_file: reports/quarterly/q4_2024_analysis.pdf
+  create_directory: true  # Automatically create 'reports/quarterly/' directory
+
+audit_task:
+  description: >
+    Perform compliance audit and save to existing audit directory
+  expected_output: >
+    A compliance audit report
+  agent: auditor
+  output_file: audit/compliance_report.md
+  create_directory: false  # Directory must already exist
+```
+
+### Use Cases
+
+**Automatic Directory Creation (`create_directory=True`):**
+- Development and prototyping environments
+- Dynamic report generation with date-based folders
+- Automated workflows where directory structure may vary
+- Multi-tenant applications with user-specific folders
+
+**Manual Directory Management (`create_directory=False`):**
+- Production environments with strict file system controls
+- Security-sensitive applications where directories must be pre-configured
+- Systems with specific permission requirements
+- Compliance environments where directory creation is audited
+
+### Error Handling
+
+When `create_directory=False` and the directory doesn't exist, CrewAI will raise a `RuntimeError`:
+
+```python Code
+try:
+    result = crew.kickoff()
+except RuntimeError as e:
+    # Handle missing directory error
+    print(f"Directory creation failed: {e}")
+    # Create directory manually or use fallback location
 ```

 Check out the video below to see how to use structured outputs in CrewAI:
--- a/docs/en/observability/neatlogs.mdx
+++ b/docs/en/observability/neatlogs.mdx
@@ -0,0 +1,134 @@
+---
+title: Neatlogs Integration
+description: Understand, debug, and share your CrewAI agent runs
+icon: magnifying-glass-chart
+---
+
+# Introduction
+
+Neatlogs helps you **see what your agent did**, **why**, and **share it**.
+
+It captures every step: thoughts, tool calls, responses, evaluations. No raw logs. Just clear, structured traces. Great for debugging and collaboration.
+
+## Why use Neatlogs?
+
+CrewAI agents use multiple tools and reasoning steps. When something goes wrong, you need context — not just errors.
+
+Neatlogs lets you:
+
+- Follow the full decision path
+- Add feedback directly on steps
+- Chat with the trace using AI assistant
+- Share runs publicly for feedback
+- Turn insights into tasks
+
+All in one place.
+
+Manage your traces effortlessly
+
+![Traces](/images/neatlogs-1.png)
+![Trace Response](/images/neatlogs-2.png)
+
+The best UX to view a CrewAI trace. Post comments anywhere you want. Use AI to debug.
+
+![Trace Details](/images/neatlogs-3.png)
+![Ai Chat Bot With A Trace](/images/neatlogs-4.png)
+![Comments Drawer](/images/neatlogs-5.png)
+
+## Core Features
+
+- **Trace Viewer**: Track thoughts, tools, and decisions in sequence
+- **Inline Comments**: Tag teammates on any trace step
+- **Feedback & Evaluation**: Mark outputs as correct or incorrect
+- **Error Highlighting**: Automatic flagging of API/tool failures
+- **Task Conversion**: Convert comments into assigned tasks
+- **Ask the Trace (AI)**: Chat with your trace using Neatlogs AI bot
+- **Public Sharing**: Publish trace links to your community
+
+## Quick Setup with CrewAI
+
+<Steps>
+  <Step title="Sign Up & Get API Key">
+    Visit [neatlogs.com](https://neatlogs.com/?utm_source=crewAI-docs), create a project, copy the API key.
+  </Step>
+  <Step title="Install SDK">
+    ```bash
+    pip install neatlogs
+    ```
+    (Latest version 0.8.0, Python 3.8+; MIT license)
+  </Step>
+  <Step title="Initialize Neatlogs">
+    Before starting Crew agents, add:
+
+    ```python
+    import neatlogs
+    neatlogs.init("YOUR_PROJECT_API_KEY")
+    ```
+
+    Agents run as usual. Neatlogs captures everything automatically.
+
+  </Step>
+</Steps>
+
+
+
+## Under the Hood
+
+According to GitHub, Neatlogs:
+
+- Captures thoughts, tool calls, responses, errors, and token stats
+- Supports AI-powered task generation and robust evaluation workflows
+
+All with just two lines of code.
+
+
+
+## Watch It Work
+
+### 🔍 Full Demo (4 min)
+
+<iframe
+  width="100%"
+  height="315"
+  src="https://www.youtube.com/embed/8KDme9T2I7Q?si=b8oHteaBwFNs_Duk"
+  title="YouTube video player"
+  frameBorder="0"
+  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+  allowFullScreen
+></iframe>
+
+### ⚙️ CrewAI Integration (30 s)
+
+<iframe
+  className="w-full aspect-video rounded-xl"
+  src="https://www.loom.com/embed/9c78b552af43452bb3e4783cb8d91230?sid=e9d7d370-a91a-49b0-809e-2f375d9e801d"
+  title="Loom video player"
+  frameBorder="0"
+  allowFullScreen
+></iframe>
+
+
+
+## Links & Support
+
+- 📘 [Neatlogs Docs](https://docs.neatlogs.com/)
+- 🔐 [Dashboard & API Key](https://app.neatlogs.com/)
+- 🐦 [Follow on Twitter](https://twitter.com/neatlogs)
+- 📧 Contact: hello@neatlogs.com
+- 🛠 [GitHub SDK](https://github.com/NeatLogs/neatlogs)
+
+
+
+## TL;DR
+
+With just:
+
+```bash
+pip install neatlogs
+
+import neatlogs
+neatlogs.init("YOUR_API_KEY")
+
+You can now capture, understand, share, and act on your CrewAI agent runs in seconds.
+No setup overhead. Full trace transparency. Full team collaboration.
+```
--- a/docs/en/tools/search-research/overview.mdx
+++ b/docs/en/tools/search-research/overview.mdx
@@ -44,6 +44,14 @@ These tools enable your agents to search the web, research topics, and find info
  <Card title="YouTube Video Search" icon="play" href="/en/tools/search-research/youtubevideosearchtool">
    Find and analyze YouTube videos by topic, keyword, or criteria.
  </Card>
+
+  <Card title="Tavily Search Tool" icon="magnifying-glass" href="/en/tools/search-research/tavilysearchtool">
+    Comprehensive web search using Tavily's AI-powered search API.
+  </Card>
+
+  <Card title="Tavily Extractor Tool" icon="file-text" href="/en/tools/search-research/tavilyextractortool">
+    Extract structured content from web pages using the Tavily API.
+  </Card>
 </CardGroup>

 ## **Common Use Cases**
@@ -55,17 +63,19 @@ These tools enable your agents to search the web, research topics, and find info
 - **Academic Research**: Find scholarly articles and technical papers

 ```python
-from crewai_tools import SerperDevTool, GitHubSearchTool, YoutubeVideoSearchTool
+from crewai_tools import SerperDevTool, GitHubSearchTool, YoutubeVideoSearchTool, TavilySearchTool, TavilyExtractorTool

 # Create research tools
 web_search = SerperDevTool()
 code_search = GitHubSearchTool()
 video_research = YoutubeVideoSearchTool()
+tavily_search = TavilySearchTool()
+content_extractor = TavilyExtractorTool()

 # Add to your agent
 agent = Agent(
    role="Research Analyst",
-    tools=[web_search, code_search, video_research],
+    tools=[web_search, code_search, video_research, tavily_search, content_extractor],
    goal="Gather comprehensive information on any topic"
 )
 ```
--- a/docs/en/tools/search-research/tavilyextractortool.mdx
+++ b/docs/en/tools/search-research/tavilyextractortool.mdx
@@ -0,0 +1,139 @@
+---
+title: "Tavily Extractor Tool"
+description: "Extract structured content from web pages using the Tavily API"
+icon: "file-text"
+---
+
+The `TavilyExtractorTool` allows CrewAI agents to extract structured content from web pages using the Tavily API. It can process single URLs or lists of URLs and provides options for controlling the extraction depth and including images.
+
+## Installation
+
+To use the `TavilyExtractorTool`, you need to install the `tavily-python` library:
+
+```shell
+pip install 'crewai[tools]' tavily-python
+```
+
+You also need to set your Tavily API key as an environment variable:
+
+```bash
+export TAVILY_API_KEY='your-tavily-api-key'
+```
+
+## Example Usage
+
+Here's how to initialize and use the `TavilyExtractorTool` within a CrewAI agent:
+
+```python
+import os
+from crewai import Agent, Task, Crew
+from crewai_tools import TavilyExtractorTool
+
+# Ensure TAVILY_API_KEY is set in your environment
+# os.environ["TAVILY_API_KEY"] = "YOUR_API_KEY"
+
+# Initialize the tool
+tavily_tool = TavilyExtractorTool()
+
+# Create an agent that uses the tool
+extractor_agent = Agent(
+    role='Web Content Extractor',
+    goal='Extract key information from specified web pages',
+    backstory='You are an expert at extracting relevant content from websites using the Tavily API.',
+    tools=[tavily_tool],
+    verbose=True
+)
+
+# Define a task for the agent
+extract_task = Task(
+    description='Extract the main content from the URL https://example.com using basic extraction depth.',
+    expected_output='A JSON string containing the extracted content from the URL.',
+    agent=extractor_agent
+)
+
+# Create and run the crew
+crew = Crew(
+    agents=[extractor_agent],
+    tasks=[extract_task],
+    verbose=2
+)
+
+result = crew.kickoff()
+print(result)
+```
+
+## Configuration Options
+
+The `TavilyExtractorTool` accepts the following arguments:
+
+- `urls` (Union[List[str], str]): **Required**. A single URL string or a list of URL strings to extract data from.
+- `include_images` (Optional[bool]): Whether to include images in the extraction results. Defaults to `False`.
+- `extract_depth` (Literal["basic", "advanced"]): The depth of extraction. Use `"basic"` for faster, surface-level extraction or `"advanced"` for more comprehensive extraction. Defaults to `"basic"`.
+- `timeout` (int): The maximum time in seconds to wait for the extraction request to complete. Defaults to `60`.
+
+## Advanced Usage
+
+### Multiple URLs with Advanced Extraction
+
+```python
+# Example with multiple URLs and advanced extraction
+multi_extract_task = Task(
+    description='Extract content from https://example.com and https://anotherexample.org using advanced extraction.',
+    expected_output='A JSON string containing the extracted content from both URLs.',
+    agent=extractor_agent
+)
+
+# Configure the tool with custom parameters
+custom_extractor = TavilyExtractorTool(
+    extract_depth='advanced',
+    include_images=True,
+    timeout=120
+)
+
+agent_with_custom_tool = Agent(
+    role="Advanced Content Extractor",
+    goal="Extract comprehensive content with images",
+    tools=[custom_extractor]
+)
+```
+
+### Tool Parameters
+
+You can customize the tool's behavior by setting parameters during initialization:
+
+```python
+# Initialize with custom configuration
+extractor_tool = TavilyExtractorTool(
+    extract_depth='advanced',  # More comprehensive extraction
+    include_images=True,       # Include image results
+    timeout=90                 # Custom timeout
+)
+```
+
+## Features
+
+- **Single or Multiple URLs**: Extract content from one URL or process multiple URLs in a single request
+- **Configurable Depth**: Choose between basic (fast) and advanced (comprehensive) extraction modes
+- **Image Support**: Optionally include images in the extraction results
+- **Structured Output**: Returns well-formatted JSON containing the extracted content
+- **Error Handling**: Robust handling of network timeouts and extraction errors
+
+## Response Format
+
+The tool returns a JSON string representing the structured data extracted from the provided URL(s). The exact structure depends on the content of the pages and the `extract_depth` used.
+
+Common response elements include:
+- **Title**: The page title
+- **Content**: Main text content of the page
+- **Images**: Image URLs and metadata (when `include_images=True`)
+- **Metadata**: Additional page information like author, description, etc.
+
+## Use Cases
+
+- **Content Analysis**: Extract and analyze content from competitor websites
+- **Research**: Gather structured data from multiple sources for analysis
+- **Content Migration**: Extract content from existing websites for migration
+- **Monitoring**: Regular extraction of content for change detection
+- **Data Collection**: Systematic extraction of information from web sources
+
+Refer to the [Tavily API documentation](https://docs.tavily.com/docs/tavily-api/python-sdk#extract) for detailed information about the response structure and available options.
--- a/docs/en/tools/search-research/tavilysearchtool.mdx
+++ b/docs/en/tools/search-research/tavilysearchtool.mdx
@@ -0,0 +1,122 @@
+---
+title: "Tavily Search Tool"
+description: "Perform comprehensive web searches using the Tavily Search API"
+icon: "magnifying-glass"
+---
+
+The `TavilySearchTool` provides an interface to the Tavily Search API, enabling CrewAI agents to perform comprehensive web searches. It allows for specifying search depth, topics, time ranges, included/excluded domains, and whether to include direct answers, raw content, or images in the results.
+
+## Installation
+
+To use the `TavilySearchTool`, you need to install the `tavily-python` library:
+
+```shell
+pip install 'crewai[tools]' tavily-python
+```
+
+## Environment Variables
+
+Ensure your Tavily API key is set as an environment variable:
+
+```bash
+export TAVILY_API_KEY='your_tavily_api_key'
+```
+
+## Example Usage
+
+Here's how to initialize and use the `TavilySearchTool` within a CrewAI agent:
+
+```python
+import os
+from crewai import Agent, Task, Crew
+from crewai_tools import TavilySearchTool
+
+# Ensure the TAVILY_API_KEY environment variable is set
+# os.environ["TAVILY_API_KEY"] = "YOUR_TAVILY_API_KEY"
+
+# Initialize the tool
+tavily_tool = TavilySearchTool()
+
+# Create an agent that uses the tool
+researcher = Agent(
+    role='Market Researcher',
+    goal='Find information about the latest AI trends',
+    backstory='An expert market researcher specializing in technology.',
+    tools=[tavily_tool],
+    verbose=True
+)
+
+# Create a task for the agent
+research_task = Task(
+    description='Search for the top 3 AI trends in 2024.',
+    expected_output='A JSON report summarizing the top 3 AI trends found.',
+    agent=researcher
+)
+
+# Form the crew and kick it off
+crew = Crew(
+    agents=[researcher],
+    tasks=[research_task],
+    verbose=2
+)
+
+result = crew.kickoff()
+print(result)
+```
+
+## Configuration Options
+
+The `TavilySearchTool` accepts the following arguments during initialization or when calling the `run` method:
+
+- `query` (str): **Required**. The search query string.
+- `search_depth` (Literal["basic", "advanced"], optional): The depth of the search. Defaults to `"basic"`.
+- `topic` (Literal["general", "news", "finance"], optional): The topic to focus the search on. Defaults to `"general"`.
+- `time_range` (Literal["day", "week", "month", "year"], optional): The time range for the search. Defaults to `None`.
+- `days` (int, optional): The number of days to search back. Relevant if `time_range` is not set. Defaults to `7`.
+- `max_results` (int, optional): The maximum number of search results to return. Defaults to `5`.
+- `include_domains` (Sequence[str], optional): A list of domains to prioritize in the search. Defaults to `None`.
+- `exclude_domains` (Sequence[str], optional): A list of domains to exclude from the search. Defaults to `None`.
+- `include_answer` (Union[bool, Literal["basic", "advanced"]], optional): Whether to include a direct answer synthesized from the search results. Defaults to `False`.
+- `include_raw_content` (bool, optional): Whether to include the raw HTML content of the searched pages. Defaults to `False`.
+- `include_images` (bool, optional): Whether to include image results. Defaults to `False`.
+- `timeout` (int, optional): The request timeout in seconds. Defaults to `60`.
+
+## Advanced Usage
+
+You can configure the tool with custom parameters:
+
+```python
+# Example: Initialize with specific parameters
+custom_tavily_tool = TavilySearchTool(
+    search_depth='advanced',
+    max_results=10,
+    include_answer=True
+)
+
+# The agent will use these defaults
+agent_with_custom_tool = Agent(
+    role="Advanced Researcher",
+    goal="Conduct detailed research with comprehensive results",
+    tools=[custom_tavily_tool]
+)
+```
+
+## Features
+
+- **Comprehensive Search**: Access to Tavily's powerful search index
+- **Configurable Depth**: Choose between basic and advanced search modes
+- **Topic Filtering**: Focus searches on general, news, or finance topics
+- **Time Range Control**: Limit results to specific time periods
+- **Domain Control**: Include or exclude specific domains
+- **Direct Answers**: Get synthesized answers from search results
+- **Content Filtering**: Prevent context window issues with automatic content truncation
+
+## Response Format
+
+The tool returns search results as a JSON string containing:
+- Search results with titles, URLs, and content snippets
+- Optional direct answers to queries
+- Optional image results
+- Optional raw HTML content (when enabled)
+
+Content for each result is automatically truncated to prevent context window issues while maintaining the most relevant information.
--- a/docs/images/neatlogs-1.png
+++ b/docs/images/neatlogs-1.png
--- a/docs/images/neatlogs-2.png
+++ b/docs/images/neatlogs-2.png
--- a/docs/images/neatlogs-3.png
+++ b/docs/images/neatlogs-3.png
--- a/docs/images/neatlogs-4.png
+++ b/docs/images/neatlogs-4.png
--- a/docs/images/neatlogs-5.png
+++ b/docs/images/neatlogs-5.png
--- a/docs/pt-BR/concepts/tasks.mdx
+++ b/docs/pt-BR/concepts/tasks.mdx
@@ -54,10 +54,11 @@ crew = Crew(
 | **Markdown** _(opcional)_        | `markdown`        | `Optional[bool]`             | Se a tarefa deve instruir o agente a retornar a resposta final formatada em Markdown. O padrão é False.            |
 | **Config** _(opcional)_          | `config`          | `Optional[Dict[str, Any]]`   | Parâmetros de configuração específicos da tarefa.                                                                  |
 | **Arquivo de Saída** _(opcional)_| `output_file`     | `Optional[str]`              | Caminho do arquivo para armazenar a saída da tarefa.                                                               |
+| **Criar Diretório** _(opcional)_ | `create_directory` | `Optional[bool]`            | Se deve criar o diretório para output_file caso não exista. O padrão é True.                                       |
 | **Saída JSON** _(opcional)_      | `output_json`     | `Optional[Type[BaseModel]]`  | Um modelo Pydantic para estruturar a saída em JSON.                                                                |
 | **Output Pydantic** _(opcional)_ | `output_pydantic` | `Optional[Type[BaseModel]]`  | Um modelo Pydantic para a saída da tarefa.                                                                         |
 | **Callback** _(opcional)_        | `callback`        | `Optional[Any]`              | Função/objeto a ser executado após a conclusão da tarefa.                                                          |
-| **Guardrail** _(opcional)_       | `guardrail`       | `Optional[Union[Callable, str]]` | Função ou descrição em string para validar a saída da tarefa antes de prosseguir para a próxima tarefa.        |
+| **Guardrail** _(opcional)_       | `guardrail`       | `Optional[Callable]`             | Função para validar a saída da tarefa antes de prosseguir para a próxima tarefa.                                |

 ## Criando Tarefas

@@ -87,7 +88,6 @@ research_task:
  expected_output: >
    Uma lista com 10 tópicos em bullet points das informações mais relevantes sobre {topic}
  agent: researcher
-  guardrail: garanta que cada bullet point contenha no mínimo 100 palavras

 reporting_task:
  description: >
@@ -332,9 +332,7 @@ analysis_task = Task(

 Guardrails (trilhas de proteção) de tarefas fornecem uma maneira de validar e transformar as saídas das tarefas antes que elas sejam passadas para a próxima tarefa. Esse recurso assegura a qualidade dos dados e oferece feedback aos agentes quando sua saída não atende a critérios específicos.

-**Guardrails podem ser definidos de duas maneiras:**
-1. **Guardrails baseados em função**: Funções Python que implementam lógica de validação customizada
-2. **Guardrails baseados em string**: Descrições em linguagem natural que são automaticamente convertidas em validação baseada em LLM
+Guardrails são implementados como funções Python que contêm lógica de validação customizada, proporcionando controle total sobre o processo de validação e garantindo resultados confiáveis e determinísticos.

 ### Guardrails Baseados em Função

@@ -376,82 +374,7 @@ blog_task = Task(
   - Em caso de sucesso: retorna uma tupla `(True, resultado_validado)`
   - Em caso de falha: retorna uma tupla `(False, "mensagem de erro explicando a falha")`

-### Guardrails Baseados em String

-Guardrails baseados em string permitem que você descreva critérios de validação em linguagem natural. Quando você fornece uma string em vez de uma função, o CrewAI automaticamente a converte em um `LLMGuardrail` que usa um agente de IA para validar a saída da tarefa.
-
-#### Usando Guardrails de String em Python
-
-```python Code
-from crewai import Task
-
-# Guardrail simples baseado em string
-blog_task = Task(
-    description="Escreva um post de blog sobre IA",
-    expected_output="Um post de blog com menos de 200 palavras",
-    agent=blog_agent,
-    guardrail="Garanta que o post do blog tenha menos de 200 palavras e inclua exemplos práticos"
-)
-
-# Critérios de validação mais complexos
-research_task = Task(
-    description="Pesquise tendências de IA para 2025",
-    expected_output="Um relatório abrangente de pesquisa",
-    agent=research_agent,
-    guardrail="Garanta que cada descoberta inclua uma fonte confiável e seja respaldada por dados recentes de 2024-2025"
-)
-```
-
-#### Usando Guardrails de String em YAML
-
-```yaml
-research_task:
-  description: Pesquise os últimos desenvolvimentos em IA
-  expected_output: Uma lista de 10 bullet points sobre IA
-  agent: researcher
-  guardrail: garanta que cada bullet point contenha no mínimo 100 palavras
-
-validation_task:
-  description: Valide os achados da pesquisa
-  expected_output: Um relatório de validação
-  agent: validator
-  guardrail: confirme que todas as fontes são de publicações respeitáveis e publicadas nos últimos 2 anos
-```
-
-#### Como Funcionam os Guardrails de String
-
-Quando você fornece um guardrail de string, o CrewAI automaticamente:
-1. Cria uma instância `LLMGuardrail` usando a string como critério de validação
-2. Usa o LLM do agente da tarefa para alimentar a validação
-3. Cria um agente temporário de validação que verifica a saída contra seus critérios
-4. Retorna feedback detalhado se a validação falhar
-
-Esta abordagem é ideal quando você quer usar linguagem natural para descrever regras de validação sem escrever funções de validação customizadas.
-
-### Classe LLMGuardrail
-
-A classe `LLMGuardrail` é o mecanismo subjacente que alimenta os guardrails baseados em string. Você também pode usá-la diretamente para maior controle avançado:
-
-```python Code
-from crewai import Task
-from crewai.tasks.llm_guardrail import LLMGuardrail
-from crewai.llm import LLM
-
-# Crie um LLMGuardrail customizado com LLM específico
-custom_guardrail = LLMGuardrail(
-    description="Garanta que a resposta contenha exatamente 5 bullet points com citações adequadas",
-    llm=LLM(model="gpt-4o-mini")
-)
-
-task = Task(
-    description="Pesquise medidas de segurança em IA",
-    expected_output="Uma análise detalhada com bullet points",
-    agent=research_agent,
-    guardrail=custom_guardrail
-)
-```
-
-**Nota**: Quando você usa um guardrail de string, o CrewAI automaticamente cria uma instância `LLMGuardrail` usando o LLM do agente da sua tarefa. Usar `LLMGuardrail` diretamente lhe dá mais controle sobre o processo de validação e seleção de LLM.

 ### Melhores Práticas de Tratamento de Erros

@@ -902,26 +825,7 @@ task = Task(
 )
 ```

-#### Use uma abordagem no-code para validação

-```python Code
-from crewai import Task
-
-task = Task(
-    description="Gerar dados em JSON",
-    expected_output="Objeto JSON válido",
-    guardrail="Garanta que a resposta é um objeto JSON válido"
-)
-```
-
-#### Usando YAML
-
-```yaml
-research_task:
-  ...
-  guardrail: garanta que cada bullet tenha no mínimo 100 palavras
-  ...
-```

 ```python Code
@CrewBase
@@ -1037,21 +941,87 @@ task = Task(

 ## Criando Diretórios ao Salvar Arquivos

-Agora é possível especificar se uma tarefa deve criar diretórios ao salvar sua saída em arquivo. Isso é útil para organizar outputs e garantir que os caminhos estejam corretos.
+O parâmetro `create_directory` controla se o CrewAI deve criar automaticamente diretórios ao salvar saídas de tarefas em arquivos. Este recurso é particularmente útil para organizar outputs e garantir que os caminhos de arquivos estejam estruturados corretamente, especialmente ao trabalhar com hierarquias de projetos complexas.
+
+### Comportamento Padrão
+
+Por padrão, `create_directory=True`, o que significa que o CrewAI criará automaticamente qualquer diretório ausente no caminho do arquivo de saída:

 ```python Code
-# ...
-
-save_output_task = Task(
-    description='Salve o resumo das notícias de IA em um arquivo',
-    expected_output='Arquivo salvo com sucesso',
-    agent=research_agent,
-    tools=[file_save_tool],
-    output_file='outputs/ai_news_summary.txt',
-    create_directory=True
+# Comportamento padrão - diretórios são criados automaticamente
+report_task = Task(
+    description='Gerar um relatório abrangente de análise de mercado',
+    expected_output='Uma análise detalhada de mercado com gráficos e insights',
+    agent=analyst_agent,
+    output_file='reports/2025/market_analysis.md',  # Cria 'reports/2025/' se não existir
+    markdown=True
 )
+```

-#...
+### Desabilitando a Criação de Diretórios
+
+Se você quiser evitar a criação automática de diretórios e garantir que o diretório já exista, defina `create_directory=False`:
+
+```python Code
+# Modo estrito - o diretório já deve existir
+strict_output_task = Task(
+    description='Salvar dados críticos que requerem infraestrutura existente',
+    expected_output='Dados salvos em localização pré-configurada',
+    agent=data_agent,
+    output_file='secure/vault/critical_data.json',
+    create_directory=False  # Gerará RuntimeError se 'secure/vault/' não existir
+)
+```
+
+### Configuração YAML
+
+Você também pode configurar este comportamento em suas definições de tarefas YAML:
+
+```yaml tasks.yaml
+analysis_task:
+  description: >
+    Gerar análise financeira trimestral
+  expected_output: >
+    Um relatório financeiro abrangente com insights trimestrais
+  agent: financial_analyst
+  output_file: reports/quarterly/q4_2024_analysis.pdf
+  create_directory: true  # Criar automaticamente o diretório 'reports/quarterly/'
+
+audit_task:
+  description: >
+    Realizar auditoria de conformidade e salvar no diretório de auditoria existente
+  expected_output: >
+    Um relatório de auditoria de conformidade
+  agent: auditor
+  output_file: audit/compliance_report.md
+  create_directory: false  # O diretório já deve existir
+```
+
+### Casos de Uso
+
+**Criação Automática de Diretórios (`create_directory=True`):**
+- Ambientes de desenvolvimento e prototipagem
+- Geração dinâmica de relatórios com pastas baseadas em datas
+- Fluxos de trabalho automatizados onde a estrutura de diretórios pode variar
+- Aplicações multi-tenant com pastas específicas do usuário
+
+**Gerenciamento Manual de Diretórios (`create_directory=False`):**
+- Ambientes de produção com controles rígidos do sistema de arquivos
+- Aplicações sensíveis à segurança onde diretórios devem ser pré-configurados
+- Sistemas com requisitos específicos de permissão
+- Ambientes de conformidade onde a criação de diretórios é auditada
+
+### Tratamento de Erros
+
+Quando `create_directory=False` e o diretório não existe, o CrewAI gerará um `RuntimeError`:
+
+```python Code
+try:
+    result = crew.kickoff()
+except RuntimeError as e:
+    # Tratar erro de diretório ausente
+    print(f"Falha na criação do diretório: {e}")
+    # Criar diretório manualmente ou usar local alternativo
 ```

 Veja o vídeo abaixo para aprender como utilizar saídas estruturadas no CrewAI:
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -11,7 +11,7 @@ dependencies = [
    # Core Dependencies
    "pydantic>=2.4.2",
    "openai>=1.13.3",
-    "litellm==1.72.6",
+    "litellm>=1.74.3",
    "instructor>=1.3.3",
    # Text Processing
    "pdfplumber>=0.11.4",
@@ -39,6 +39,7 @@ dependencies = [
    "tomli>=2.0.2",
    "blinker>=1.9.0",
    "json5>=0.10.0",
+    "portalocker==2.7.0",
 ]

 [project.urls]
@@ -47,7 +48,7 @@ Documentation = "https://docs.crewai.com"
 Repository = "https://github.com/crewAIInc/crewAI"

 [project.optional-dependencies]
-tools = ["crewai-tools~=0.51.0"]
+tools = ["crewai-tools~=0.55.0"]
 embeddings = [
    "tiktoken~=0.8.0"
 ]
--- a/src/crewai/init.py
+++ b/src/crewai/init.py
@@ -54,7 +54,7 @@ def _track_install_async():

 _track_install_async()

-__version__ = "0.141.0"
+__version__ = "0.148.0"
 __all__ = [
    "Agent",
    "Crew",
--- a/src/crewai/agents/crew_agent_executor.py
+++ b/src/crewai/agents/crew_agent_executor.py
@@ -120,11 +120,8 @@ class CrewAgentExecutor(CrewAgentExecutorMixin):
            raise
        except Exception as e:
            handle_unknown_error(self._printer, e)
-            if e.__class__.__module__.startswith("litellm"):
-                # Do not retry on litellm errors
-                raise e
-            else:
-                raise e
+            raise
+

        if self.ask_for_human_input:
            formatted_answer = self._handle_human_feedback(formatted_answer)
--- a/src/crewai/cli/templates/crew/pyproject.toml
+++ b/src/crewai/cli/templates/crew/pyproject.toml
@@ -5,7 +5,7 @@ description = "{{name}} using crewAI"
 authors = [{ name = "Your Name", email = "you@example.com" }]
 requires-python = ">=3.10,<3.14"
 dependencies = [
-    "crewai[tools]>=0.141.0,<1.0.0"
+    "crewai[tools]>=0.148.0,<1.0.0"
 ]

 [project.scripts]
--- a/src/crewai/cli/templates/flow/pyproject.toml
+++ b/src/crewai/cli/templates/flow/pyproject.toml
@@ -5,7 +5,7 @@ description = "{{name}} using crewAI"
 authors = [{ name = "Your Name", email = "you@example.com" }]
 requires-python = ">=3.10,<3.14"
 dependencies = [
-    "crewai[tools]>=0.141.0,<1.0.0",
+    "crewai[tools]>=0.148.0,<1.0.0",
 ]

 [project.scripts]
--- a/src/crewai/cli/templates/tool/pyproject.toml
+++ b/src/crewai/cli/templates/tool/pyproject.toml
@@ -5,7 +5,7 @@ description = "Power up your crews with {{folder_name}}"
 readme = "README.md"
 requires-python = ">=3.10,<3.14"
 dependencies = [
-    "crewai[tools]>=0.141.0"
+    "crewai[tools]>=0.148.0"
 ]

 [tool.crewai]
--- a/src/crewai/crew.py
+++ b/src/crewai/crew.py
@@ -161,7 +161,7 @@ class Crew(FlowTrackable, BaseModel):
    )
    user_memory: Optional[InstanceOf[UserMemory]] = Field(
        default=None,
-        description="An instance of the UserMemory to be used by the Crew to store/fetch memories of a specific user.",
+        description="DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, whichever comes first. Use external_memory instead.",
    )
    external_memory: Optional[InstanceOf[ExternalMemory]] = Field(
        default=None,
@@ -327,7 +327,7 @@ class Crew(FlowTrackable, BaseModel):
        self._short_term_memory = self.short_term_memory
        self._entity_memory = self.entity_memory

-        # UserMemory is gonna to be deprecated in the future, but we have to initialize a default value for now
+        # UserMemory will be removed in version 0.156.0 or on 2025-08-04, whichever comes first
        self._user_memory = None

        if self.memory:
@@ -1255,6 +1255,7 @@ class Crew(FlowTrackable, BaseModel):
        if self.external_memory:
            copied_data["external_memory"] = self.external_memory.model_copy(deep=True)
        if self.user_memory:
+            # DEPRECATED: UserMemory will be removed in version 0.156.0 or on 2025-08-04
            copied_data["user_memory"] = self.user_memory.model_copy(deep=True)

        copied_data.pop("agents", None)
@@ -1313,7 +1314,6 @@ class Crew(FlowTrackable, BaseModel):
        n_iterations: int,
        eval_llm: Union[str, InstanceOf[BaseLLM]],
        inputs: Optional[Dict[str, Any]] = None,
-        include_agent_eval: Optional[bool] = False
    ) -> None:
        """Test and evaluate the Crew with the given inputs for n iterations concurrently using concurrent.futures."""
        try:
@@ -1333,28 +1333,13 @@ class Crew(FlowTrackable, BaseModel):
            )
            test_crew = self.copy()

-            # TODO: Refator to use a single Evaluator Manage class
            evaluator = CrewEvaluator(test_crew, llm_instance)

-            if include_agent_eval:
-                from crewai.experimental.evaluation import create_default_evaluator
-                agent_evaluator = create_default_evaluator(crew=test_crew)
-
            for i in range(1, n_iterations + 1):
                evaluator.set_iteration(i)
-
-                if include_agent_eval:
-                    agent_evaluator.set_iteration(i)
-
                test_crew.kickoff(inputs=inputs)

-                # TODO: Refactor to use ListenerEvents instead of trigger each iteration manually
-                if include_agent_eval:
-                    agent_evaluator.evaluate_current_iteration()
-
            evaluator.print_crew_evaluation_result()
-            if include_agent_eval:
-                agent_evaluator.get_agent_evaluation(include_evaluation_feedback=True)

            crewai_event_bus.emit(
                self,
--- a/src/crewai/evaluation/init.py
+++ b/src/crewai/evaluation/init.py
@@ -1,53 +0,0 @@
-from crewai.evaluation.base_evaluator import (
-    BaseEvaluator,
-    EvaluationScore,
-    MetricCategory,
-    AgentEvaluationResult
-)
-
-from crewai.evaluation.metrics.semantic_quality_metrics import (
-    SemanticQualityEvaluator
-)
-
-from crewai.evaluation.metrics.goal_metrics import (
-    GoalAlignmentEvaluator
-)
-
-from crewai.evaluation.metrics.reasoning_metrics import (
-    ReasoningEfficiencyEvaluator
-)
-
-
-from crewai.evaluation.metrics.tools_metrics import (
-    ToolSelectionEvaluator,
-    ParameterExtractionEvaluator,
-    ToolInvocationEvaluator
-)
-
-from crewai.evaluation.evaluation_listener import (
-    EvaluationTraceCallback,
-    create_evaluation_callbacks
-)
-
-
-from crewai.evaluation.agent_evaluator import (
-    AgentEvaluator,
-    create_default_evaluator
-)
-
-__all__ = [
-    "BaseEvaluator",
-    "EvaluationScore",
-    "MetricCategory",
-    "AgentEvaluationResult",
-    "SemanticQualityEvaluator",
-    "GoalAlignmentEvaluator",
-    "ReasoningEfficiencyEvaluator",
-    "ToolSelectionEvaluator",
-    "ParameterExtractionEvaluator",
-    "ToolInvocationEvaluator",
-    "EvaluationTraceCallback",
-    "create_evaluation_callbacks",
-    "AgentEvaluator",
-    "create_default_evaluator"
-]
--- a/src/crewai/evaluation/agent_evaluator.py
+++ b/src/crewai/evaluation/agent_evaluator.py
@@ -1,178 +0,0 @@
-from crewai.evaluation.base_evaluator import AgentEvaluationResult, AggregationStrategy
-from crewai.agent import Agent
-from crewai.task import Task
-from crewai.evaluation.evaluation_display import EvaluationDisplayFormatter
-
-from typing import Any, Dict
-from collections import defaultdict
-from crewai.evaluation import BaseEvaluator, create_evaluation_callbacks
-from collections.abc import Sequence
-from crewai.crew import Crew
-from crewai.utilities.events.crewai_event_bus import crewai_event_bus
-from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
-
-class AgentEvaluator:
-    def __init__(
-        self,
-        evaluators: Sequence[BaseEvaluator] | None = None,
-        crew: Crew | None = None,
-    ):
-        self.crew: Crew | None = crew
-        self.evaluators: Sequence[BaseEvaluator] | None = evaluators
-
-        self.agent_evaluators: dict[str, Sequence[BaseEvaluator] | None] = {}
-        if crew is not None:
-            assert crew and crew.agents is not None
-            for agent in crew.agents:
-                self.agent_evaluators[str(agent.id)] = self.evaluators
-
-        self.callback = create_evaluation_callbacks()
-        self.console_formatter = ConsoleFormatter()
-        self.display_formatter = EvaluationDisplayFormatter()
-
-        self.iteration = 1
-        self.iterations_results: dict[int, dict[str, list[AgentEvaluationResult]]] = {}
-
-    def set_iteration(self, iteration: int) -> None:
-        self.iteration = iteration
-
-    def evaluate_current_iteration(self) -> dict[str, list[AgentEvaluationResult]]:
-        if not self.crew:
-            raise ValueError("Cannot evaluate: no crew was provided to the evaluator.")
-
-        if not self.callback:
-            raise ValueError("Cannot evaluate: no callback was set. Use set_callback() method first.")
-
-        from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
-        evaluation_results: defaultdict[str, list[AgentEvaluationResult]] = defaultdict(list)
-
-        total_evals = 0
-        for agent in self.crew.agents:
-            for task in self.crew.tasks:
-                if task.agent and task.agent.id == agent.id and self.agent_evaluators.get(str(agent.id)):
-                    total_evals += 1
-
-        with Progress(
-            SpinnerColumn(),
-            TextColumn("[bold blue]{task.description}[/bold blue]"),
-            BarColumn(),
-            TextColumn("{task.percentage:.0f}% completed"),
-            console=self.console_formatter.console
-        ) as progress:
-            eval_task = progress.add_task(f"Evaluating agents (iteration {self.iteration})...", total=total_evals)
-
-            for agent in self.crew.agents:
-                evaluator = self.agent_evaluators.get(str(agent.id))
-                if not evaluator:
-                    continue
-
-                for task in self.crew.tasks:
-
-                    if task.agent and str(task.agent.id) != str(agent.id):
-                        continue
-
-                    trace = self.callback.get_trace(str(agent.id), str(task.id))
-                    if not trace:
-                        self.console_formatter.print(f"[yellow]Warning: No trace found for agent {agent.role} on task {task.description[:30]}...[/yellow]")
-                        progress.update(eval_task, advance=1)
-                        continue
-
-                    with crewai_event_bus.scoped_handlers():
-                        result = self.evaluate(
-                            agent=agent,
-                            task=task,
-                            execution_trace=trace,
-                            final_output=task.output
-                        )
-                        evaluation_results[agent.role].append(result)
-                        progress.update(eval_task, advance=1)
-
-        self.iterations_results[self.iteration] = evaluation_results
-        return evaluation_results
-
-    def get_evaluation_results(self):
-        if self.iteration in self.iterations_results:
-            return self.iterations_results[self.iteration]
-
-        return self.evaluate_current_iteration()
-
-    def display_results_with_iterations(self):
-        self.display_formatter.display_summary_results(self.iterations_results)
-
-    def get_agent_evaluation(self, strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE, include_evaluation_feedback: bool = False):
-        agent_results = {}
-        with crewai_event_bus.scoped_handlers():
-            task_results = self.get_evaluation_results()
-            for agent_role, results in task_results.items():
-                if not results:
-                    continue
-
-                agent_id = results[0].agent_id
-
-                aggregated_result = self.display_formatter._aggregate_agent_results(
-                    agent_id=agent_id,
-                    agent_role=agent_role,
-                    results=results,
-                    strategy=strategy
-                )
-
-                agent_results[agent_role] = aggregated_result
-
-
-            if self.iteration == max(self.iterations_results.keys()):
-                self.display_results_with_iterations()
-
-            if include_evaluation_feedback:
-                self.display_evaluation_with_feedback()
-
-        return agent_results
-
-    def display_evaluation_with_feedback(self):
-        self.display_formatter.display_evaluation_with_feedback(self.iterations_results)
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: Any
-    ) -> AgentEvaluationResult:
-        result = AgentEvaluationResult(
-            agent_id=str(agent.id),
-            task_id=str(task.id)
-        )
-        assert self.evaluators is not None
-        for evaluator in self.evaluators:
-            try:
-                score = evaluator.evaluate(
-                    agent=agent,
-                    task=task,
-                    execution_trace=execution_trace,
-                    final_output=final_output
-                )
-                result.metrics[evaluator.metric_category] = score
-            except Exception as e:
-                self.console_formatter.print(f"Error in {evaluator.metric_category.value} evaluator: {str(e)}")
-
-        return result
-
-def create_default_evaluator(crew, llm=None):
-    from crewai.evaluation import (
-        GoalAlignmentEvaluator,
-        SemanticQualityEvaluator,
-        ToolSelectionEvaluator,
-        ParameterExtractionEvaluator,
-        ToolInvocationEvaluator,
-        ReasoningEfficiencyEvaluator
-    )
-
-    evaluators = [
-        GoalAlignmentEvaluator(llm=llm),
-        SemanticQualityEvaluator(llm=llm),
-        ToolSelectionEvaluator(llm=llm),
-        ParameterExtractionEvaluator(llm=llm),
-        ToolInvocationEvaluator(llm=llm),
-        ReasoningEfficiencyEvaluator(llm=llm),
-    ]
-
-    return AgentEvaluator(evaluators=evaluators, crew=crew)
--- a/src/crewai/evaluation/base_evaluator.py
+++ b/src/crewai/evaluation/base_evaluator.py
@@ -1,125 +0,0 @@
-import abc
-import enum
-from enum import Enum
-from typing import Any, Dict, List, Optional
-
-from pydantic import BaseModel, Field
-
-from crewai.agent import Agent
-from crewai.task import Task
-from crewai.llm import BaseLLM
-from crewai.utilities.llm_utils import create_llm
-
-class MetricCategory(enum.Enum):
-    GOAL_ALIGNMENT = "goal_alignment"
-    SEMANTIC_QUALITY = "semantic_quality"
-    REASONING_EFFICIENCY = "reasoning_efficiency"
-    TOOL_SELECTION = "tool_selection"
-    PARAMETER_EXTRACTION = "parameter_extraction"
-    TOOL_INVOCATION = "tool_invocation"
-
-    def title(self):
-        return self.value.replace('_', ' ').title()
-
-
-class EvaluationScore(BaseModel):
-    score: float | None = Field(
-        default=5.0,
-        description="Numeric score from 0-10 where 0 is worst and 10 is best, None if not applicable",
-        ge=0.0,
-        le=10.0
-    )
-    feedback: str = Field(
-        default="",
-        description="Detailed feedback explaining the evaluation score"
-    )
-    raw_response: str | None = Field(
-        default=None,
-        description="Raw response from the evaluator (e.g., LLM)"
-    )
-
-    def __str__(self) -> str:
-        if self.score is None:
-            return f"Score: N/A - {self.feedback}"
-        return f"Score: {self.score:.1f}/10 - {self.feedback}"
-
-
-class BaseEvaluator(abc.ABC):
-    def __init__(self, llm: BaseLLM | None = None):
-        self.llm: BaseLLM | None = create_llm(llm)
-
-    @property
-    @abc.abstractmethod
-    def metric_category(self) -> MetricCategory:
-        pass
-
-    @abc.abstractmethod
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: Any,
-    ) -> EvaluationScore:
-        pass
-
-
-class AgentEvaluationResult(BaseModel):
-    agent_id: str = Field(description="ID of the evaluated agent")
-    task_id: str = Field(description="ID of the task that was executed")
-    metrics: Dict[MetricCategory, EvaluationScore] = Field(
-        default_factory=dict,
-        description="Evaluation scores for each metric category"
-    )
-
-
-class AggregationStrategy(Enum):
-    SIMPLE_AVERAGE = "simple_average"  # Equal weight to all tasks
-    WEIGHTED_BY_COMPLEXITY = "weighted_by_complexity"  # Weight by task complexity
-    BEST_PERFORMANCE = "best_performance"  # Use best scores across tasks
-    WORST_PERFORMANCE = "worst_performance"  # Use worst scores across tasks
-
-
-class AgentAggregatedEvaluationResult(BaseModel):
-    agent_id: str = Field(
-        default="",
-        description="ID of the agent"
-    )
-    agent_role: str = Field(
-        default="",
-        description="Role of the agent"
-    )
-    task_count: int = Field(
-        default=0,
-        description="Number of tasks included in this aggregation"
-    )
-    aggregation_strategy: AggregationStrategy = Field(
-        default=AggregationStrategy.SIMPLE_AVERAGE,
-        description="Strategy used for aggregation"
-    )
-    metrics: Dict[MetricCategory, EvaluationScore] = Field(
-        default_factory=dict,
-        description="Aggregated metrics across all tasks"
-    )
-    task_results: List[str] = Field(
-        default_factory=list,
-        description="IDs of tasks included in this aggregation"
-    )
-    overall_score: Optional[float] = Field(
-        default=None,
-        description="Overall score for this agent"
-    )
-
-    def __str__(self) -> str:
-        result = f"Agent Evaluation: {self.agent_role}\n"
-        result += f"Strategy: {self.aggregation_strategy.value}\n"
-        result += f"Tasks evaluated: {self.task_count}\n"
-
-        for category, score in self.metrics.items():
-            result += f"\n\n- {category.value.upper()}: {score.score}/10\n"
-
-            if score.feedback:
-                detailed_feedback = "\n  ".join(score.feedback.split('\n'))
-                result += f"  {detailed_feedback}\n"
-
-        return result
--- a/src/crewai/evaluation/evaluation_display.py
+++ b/src/crewai/evaluation/evaluation_display.py
@@ -1,341 +0,0 @@
-from collections import defaultdict
-from typing import Dict, Any, List
-from rich.table import Table
-from rich.box import HEAVY_EDGE, ROUNDED
-from collections.abc import Sequence
-from crewai.evaluation.base_evaluator import AgentAggregatedEvaluationResult, AggregationStrategy, AgentEvaluationResult, MetricCategory
-from crewai.evaluation import EvaluationScore
-from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
-from crewai.utilities.llm_utils import create_llm
-
-class EvaluationDisplayFormatter:
-    def __init__(self):
-        self.console_formatter = ConsoleFormatter()
-
-    def display_evaluation_with_feedback(self, iterations_results: Dict[int, Dict[str, List[Any]]]):
-        if not iterations_results:
-            self.console_formatter.print("[yellow]No evaluation results to display[/yellow]")
-            return
-
-        # Get all agent roles across all iterations
-        all_agent_roles: set[str] = set()
-        for iter_results in iterations_results.values():
-            all_agent_roles.update(iter_results.keys())
-
-        for agent_role in sorted(all_agent_roles):
-            self.console_formatter.print(f"\n[bold cyan]Agent: {agent_role}[/bold cyan]")
-
-            # Process each iteration
-            for iter_num, results in sorted(iterations_results.items()):
-                if agent_role not in results or not results[agent_role]:
-                    continue
-
-                agent_results = results[agent_role]
-                agent_id = agent_results[0].agent_id
-
-                # Aggregate results for this agent in this iteration
-                aggregated_result = self._aggregate_agent_results(
-                    agent_id=agent_id,
-                    agent_role=agent_role,
-                    results=agent_results,
-                )
-
-                # Display iteration header
-                self.console_formatter.print(f"\n[bold]Iteration {iter_num}[/bold]")
-
-                # Create table for this iteration
-                table = Table(box=ROUNDED)
-                table.add_column("Metric", style="cyan")
-                table.add_column("Score (1-10)", justify="center")
-                table.add_column("Feedback", style="green")
-
-                # Add metrics to table
-                if aggregated_result.metrics:
-                    for metric, evaluation_score in aggregated_result.metrics.items():
-                        score = evaluation_score.score
-
-                        if isinstance(score, (int, float)):
-                            if score >= 8.0:
-                                score_text = f"[green]{score:.1f}[/green]"
-                            elif score >= 6.0:
-                                score_text = f"[cyan]{score:.1f}[/cyan]"
-                            elif score >= 4.0:
-                                score_text = f"[yellow]{score:.1f}[/yellow]"
-                            else:
-                                score_text = f"[red]{score:.1f}[/red]"
-                        else:
-                            score_text = "[dim]N/A[/dim]"
-
-                        table.add_section()
-                        table.add_row(
-                            metric.title(),
-                            score_text,
-                            evaluation_score.feedback or ""
-                        )
-
-                if aggregated_result.overall_score is not None:
-                    overall_score = aggregated_result.overall_score
-                    if overall_score >= 8.0:
-                        overall_color = "green"
-                    elif overall_score >= 6.0:
-                        overall_color = "cyan"
-                    elif overall_score >= 4.0:
-                        overall_color = "yellow"
-                    else:
-                        overall_color = "red"
-
-                    table.add_section()
-                    table.add_row(
-                        "Overall Score",
-                        f"[{overall_color}]{overall_score:.1f}[/]",
-                        "Overall agent evaluation score"
-                    )
-
-                # Print the table for this iteration
-                self.console_formatter.print(table)
-
-    def display_summary_results(self, iterations_results: Dict[int, Dict[str, List[AgentAggregatedEvaluationResult]]]):
-        if not iterations_results:
-            self.console_formatter.print("[yellow]No evaluation results to display[/yellow]")
-            return
-
-        self.console_formatter.print("\n")
-
-        table = Table(title="Agent Performance Scores \n (1-10 Higher is better)", box=HEAVY_EDGE)
-
-        table.add_column("Agent/Metric", style="cyan")
-
-        for iter_num in sorted(iterations_results.keys()):
-            run_label = f"Run {iter_num}"
-            table.add_column(run_label, justify="center")
-
-        table.add_column("Avg. Total", justify="center")
-
-        all_agent_roles: set[str] = set()
-        for results in iterations_results.values():
-            all_agent_roles.update(results.keys())
-
-        for agent_role in sorted(all_agent_roles):
-            agent_scores_by_iteration = {}
-            agent_metrics_by_iteration = {}
-
-            for iter_num, results in sorted(iterations_results.items()):
-                if agent_role not in results or not results[agent_role]:
-                    continue
-
-                agent_results = results[agent_role]
-                agent_id = agent_results[0].agent_id
-
-                aggregated_result = self._aggregate_agent_results(
-                    agent_id=agent_id,
-                    agent_role=agent_role,
-                    results=agent_results,
-                    strategy=AggregationStrategy.SIMPLE_AVERAGE
-                )
-
-                valid_scores = [score.score for score in aggregated_result.metrics.values()
-                               if score.score is not None]
-                if valid_scores:
-                    avg_score = sum(valid_scores) / len(valid_scores)
-                    agent_scores_by_iteration[iter_num] = avg_score
-
-                agent_metrics_by_iteration[iter_num] = aggregated_result.metrics
-
-            if not agent_scores_by_iteration:
-                continue
-
-            avg_across_iterations = sum(agent_scores_by_iteration.values()) / len(agent_scores_by_iteration)
-
-            row = [f"[bold]{agent_role}[/bold]"]
-
-            for iter_num in sorted(iterations_results.keys()):
-                if iter_num in agent_scores_by_iteration:
-                    score = agent_scores_by_iteration[iter_num]
-                    if score >= 8.0:
-                        color = "green"
-                    elif score >= 6.0:
-                        color = "cyan"
-                    elif score >= 4.0:
-                        color = "yellow"
-                    else:
-                        color = "red"
-                    row.append(f"[bold {color}]{score:.1f}[/]")
-                else:
-                    row.append("-")
-
-            if avg_across_iterations >= 8.0:
-                color = "green"
-            elif avg_across_iterations >= 6.0:
-                color = "cyan"
-            elif avg_across_iterations >= 4.0:
-                color = "yellow"
-            else:
-                color = "red"
-            row.append(f"[bold {color}]{avg_across_iterations:.1f}[/]")
-
-            table.add_row(*row)
-
-            all_metrics: set[Any] = set()
-            for metrics in agent_metrics_by_iteration.values():
-                all_metrics.update(metrics.keys())
-
-            for metric in sorted(all_metrics, key=lambda x: x.value):
-                metric_scores = []
-
-                row = [f"  - {metric.title()}"]
-
-                for iter_num in sorted(iterations_results.keys()):
-                    if (iter_num in agent_metrics_by_iteration and
-                            metric in agent_metrics_by_iteration[iter_num]):
-                        metric_score = agent_metrics_by_iteration[iter_num][metric].score
-                        if metric_score is not None:
-                            metric_scores.append(metric_score)
-                            if metric_score >= 8.0:
-                                color = "green"
-                            elif metric_score >= 6.0:
-                                color = "cyan"
-                            elif metric_score >= 4.0:
-                                color = "yellow"
-                            else:
-                                color = "red"
-                            row.append(f"[{color}]{metric_score:.1f}[/]")
-                        else:
-                            row.append("[dim]N/A[/dim]")
-                    else:
-                        row.append("-")
-
-                if metric_scores:
-                    avg = sum(metric_scores) / len(metric_scores)
-                    if avg >= 8.0:
-                        color = "green"
-                    elif avg >= 6.0:
-                        color = "cyan"
-                    elif avg >= 4.0:
-                        color = "yellow"
-                    else:
-                        color = "red"
-                    row.append(f"[{color}]{avg:.1f}[/]")
-                else:
-                    row.append("-")
-
-                table.add_row(*row)
-
-            table.add_row(*[""] * (len(sorted(iterations_results.keys())) + 2))
-
-        self.console_formatter.print(table)
-        self.console_formatter.print("\n")
-
-    def _aggregate_agent_results(
-        self,
-        agent_id: str,
-        agent_role: str,
-        results: Sequence[AgentEvaluationResult],
-        strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE,
-    ) -> AgentAggregatedEvaluationResult:
-        metrics_by_category: dict[MetricCategory, list[EvaluationScore]] = defaultdict(list)
-
-        for result in results:
-            for metric_name, evaluation_score in result.metrics.items():
-                metrics_by_category[metric_name].append(evaluation_score)
-
-        aggregated_metrics: dict[MetricCategory, EvaluationScore] = {}
-        for category, scores in metrics_by_category.items():
-            valid_scores = [s.score for s in scores if s.score is not None]
-            avg_score = sum(valid_scores) / len(valid_scores) if valid_scores else None
-
-            feedbacks = [s.feedback for s in scores if s.feedback]
-
-            feedback_summary = None
-            if feedbacks:
-                if len(feedbacks) > 1:
-                    # Use the summarization method for multiple feedbacks
-                    feedback_summary = self._summarize_feedbacks(
-                        agent_role=agent_role,
-                        metric=category.title(),
-                        feedbacks=feedbacks,
-                        scores=[s.score for s in scores],
-                        strategy=strategy
-                    )
-                else:
-                    feedback_summary = feedbacks[0]
-
-            aggregated_metrics[category] = EvaluationScore(
-                score=avg_score,
-                feedback=feedback_summary
-            )
-
-        overall_score = None
-        if aggregated_metrics:
-            valid_scores = [m.score for m in aggregated_metrics.values() if m.score is not None]
-            if valid_scores:
-                overall_score = sum(valid_scores) / len(valid_scores)
-
-        return AgentAggregatedEvaluationResult(
-            agent_id=agent_id,
-            agent_role=agent_role,
-            metrics=aggregated_metrics,
-            overall_score=overall_score,
-            task_count=len(results),
-            aggregation_strategy=strategy
-        )
-
-    def _summarize_feedbacks(
-        self,
-        agent_role: str,
-        metric: str,
-        feedbacks: List[str],
-        scores: List[float | None],
-        strategy: AggregationStrategy
-    ) -> str:
-        if len(feedbacks) <= 2 and all(len(fb) < 200 for fb in feedbacks):
-            return "\n\n".join([f"Feedback {i+1}: {fb}" for i, fb in enumerate(feedbacks)])
-
-        try:
-            llm = create_llm()
-
-            formatted_feedbacks = []
-            for i, (feedback, score) in enumerate(zip(feedbacks, scores)):
-                if len(feedback) > 500:
-                    feedback = feedback[:500] + "..."
-                score_text = f"{score:.1f}" if score is not None else "N/A"
-                formatted_feedbacks.append(f"Feedback #{i+1} (Score: {score_text}):\n{feedback}")
-
-            all_feedbacks = "\n\n" + "\n\n---\n\n".join(formatted_feedbacks)
-
-            strategy_guidance = ""
-            if strategy == AggregationStrategy.BEST_PERFORMANCE:
-                strategy_guidance = "Focus on the highest-scoring aspects and strengths demonstrated."
-            elif strategy == AggregationStrategy.WORST_PERFORMANCE:
-                strategy_guidance = "Focus on areas that need improvement and common issues across tasks."
-            else:  # Default/average strategies
-                strategy_guidance = "Provide a balanced analysis of strengths and weaknesses across all tasks."
-
-            prompt = [
-                {"role": "system", "content": f"""You are an expert evaluator creating a comprehensive summary of agent performance feedback.
-                Your job is to synthesize multiple feedback points about the same metric across different tasks.
-
-                Create a concise, insightful summary that captures the key patterns and themes from all feedback.
-                {strategy_guidance}
-
-                Your summary should be:
-                1. Specific and concrete (not vague or general)
-                2. Focused on actionable insights
-                3. Highlighting patterns across tasks
-                4. 150-250 words in length
-
-                The summary should be directly usable as final feedback for the agent's performance on this metric."""},
-                {"role": "user", "content": f"""I need a synthesized summary of the following feedback for:
-
-                Agent Role: {agent_role}
-                Metric: {metric.title()}
-
-                {all_feedbacks}
-                """}
-            ]
-            assert llm is not None
-            response = llm.call(prompt)
-
-            return response
-
-        except Exception:
-            return "Synthesized from multiple tasks: " + "\n\n".join([f"- {fb[:500]}..." for fb in feedbacks])
--- a/src/crewai/evaluation/evaluation_listener.py
+++ b/src/crewai/evaluation/evaluation_listener.py
@@ -1,190 +0,0 @@
-from datetime import datetime
-from typing import Any, Dict, Optional
-
-from collections.abc import Sequence
-
-from crewai.agent import Agent
-from crewai.task import Task
-from crewai.utilities.events.base_event_listener import BaseEventListener
-from crewai.utilities.events.crewai_event_bus import CrewAIEventsBus
-from crewai.utilities.events.agent_events import (
-    AgentExecutionStartedEvent,
-    AgentExecutionCompletedEvent
-)
-from crewai.utilities.events.tool_usage_events import (
-    ToolUsageFinishedEvent,
-    ToolUsageErrorEvent,
-    ToolExecutionErrorEvent,
-    ToolSelectionErrorEvent,
-    ToolValidateInputErrorEvent
-)
-from crewai.utilities.events.llm_events import (
-    LLMCallStartedEvent,
-    LLMCallCompletedEvent
-)
-
-class EvaluationTraceCallback(BaseEventListener):
-    """Event listener for collecting execution traces for evaluation.
-
-    This listener attaches to the event bus to collect detailed information
-    about the execution process, including agent steps, tool uses, knowledge
-    retrievals, and final output - all for use in agent evaluation.
-    """
-
-    _instance = None
-
-    def __new__(cls):
-        if cls._instance is None:
-            cls._instance = super().__new__(cls)
-            cls._instance._initialized = False
-        return cls._instance
-
-    def __init__(self):
-        if not hasattr(self, "_initialized") or not self._initialized:
-            super().__init__()
-            self.traces = {}
-            self.current_agent_id = None
-            self.current_task_id = None
-            self._initialized = True
-
-    def setup_listeners(self, event_bus: CrewAIEventsBus):
-        @event_bus.on(AgentExecutionStartedEvent)
-        def on_agent_started(source, event: AgentExecutionStartedEvent):
-            self.on_agent_start(event.agent, event.task)
-
-        @event_bus.on(AgentExecutionCompletedEvent)
-        def on_agent_completed(source, event: AgentExecutionCompletedEvent):
-            self.on_agent_finish(event.agent, event.task, event.output)
-
-        @event_bus.on(ToolUsageFinishedEvent)
-        def on_tool_completed(source, event: ToolUsageFinishedEvent):
-            self.on_tool_use(event.tool_name, event.tool_args, event.output, success=True)
-
-        @event_bus.on(ToolUsageErrorEvent)
-        def on_tool_usage_error(source, event: ToolUsageErrorEvent):
-            self.on_tool_use(event.tool_name, event.tool_args, event.error,
-                           success=False, error_type="usage_error")
-
-        @event_bus.on(ToolExecutionErrorEvent)
-        def on_tool_execution_error(source, event: ToolExecutionErrorEvent):
-            self.on_tool_use(event.tool_name, event.tool_args, event.error,
-                           success=False, error_type="execution_error")
-
-        @event_bus.on(ToolSelectionErrorEvent)
-        def on_tool_selection_error(source, event: ToolSelectionErrorEvent):
-            self.on_tool_use(event.tool_name, event.tool_args, event.error,
-                           success=False, error_type="selection_error")
-
-        @event_bus.on(ToolValidateInputErrorEvent)
-        def on_tool_validate_input_error(source, event: ToolValidateInputErrorEvent):
-            self.on_tool_use(event.tool_name, event.tool_args, event.error,
-                           success=False, error_type="validation_error")
-
-        @event_bus.on(LLMCallStartedEvent)
-        def on_llm_call_started(source, event: LLMCallStartedEvent):
-            self.on_llm_call_start(event.messages, event.tools)
-
-        @event_bus.on(LLMCallCompletedEvent)
-        def on_llm_call_completed(source, event: LLMCallCompletedEvent):
-            self.on_llm_call_end(event.messages, event.response)
-
-    def on_agent_start(self, agent: Agent, task: Task):
-        self.current_agent_id = agent.id
-        self.current_task_id = task.id
-
-        trace_key = f"{agent.id}_{task.id}"
-        self.traces[trace_key] = {
-            "agent_id": agent.id,
-            "task_id": task.id,
-            "tool_uses": [],
-            "llm_calls": [],
-            "start_time": datetime.now(),
-            "final_output": None
-        }
-
-    def on_agent_finish(self, agent: Agent, task: Task, output: Any):
-        trace_key = f"{agent.id}_{task.id}"
-        if trace_key in self.traces:
-            self.traces[trace_key]["final_output"] = output
-            self.traces[trace_key]["end_time"] = datetime.now()
-
-        self.current_agent_id = None
-        self.current_task_id = None
-
-    def on_tool_use(self, tool_name: str, tool_args: dict[str, Any] | str, result: Any,
-                   success: bool = True, error_type: str | None = None):
-        if not self.current_agent_id or not self.current_task_id:
-            return
-
-        trace_key = f"{self.current_agent_id}_{self.current_task_id}"
-        if trace_key in self.traces:
-            tool_use = {
-                "tool": tool_name,
-                "args": tool_args,
-                "result": result,
-                "success": success,
-                "timestamp": datetime.now()
-            }
-
-            # Add error information if applicable
-            if not success and error_type:
-                tool_use["error"] = True
-                tool_use["error_type"] = error_type
-
-            self.traces[trace_key]["tool_uses"].append(tool_use)
-
-    def on_llm_call_start(self, messages: str | Sequence[dict[str, Any]] | None, tools: Sequence[dict[str, Any]] | None = None):
-        if not self.current_agent_id or not self.current_task_id:
-            return
-
-        trace_key = f"{self.current_agent_id}_{self.current_task_id}"
-        if trace_key not in self.traces:
-            return
-
-        self.current_llm_call = {
-            "messages": messages,
-            "tools": tools,
-            "start_time": datetime.now(),
-            "response": None,
-            "end_time": None
-        }
-
-    def on_llm_call_end(self, messages: str | list[dict[str, Any]] | None, response: Any):
-        if not self.current_agent_id or not self.current_task_id:
-            return
-
-        trace_key = f"{self.current_agent_id}_{self.current_task_id}"
-        if trace_key not in self.traces:
-            return
-
-        total_tokens = 0
-        if hasattr(response, "usage") and hasattr(response.usage, "total_tokens"):
-            total_tokens = response.usage.total_tokens
-
-        current_time = datetime.now()
-        start_time = None
-        if hasattr(self, "current_llm_call") and self.current_llm_call:
-            start_time = self.current_llm_call.get("start_time")
-
-        if not start_time:
-            start_time = current_time
-        llm_call = {
-            "messages": messages,
-            "response": response,
-            "start_time": start_time,
-            "end_time": current_time,
-            "total_tokens": total_tokens
-        }
-
-        self.traces[trace_key]["llm_calls"].append(llm_call)
-
-        if hasattr(self, "current_llm_call"):
-            self.current_llm_call = {}
-
-    def get_trace(self, agent_id: str, task_id: str) -> Optional[Dict[str, Any]]:
-        trace_key = f"{agent_id}_{task_id}"
-        return self.traces.get(trace_key)
-
-
-def create_evaluation_callbacks() -> EvaluationTraceCallback:
-    return EvaluationTraceCallback()
--- a/src/crewai/evaluation/experiment/testing.py
+++ b/src/crewai/evaluation/experiment/testing.py
@@ -1,49 +0,0 @@
-import warnings
-from crewai.experimental.evaluation import ExperimentResults
-
-def assert_experiment_successfully(experiment_results: ExperimentResults) -> None:
-    """
-    Assert that all experiment results passed successfully.
-
-    Args:
-        experiment_results: The experiment results to check
-
-    Raises:
-        AssertionError: If any test case failed
-    """
-    failed_tests = [result for result in experiment_results.results if not result.passed]
-
-    if failed_tests:
-        detailed_failures: list[str] = []
-
-        for result in failed_tests:
-            expected = result.expected_score
-            actual = result.score
-            detailed_failures.append(f"- {result.identifier}: expected {expected}, got {actual}")
-
-        failure_details = "\n".join(detailed_failures)
-        raise AssertionError(f"The following test cases failed:\n{failure_details}")
-
-def assert_experiment_no_regression(comparison_result: dict[str, list[str]]) -> None:
-    """
-    Assert that there are no regressions in the experiment results compared to baseline.
-    Also warns if there are missing tests.
-
-    Args:
-        comparison_result: The result from compare_with_baseline()
-
-    Raises:
-        AssertionError: If there are regressions
-    """
-    # Check for regressions
-    regressed = comparison_result.get("regressed", [])
-    if regressed:
-        raise AssertionError(f"Regression detected! The following tests that previously passed now fail: {regressed}")
-
-    # Check for missing tests and warn
-    missing_tests = comparison_result.get("missing_tests", [])
-    if missing_tests:
-        warnings.warn(
-            f"Warning: {len(missing_tests)} tests from the baseline are missing in the current run: {missing_tests}",
-            UserWarning
-        )
--- a/src/crewai/evaluation/json_parser.py
+++ b/src/crewai/evaluation/json_parser.py
@@ -1,30 +0,0 @@
-"""Robust JSON parsing utilities for evaluation responses."""
-
-import json
-import re
-from typing import Any
-
-
-def extract_json_from_llm_response(text: str) -> dict[str, Any]:
-    try:
-        return json.loads(text)
-    except json.JSONDecodeError:
-        pass
-
-    json_patterns = [
-        # Standard markdown code blocks with json
-        r'```json\s*([\s\S]*?)\s*```',
-        # Code blocks without language specifier
-        r'```\s*([\s\S]*?)\s*```',
-        # Inline code with JSON
-        r'`([{\\[].*[}\]])`',
-    ]
-
-    for pattern in json_patterns:
-        matches = re.findall(pattern, text, re.IGNORECASE | re.DOTALL)
-        for match in matches:
-            try:
-                return json.loads(match.strip())
-            except json.JSONDecodeError:
-                continue
-    raise ValueError("No valid JSON found in the response")
--- a/src/crewai/evaluation/metrics/init.py
+++ b/src/crewai/evaluation/metrics/init.py
--- a/src/crewai/evaluation/metrics/goal_metrics.py
+++ b/src/crewai/evaluation/metrics/goal_metrics.py
@@ -1,66 +0,0 @@
-from typing import Any, Dict
-
-from crewai.agent import Agent
-from crewai.task import Task
-
-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
-
-class GoalAlignmentEvaluator(BaseEvaluator):
-    @property
-    def metric_category(self) -> MetricCategory:
-        return MetricCategory.GOAL_ALIGNMENT
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: Any,
-    ) -> EvaluationScore:
-        prompt = [
-            {"role": "system", "content": """You are an expert evaluator assessing how well an AI agent's output aligns with its assigned task goal.
-
-Score the agent's goal alignment on a scale from 0-10 where:
- 0: Complete misalignment, agent did not understand or attempt the task goal
- 5: Partial alignment, agent attempted the task but missed key requirements
- 10: Perfect alignment, agent fully satisfied all task requirements
-
-Consider:
-1. Did the agent correctly interpret the task goal?
-2. Did the final output directly address the requirements?
-3. Did the agent focus on relevant aspects of the task?
-4. Did the agent provide all requested information or deliverables?
-
-Return your evaluation as JSON with fields 'score' (number) and 'feedback' (string).
-"""},
-            {"role": "user", "content": f"""
-Agent role: {agent.role}
-Agent goal: {agent.goal}
-Task description: {task.description}
-Expected output: {task.expected_output}
-
-Agent's final output:
-{final_output}
-
-Evaluate how well the agent's output aligns with the assigned task goal.
-"""}
-        ]
-        assert self.llm is not None
-        response = self.llm.call(prompt)
-
-        try:
-            evaluation_data: dict[str, Any] = extract_json_from_llm_response(response)
-            assert evaluation_data is not None
-
-            return EvaluationScore(
-                score=evaluation_data.get("score", 0),
-                feedback=evaluation_data.get("feedback", response),
-                raw_response=response
-            )
-        except Exception:
-            return EvaluationScore(
-                score=None,
-                feedback=f"Failed to parse evaluation. Raw response: {response}",
-                raw_response=response
-            )
--- a/src/crewai/evaluation/metrics/reasoning_metrics.py
+++ b/src/crewai/evaluation/metrics/reasoning_metrics.py
@@ -1,355 +0,0 @@
-"""Agent reasoning efficiency evaluators.
-
-This module provides evaluator implementations for:
- Reasoning efficiency
- Loop detection
- Thinking-to-action ratio
-"""
-
-import logging
-import re
-from enum import Enum
-from typing import Any, Dict, List, Tuple
-import numpy as np
-from collections.abc import Sequence
-
-from crewai.agent import Agent
-from crewai.task import Task
-
-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
-from crewai.tasks.task_output import TaskOutput
-
-class ReasoningPatternType(Enum):
-    EFFICIENT = "efficient"  # Good reasoning flow
-    LOOP = "loop"  # Agent is stuck in a loop
-    VERBOSE = "verbose"  # Agent is unnecessarily verbose
-    INDECISIVE = "indecisive"  # Agent struggles to make decisions
-    SCATTERED = "scattered"  # Agent jumps between topics without focus
-
-
-class ReasoningEfficiencyEvaluator(BaseEvaluator):
-    @property
-    def metric_category(self) -> MetricCategory:
-        return MetricCategory.REASONING_EFFICIENCY
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: TaskOutput,
-    ) -> EvaluationScore:
-        llm_calls = execution_trace.get("llm_calls", [])
-
-        if not llm_calls or len(llm_calls) < 2:
-            return EvaluationScore(
-                score=None,
-                feedback="Insufficient LLM calls to evaluate reasoning efficiency."
-            )
-
-        total_calls = len(llm_calls)
-        total_tokens = sum(call.get("total_tokens", 0) for call in llm_calls)
-        avg_tokens_per_call = total_tokens / total_calls if total_calls > 0 else 0
-        time_intervals = []
-        has_reliable_timing = True
-        for i in range(1, len(llm_calls)):
-            start_time = llm_calls[i-1].get("end_time")
-            end_time = llm_calls[i].get("start_time")
-            if start_time and end_time and start_time != end_time:
-                try:
-                    interval = end_time - start_time
-                    time_intervals.append(interval.total_seconds() if hasattr(interval, 'total_seconds') else 0)
-                except Exception:
-                    has_reliable_timing = False
-            else:
-                has_reliable_timing = False
-
-        loop_detected, loop_details = self._detect_loops(llm_calls)
-        pattern_analysis = self._analyze_reasoning_patterns(llm_calls)
-
-        efficiency_metrics = {
-            "total_llm_calls": total_calls,
-            "total_tokens": total_tokens,
-            "avg_tokens_per_call": avg_tokens_per_call,
-            "reasoning_pattern": pattern_analysis["primary_pattern"].value,
-            "loops_detected": loop_detected,
-        }
-
-        if has_reliable_timing and time_intervals:
-            efficiency_metrics["avg_time_between_calls"] = np.mean(time_intervals)
-
-        loop_info = f"Detected {len(loop_details)} potential reasoning loops." if loop_detected else "No significant reasoning loops detected."
-
-        call_samples = self._get_call_samples(llm_calls)
-
-        prompt = [
-            {"role": "system", "content": """You are an expert evaluator assessing the reasoning efficiency of an AI agent's thought process.
-
-Evaluate the agent's reasoning efficiency across these five key subcategories:
-
-1. Focus (0-10): How well the agent stays on topic and avoids unnecessary tangents
-2. Progression (0-10): How effectively the agent builds on previous thoughts rather than repeating or circling
-3. Decision Quality (0-10): How decisively and appropriately the agent makes decisions
-4. Conciseness (0-10): How efficiently the agent communicates without unnecessary verbosity
-5. Loop Avoidance (0-10): How well the agent avoids getting stuck in repetitive thinking patterns
-
-For each subcategory, provide a score from 0-10 where:
- 0: Completely inefficient
- 5: Moderately efficient
- 10: Highly efficient
-
-The overall score should be a weighted average of these subcategories.
-
-Return your evaluation as JSON with the following structure:
-{
-    "overall_score": float,
-    "scores": {
-        "focus": float,
-        "progression": float,
-        "decision_quality": float,
-        "conciseness": float,
-        "loop_avoidance": float
-    },
-    "feedback": string (general feedback about overall reasoning efficiency),
-    "optimization_suggestions": string (concrete suggestions for improving reasoning efficiency),
-    "detected_patterns": string (describe any inefficient reasoning patterns you observe)
-}"""},
-            {"role": "user", "content": f"""
-Agent role: {agent.role}
-Task description: {task.description}
-
-Reasoning efficiency metrics:
- Total LLM calls: {efficiency_metrics["total_llm_calls"]}
- Average tokens per call: {efficiency_metrics["avg_tokens_per_call"]:.1f}
- Primary reasoning pattern: {efficiency_metrics["reasoning_pattern"]}
- {loop_info}
-{"- Average time between calls: {:.2f} seconds".format(efficiency_metrics.get("avg_time_between_calls", 0)) if "avg_time_between_calls" in efficiency_metrics else ""}
-
-Sample of agent reasoning flow (chronological sequence):
-{call_samples}
-
-Agent's final output:
-{final_output.raw[:500]}... (truncated)
-
-Evaluate the reasoning efficiency of this agent based on these interaction patterns.
-Identify any inefficient reasoning patterns and provide specific suggestions for optimization.
-"""}
-        ]
-
-        assert self.llm is not None
-        response = self.llm.call(prompt)
-
-        try:
-            evaluation_data = extract_json_from_llm_response(response)
-
-            scores = evaluation_data.get("scores", {})
-            focus = scores.get("focus", 5.0)
-            progression = scores.get("progression", 5.0)
-            decision_quality = scores.get("decision_quality", 5.0)
-            conciseness = scores.get("conciseness", 5.0)
-            loop_avoidance = scores.get("loop_avoidance", 5.0)
-
-            overall_score = evaluation_data.get("overall_score", evaluation_data.get("score", 5.0))
-            feedback = evaluation_data.get("feedback", "No detailed feedback provided.")
-            optimization_suggestions = evaluation_data.get("optimization_suggestions", "No specific suggestions provided.")
-
-            detailed_feedback = "Reasoning Efficiency Evaluation:\n"
-            detailed_feedback += f"• Focus: {focus}/10 - Staying on topic without tangents\n"
-            detailed_feedback += f"• Progression: {progression}/10 - Building on previous thinking\n"
-            detailed_feedback += f"• Decision Quality: {decision_quality}/10 - Making appropriate decisions\n"
-            detailed_feedback += f"• Conciseness: {conciseness}/10 - Communicating efficiently\n"
-            detailed_feedback += f"• Loop Avoidance: {loop_avoidance}/10 - Avoiding repetitive patterns\n\n"
-
-            detailed_feedback += f"Feedback:\n{feedback}\n\n"
-            detailed_feedback += f"Optimization Suggestions:\n{optimization_suggestions}"
-
-            return EvaluationScore(
-                score=float(overall_score),
-                feedback=detailed_feedback,
-                raw_response=response
-            )
-        except Exception as e:
-            logging.warning(f"Failed to parse reasoning efficiency evaluation: {e}")
-            return EvaluationScore(
-                score=None,
-                feedback=f"Failed to parse reasoning efficiency evaluation. Raw response: {response[:200]}...",
-                raw_response=response
-            )
-
-    def _detect_loops(self, llm_calls: List[Dict]) -> Tuple[bool, List[Dict]]:
-        loop_details = []
-
-        messages = []
-        for call in llm_calls:
-            content = call.get("response", "")
-            if isinstance(content, str):
-                messages.append(content)
-            elif isinstance(content, list) and len(content) > 0:
-                # Handle message list format
-                for msg in content:
-                    if isinstance(msg, dict) and "content" in msg:
-                        messages.append(msg["content"])
-
-        # Simple n-gram based similarity detection
-        # For a more robust implementation, consider using embedding-based similarity
-        for i in range(len(messages) - 2):
-            for j in range(i + 1, len(messages) - 1):
-                # Check for repeated patterns (simplistic approach)
-                # A more sophisticated approach would use semantic similarity
-                similarity = self._calculate_text_similarity(messages[i], messages[j])
-                if similarity > 0.7:  # Arbitrary threshold
-                    loop_details.append({
-                        "first_occurrence": i,
-                        "second_occurrence": j,
-                        "similarity": similarity,
-                        "snippet": messages[i][:100] + "..."
-                    })
-
-        return len(loop_details) > 0, loop_details
-
-    def _calculate_text_similarity(self, text1: str, text2: str) -> float:
-        text1 = re.sub(r'\s+', ' ', text1.lower()).strip()
-        text2 = re.sub(r'\s+', ' ', text2.lower()).strip()
-
-        # Simple Jaccard similarity on word sets
-        words1 = set(text1.split())
-        words2 = set(text2.split())
-
-        intersection = len(words1.intersection(words2))
-        union = len(words1.union(words2))
-
-        return intersection / union if union > 0 else 0.0
-
-    def _analyze_reasoning_patterns(self, llm_calls: List[Dict]) -> Dict[str, Any]:
-        call_lengths = []
-        response_times = []
-
-        for call in llm_calls:
-            content = call.get("response", "")
-            if isinstance(content, str):
-                call_lengths.append(len(content))
-            elif isinstance(content, list) and len(content) > 0:
-                # Handle message list format
-                total_length = 0
-                for msg in content:
-                    if isinstance(msg, dict) and "content" in msg:
-                        total_length += len(msg["content"])
-                call_lengths.append(total_length)
-
-            start_time = call.get("start_time")
-            end_time = call.get("end_time")
-            if start_time and end_time:
-                try:
-                    response_times.append(end_time - start_time)
-                except Exception:
-                    pass
-
-        avg_length = np.mean(call_lengths) if call_lengths else 0
-        std_length = np.std(call_lengths) if call_lengths else 0
-        length_trend = self._calculate_trend(call_lengths)
-
-        primary_pattern = ReasoningPatternType.EFFICIENT
-        details = "Agent demonstrates efficient reasoning patterns."
-
-        loop_score = self._calculate_loop_likelihood(call_lengths, response_times)
-        if loop_score > 0.7:
-            primary_pattern = ReasoningPatternType.LOOP
-            details = "Agent appears to be stuck in repetitive thinking patterns."
-        elif avg_length > 1000 and std_length / avg_length < 0.3:
-            primary_pattern = ReasoningPatternType.VERBOSE
-            details = "Agent is consistently verbose across interactions."
-        elif len(llm_calls) > 10 and length_trend > 0.5:
-            primary_pattern = ReasoningPatternType.INDECISIVE
-            details = "Agent shows signs of indecisiveness with increasing message lengths."
-        elif std_length / avg_length > 0.8:
-            primary_pattern = ReasoningPatternType.SCATTERED
-            details = "Agent shows inconsistent reasoning flow with highly variable responses."
-
-        return {
-            "primary_pattern": primary_pattern,
-            "details": details,
-            "metrics": {
-                "avg_length": avg_length,
-                "std_length": std_length,
-                "length_trend": length_trend,
-                "loop_score": loop_score
-            }
-        }
-
-    def _calculate_trend(self, values: Sequence[float | int]) -> float:
-        if not values or len(values) < 2:
-            return 0.0
-
-        try:
-            x = np.arange(len(values))
-            y = np.array(values)
-
-            # Simple linear regression
-            slope = np.polyfit(x, y, 1)[0]
-
-            # Normalize slope to -1 to 1 range
-            max_possible_slope = max(values) - min(values)
-            if max_possible_slope > 0:
-                normalized_slope = slope / max_possible_slope
-                return max(min(normalized_slope, 1.0), -1.0)
-            return 0.0
-        except Exception:
-            return 0.0
-
-    def _calculate_loop_likelihood(self, call_lengths: Sequence[float], response_times: Sequence[float]) -> float:
-        if not call_lengths or len(call_lengths) < 3:
-            return 0.0
-
-        indicators = []
-
-        if len(call_lengths) >= 4:
-            repeated_lengths = 0
-            for i in range(len(call_lengths) - 2):
-                ratio = call_lengths[i] / call_lengths[i + 2] if call_lengths[i + 2] > 0 else 0
-                if 0.85 <= ratio <= 1.15:
-                    repeated_lengths += 1
-
-            length_repetition_score = repeated_lengths / (len(call_lengths) - 2)
-            indicators.append(length_repetition_score)
-
-        if response_times and len(response_times) >= 3:
-            try:
-                std_time = np.std(response_times)
-                mean_time = np.mean(response_times)
-                if mean_time > 0:
-                    time_consistency = 1.0 - (std_time / mean_time)
-                    indicators.append(max(0, time_consistency - 0.3) * 1.5)
-            except Exception:
-                pass
-
-        return np.mean(indicators) if indicators else 0.0
-
-    def _get_call_samples(self, llm_calls: List[Dict]) -> str:
-        samples = []
-
-        if len(llm_calls) <= 6:
-            sample_indices = list(range(len(llm_calls)))
-        else:
-            sample_indices = [0, 1, len(llm_calls) // 2 - 1, len(llm_calls) // 2,
-                             len(llm_calls) - 2, len(llm_calls) - 1]
-
-        for idx in sample_indices:
-            call = llm_calls[idx]
-            content = call.get("response", "")
-
-            if isinstance(content, str):
-                sample = content
-            elif isinstance(content, list) and len(content) > 0:
-                sample_parts = []
-                for msg in content:
-                    if isinstance(msg, dict) and "content" in msg:
-                        sample_parts.append(msg["content"])
-                sample = "\n".join(sample_parts)
-            else:
-                sample = str(content)
-
-            truncated = sample[:200] + "..." if len(sample) > 200 else sample
-            samples.append(f"Call {idx + 1}:\n{truncated}\n")
-
-        return "\n".join(samples)
--- a/src/crewai/evaluation/metrics/semantic_quality_metrics.py
+++ b/src/crewai/evaluation/metrics/semantic_quality_metrics.py
@@ -1,65 +0,0 @@
-from typing import Any, Dict
-
-from crewai.agent import Agent
-from crewai.task import Task
-
-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
-
-class SemanticQualityEvaluator(BaseEvaluator):
-    @property
-    def metric_category(self) -> MetricCategory:
-        return MetricCategory.SEMANTIC_QUALITY
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: Any,
-    ) -> EvaluationScore:
-        prompt = [
-            {"role": "system", "content": """You are an expert evaluator assessing the semantic quality of an AI agent's output.
-
-Score the semantic quality on a scale from 0-10 where:
- 0: Completely incoherent, confusing, or logically flawed output
- 5: Moderately clear and logical output with some issues
- 10: Exceptionally clear, coherent, and logically sound output
-
-Consider:
-1. Is the output well-structured and organized?
-2. Is the reasoning logical and well-supported?
-3. Is the language clear, precise, and appropriate for the task?
-4. Are claims supported by evidence when appropriate?
-5. Is the output free from contradictions and logical fallacies?
-
-Return your evaluation as JSON with fields 'score' (number) and 'feedback' (string).
-"""},
-            {"role": "user", "content": f"""
-Agent role: {agent.role}
-Task description: {task.description}
-
-Agent's final output:
-{final_output}
-
-Evaluate the semantic quality and reasoning of this output.
-"""}
-        ]
-
-        assert self.llm is not None
-        response = self.llm.call(prompt)
-
-        try:
-            evaluation_data: dict[str, Any] = extract_json_from_llm_response(response)
-            assert evaluation_data is not None
-            return EvaluationScore(
-                score=float(evaluation_data["score"]) if evaluation_data.get("score") is not None else None,
-                feedback=evaluation_data.get("feedback", response),
-                raw_response=response
-            )
-        except Exception:
-            return EvaluationScore(
-                score=None,
-                feedback=f"Failed to parse evaluation. Raw response: {response}",
-                raw_response=response
-            )
--- a/src/crewai/evaluation/metrics/tools_metrics.py
+++ b/src/crewai/evaluation/metrics/tools_metrics.py
@@ -1,400 +0,0 @@
-import json
-from typing import Dict, Any
-
-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
-from crewai.agent import Agent
-from crewai.task import Task
-
-
-class ToolSelectionEvaluator(BaseEvaluator):
-
-    @property
-    def metric_category(self) -> MetricCategory:
-        return MetricCategory.TOOL_SELECTION
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: str,
-    ) -> EvaluationScore:
-        tool_uses = execution_trace.get("tool_uses", [])
-        tool_count = len(tool_uses)
-        unique_tool_types = set([tool.get("tool", "Unknown tool") for tool in tool_uses])
-
-        if tool_count == 0:
-            if not agent.tools:
-                return EvaluationScore(
-                    score=None,
-                    feedback="Agent had no tools available to use."
-                )
-            else:
-                return EvaluationScore(
-                    score=None,
-                    feedback="Agent had tools available but didn't use any."
-                )
-
-        available_tools_info = ""
-        if agent.tools:
-            for tool in agent.tools:
-                available_tools_info += f"- {tool.name}: {tool.description}\n"
-        else:
-            available_tools_info = "No tools available"
-
-        tool_types_summary = "Tools selected by the agent:\n"
-        for tool_type in sorted(unique_tool_types):
-            tool_types_summary += f"- {tool_type}\n"
-
-        prompt = [
-            {"role": "system", "content": """You are an expert evaluator assessing if an AI agent selected the most appropriate tools for a given task.
-
-You must evaluate based on these 2 criteria:
-1. Relevance (0-10): Were the tools chosen directly aligned with the task's goals?
-2. Coverage (0-10): Did the agent select ALL appropriate tools from the AVAILABLE tools?
-
-IMPORTANT:
- ONLY consider tools that are listed as available to the agent
- DO NOT suggest tools that aren't in the 'Available tools' list
- DO NOT evaluate the quality or accuracy of tool outputs/results
- DO NOT evaluate how many times each tool was used
- DO NOT evaluate how the agent used the parameters
- DO NOT evaluate whether the agent interpreted the task correctly
-
-Focus ONLY on whether the correct CATEGORIES of tools were selected from what was available.
-
-Return your evaluation as JSON with these fields:
- scores: {"relevance": number, "coverage": number}
- overall_score: number (average of all scores, 0-10)
- feedback: string (focused ONLY on tool selection decisions from available tools)
- improvement_suggestions: string (ONLY suggest better selection from the AVAILABLE tools list, NOT new tools)
-"""},
-            {"role": "user", "content": f"""
-Agent role: {agent.role}
-Task description: {task.description}
-
-Available tools for this agent:
-{available_tools_info}
-
-{tool_types_summary}
-
-Based ONLY on the task description and comparing the AVAILABLE tools with those that were selected (listed above), evaluate if the agent selected the appropriate tool types for this task.
-
-IMPORTANT:
- ONLY evaluate selection from tools listed as available
- DO NOT suggest new tools that aren't in the available tools list
- DO NOT evaluate tool usage or results
-"""}
-        ]
-        assert self.llm is not None
-        response = self.llm.call(prompt)
-
-        try:
-            evaluation_data = extract_json_from_llm_response(response)
-            assert evaluation_data is not None
-
-            scores = evaluation_data.get("scores", {})
-            relevance = scores.get("relevance", 5.0)
-            coverage = scores.get("coverage", 5.0)
-            overall_score = float(evaluation_data.get("overall_score", 5.0))
-
-            feedback = "Tool Selection Evaluation:\n"
-            feedback += f"• Relevance: {relevance}/10 - Selection of appropriate tool types for the task\n"
-            feedback += f"• Coverage: {coverage}/10 - Selection of all necessary tool types\n"
-            if "improvement_suggestions" in evaluation_data:
-                feedback += f"Improvement Suggestions:\n{evaluation_data['improvement_suggestions']}"
-            else:
-                feedback += evaluation_data.get("feedback", "No detailed feedback available.")
-
-            return EvaluationScore(
-                score=overall_score,
-                feedback=feedback,
-                raw_response=response
-            )
-        except Exception as e:
-            return EvaluationScore(
-                score=None,
-                feedback=f"Error evaluating tool selection: {e}",
-                raw_response=response
-            )
-
-
-class ParameterExtractionEvaluator(BaseEvaluator):
-    @property
-    def metric_category(self) -> MetricCategory:
-        return MetricCategory.PARAMETER_EXTRACTION
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: str,
-    ) -> EvaluationScore:
-        tool_uses = execution_trace.get("tool_uses", [])
-        tool_count = len(tool_uses)
-
-        if tool_count == 0:
-            return EvaluationScore(
-                score=None,
-                feedback="No tool usage detected. Cannot evaluate parameter extraction."
-            )
-
-        validation_errors = []
-        for tool_use in tool_uses:
-            if not tool_use.get("success", True) and tool_use.get("error_type") == "validation_error":
-                validation_errors.append({
-                    "tool": tool_use.get("tool", "Unknown tool"),
-                    "error": tool_use.get("result"),
-                    "args": tool_use.get("args", {})
-                })
-
-        validation_error_rate = len(validation_errors) / tool_count if tool_count > 0 else 0
-
-        param_samples = []
-        for i, tool_use in enumerate(tool_uses[:5]):
-            tool_name = tool_use.get("tool", "Unknown tool")
-            tool_args = tool_use.get("args", {})
-            success = tool_use.get("success", True) and not tool_use.get("error", False)
-            error_type = tool_use.get("error_type", "") if not success else ""
-
-            is_validation_error = error_type == "validation_error"
-
-            sample = f"Tool use #{i+1} - {tool_name}:\n"
-            sample += f"- Parameters: {json.dumps(tool_args, indent=2)}\n"
-            sample += f"- Success: {'No' if not success else 'Yes'}"
-
-            if is_validation_error:
-                sample += " (PARAMETER VALIDATION ERROR)\n"
-                sample += f"- Error: {tool_use.get('result', 'Unknown error')}"
-            elif not success:
-                sample += f" (Other error: {error_type})\n"
-
-            param_samples.append(sample)
-
-        validation_errors_info = ""
-        if validation_errors:
-            validation_errors_info = f"\nParameter validation errors detected: {len(validation_errors)} ({validation_error_rate:.1%} of tool uses)\n"
-            for i, err in enumerate(validation_errors[:3]):
-                tool_name = err.get("tool", "Unknown tool")
-                error_msg = err.get("error", "Unknown error")
-                args = err.get("args", {})
-                validation_errors_info += f"\nValidation Error #{i+1}:\n- Tool: {tool_name}\n- Args: {json.dumps(args, indent=2)}\n- Error: {error_msg}"
-
-            if len(validation_errors) > 3:
-                validation_errors_info += f"\n...and {len(validation_errors) - 3} more validation errors."
-        param_samples_text = "\n\n".join(param_samples)
-        prompt = [
-            {"role": "system", "content": """You are an expert evaluator assessing how well an AI agent extracts and formats PARAMETER VALUES for tool calls.
-
-Your job is to evaluate ONLY whether the agent used the correct parameter VALUES, not whether the right tools were selected or how the tools were invoked.
-
-Evaluate parameter extraction based on these criteria:
-1. Accuracy (0-10): Are parameter values correctly identified from the context/task?
-2. Formatting (0-10): Are values formatted correctly for each tool's requirements?
-3. Completeness (0-10): Are all required parameter values provided, with no missing information?
-
-IMPORTANT: DO NOT evaluate:
- Whether the right tool was chosen (that's the ToolSelectionEvaluator's job)
- How the tools were structurally invoked (that's the ToolInvocationEvaluator's job)
- The quality of results from tools
-
-Focus ONLY on the PARAMETER VALUES - whether they were correctly extracted from the context, properly formatted, and complete.
-
-Validation errors are important signals that parameter values weren't properly extracted or formatted.
-
-Return your evaluation as JSON with these fields:
- scores: {"accuracy": number, "formatting": number, "completeness": number}
- overall_score: number (average of all scores, 0-10)
- feedback: string (focused ONLY on parameter value extraction quality)
- improvement_suggestions: string (concrete suggestions for better parameter VALUE extraction)
-"""},
-            {"role": "user", "content": f"""
-Agent role: {agent.role}
-Task description: {task.description}
-
-Parameter extraction examples:
-{param_samples_text}
-{validation_errors_info}
-
-Evaluate the quality of the agent's parameter extraction for this task.
-"""}
-        ]
-
-        assert self.llm is not None
-        response = self.llm.call(prompt)
-
-        try:
-            evaluation_data = extract_json_from_llm_response(response)
-            assert evaluation_data is not None
-
-            scores = evaluation_data.get("scores", {})
-            accuracy = scores.get("accuracy", 5.0)
-            formatting = scores.get("formatting", 5.0)
-            completeness = scores.get("completeness", 5.0)
-
-            overall_score = float(evaluation_data.get("overall_score", 5.0))
-
-            feedback = "Parameter Extraction Evaluation:\n"
-            feedback += f"• Accuracy: {accuracy}/10 - Correctly identifying required parameters\n"
-            feedback += f"• Formatting: {formatting}/10 - Properly formatting parameters for tools\n"
-            feedback += f"• Completeness: {completeness}/10 - Including all necessary information\n\n"
-
-            if "improvement_suggestions" in evaluation_data:
-                feedback += f"Improvement Suggestions:\n{evaluation_data['improvement_suggestions']}"
-            else:
-                feedback += evaluation_data.get("feedback", "No detailed feedback available.")
-
-            return EvaluationScore(
-                score=overall_score,
-                feedback=feedback,
-                raw_response=response
-            )
-        except Exception as e:
-            return EvaluationScore(
-                score=None,
-                feedback=f"Error evaluating parameter extraction: {e}",
-                raw_response=response
-            )
-
-
-class ToolInvocationEvaluator(BaseEvaluator):
-    @property
-    def metric_category(self) -> MetricCategory:
-        return MetricCategory.TOOL_INVOCATION
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: str,
-    ) -> EvaluationScore:
-        tool_uses = execution_trace.get("tool_uses", [])
-        tool_errors = []
-        tool_count = len(tool_uses)
-
-        if tool_count == 0:
-            return EvaluationScore(
-                score=None,
-                feedback="No tool usage detected. Cannot evaluate tool invocation."
-            )
-
-        for tool_use in tool_uses:
-            if not tool_use.get("success", True) or tool_use.get("error", False):
-                error_info = {
-                    "tool": tool_use.get("tool", "Unknown tool"),
-                    "error": tool_use.get("result"),
-                    "error_type": tool_use.get("error_type", "unknown_error")
-                }
-                tool_errors.append(error_info)
-
-        error_rate = len(tool_errors) / tool_count if tool_count > 0 else 0
-
-        error_types = {}
-        for error in tool_errors:
-            error_type = error.get("error_type", "unknown_error")
-            if error_type not in error_types:
-                error_types[error_type] = 0
-            error_types[error_type] += 1
-
-        invocation_samples = []
-        for i, tool_use in enumerate(tool_uses[:5]):
-            tool_name = tool_use.get("tool", "Unknown tool")
-            tool_args = tool_use.get("args", {})
-            success = tool_use.get("success", True) and not tool_use.get("error", False)
-            error_type = tool_use.get("error_type", "") if not success else ""
-            error_msg = tool_use.get("result", "No error") if not success else "No error"
-
-            sample = f"Tool invocation #{i+1}:\n"
-            sample += f"- Tool: {tool_name}\n"
-            sample += f"- Parameters: {json.dumps(tool_args, indent=2)}\n"
-            sample += f"- Success: {'No' if not success else 'Yes'}\n"
-            if not success:
-                sample += f"- Error type: {error_type}\n"
-                sample += f"- Error: {error_msg}"
-            invocation_samples.append(sample)
-
-        error_type_summary = ""
-        if error_types:
-            error_type_summary = "Error type breakdown:\n"
-            for error_type, count in error_types.items():
-                error_type_summary += f"- {error_type}: {count} occurrences ({(count/tool_count):.1%})\n"
-
-        invocation_samples_text = "\n\n".join(invocation_samples)
-        prompt = [
-            {"role": "system", "content": """You are an expert evaluator assessing how correctly an AI agent's tool invocations are STRUCTURED.
-
-Your job is to evaluate ONLY the structural and syntactical aspects of how the agent called tools, NOT which tools were selected or what parameter values were used.
-
-Evaluate the agent's tool invocation based on these criteria:
-1. Structure (0-10): Does the tool call follow the expected syntax and format?
-2. Error Handling (0-10): Does the agent handle tool errors appropriately?
-3. Invocation Patterns (0-10): Are tool calls properly sequenced, batched, or managed?
-
-Error types that indicate invocation issues:
- execution_error: The tool was called correctly but failed during execution
- usage_error: General errors in how the tool was used structurally
-
-IMPORTANT: DO NOT evaluate:
- Whether the right tool was chosen (that's the ToolSelectionEvaluator's job)
- Whether the parameter values are correct (that's the ParameterExtractionEvaluator's job)
- The quality of results from tools
-
-Focus ONLY on HOW tools were invoked - the structure, format, and handling of the invocation process.
-
-Return your evaluation as JSON with these fields:
- scores: {"structure": number, "error_handling": number, "invocation_patterns": number}
- overall_score: number (average of all scores, 0-10)
- feedback: string (focused ONLY on structural aspects of tool invocation)
- improvement_suggestions: string (concrete suggestions for better structuring of tool calls)
-"""},
-            {"role": "user", "content": f"""
-Agent role: {agent.role}
-Task description: {task.description}
-
-Tool invocation examples:
-{invocation_samples_text}
-
-Tool error rate: {error_rate:.2%} ({len(tool_errors)} errors out of {tool_count} invocations)
-{error_type_summary}
-
-Evaluate the quality of the agent's tool invocation structure during this task.
-"""}
-        ]
-
-        assert self.llm is not None
-        response = self.llm.call(prompt)
-
-        try:
-            evaluation_data = extract_json_from_llm_response(response)
-            assert evaluation_data is not None
-            scores = evaluation_data.get("scores", {})
-            structure = scores.get("structure", 5.0)
-            error_handling = scores.get("error_handling", 5.0)
-            invocation_patterns = scores.get("invocation_patterns", 5.0)
-
-            overall_score = float(evaluation_data.get("overall_score", 5.0))
-
-            feedback = "Tool Invocation Evaluation:\n"
-            feedback += f"• Structure: {structure}/10 - Following proper syntax and format\n"
-            feedback += f"• Error Handling: {error_handling}/10 - Appropriately handling tool errors\n"
-            feedback += f"• Invocation Patterns: {invocation_patterns}/10 - Proper sequencing and management of calls\n\n"
-
-            if "improvement_suggestions" in evaluation_data:
-                feedback += f"Improvement Suggestions:\n{evaluation_data['improvement_suggestions']}"
-            else:
-                feedback += evaluation_data.get("feedback", "No detailed feedback available.")
-
-            return EvaluationScore(
-                score=overall_score,
-                feedback=feedback,
-                raw_response=response
-            )
-        except Exception as e:
-            return EvaluationScore(
-                score=None,
-                feedback=f"Error evaluating tool invocation: {e}",
-                raw_response=response
-            )
--- a/src/crewai/experimental/evaluation/agent_evaluator.py
+++ b/src/crewai/experimental/evaluation/agent_evaluator.py
@@ -1,34 +1,35 @@
+import threading
+from typing import Any
+
 from crewai.experimental.evaluation.base_evaluator import AgentEvaluationResult, AggregationStrategy
 from crewai.agent import Agent
 from crewai.task import Task
 from crewai.experimental.evaluation.evaluation_display import EvaluationDisplayFormatter
-
-from typing import Any, Dict
-from collections import defaultdict
+from crewai.utilities.events.agent_events import AgentEvaluationStartedEvent, AgentEvaluationCompletedEvent, AgentEvaluationFailedEvent
 from crewai.experimental.evaluation import BaseEvaluator, create_evaluation_callbacks
 from collections.abc import Sequence
-from crewai.crew import Crew
 from crewai.utilities.events.crewai_event_bus import crewai_event_bus
 from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
-from crewai.experimental.evaluation.evaluation_display import AgentAggregatedEvaluationResult
-from contextlib import contextmanager
-import threading
+from crewai.utilities.events.task_events import TaskCompletedEvent
+from crewai.utilities.events.agent_events import LiteAgentExecutionCompletedEvent
+from crewai.experimental.evaluation.base_evaluator import AgentAggregatedEvaluationResult, EvaluationScore, MetricCategory

 class ExecutionState:
    def __init__(self):
-        self.traces: dict[str, Any] = {}
+        self.traces = {}
        self.current_agent_id: str | None = None
        self.current_task_id: str | None = None
-        self.iteration: int = 1
-        self.iterations_results: dict[int, dict[str, list[AgentEvaluationResult]]] = {}
+        self.iteration = 1
+        self.iterations_results = {}
+        self.agent_evaluators = {}

 class AgentEvaluator:
    def __init__(
        self,
+        agents: list[Agent],
        evaluators: Sequence[BaseEvaluator] | None = None,
-        crew: Crew | None = None,
    ):
-        self.crew: Crew | None = crew
+        self.agents: list[Agent] = agents
        self.evaluators: Sequence[BaseEvaluator] | None = evaluators

        self.callback = create_evaluation_callbacks()
@@ -37,19 +38,10 @@ class AgentEvaluator:

        self._thread_local: threading.local = threading.local()

-        self.agent_evaluators: dict[str, Sequence[BaseEvaluator] | None] = {}
-        if crew is not None:
-            assert crew and crew.agents is not None
-            for agent in crew.agents:
-                self.agent_evaluators[str(agent.id)] = self.evaluators
+        for agent in self.agents:
+            self._execution_state.agent_evaluators[str(agent.id)] = self.evaluators

-    @contextmanager
-    def execution_context(self):
-        state = ExecutionState()
-        try:
-            yield state
-        finally:
-            pass
+        self._subscribe_to_events()

    @property
    def _execution_state(self) -> ExecutionState:
@@ -57,81 +49,100 @@ class AgentEvaluator:
            self._thread_local.execution_state = ExecutionState()
        return self._thread_local.execution_state

+    def _subscribe_to_events(self) -> None:
+        from typing import cast
+        crewai_event_bus.register_handler(TaskCompletedEvent, cast(Any, self._handle_task_completed))
+        crewai_event_bus.register_handler(LiteAgentExecutionCompletedEvent, cast(Any, self._handle_lite_agent_completed))
+
+    def _handle_task_completed(self, source: Any, event: TaskCompletedEvent) -> None:
+        assert event.task is not None
+        agent = event.task.agent
+        if agent and str(getattr(agent, 'id', 'unknown')) in self._execution_state.agent_evaluators:
+            self.emit_evaluation_started_event(agent_role=agent.role, agent_id=str(agent.id), task_id=str(event.task.id))
+
+            state = ExecutionState()
+            state.current_agent_id = str(agent.id)
+            state.current_task_id = str(event.task.id)
+
+            assert state.current_agent_id is not None and state.current_task_id is not None
+            trace = self.callback.get_trace(state.current_agent_id, state.current_task_id)
+
+            if not trace:
+                return
+
+            result = self.evaluate(
+                agent=agent,
+                task=event.task,
+                execution_trace=trace,
+                final_output=event.output,
+                state=state
+            )
+
+            current_iteration = self._execution_state.iteration
+            if current_iteration not in self._execution_state.iterations_results:
+                self._execution_state.iterations_results[current_iteration] = {}
+
+            if agent.role not in self._execution_state.iterations_results[current_iteration]:
+                self._execution_state.iterations_results[current_iteration][agent.role] = []
+
+            self._execution_state.iterations_results[current_iteration][agent.role].append(result)
+
+    def _handle_lite_agent_completed(self, source: object, event: LiteAgentExecutionCompletedEvent) -> None:
+        agent_info = event.agent_info
+        agent_id = str(agent_info["id"])
+
+        if agent_id in self._execution_state.agent_evaluators:
+            state = ExecutionState()
+            state.current_agent_id = agent_id
+            state.current_task_id = "lite_task"
+
+            target_agent = None
+            for agent in self.agents:
+                if str(agent.id) == agent_id:
+                    target_agent = agent
+                    break
+
+            if not target_agent:
+                return
+
+            assert state.current_agent_id is not None and state.current_task_id is not None
+            trace = self.callback.get_trace(state.current_agent_id, state.current_task_id)
+
+            if not trace:
+                return
+
+            result = self.evaluate(
+                agent=target_agent,
+                execution_trace=trace,
+                final_output=event.output,
+                state=state
+            )
+
+            current_iteration = self._execution_state.iteration
+            if current_iteration not in self._execution_state.iterations_results:
+                self._execution_state.iterations_results[current_iteration] = {}
+
+            agent_role = target_agent.role
+            if agent_role not in self._execution_state.iterations_results[current_iteration]:
+                self._execution_state.iterations_results[current_iteration][agent_role] = []
+
+            self._execution_state.iterations_results[current_iteration][agent_role].append(result)
+
    def set_iteration(self, iteration: int) -> None:
        self._execution_state.iteration = iteration

    def reset_iterations_results(self) -> None:
        self._execution_state.iterations_results = {}

-    def evaluate_current_iteration(self) -> dict[str, list[AgentEvaluationResult]]:
-        if not self.crew:
-            raise ValueError("Cannot evaluate: no crew was provided to the evaluator.")
-
-        if not self.callback:
-            raise ValueError("Cannot evaluate: no callback was set. Use set_callback() method first.")
-
-        from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
-        evaluation_results: defaultdict[str, list[AgentEvaluationResult]] = defaultdict(list)
-
-        total_evals = 0
-        for agent in self.crew.agents:
-            for task in self.crew.tasks:
-                if task.agent and task.agent.id == agent.id and self.agent_evaluators.get(str(agent.id)):
-                    total_evals += 1
-
-        with Progress(
-            SpinnerColumn(),
-            TextColumn("[bold blue]{task.description}[/bold blue]"),
-            BarColumn(),
-            TextColumn("{task.percentage:.0f}% completed"),
-            console=self.console_formatter.console
-        ) as progress:
-            eval_task = progress.add_task(f"Evaluating agents (iteration {self._execution_state.iteration})...", total=total_evals)
-
-            with self.execution_context() as state:
-                state.iteration = self._execution_state.iteration
-
-                for agent in self.crew.agents:
-                    evaluator = self.agent_evaluators.get(str(agent.id))
-                    if not evaluator:
-                        continue
-
-                    for task in self.crew.tasks:
-                        if task.agent and str(task.agent.id) != str(agent.id):
-                            continue
-
-                        trace = self.callback.get_trace(str(agent.id), str(task.id))
-                        if not trace:
-                            self.console_formatter.print(f"[yellow]Warning: No trace found for agent {agent.role} on task {task.description[:30]}...[/yellow]")
-                            progress.update(eval_task, advance=1)
-                            continue
-
-                        state.current_agent_id = str(agent.id)
-                        state.current_task_id = str(task.id)
-
-                        with crewai_event_bus.scoped_handlers():
-                            result = self.evaluate(
-                                agent=agent,
-                                task=task,
-                                execution_trace=trace,
-                                final_output=task.output,
-                                state=state
-                            )
-                            evaluation_results[agent.role].append(result)
-                            progress.update(eval_task, advance=1)
-
-        self._execution_state.iterations_results[self._execution_state.iteration] = evaluation_results
-        return evaluation_results
-
    def get_evaluation_results(self) -> dict[str, list[AgentEvaluationResult]]:
-        if self._execution_state.iteration in self._execution_state.iterations_results:
+        if self._execution_state.iterations_results and self._execution_state.iteration in self._execution_state.iterations_results:
            return self._execution_state.iterations_results[self._execution_state.iteration]
-        return self.evaluate_current_iteration()
+        return {}

    def display_results_with_iterations(self) -> None:
        self.display_formatter.display_summary_results(self._execution_state.iterations_results)

-    def get_agent_evaluation(self, strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE, include_evaluation_feedback: bool = False) -> Dict[str, AgentAggregatedEvaluationResult]:
+    def get_agent_evaluation(self, strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE, include_evaluation_feedback: bool = True) -> dict[str, AgentAggregatedEvaluationResult]:
        agent_results = {}
        with crewai_event_bus.scoped_handlers():
            task_results = self.get_evaluation_results()
@@ -165,19 +176,21 @@ class AgentEvaluator:
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: dict[str, Any],
        final_output: Any,
-        state: ExecutionState
+        state: ExecutionState,
+        task: Task | None = None,
    ) -> AgentEvaluationResult:
        result = AgentEvaluationResult(
            agent_id=state.current_agent_id or str(agent.id),
-            task_id=state.current_task_id or str(task.id)
+            task_id=state.current_task_id or (str(task.id) if task else "unknown_task")
        )

        assert self.evaluators is not None
+        task_id = str(task.id) if task else None
        for evaluator in self.evaluators:
            try:
+                self.emit_evaluation_started_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id)
                score = evaluator.evaluate(
                    agent=agent,
                    task=task,
@@ -185,12 +198,32 @@ class AgentEvaluator:
                    final_output=final_output
                )
                result.metrics[evaluator.metric_category] = score
+                self.emit_evaluation_completed_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id, metric_category=evaluator.metric_category, score=score)
            except Exception as e:
+                self.emit_evaluation_failed_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id, error=str(e))
                self.console_formatter.print(f"Error in {evaluator.metric_category.value} evaluator: {str(e)}")

        return result

-def create_default_evaluator(crew, llm=None):
+    def emit_evaluation_started_event(self, agent_role: str, agent_id: str, task_id: str | None = None):
+        crewai_event_bus.emit(
+            self,
+            AgentEvaluationStartedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration)
+        )
+
+    def emit_evaluation_completed_event(self, agent_role: str, agent_id: str, task_id: str | None = None, metric_category: MetricCategory | None = None, score: EvaluationScore | None = None):
+        crewai_event_bus.emit(
+            self,
+            AgentEvaluationCompletedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration, metric_category=metric_category, score=score)
+        )
+
+    def emit_evaluation_failed_event(self, agent_role: str, agent_id: str, error: str, task_id: str | None = None):
+        crewai_event_bus.emit(
+            self,
+            AgentEvaluationFailedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration, error=error)
+        )
+
+def create_default_evaluator(agents: list[Agent], llm: None = None):
    from crewai.experimental.evaluation import (
        GoalAlignmentEvaluator,
        SemanticQualityEvaluator,
@@ -209,4 +242,4 @@ def create_default_evaluator(crew, llm=None):
        ReasoningEfficiencyEvaluator(llm=llm),
    ]

-    return AgentEvaluator(evaluators=evaluators, crew=crew)
+    return AgentEvaluator(evaluators=evaluators, agents=agents)
--- a/src/crewai/experimental/evaluation/base_evaluator.py
+++ b/src/crewai/experimental/evaluation/base_evaluator.py
@@ -57,9 +57,9 @@ class BaseEvaluator(abc.ABC):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: Any,
+        task: Task | None = None,
    ) -> EvaluationScore:
        pass

--- a/src/crewai/experimental/evaluation/evaluation_listener.py
+++ b/src/crewai/experimental/evaluation/evaluation_listener.py
@@ -9,7 +9,9 @@ from crewai.utilities.events.base_event_listener import BaseEventListener
 from crewai.utilities.events.crewai_event_bus import CrewAIEventsBus
 from crewai.utilities.events.agent_events import (
    AgentExecutionStartedEvent,
-    AgentExecutionCompletedEvent
+    AgentExecutionCompletedEvent,
+    LiteAgentExecutionStartedEvent,
+    LiteAgentExecutionCompletedEvent
 )
 from crewai.utilities.events.tool_usage_events import (
    ToolUsageFinishedEvent,
@@ -52,10 +54,18 @@ class EvaluationTraceCallback(BaseEventListener):
        def on_agent_started(source, event: AgentExecutionStartedEvent):
            self.on_agent_start(event.agent, event.task)

+        @event_bus.on(LiteAgentExecutionStartedEvent)
+        def on_lite_agent_started(source, event: LiteAgentExecutionStartedEvent):
+            self.on_lite_agent_start(event.agent_info)
+
        @event_bus.on(AgentExecutionCompletedEvent)
        def on_agent_completed(source, event: AgentExecutionCompletedEvent):
            self.on_agent_finish(event.agent, event.task, event.output)

+        @event_bus.on(LiteAgentExecutionCompletedEvent)
+        def on_lite_agent_completed(source, event: LiteAgentExecutionCompletedEvent):
+            self.on_lite_agent_finish(event.output)
+
        @event_bus.on(ToolUsageFinishedEvent)
        def on_tool_completed(source, event: ToolUsageFinishedEvent):
            self.on_tool_use(event.tool_name, event.tool_args, event.output, success=True)
@@ -88,19 +98,38 @@ class EvaluationTraceCallback(BaseEventListener):
        def on_llm_call_completed(source, event: LLMCallCompletedEvent):
            self.on_llm_call_end(event.messages, event.response)

+    def on_lite_agent_start(self, agent_info: dict[str, Any]):
+        self.current_agent_id = agent_info['id']
+        self.current_task_id = "lite_task"
+
+        trace_key = f"{self.current_agent_id}_{self.current_task_id}"
+        self._init_trace(
+            trace_key=trace_key,
+            agent_id=self.current_agent_id,
+            task_id=self.current_task_id,
+            tool_uses=[],
+            llm_calls=[],
+            start_time=datetime.now(),
+            final_output=None
+        )
+
+    def _init_trace(self, trace_key: str, **kwargs: Any):
+        self.traces[trace_key] = kwargs
+
    def on_agent_start(self, agent: Agent, task: Task):
        self.current_agent_id = agent.id
        self.current_task_id = task.id

        trace_key = f"{agent.id}_{task.id}"
-        self.traces[trace_key] = {
-            "agent_id": agent.id,
-            "task_id": task.id,
-            "tool_uses": [],
-            "llm_calls": [],
-            "start_time": datetime.now(),
-            "final_output": None
-        }
+        self._init_trace(
+            trace_key=trace_key,
+            agent_id=agent.id,
+            task_id=task.id,
+            tool_uses=[],
+            llm_calls=[],
+            start_time=datetime.now(),
+            final_output=None
+        )

    def on_agent_finish(self, agent: Agent, task: Task, output: Any):
        trace_key = f"{agent.id}_{task.id}"
@@ -108,9 +137,20 @@ class EvaluationTraceCallback(BaseEventListener):
            self.traces[trace_key]["final_output"] = output
            self.traces[trace_key]["end_time"] = datetime.now()

+        self._reset_current()
+
+    def _reset_current(self):
        self.current_agent_id = None
        self.current_task_id = None

+    def on_lite_agent_finish(self, output: Any):
+        trace_key = f"{self.current_agent_id}_lite_task"
+        if trace_key in self.traces:
+            self.traces[trace_key]["final_output"] = output
+            self.traces[trace_key]["end_time"] = datetime.now()
+
+        self._reset_current()
+
    def on_tool_use(self, tool_name: str, tool_args: dict[str, Any] | str, result: Any,
                   success: bool = True, error_type: str | None = None):
        if not self.current_agent_id or not self.current_task_id:
@@ -187,4 +227,8 @@ class EvaluationTraceCallback(BaseEventListener):


 def create_evaluation_callbacks() -> EvaluationTraceCallback:
-    return EvaluationTraceCallback()
+    from crewai.utilities.events.crewai_event_bus import crewai_event_bus
+
+    callback = EvaluationTraceCallback()
+    callback.setup_listeners(crewai_event_bus)
+    return callback
--- a/src/crewai/experimental/evaluation/experiment/runner.py
+++ b/src/crewai/experimental/evaluation/experiment/runner.py
@@ -2,7 +2,7 @@ from collections import defaultdict
 from hashlib import md5
 from typing import Any

-from crewai import Crew
+from crewai import Crew, Agent
 from crewai.experimental.evaluation import AgentEvaluator, create_default_evaluator
 from crewai.experimental.evaluation.experiment.result_display import ExperimentResultsDisplay
 from crewai.experimental.evaluation.experiment.result import ExperimentResults, ExperimentResult
@@ -14,14 +14,18 @@ class ExperimentRunner:
        self.evaluator: AgentEvaluator | None = None
        self.display = ExperimentResultsDisplay()

-    def run(self, crew: Crew, print_summary: bool = False) -> ExperimentResults:
-        self.evaluator = create_default_evaluator(crew=crew)
+    def run(self, crew: Crew | None = None, agents: list[Agent] | None = None, print_summary: bool = False) -> ExperimentResults:
+        if crew and not agents:
+            agents = crew.agents
+
+        assert agents is not None
+        self.evaluator = create_default_evaluator(agents=agents)

        results = []

        for test_case in self.dataset:
            self.evaluator.reset_iterations_results()
-            result = self._run_test_case(test_case, crew)
+            result = self._run_test_case(test_case=test_case, crew=crew, agents=agents)
            results.append(result)

        experiment_results = ExperimentResults(results)
@@ -31,7 +35,7 @@ class ExperimentRunner:

        return experiment_results

-    def _run_test_case(self, test_case: dict[str, Any], crew: Crew) -> ExperimentResult:
+    def _run_test_case(self, test_case: dict[str, Any], agents: list[Agent], crew: Crew | None = None) -> ExperimentResult:
        inputs = test_case["inputs"]
        expected_score = test_case["expected_score"]
        identifier = test_case.get("identifier") or md5(str(test_case).encode(), usedforsecurity=False).hexdigest()
@@ -39,7 +43,11 @@ class ExperimentRunner:
        try:
            self.display.console.print(f"[dim]Running crew with input: {str(inputs)[:50]}...[/dim]")
            self.display.console.print("\n")
-            crew.kickoff(inputs=inputs)
+            if crew:
+                crew.kickoff(inputs=inputs)
+            else:
+                for agent in agents:
+                    agent.kickoff(**inputs)

            assert self.evaluator is not None
            agent_evaluations = self.evaluator.get_agent_evaluation()
--- a/src/crewai/experimental/evaluation/metrics/goal_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/goal_metrics.py
@@ -14,10 +14,14 @@ class GoalAlignmentEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: Any,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}\nExpected output: {task.expected_output}\n"
+
        prompt = [
            {"role": "system", "content": """You are an expert evaluator assessing how well an AI agent's output aligns with its assigned task goal.

@@ -37,8 +41,7 @@ Return your evaluation as JSON with fields 'score' (number) and 'feedback' (stri
            {"role": "user", "content": f"""
 Agent role: {agent.role}
 Agent goal: {agent.goal}
-Task description: {task.description}
-Expected output: {task.expected_output}
+{task_context}

 Agent's final output:
 {final_output}
--- a/src/crewai/experimental/evaluation/metrics/reasoning_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/reasoning_metrics.py
@@ -36,10 +36,14 @@ class ReasoningEfficiencyEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
-        final_output: TaskOutput,
+        final_output: TaskOutput | str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}\nExpected output: {task.expected_output}\n"
+
        llm_calls = execution_trace.get("llm_calls", [])

        if not llm_calls or len(llm_calls) < 2:
@@ -83,6 +87,8 @@ class ReasoningEfficiencyEvaluator(BaseEvaluator):

        call_samples = self._get_call_samples(llm_calls)

+        final_output = final_output.raw if isinstance(final_output, TaskOutput) else final_output
+
        prompt = [
            {"role": "system", "content": """You are an expert evaluator assessing the reasoning efficiency of an AI agent's thought process.

@@ -117,7 +123,7 @@ Return your evaluation as JSON with the following structure:
 }"""},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Reasoning efficiency metrics:
 - Total LLM calls: {efficiency_metrics["total_llm_calls"]}
@@ -130,7 +136,7 @@ Sample of agent reasoning flow (chronological sequence):
 {call_samples}

 Agent's final output:
-{final_output.raw[:500]}... (truncated)
+{final_output[:500]}... (truncated)

 Evaluate the reasoning efficiency of this agent based on these interaction patterns.
 Identify any inefficient reasoning patterns and provide specific suggestions for optimization.
--- a/src/crewai/experimental/evaluation/metrics/semantic_quality_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/semantic_quality_metrics.py
@@ -14,10 +14,13 @@ class SemanticQualityEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: Any,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
        prompt = [
            {"role": "system", "content": """You are an expert evaluator assessing the semantic quality of an AI agent's output.

@@ -37,7 +40,7 @@ Return your evaluation as JSON with fields 'score' (number) and 'feedback' (stri
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Agent's final output:
 {final_output}
--- a/src/crewai/experimental/evaluation/metrics/tools_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/tools_metrics.py
@@ -16,10 +16,14 @@ class ToolSelectionEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
+
        tool_uses = execution_trace.get("tool_uses", [])
        tool_count = len(tool_uses)
        unique_tool_types = set([tool.get("tool", "Unknown tool") for tool in tool_uses])
@@ -72,7 +76,7 @@ Return your evaluation as JSON with these fields:
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Available tools for this agent:
 {available_tools_info}
@@ -128,10 +132,13 @@ class ParameterExtractionEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
        tool_uses = execution_trace.get("tool_uses", [])
        tool_count = len(tool_uses)

@@ -212,7 +219,7 @@ Return your evaluation as JSON with these fields:
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Parameter extraction examples:
 {param_samples_text}
@@ -267,10 +274,13 @@ class ToolInvocationEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
        tool_uses = execution_trace.get("tool_uses", [])
        tool_errors = []
        tool_count = len(tool_uses)
@@ -352,7 +362,7 @@ Return your evaluation as JSON with these fields:
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Tool invocation examples:
 {invocation_samples_text}
--- a/src/crewai/experimental/evaluation/testing.py
+++ b/src/crewai/experimental/evaluation/testing.py
@@ -3,7 +3,7 @@ import inspect
 from typing_extensions import Any
 import warnings
 from crewai.experimental.evaluation.experiment import ExperimentResults, ExperimentRunner
-from crewai import Crew
+from crewai import Crew, Agent

 def assert_experiment_successfully(experiment_results: ExperimentResults, baseline_filepath: str | None = None) -> None:
    failed_tests = [result for result in experiment_results.results if not result.passed]
@@ -35,10 +35,10 @@ def assert_experiment_no_regression(comparison_result: dict[str, list[str]]) ->
            UserWarning
        )

-def run_experiment(dataset: list[dict[str, Any]], crew: Crew, verbose: bool = False) -> ExperimentResults:
+def run_experiment(dataset: list[dict[str, Any]], crew: Crew | None = None, agents: list[Agent] | None = None, verbose: bool = False) -> ExperimentResults:
    runner = ExperimentRunner(dataset=dataset)

-    return runner.run(crew=crew, print_summary=verbose)
+    return runner.run(agents=agents, crew=crew, print_summary=verbose)

 def _get_baseline_filepath_fallback() -> str:
    test_func_name = "experiment_fallback"
--- a/src/crewai/knowledge/storage/knowledge_storage.py
+++ b/src/crewai/knowledge/storage/knowledge_storage.py
@@ -18,6 +18,7 @@ from crewai.utilities.chromadb import sanitize_collection_name
 from crewai.utilities.constants import KNOWLEDGE_DIRECTORY
 from crewai.utilities.logger import Logger
 from crewai.utilities.paths import db_storage_path
+from crewai.utilities.chromadb import create_persistent_client


@contextlib.contextmanager
@@ -84,14 +85,11 @@ class KnowledgeStorage(BaseKnowledgeStorage):
                raise Exception("Collection not initialized")

    def initialize_knowledge_storage(self):
-        base_path = os.path.join(db_storage_path(), "knowledge")
-        chroma_client = chromadb.PersistentClient(
-            path=base_path,
+        self.app = create_persistent_client(
+            path=os.path.join(db_storage_path(), "knowledge"),
            settings=Settings(allow_reset=True),
        )

-        self.app = chroma_client
-
        try:
            collection_name = (
                f"knowledge_{self.collection_name}"
@@ -111,9 +109,8 @@ class KnowledgeStorage(BaseKnowledgeStorage):
    def reset(self):
        base_path = os.path.join(db_storage_path(), KNOWLEDGE_DIRECTORY)
        if not self.app:
-            self.app = chromadb.PersistentClient(
-                path=base_path,
-                settings=Settings(allow_reset=True),
+            self.app = create_persistent_client(
+                path=base_path, settings=Settings(allow_reset=True)
            )

        self.app.reset()
--- a/src/crewai/lite_agent.py
+++ b/src/crewai/lite_agent.py
@@ -305,6 +305,7 @@ class LiteAgent(FlowTrackable, BaseModel):
        """
        # Create agent info for event emission
        agent_info = {
+            "id": self.id,
            "role": self.role,
            "goal": self.goal,
            "backstory": self.backstory,
--- a/src/crewai/llm.py
+++ b/src/crewai/llm.py
@@ -59,6 +59,7 @@ from crewai.utilities.exceptions.context_window_exceeding_exception import (

 load_dotenv()

+litellm.suppress_debug_info = True

 class FilteredStream(io.TextIOBase):
    _lock = None
@@ -76,9 +77,7 @@ class FilteredStream(io.TextIOBase):

            # Skip common noisy LiteLLM banners and any other lines that contain "litellm"
            if (
-                "give feedback / get help" in lower_s
-                or "litellm.info:" in lower_s
-                or "litellm" in lower_s
+                "litellm.info:" in lower_s
                or "Consider using a smaller input or implementing a text splitting strategy" in lower_s
            ):
                return 0
@@ -760,7 +759,7 @@ class LLM(BaseLLM):
        available_functions: Optional[Dict[str, Any]] = None,
        from_task: Optional[Any] = None,
        from_agent: Optional[Any] = None,
-    ) -> str:
+    ) -> str | Any:
        """Handle a non-streaming response from the LLM.

        Args:
@@ -784,13 +783,11 @@ class LLM(BaseLLM):
            # Convert litellm's context window error to our own exception type
            # for consistent handling in the rest of the codebase
            raise LLMContextLengthExceededException(str(e))
-
        # --- 2) Extract response message and content
        response_message = cast(Choices, cast(ModelResponse, response).choices)[
            0
        ].message
        text_response = response_message.content or ""
-
        # --- 3) Handle callbacks with usage info
        if callbacks and len(callbacks) > 0:
            for callback in callbacks:
@@ -803,21 +800,22 @@ class LLM(BaseLLM):
                            start_time=0,
                            end_time=0,
                        )
-
        # --- 4) Check for tool calls
        tool_calls = getattr(response_message, "tool_calls", [])

-        # --- 5) If no tool calls or no available functions, return the text response directly
-        if not tool_calls or not available_functions:
+        # --- 5) If no tool calls or no available functions, return the text response directly as long as there is a text response
+        if (not tool_calls or not available_functions) and text_response:
            self._handle_emit_call_events(response=text_response, call_type=LLMCallType.LLM_CALL, from_task=from_task, from_agent=from_agent, messages=params["messages"])
            return text_response
+        # --- 6) If there is no text response, no available functions, but there are tool calls, return the tool calls
+        elif tool_calls and not available_functions and not text_response:
+            return tool_calls

-        # --- 6) Handle tool calls if present
+        # --- 7) Handle tool calls if present
        tool_result = self._handle_tool_call(tool_calls, available_functions)
        if tool_result is not None:
            return tool_result
-
-        # --- 7) If tool call handling didn't return a result, emit completion event and return text response
+        # --- 8) If tool call handling didn't return a result, emit completion event and return text response
        self._handle_emit_call_events(response=text_response, call_type=LLMCallType.LLM_CALL, from_task=from_task, from_agent=from_agent, messages=params["messages"])
        return text_response

@@ -952,22 +950,18 @@ class LLM(BaseLLM):
        # --- 3) Convert string messages to proper format if needed
        if isinstance(messages, str):
            messages = [{"role": "user", "content": messages}]
-
        # --- 4) Handle O1 model special case (system messages not supported)
        if "o1" in self.model.lower():
            for message in messages:
                if message.get("role") == "system":
                    message["role"] = "assistant"
-
        # --- 5) Set up callbacks if provided
        with suppress_warnings():
            if callbacks and len(callbacks) > 0:
                self.set_callbacks(callbacks)
-
            try:
                # --- 6) Prepare parameters for the completion call
                params = self._prepare_completion_params(messages, tools)
-
                # --- 7) Make the completion call and handle response
                if self.stream:
                    return self._handle_streaming_response(
@@ -984,12 +978,32 @@ class LLM(BaseLLM):
                # whether to summarize the content or abort based on the respect_context_window flag
                raise
            except Exception as e:
+                unsupported_stop = "Unsupported parameter" in str(e) and "'stop'" in str(e)
+
+                if unsupported_stop:
+                    if "additional_drop_params" in self.additional_params and isinstance(self.additional_params["additional_drop_params"], list):
+                        self.additional_params["additional_drop_params"].append("stop")
+                    else:
+                        self.additional_params = {"additional_drop_params": ["stop"]}
+
+                    logging.info(
+                        "Retrying LLM call without the unsupported 'stop'"
+                    )
+
+                    return self.call(
+                        messages,
+                        tools=tools,
+                        callbacks=callbacks,
+                        available_functions=available_functions,
+                        from_task=from_task,
+                        from_agent=from_agent,
+                    )
+
                assert hasattr(crewai_event_bus, "emit")
                crewai_event_bus.emit(
                    self,
                    event=LLMCallFailedEvent(error=str(e), from_task=from_task, from_agent=from_agent),
                )
-                logging.error(f"LiteLLM call failed: {str(e)}")
                raise

    def _handle_emit_call_events(self, response: Any, call_type: LLMCallType, from_task: Optional[Any] = None, from_agent: Optional[Any] = None, messages: str | list[dict[str, Any]] | None = None):
@@ -1058,6 +1072,15 @@ class LLM(BaseLLM):
                messages.append({"role": "user", "content": "Please continue."})
            return messages

+        # TODO: Remove this code after merging PR https://github.com/BerriAI/litellm/pull/10917
+        # Ollama doesn't supports last message to be 'assistant'
+        if "ollama" in self.model.lower() and messages and messages[-1]["role"] == "assistant":
+            messages = messages.copy()
+            messages.append(
+                {"role": "user", "content": ""}
+            )
+            return messages
+
        # Handle Anthropic models
        if not self.is_anthropic:
            return messages
--- a/src/crewai/memory/contextual/contextual_memory.py
+++ b/src/crewai/memory/contextual/contextual_memory.py
@@ -108,6 +108,7 @@ class ContextualMemory:

    def _fetch_user_context(self, query: str) -> str:
        """
+        DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, whichever comes first.
        Fetches and formats relevant user information from User Memory.
        Args:
            query (str): The search query to find relevant user memories.
--- a/src/crewai/memory/storage/mem0_storage.py
+++ b/src/crewai/memory/storage/mem0_storage.py
@@ -64,6 +64,7 @@ class Mem0Storage(Storage):
    def save(self, value: Any, metadata: Dict[str, Any]) -> None:
        user_id = self._get_user_id()
        agent_name = self._get_agent_name()
+        assistant_message = [{"role" : "assistant","content" : value}] 
        params = None
        if self.memory_type == "short_term":
            params = {
@@ -93,7 +94,8 @@ class Mem0Storage(Storage):
        if params:
            if isinstance(self.memory, MemoryClient):
                params["output_format"] = "v1.1"
-            self.memory.add(value, **params)
+            
+            self.memory.add(assistant_message, **params)

    def search(
        self,
--- a/src/crewai/memory/storage/rag_storage.py
+++ b/src/crewai/memory/storage/rag_storage.py
@@ -4,12 +4,12 @@ import logging
 import os
 import shutil
 import uuid
+
 from typing import Any, Dict, List, Optional
-
 from chromadb.api import ClientAPI
-
 from crewai.memory.storage.base_rag_storage import BaseRAGStorage
 from crewai.utilities import EmbeddingConfigurator
+from crewai.utilities.chromadb import create_persistent_client
 from crewai.utilities.constants import MAX_FILE_NAME_LENGTH
 from crewai.utilities.paths import db_storage_path

@@ -60,17 +60,15 @@ class RAGStorage(BaseRAGStorage):
        self.embedder_config = configurator.configure_embedder(self.embedder_config)

    def _initialize_app(self):
-        import chromadb
        from chromadb.config import Settings

        self._set_embedder_config()
-        chroma_client = chromadb.PersistentClient(
+
+        self.app = create_persistent_client(
            path=self.path if self.path else self.storage_file_name,
            settings=Settings(allow_reset=self.allow_reset),
        )

-        self.app = chroma_client
-
        self.collection = self.app.get_or_create_collection(
            name=self.type, embedding_function=self.embedder_config
        )
--- a/src/crewai/memory/user/user_memory.py
+++ b/src/crewai/memory/user/user_memory.py
@@ -14,7 +14,8 @@ class UserMemory(Memory):

    def __init__(self, crew=None):
        warnings.warn(
-            "UserMemory is deprecated and will be removed in a future version. "
+            "UserMemory is deprecated and will be removed in version 0.156.0 "
+            "or on 2025-08-04, whichever comes first. "
            "Please use ExternalMemory instead.",
            DeprecationWarning,
            stacklevel=2,
--- a/src/crewai/memory/user/user_memory_item.py
+++ b/src/crewai/memory/user/user_memory_item.py
@@ -1,8 +1,16 @@
+import warnings
 from typing import Any, Dict, Optional


 class UserMemoryItem:
    def __init__(self, data: Any, user: str, metadata: Optional[Dict[str, Any]] = None):
+        warnings.warn(
+            "UserMemoryItem is deprecated and will be removed in version 0.156.0 "
+            "or on 2025-08-04, whichever comes first. "
+            "Please use ExternalMemory instead.",
+            DeprecationWarning,
+            stacklevel=2,
+        )
        self.data = data
        self.user = user
        self.metadata = metadata if metadata is not None else {}
--- a/src/crewai/utilities/agent_utils.py
+++ b/src/crewai/utilities/agent_utils.py
@@ -157,10 +157,6 @@ def get_llm_response(
            from_agent=from_agent,
        )
    except Exception as e:
-        printer.print(
-            content=f"Error during LLM call: {e}",
-            color="red",
-        )
        raise e
    if not answer:
        printer.print(
@@ -232,12 +228,17 @@ def handle_unknown_error(printer: Any, exception: Exception) -> None:
        printer: Printer instance for output
        exception: The exception that occurred
    """
+    error_message = str(exception)
+
+    if "litellm" in error_message:
+        return
+
    printer.print(
        content="An unknown error occurred. Please check the details below.",
        color="red",
    )
    printer.print(
-        content=f"Error details: {exception}",
+        content=f"Error details: {error_message}",
        color="red",
    )

--- a/src/crewai/utilities/chromadb.py
+++ b/src/crewai/utilities/chromadb.py
@@ -1,6 +1,10 @@
 import re
+import portalocker
+from chromadb import PersistentClient
+from hashlib import md5
 from typing import Optional

+
 MIN_COLLECTION_LENGTH = 3
 MAX_COLLECTION_LENGTH = 63
 DEFAULT_COLLECTION = "default_collection"
@@ -60,3 +64,16 @@ def sanitize_collection_name(name: Optional[str], max_collection_length: int = M
            sanitized = sanitized[:-1] + "z"

    return sanitized
+
+
+def create_persistent_client(path: str, **kwargs):
+    """
+    Creates a persistent client for ChromaDB with a lock file to prevent
+    concurrent creations. Works for both multi-threads and multi-processes
+    environments.
+    """
+    lockfile = f"chromadb-{md5(path.encode(), usedforsecurity=False).hexdigest()}.lock"
+    with portalocker.Lock(lockfile):
+        client = PersistentClient(path=path, **kwargs)
+
+    return client
--- a/src/crewai/utilities/events/init.py
+++ b/src/crewai/utilities/events/init.py
@@ -17,6 +17,9 @@ from .agent_events import (
    AgentExecutionStartedEvent,
    AgentExecutionCompletedEvent,
    AgentExecutionErrorEvent,
+    AgentEvaluationStartedEvent,
+    AgentEvaluationCompletedEvent,
+    AgentEvaluationFailedEvent,
 )
 from .task_events import (
    TaskStartedEvent,
@@ -74,6 +77,9 @@ __all__ = [
    "AgentExecutionStartedEvent",
    "AgentExecutionCompletedEvent",
    "AgentExecutionErrorEvent",
+    "AgentEvaluationStartedEvent",
+    "AgentEvaluationCompletedEvent",
+    "AgentEvaluationFailedEvent",
    "TaskStartedEvent",
    "TaskCompletedEvent",
    "TaskFailedEvent",
--- a/src/crewai/utilities/events/agent_events.py
+++ b/src/crewai/utilities/events/agent_events.py
@@ -123,3 +123,28 @@ class AgentLogsExecutionEvent(BaseEvent):
    type: str = "agent_logs_execution"

    model_config = {"arbitrary_types_allowed": True}
+
+# Agent Eval events
+class AgentEvaluationStartedEvent(BaseEvent):
+    agent_id: str
+    agent_role: str
+    task_id: str | None = None
+    iteration: int
+    type: str = "agent_evaluation_started"
+
+class AgentEvaluationCompletedEvent(BaseEvent):
+    agent_id: str
+    agent_role: str
+    task_id: str | None = None
+    iteration: int
+    metric_category: Any
+    score: Any
+    type: str = "agent_evaluation_completed"
+
+class AgentEvaluationFailedEvent(BaseEvent):
+    agent_id: str
+    agent_role: str
+    task_id: str | None = None
+    iteration: int
+    error: str
+    type: str = "agent_evaluation_failed"
--- a/src/crewai/utilities/events/crewai_event_bus.py
+++ b/src/crewai/utilities/events/crewai_event_bus.py
@@ -1,6 +1,6 @@
 import threading
 from contextlib import contextmanager
-from typing import Any, Callable, Type, TypeVar, cast
+from typing import Any, Callable, Dict, List, Type, TypeVar, cast

 from blinker import Signal

@@ -14,13 +14,10 @@ class CrewAIEventsBus:
    """
    A singleton event bus that uses blinker signals for event handling.
    Allows both internal (Flow/Crew) and external event handling.
-    Handlers are global by default for cross-thread communication,
-    with optional thread-local isolation for testing scenarios.
    """

    _instance = None
    _lock = threading.Lock()
-    _thread_local: threading.local = threading.local()

    def __new__(cls):
        if cls._instance is None:
@@ -33,46 +30,7 @@ class CrewAIEventsBus:
    def _initialize(self) -> None:
        """Initialize the event bus internal state"""
        self._signal = Signal("crewai_event_bus")
-        self._global_handlers: dict[type[BaseEvent], list[Callable]] = {}
-
-    @property
-    def _handlers(self) -> dict[type[BaseEvent], list[Callable]]:
-        if not hasattr(CrewAIEventsBus._thread_local, "handlers"):
-            CrewAIEventsBus._thread_local.handlers = {}
-        return CrewAIEventsBus._thread_local.handlers
-
-    @_handlers.setter
-    def _handlers(self, value: dict[type[BaseEvent], list[Callable]]) -> None:
-        if not hasattr(CrewAIEventsBus._thread_local, "handlers"):
-            CrewAIEventsBus._thread_local.handlers = {}
-        CrewAIEventsBus._thread_local.handlers = value
-
-    def _add_handler_with_deduplication(
-        self, handlers_dict: dict, event_type: Type[BaseEvent], handler: Callable
-    ) -> bool:
-        """
-        Add a handler to the specified handlers dictionary with deduplication.
-
-        Args:
-            handlers_dict: The dictionary to add the handler to
-            event_type: The event type
-            handler: The handler function to add
-
-        Returns:
-            bool: True if handler was added, False if it was already present
-        """
-        if event_type not in handlers_dict:
-            handlers_dict[event_type] = []
-
-        # Check if handler is already registered
-        for existing_handler in handlers_dict[event_type]:
-            if existing_handler is handler:
-                # Handler already exists, don't add duplicate
-                return False
-
-        # Add the handler
-        handlers_dict[event_type].append(handler)
-        return True
+        self._handlers: Dict[Type[BaseEvent], List[Callable]] = {}

    def on(
        self, event_type: Type[EventT]
@@ -80,13 +38,6 @@ class CrewAIEventsBus:
        """
        Decorator to register an event handler for a specific event type.

-        Handlers registered with this decorator are global by default,
-        allowing cross-thread event communication. Use scoped_handlers()
-        for thread-local isolation in testing scenarios.
-
-        Duplicate handlers are automatically prevented - the same handler
-        function will only be registered once per event type.
-
        Usage:
            @crewai_event_bus.on(AgentExecutionCompletedEvent)
            def on_agent_execution_completed(
@@ -99,38 +50,23 @@ class CrewAIEventsBus:
        def decorator(
            handler: Callable[[Any, EventT], None],
        ) -> Callable[[Any, EventT], None]:
-            was_added = self._add_handler_with_deduplication(
-                self._global_handlers, event_type, handler
+            if event_type not in self._handlers:
+                self._handlers[event_type] = []
+            self._handlers[event_type].append(
+                cast(Callable[[Any, EventT], None], handler)
            )
-            if not was_added:
-                # Log that duplicate was prevented (optional)
-                print(
-                    f"[EventBus Info] Handler '{handler.__name__}' already registered for {event_type.__name__}"
-                )
            return handler

        return decorator

    def emit(self, source: Any, event: BaseEvent) -> None:
        """
-        Emit an event to all registered handlers (both global and thread-local)
+        Emit an event to all registered handlers

        Args:
            source: The object emitting the event
            event: The event instance to emit
        """
-        # Call global handlers (default behavior, cross-thread)
-        for event_type, handlers in self._global_handlers.items():
-            if isinstance(event, event_type):
-                for handler in handlers:
-                    try:
-                        handler(source, event)
-                    except Exception as e:
-                        print(
-                            f"[EventBus Error] Global handler '{handler.__name__}' failed for event '{event_type.__name__}': {e}"
-                        )
-
-        # Call thread-local handlers (for testing isolation)
        for event_type, handlers in self._handlers.items():
            if isinstance(event, event_type):
                for handler in handlers:
@@ -138,76 +74,32 @@ class CrewAIEventsBus:
                        handler(source, event)
                    except Exception as e:
                        print(
-                            f"[EventBus Error] Thread-local handler '{handler.__name__}' failed for event '{event_type.__name__}': {e}"
+                            f"[EventBus Error] Handler '{handler.__name__}' failed for event '{event_type.__name__}': {e}"
                        )

-        # Send to blinker signal (existing mechanism)
        self._signal.send(source, event=event)

    def register_handler(
-        self, event_type: Type[BaseEvent], handler: Callable[[Any, BaseEvent], None]
-    ) -> bool:
-        """
-        Register an event handler for a specific event type (global)
-
-        Args:
-            event_type: The event type to handle
-            handler: The handler function to register
-
-        Returns:
-            bool: True if handler was added, False if it was already present
-        """
-        return self._add_handler_with_deduplication(
-            self._global_handlers, event_type, handler
+        self, event_type: Type[EventTypes], handler: Callable[[Any, EventTypes], None]
+    ) -> None:
+        """Register an event handler for a specific event type"""
+        if event_type not in self._handlers:
+            self._handlers[event_type] = []
+        self._handlers[event_type].append(
+            cast(Callable[[Any, EventTypes], None], handler)
        )

-    def unregister_handler(
-        self, event_type: Type[BaseEvent], handler: Callable[[Any, BaseEvent], None]
-    ) -> bool:
-        """
-        Unregister an event handler for a specific event type (global)
-
-        Args:
-            event_type: The event type
-            handler: The handler function to unregister
-
-        Returns:
-            bool: True if handler was removed, False if it wasn't found
-        """
-        if event_type in self._global_handlers:
-            try:
-                self._global_handlers[event_type].remove(handler)
-                return True
-            except ValueError:
-                return False
-        return False
-
-    def get_handler_count(self, event_type: Type[BaseEvent]) -> int:
-        """
-        Get the number of handlers registered for a specific event type
-
-        Args:
-            event_type: The event type to check
-
-        Returns:
-            int: Number of handlers registered for this event type
-        """
-        return len(self._global_handlers.get(event_type, []))
-
    @contextmanager
    def scoped_handlers(self):
        """
-        Context manager for temporary thread-local event handling scope.
-        Useful for testing or temporary event handling with thread isolation.
-
-        This creates thread-local handlers that are isolated from global handlers,
-        making it useful for testing scenarios where you want to avoid interference.
+        Context manager for temporary event handling scope.
+        Useful for testing or temporary event handling.

        Usage:
            with crewai_event_bus.scoped_handlers():
                @crewai_event_bus.on(CrewKickoffStarted)
                def temp_handler(source, event):
-                    print("Temporary thread-local handler")
+                    print("Temporary handler")
                # Do stuff...
            # Handlers are cleared after the context
        """
@@ -218,25 +110,6 @@ class CrewAIEventsBus:
        finally:
            self._handlers = previous_handlers

-    @contextmanager
-    def scoped_global_handlers(self):
-        """
-        Context manager for temporary global event handling scope.
-        Useful for testing or temporary global event handling.
-
-        Usage:
-            with crewai_event_bus.scoped_global_handlers():
-                crewai_event_bus.register_handler(CrewKickoffStarted, temp_handler)
-                # Do stuff...
-            # Global handlers are cleared after the context
-        """
-        previous_global_handlers = self._global_handlers.copy()
-        self._global_handlers.clear()
-        try:
-            yield
-        finally:
-            self._global_handlers = previous_global_handlers
-

 # Global instance
 crewai_event_bus = CrewAIEventsBus()
--- a/src/crewai/utilities/events/event_types.py
+++ b/src/crewai/utilities/events/event_types.py
@@ -4,6 +4,7 @@ from .agent_events import (
    AgentExecutionCompletedEvent,
    AgentExecutionErrorEvent,
    AgentExecutionStartedEvent,
+    LiteAgentExecutionCompletedEvent,
 )
 from .crew_events import (
    CrewKickoffCompletedEvent,
@@ -80,6 +81,7 @@ EventTypes = Union[
    CrewTrainFailedEvent,
    AgentExecutionStartedEvent,
    AgentExecutionCompletedEvent,
+    LiteAgentExecutionCompletedEvent,
    TaskStartedEvent,
    TaskCompletedEvent,
    TaskFailedEvent,
--- a/tests/agent_test.py
+++ b/tests/agent_test.py
@@ -2010,7 +2010,6 @@ def test_crew_agent_executor_litellm_auth_error():
    from litellm.exceptions import AuthenticationError

    from crewai.agents.tools_handler import ToolsHandler
-    from crewai.utilities import Printer

    # Create an agent and executor
    agent = Agent(
@@ -2043,7 +2042,6 @@ def test_crew_agent_executor_litellm_auth_error():
    # Mock the LLM call to raise AuthenticationError
    with (
        patch.object(LLM, "call") as mock_llm_call,
-        patch.object(Printer, "print") as mock_printer,
        pytest.raises(AuthenticationError) as exc_info,
    ):
        mock_llm_call.side_effect = AuthenticationError(
@@ -2057,13 +2055,6 @@ def test_crew_agent_executor_litellm_auth_error():
            }
        )

-    # Verify error handling messages
-    error_message = f"Error during LLM call: {str(mock_llm_call.side_effect)}"
-    mock_printer.assert_any_call(
-        content=error_message,
-        color="red",
-    )
-
    # Verify the call was only made once (no retries)
    mock_llm_call.assert_called_once()

--- a/tests/cassettes/TestAgentEvaluator.test_eval_lite_agent.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_eval_lite_agent.yaml
@@ -0,0 +1,237 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are Test Agent. An agent
+      created for testing purposes\nYour personal goal is: Complete test tasks successfully\n\nTo
+      give my best complete final answer to the task respond using the exact following
+      format:\n\nThought: I now can give a great answer\nFinal Answer: Your final
+      answer must be the great and the most complete as possible, it must be outcome
+      described.\n\nI MUST use these formats, my job depends on it!"}, {"role": "user",
+      "content": "Complete this task successfully"}], "model": "gpt-4o-mini", "stop":
+      ["\nObservation:"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '583'
+      content-type:
+      - application/json
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAAwAAAP//jFNNb9swDL3nVxA6J0U+HKTNbd0woMAOw7Bu6LbCUCXa1iqLgkgnzYr8
+        98FKWqdbB+wiQHx81OMj9TgCUM6qNSjTaDFt9JNL+TZ7N/dfrusPN01NyV6vPk3f/mrl5vLrXI17
+        Bt39RCNPrDNDbfQojsIBNgm1YF91tlrOl+fzxXKWgZYs+p5WR5kUNGldcJP5dF5MpqvJ7PzIbsgZ
+        ZLWG7yMAgMd89jqDxQe1hun4KdIis65RrZ+TAFQi30eUZnYsOogaD6ChIBiy9M8NdXUja7iCQFsw
+        OkDtNgga6l4/6MBbTAA/wnsXtIc3+b6Gjx41I8REG2cRWoStkwakQeCIxlXOgEXRzjNQgvzigwBV
+        OUU038OOOgiIFhr0MdPHoIOFK9g67wEDdwlBCI7OIjgB7oxB5qrzfpeznxRokIZS3wwk5EiB8ey0
+        54RVx7r3PXTenwA6BBLdzy27fXtE9s/+eqpjojv+g6oqFxw3ZULNFHovWSiqjO5HALd5jt2L0aiY
+        qI1SCt1jfu7i4lBODdszgEVxBIVE+yE+KxbjV8qVR79PFkEZbRq0A3XYGt1ZRyfA6KTpv9W8VvvQ
+        uAv1/5QfAGMwCtoyJrTOvOx4SEvYf65/pT2bnAUrxrRxBktxmPpBWKx05w8rr3jHgm1ZuVBjiskd
+        9r6K5aLQy0LjxcKo0X70GwAA//8DAMz2wVUFBAAA
+    headers:
+      CF-RAY:
+      - 95f93ea9af627e0b-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Tue, 15 Jul 2025 12:25:54 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=GRZmZLrjW5ZRHNmUJa4ccrMcy20D1rmeqK6Ptlv0mRY-1752582354-1.0.1.1-xKd_yga48Eedech5TRlThlEpDgsB2whxkWHlCyAGOVMqMcvH1Ju9FdXYbuQ9NdUQcVxPLgiGM35lYhqSLVQiXDyK01dnyp2Gvm560FBN9DY;
+        path=/; expires=Tue, 15-Jul-25 12:55:54 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=MYqswpSR7sqr4kGp6qZVkaL7HDYwMiww49PeN9QBP.A-1752582354973-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '4047'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '4440'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999885'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_5704c0f206a927ddc12aa1a19b612a75
+    status:
+      code: 200
+      message: OK
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are an expert evaluator
+      assessing how well an AI agent''s output aligns with its assigned task goal.\n\nScore
+      the agent''s goal alignment on a scale from 0-10 where:\n- 0: Complete misalignment,
+      agent did not understand or attempt the task goal\n- 5: Partial alignment, agent
+      attempted the task but missed key requirements\n- 10: Perfect alignment, agent
+      fully satisfied all task requirements\n\nConsider:\n1. Did the agent correctly
+      interpret the task goal?\n2. Did the final output directly address the requirements?\n3.
+      Did the agent focus on relevant aspects of the task?\n4. Did the agent provide
+      all requested information or deliverables?\n\nReturn your evaluation as JSON
+      with fields ''score'' (number) and ''feedback'' (string).\n"}, {"role": "user",
+      "content": "\nAgent role: Test Agent\nAgent goal: Complete test tasks successfully\n\n\nAgent''s
+      final output:\nPlease provide me with the specific details or context of the
+      task you need help with, and I will ensure to complete it successfully and provide
+      a thorough response.\n\nEvaluate how well the agent''s output aligns with the
+      assigned task goal.\n"}], "model": "gpt-4o-mini", "stop": []}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '1196'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=GRZmZLrjW5ZRHNmUJa4ccrMcy20D1rmeqK6Ptlv0mRY-1752582354-1.0.1.1-xKd_yga48Eedech5TRlThlEpDgsB2whxkWHlCyAGOVMqMcvH1Ju9FdXYbuQ9NdUQcVxPLgiGM35lYhqSLVQiXDyK01dnyp2Gvm560FBN9DY;
+        _cfuvid=MYqswpSR7sqr4kGp6qZVkaL7HDYwMiww49PeN9QBP.A-1752582354973-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAA4xUy27bQAy8+yuIPdtGbMdN4FvbSxM0QIsEKNA6MJhdSmK82hWWVFwj8L8XKz/k
+        9AH0ogOHnOFjVq8DAMPOLMDYCtXWjR990O+TT7dfZs/v5OtFy/ef7++mxfu7j83t/cONGeaK+PRM
+        Vo9VYxvrxpNyDHvYJkKlzDq5mk/n19PZfN4BdXTkc1nZ6OgyjmoOPJpeTC9HF1ejyfWhuopsScwC
+        fgwAAF67b+4zOPppFnAxPEZqEsGSzOKUBGBS9DliUIRFMagZ9qCNQSl0rb8uA8DSiI2JlmYB0+E+
+        UBC5J7TrHFuah4oASwoKjh2EqOCojkE0oRIgWE+YoA2OUhZzHEqIBWhFoChrKCP6IWwqthWwgEY4
+        bItASbRLEpDWWhIpWu+3Y7gJooRuCKyAsiYHRUxQx0TgSJG9DIGDY4ua5RA82nVW5cDKqPxCWYhC
+        iSXBhrU69TOGbxV7ysxSxY0Awoa951AGkq69/do67QLZk8vBJsUXdgQYtoBWW/SQSJoYpFPq2Ptp
+        MLjTttC51DFXVIPjRFb9drw0y7A7v0uiohXM3git92cAhhAVs7c6RzwekN3JAz6WTYpP8lupKTiw
+        VKtEKDHke4vGxnTobgDw2HmtfWMf06RYN7rSuKZObjo7eM30Fu/R6yOoUdH38dnkCLzhWx1ud+ZW
+        Y9FW5PrS3trYOo5nwOBs6j+7+Rv3fnIO5f/Q94C11Ci5VZPIsX07cZ+WKP8B/pV22nLXsBFKL2xp
+        pUwpX8JRga3fv0sjW1GqVwWHklKTuHuc+ZKD3eAXAAAA//8DADksFsafBAAA
+    headers:
+      CF-RAY:
+      - 95f93ec73a1c7e0b-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Tue, 15 Jul 2025 12:25:57 GMT
+      Server:
+      - cloudflare
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '1544'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '1546'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999732'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_44930ba12ad8d1e3f0beed1d5e3d8b0c
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/cassettes/TestAgentEvaluator.test_eval_specific_agents_from_crew.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_eval_specific_agents_from_crew.yaml
--- a/tests/cassettes/TestAgentEvaluator.test_evaluate_current_iteration.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_evaluate_current_iteration.yaml
@@ -427,4 +427,140 @@ interactions:
    status:
      code: 200
      message: OK
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are an expert evaluator
+      assessing how well an AI agent''s output aligns with its assigned task goal.\n\nScore
+      the agent''s goal alignment on a scale from 0-10 where:\n- 0: Complete misalignment,
+      agent did not understand or attempt the task goal\n- 5: Partial alignment, agent
+      attempted the task but missed key requirements\n- 10: Perfect alignment, agent
+      fully satisfied all task requirements\n\nConsider:\n1. Did the agent correctly
+      interpret the task goal?\n2. Did the final output directly address the requirements?\n3.
+      Did the agent focus on relevant aspects of the task?\n4. Did the agent provide
+      all requested information or deliverables?\n\nReturn your evaluation as JSON
+      with fields ''score'' (number) and ''feedback'' (string).\n"}, {"role": "user",
+      "content": "\nAgent role: Test Agent\nAgent goal: Complete test tasks successfully\nTask
+      description: Test task description\nExpected output: Expected test output\n\nAgent''s
+      final output:\nThe expected test output is a comprehensive document that outlines
+      the specific parameters and criteria that define success for the task at hand.
+      It should include detailed descriptions of the tasks, the goals that need to
+      be achieved, and any specific formatting or structural requirements necessary
+      for the output. Each component of the task must be analyzed and addressed, providing
+      context as well as examples where applicable. Additionally, any tools or methodologies
+      that are relevant to executing the tasks successfully should be outlined, including
+      any potential risks or challenges that may arise during the process. This document
+      serves as a guiding framework to ensure that all aspects of the task are thoroughly
+      considered and executed to meet the high standards expected.\n\nEvaluate how
+      well the agent''s output aligns with the assigned task goal.\n"}], "model":
+      "gpt-4o-mini", "stop": []}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '1893'
+      content-type:
+      - application/json
+      cookie:
+      - _cfuvid=XwsgBfgvDGlKFQ4LiGYGIARIoSNTiwidqoo9UZcc.XY-1752087999227-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAAwAAAP//jFRNbxs5DL37VxA6jwPHddrUxxwWi2BRtEAPRevCYCSOh41GUkWOnTTI
+        fy8kf4zT5rCXOfCRT4+P5DxNAAw7swRjO1TbJz+90dvFxy//vX0za7dfr29+3eo/n75++Mh0O/za
+        maZUxLsfZPVYdWFjnzwpx7CHbSZUKqyX767mV/PL2eKqAn105EvZJul0Eac9B57OZ/PFdPZuenl9
+        qO4iWxKzhG8TAICn+i06g6MHs4RZc4z0JIIbMstTEoDJ0ZeIQREWxaCmGUEbg1Ko0p9WAWBlxMZM
+        K7OEq2YfaIncHdr7EluZzx0BbigopBy37MgBgiNF9uTAkdjMqbQOsYVdhwraEdBDIqvkIA6aBgXp
+        4uAdcLB+cNTArmPbAQfHFpUEJPYEQ3CUi2LHYVPoCpOi3EOmnwNn6imoXMC/cUdbyk3FWw7oj8+4
+        SAIhKkgiyy1b9P4RHHneUn4pTEn0WIYC6YDX5866aqDH+yKHFRJm5cqInjeB3AWM7vQsUgzhTFb9
+        48GtUlloSwMkZ4bEDMetOaSg1QH9XldVwSrk2wY4iBLWSs/hmG47zGiVMouylZP7WHkzdRSEtwQu
+        2qH4dhyBjcWKHWsXhzJTEgpVAwagByySirgzRSfLDrtzsTKr8Hy+VJnaQbAsdhi8PwMwhKhYfKzr
+        /P2APJ8W2MdNyvFO/ig1LQeWbp0JJYayrKIxmYo+TwC+10MZXuy+STn2Sdca76k+92ax2POZ8T5H
+        9P31AdSo6Mf4YjFvXuFb71dezk7NWLQdubF0vEscHMczYHLW9d9qXuPed85h83/oR8BaSkpunTI5
+        ti87HtMy/agTfT3t5HIVbITyli2tlSmXSThqcfD7n4qRR1Hq1y2HDeWUuf5ZyiQnz5PfAAAA//8D
+        AEfUP8BcBQAA
+    headers:
+      CF-RAY:
+      - 95f365f1bfc87ded-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Mon, 14 Jul 2025 19:24:07 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=PcC3_3T8.MK_WpZlQLdZfwpNv9Pe45AIYmrXOSgJ65E-1752521047-1.0.1.1-eyqwSWfQC7ZV6.JwTsTihK1ZWCrEmxd52CtNcfe.fw1UjjBN9rdTU4G7hRZiNqHQYo4sVZMmgRgqM9k7HRSzN2zln0bKmMiOuSQTZh6xF_I;
+        path=/; expires=Mon, 14-Jul-25 19:54:07 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=JvQ1c4qYZefNwOPoVNgAtX8ET7ObU.JKDvGc43LOR6g-1752521047741-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '2729'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '2789'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999559'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_74f6e8ff49db25dbea3d3525cc149e8e
+    status:
+      code: 200
+      message: OK
 version: 1
--- a/tests/cassettes/TestAgentEvaluator.test_failed_evaluation.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_failed_evaluation.yaml
@@ -0,0 +1,123 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are Test Agent. An agent
+      created for testing purposes\nYour personal goal is: Complete test tasks successfully\nTo
+      give my best complete final answer to the task respond using the exact following
+      format:\n\nThought: I now can give a great answer\nFinal Answer: Your final
+      answer must be the great and the most complete as possible, it must be outcome
+      described.\n\nI MUST use these formats, my job depends on it!"}, {"role": "user",
+      "content": "\nCurrent Task: Test task description\n\nThis is the expected criteria
+      for your final answer: Expected test output\nyou MUST return the actual complete
+      content as the final answer, not a summary.\n\nBegin! This is VERY important
+      to you, use the tools available and give your best Final Answer, your job depends
+      on it!\n\nThought:"}], "model": "gpt-4o-mini", "stop": ["\nObservation:"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '879'
+      content-type:
+      - application/json
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAAwAAAP//jFTBbhtHDL3rK4g5rwRbtaNYt9RoEaNoUaBODm0DgZnh7jKe5WyHXDmO
+        4X8vZiRLcupDLwvsPPLxPQ45jzMAx8GtwfkezQ9jnP9oeLv98N5+vfl9+4v89Mf76+XV7XDz8Yc/
+        r39T15SM9PkLeXvOWvg0jJGMk+xgnwmNCuv56nJ5+XZ1tbqswJACxZLWjTa/SPOBhefLs+XF/Gw1
+        P3+7z+4Te1K3hr9mAACP9Vt0SqCvbg1nzfPJQKrYkVsfggBcTrGcOFRlNRRzzRH0SYykSr8BSffg
+        UaDjLQFCV2QDit5TBvhbfmbBCO/q/xpue1ZgBesJ6OtI3iiAkRqkycbJGrjv2ffgk5S6CqkFhECG
+        HClAIPWZx9Kkgtz3aJVq37vChXoH2qcpBogp3UHkO1rAbU/QViW7Os8hLD5OgQBjBCFfOpEfgKVN
+        ecBSpoFAQxK1jMbSgY+Y2R6aWjJTT6K8JSHVBlACYOgpk3gCS4DyADqS55YpQDdxoMhCuoCbgwKf
+        tpSB0PeAJdaKseKpOsn0z8SZBhJrgESnXERY8S0JRsxWulkoilkKkDJ0JJQx8jcKi13DX3pWyuWm
+        FPDQN8jU7mW3KRfdSaj2r5ZLMEmgXOYg7K5OlcQYI1Cs4vSFavSVmLWnsDgdnEztpFiGV6YYTwAU
+        SVYbXkf20x55OgxpTN2Y02f9LtW1LKz9JhNqkjKQaml0FX2aAXyqyzC9mG835jSMtrF0R7Xc+Zvz
+        HZ877uARvXqzBy0ZxuP58nLVvMK32Q2rnqyT8+h7CsfU4+7hFDidALMT1/9V8xr3zjlL93/oj4D3
+        NBqFzZgpsH/p+BiW6Utd0dfDDl2ugl2ZK/a0MaZcbiJQi1PcPRxOH9Ro2LQsHeUxc309yk3Onmb/
+        AgAA//8DAAbYfvVABQAA
+    headers:
+      CF-RAY:
+      - 95f9c7ffa8331b11-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Tue, 15 Jul 2025 13:59:38 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=J_xe1AP.B5P6D2GVMCesyioeS5E9DnYT34rbwQUefFc-1752587978-1.0.1.1-5Dflk5cAj6YCsOSVbCFWWSpXpw_mXsczIdzWzs2h2OwDL01HQbduE5LAToy67sfjFjHeeO4xRrqPLUQpySy2QqyHXbI_fzX4UAt3.UdwHxU;
+        path=/; expires=Tue, 15-Jul-25 14:29:38 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=0rTD8RMpxBQQy42jzmum16_eoRtWNfaZMG_TJkhGS7I-1752587978437-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '2623'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '2626'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999813'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_ccc347e91010713379c920aa0efd1f4f
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/cassettes/test_llm_call_when_stop_is_unsupported.yaml
+++ b/tests/cassettes/test_llm_call_when_stop_is_unsupported.yaml
@@ -0,0 +1,209 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini", "stop": ["stop"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '115'
+      content-type:
+      - application/json
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: "{\n  \"error\": {\n    \"message\": \"Unsupported parameter: 'stop'
+        is not supported with this model.\",\n    \"type\": \"invalid_request_error\",\n
+        \   \"param\": \"stop\",\n    \"code\": \"unsupported_parameter\"\n  }\n}"
+    headers:
+      CF-RAY:
+      - 961215744c94cb45-GIG
+      Connection:
+      - keep-alive
+      Content-Length:
+      - '196'
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:46:46 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        path=/; expires=Fri, 18-Jul-25 13:16:46 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '20'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '32'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_7be4715c3ee32aa406eacb68c7cc966e
+    status:
+      code: 400
+      message: Bad Request
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini"}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '97'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAA3RSwU7jMBC95ytGPlYNakJhQ2/sgSsg7QUhFA32pJni2JHtwFao/76yC3XQwsWH
+        efOe35uZ9wJAsBIbELLHIIdRl78nGvaqOt/dPDxf71/fdg/9bXO3e5ETXt+LZWTY5x3J8Mk6k3YY
+        NQW25ghLRxgoqla/LupmXTeXqwQMVpGONFuVAxsu61W9LldXZVV/MHvLkrzYwGMBAPCe3ujRKPor
+        NpB0UmUg73FLYnNqAhDO6lgR6D37gCaIZQalNYFMsv2nJ5A4ckANtoMbh0YSsIfF4g4d+8XibM50
+        1E0eo3MzaT0D0BgbMCZPnp8+kMPJZceGfd86Qm9N/NkHO4qEHgqAp5R6+hJEjM4OY2iDfaEkW62P
+        ciLPOYPNJxhsQJ3rV83yG7VWUUDWfjY1IVH2pDIzjxgnxXYGFLNs/5v5TvuYm802q1yuf9TPgJQ0
+        BlLt6Eix/Jo4tzmKZ/hT22nIybHw5F5ZUhuYXFyEog4nfTwQ4fc+0NB2bLbkRsfpSuKui0PxDwAA
+        //8DAN7IUy8kAwAA
+    headers:
+      CF-RAY:
+      - 961216c3f9837e07-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:47:41 GMT
+      Server:
+      - cloudflare
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '1027'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '1029'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_19a0763b09f0410b9d09598078a04cd6
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/cassettes/test_llm_call_when_stop_is_unsupported_when_additional_drop_params_is_provided.yaml
+++ b/tests/cassettes/test_llm_call_when_stop_is_unsupported_when_additional_drop_params_is_provided.yaml
@@ -0,0 +1,206 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini", "stop": ["stop"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '115'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: "{\n  \"error\": {\n    \"message\": \"Unsupported parameter: 'stop'
+        is not supported with this model.\",\n    \"type\": \"invalid_request_error\",\n
+        \   \"param\": \"stop\",\n    \"code\": \"unsupported_parameter\"\n  }\n}"
+    headers:
+      CF-RAY:
+      - 961220323a627e05-GRU
+      Connection:
+      - keep-alive
+      Content-Length:
+      - '196'
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:54:06 GMT
+      Server:
+      - cloudflare
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '9'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '11'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_e8d7880c5977029062d8487d215e5282
+    status:
+      code: 400
+      message: Bad Request
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini"}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '97'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAA3SSQW/bMAyF7/4Vgo5BXCSeV6c5bkAPPTVbMaAYCoOT6JitLAkSPbQo8t8HKWns
+        Yu1FB3181HsUXwshJGm5FVL1wGrwpvw2In/fXY3Pcd/sftzf9ENvnurm569dc9/IZVK4P4+o+E11
+        odzgDTI5e8QqIDCmruvma7Wpv1T1ZQaD02iSzK3LgSyV1aqqy9VVua5Oyt6Rwii34nchhBCv+Uwe
+        rcZnuRWr5dvNgDHCHuX2XCSEDM6kGwkxUmSwLJcTVM4y2mz7rkehwBODEa4T1wGsQkFRLBa3ECgu
+        FhdzZcBujJCc29GYGQBrHUNKnj0/nMjh7LIjS7FvA0J0Nr0c2XmZ6aEQ4iGnHt8FkT64wXPL7glz
+        23V9bCenOc/h5kTZMZgZuKyWH/RrNTKQibO5SQWqRz1JpyHDqMnNQDFL97+dj3ofk5Pdz5xVm08f
+        mIBS6Bl16wNqUu9DT2UB0yZ+Vnaec7YsI4a/pLBlwpD+QmMHoznuiIwvkXFoO7J7DD5QXpT03cWh
+        +AcAAP//AwAo/zsSJwMAAA==
+    headers:
+      CF-RAY:
+      - 961220338bd47e05-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:54:08 GMT
+      Server:
+      - cloudflare
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '1280'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '1286'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_b7390d46fa4e14380d42162cb22045df
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/experimental/evaluation/test_agent_evaluator.py
+++ b/tests/experimental/evaluation/test_agent_evaluator.py
@@ -11,10 +11,15 @@ from crewai.experimental.evaluation import (
    ToolSelectionEvaluator,
    ParameterExtractionEvaluator,
    ToolInvocationEvaluator,
-    ReasoningEfficiencyEvaluator
+    ReasoningEfficiencyEvaluator,
+    MetricCategory,
+    EvaluationScore
 )

+from crewai.utilities.events.agent_events import AgentEvaluationStartedEvent, AgentEvaluationCompletedEvent, AgentEvaluationFailedEvent
+from crewai.utilities.events.crewai_event_bus import crewai_event_bus
 from crewai.experimental.evaluation import create_default_evaluator
+
 class TestAgentEvaluator:
    @pytest.fixture
    def mock_crew(self):
@@ -39,18 +44,18 @@ class TestAgentEvaluator:
        return crew

    def test_set_iteration(self):
-        agent_evaluator = AgentEvaluator()
+        agent_evaluator = AgentEvaluator(agents=[])

        agent_evaluator.set_iteration(3)
        assert agent_evaluator._execution_state.iteration == 3

    @pytest.mark.vcr(filter_headers=["authorization"])
    def test_evaluate_current_iteration(self, mock_crew):
-        agent_evaluator = AgentEvaluator(crew=mock_crew, evaluators=[GoalAlignmentEvaluator()])
+        agent_evaluator = AgentEvaluator(agents=mock_crew.agents, evaluators=[GoalAlignmentEvaluator()])

        mock_crew.kickoff()

-        results = agent_evaluator.evaluate_current_iteration()
+        results = agent_evaluator.get_evaluation_results()

        assert isinstance(results, dict)

@@ -70,16 +75,16 @@ class TestAgentEvaluator:
        goal_alignment, = result.metrics.values()
        assert goal_alignment.score == 5.0

-        expected_feedback = "The agent's output demonstrates an understanding of the need for a comprehensive document"
+        expected_feedback = "The agent's output demonstrates an understanding of the need for a comprehensive document outlining task"
        assert expected_feedback in goal_alignment.feedback

        assert goal_alignment.raw_response is not None
        assert '"score": 5' in goal_alignment.raw_response

    def test_create_default_evaluator(self, mock_crew):
-        agent_evaluator = create_default_evaluator(crew=mock_crew)
+        agent_evaluator = create_default_evaluator(agents=mock_crew.agents)
        assert isinstance(agent_evaluator, AgentEvaluator)
-        assert agent_evaluator.crew == mock_crew
+        assert agent_evaluator.agents == mock_crew.agents

        expected_types = [
            GoalAlignmentEvaluator,
@@ -93,3 +98,181 @@ class TestAgentEvaluator:
        assert len(agent_evaluator.evaluators) == len(expected_types)
        for evaluator, expected_type in zip(agent_evaluator.evaluators, expected_types):
            assert isinstance(evaluator, expected_type)
+
+    @pytest.mark.vcr(filter_headers=["authorization"])
+    def test_eval_lite_agent(self):
+        agent = Agent(
+            role="Test Agent",
+            goal="Complete test tasks successfully",
+            backstory="An agent created for testing purposes",
+        )
+
+        with crewai_event_bus.scoped_handlers():
+            events = {}
+            @crewai_event_bus.on(AgentEvaluationStartedEvent)
+            def capture_started(source, event):
+                events["started"] = event
+
+            @crewai_event_bus.on(AgentEvaluationCompletedEvent)
+            def capture_completed(source, event):
+                events["completed"] = event
+
+            @crewai_event_bus.on(AgentEvaluationFailedEvent)
+            def capture_failed(source, event):
+                events["failed"] = event
+
+            agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[GoalAlignmentEvaluator()])
+
+            agent.kickoff(messages="Complete this task successfully")
+
+            assert events.keys() == {"started", "completed"}
+            assert events["started"].agent_id == str(agent.id)
+            assert events["started"].agent_role == agent.role
+            assert events["started"].task_id is None
+            assert events["started"].iteration == 1
+
+            assert events["completed"].agent_id == str(agent.id)
+            assert events["completed"].agent_role == agent.role
+            assert events["completed"].task_id is None
+            assert events["completed"].iteration == 1
+            assert events["completed"].metric_category == MetricCategory.GOAL_ALIGNMENT
+            assert isinstance(events["completed"].score, EvaluationScore)
+            assert events["completed"].score.score == 2.0
+
+            results = agent_evaluator.get_evaluation_results()
+
+            assert isinstance(results, dict)
+
+            result, = results[agent.role]
+            assert isinstance(result, AgentEvaluationResult)
+
+            assert result.agent_id == str(agent.id)
+            assert result.task_id == "lite_task"
+
+            goal_alignment, = result.metrics.values()
+            assert goal_alignment.score == 2.0
+
+            expected_feedback = "The agent did not demonstrate a clear understanding of the task goal, which is to complete test tasks successfully"
+            assert expected_feedback in goal_alignment.feedback
+
+            assert goal_alignment.raw_response is not None
+            assert '"score": 2' in goal_alignment.raw_response
+
+    @pytest.mark.vcr(filter_headers=["authorization"])
+    def test_eval_specific_agents_from_crew(self, mock_crew):
+        agent = Agent(
+            role="Test Agent Eval",
+            goal="Complete test tasks successfully",
+            backstory="An agent created for testing purposes",
+        )
+        task = Task(
+            description="Test task description",
+            agent=agent,
+            expected_output="Expected test output"
+        )
+        mock_crew.agents.append(agent)
+        mock_crew.tasks.append(task)
+
+        with crewai_event_bus.scoped_handlers():
+            events = {}
+            @crewai_event_bus.on(AgentEvaluationStartedEvent)
+            def capture_started(source, event):
+                events["started"] = event
+
+            @crewai_event_bus.on(AgentEvaluationCompletedEvent)
+            def capture_completed(source, event):
+                events["completed"] = event
+
+            @crewai_event_bus.on(AgentEvaluationFailedEvent)
+            def capture_failed(source, event):
+                events["failed"] = event
+
+            agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[GoalAlignmentEvaluator()])
+            mock_crew.kickoff()
+
+            assert events.keys() == {"started", "completed"}
+            assert events["started"].agent_id == str(agent.id)
+            assert events["started"].agent_role == agent.role
+            assert events["started"].task_id == str(task.id)
+            assert events["started"].iteration == 1
+
+            assert events["completed"].agent_id == str(agent.id)
+            assert events["completed"].agent_role == agent.role
+            assert events["completed"].task_id == str(task.id)
+            assert events["completed"].iteration == 1
+            assert events["completed"].metric_category == MetricCategory.GOAL_ALIGNMENT
+            assert isinstance(events["completed"].score, EvaluationScore)
+            assert events["completed"].score.score == 5.0
+
+            results = agent_evaluator.get_evaluation_results()
+
+            assert isinstance(results, dict)
+            assert len(results.keys()) == 1
+            result, = results[agent.role]
+            assert isinstance(result, AgentEvaluationResult)
+
+            assert result.agent_id == str(agent.id)
+            assert result.task_id == str(task.id)
+
+            goal_alignment, = result.metrics.values()
+            assert goal_alignment.score == 5.0
+
+            expected_feedback = "The agent provided a thorough guide on how to conduct a test task but failed to produce specific expected output"
+            assert expected_feedback in goal_alignment.feedback
+
+            assert goal_alignment.raw_response is not None
+            assert '"score": 5' in goal_alignment.raw_response
+
+
+    @pytest.mark.vcr(filter_headers=["authorization"])
+    def test_failed_evaluation(self, mock_crew):
+        agent, = mock_crew.agents
+        task, = mock_crew.tasks
+
+        with crewai_event_bus.scoped_handlers():
+            events = {}
+
+            @crewai_event_bus.on(AgentEvaluationStartedEvent)
+            def capture_started(source, event):
+                events["started"] = event
+
+            @crewai_event_bus.on(AgentEvaluationCompletedEvent)
+            def capture_completed(source, event):
+                events["completed"] = event
+
+            @crewai_event_bus.on(AgentEvaluationFailedEvent)
+            def capture_failed(source, event):
+                events["failed"] = event
+
+            # Create a mock evaluator that will raise an exception
+            from crewai.experimental.evaluation.base_evaluator import BaseEvaluator
+            from crewai.experimental.evaluation import MetricCategory
+            class FailingEvaluator(BaseEvaluator):
+                metric_category = MetricCategory.GOAL_ALIGNMENT
+
+                def evaluate(self, agent, task, execution_trace, final_output):
+                    raise ValueError("Forced evaluation failure")
+
+            agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[FailingEvaluator()])
+            mock_crew.kickoff()
+
+            assert events.keys() == {"started", "failed"}
+            assert events["started"].agent_id == str(agent.id)
+            assert events["started"].agent_role == agent.role
+            assert events["started"].task_id == str(task.id)
+            assert events["started"].iteration == 1
+
+            assert events["failed"].agent_id == str(agent.id)
+            assert events["failed"].agent_role == agent.role
+            assert events["failed"].task_id == str(task.id)
+            assert events["failed"].iteration == 1
+            assert events["failed"].error == "Forced evaluation failure"
+
+            results = agent_evaluator.get_evaluation_results()
+            result, = results[agent.role]
+            assert isinstance(result, AgentEvaluationResult)
+
+            assert result.agent_id == str(agent.id)
+            assert result.task_id == str(task.id)
+
+            assert result.metrics == {}
--- a/tests/llm_test.py
+++ b/tests/llm_test.py
@@ -1,3 +1,4 @@
+import logging
 import os
 from time import sleep
 from unittest.mock import MagicMock, patch
@@ -664,3 +665,49 @@ def test_handle_streaming_tool_calls_no_tools(mock_emit):
        expected_completed_llm_call=1,
        expected_final_chunk_result=response,
    )
+
+
+@pytest.mark.vcr(filter_headers=["authorization"])
+def test_llm_call_when_stop_is_unsupported(caplog):
+    llm = LLM(model="o1-mini", stop=["stop"])
+    with caplog.at_level(logging.INFO):
+        result = llm.call("What is the capital of France?")
+        assert "Retrying LLM call without the unsupported 'stop'" in caplog.text
+    assert isinstance(result, str)
+    assert "Paris" in result
+
+@pytest.mark.vcr(filter_headers=["authorization"])
+def test_llm_call_when_stop_is_unsupported_when_additional_drop_params_is_provided(caplog):
+    llm = LLM(model="o1-mini", stop=["stop"], additional_drop_params=["another_param"])
+    with caplog.at_level(logging.INFO):
+        result = llm.call("What is the capital of France?")
+        assert "Retrying LLM call without the unsupported 'stop'" in caplog.text
+    assert isinstance(result, str)
+    assert "Paris" in result
+
+
+@pytest.fixture
+def ollama_llm():
+    return LLM(model="ollama/llama3.2:3b")
+
+def test_ollama_appends_dummy_user_message_when_last_is_assistant(ollama_llm):
+    original_messages = [
+        {"role": "user", "content": "Hi there"},
+        {"role": "assistant", "content": "Hello!"},
+    ]
+
+    formatted = ollama_llm._format_messages_for_provider(original_messages)
+
+    assert len(formatted) == len(original_messages) + 1
+    assert formatted[-1]["role"] == "user"
+    assert formatted[-1]["content"] == ""
+
+
+def test_ollama_does_not_modify_when_last_is_user(ollama_llm):
+    original_messages = [
+        {"role": "user", "content": "Tell me a joke."},
+    ]
+
+    formatted = ollama_llm._format_messages_for_provider(original_messages)
+
+    assert formatted == original_messages
--- a/tests/storage/test_mem0_storage.py
+++ b/tests/storage/test_mem0_storage.py
@@ -1,14 +1,10 @@
-import os
 from unittest.mock import MagicMock, patch

 import pytest
 from mem0.client.main import MemoryClient
 from mem0.memory.main import Memory

-from crewai.agent import Agent
-from crewai.crew import Crew
 from crewai.memory.storage.mem0_storage import Mem0Storage
-from crewai.task import Task


 # Define the class (if not already defined)
@@ -172,7 +168,7 @@ def test_save_method_with_memory_oss(mem0_storage_with_mocked_config):
    mem0_storage.save(test_value, test_metadata)
    
    mem0_storage.memory.add.assert_called_once_with(
-        test_value,
+        [{'role': 'assistant' , 'content': test_value}],
        agent_id="Test_Agent",
        infer=False,
        metadata={"type": "short_term", "key": "value"},
@@ -191,7 +187,7 @@ def test_save_method_with_memory_client(mem0_storage_with_memory_client_using_co
    mem0_storage.save(test_value, test_metadata)
    
    mem0_storage.memory.add.assert_called_once_with(
-        test_value,
+        [{'role': 'assistant' , 'content': test_value}],
        agent_id="Test_Agent",
        infer=False,
        metadata={"type": "short_term", "key": "value"},
--- a/tests/test_litellm_version_constraint.py
+++ b/tests/test_litellm_version_constraint.py
@@ -0,0 +1,116 @@
+import pytest
+import importlib.metadata
+from packaging import version
+
+from crewai.llm import LLM
+from crewai.agent import Agent
+from crewai.task import Task
+from crewai.crew import Crew
+
+
+def test_litellm_minimum_version_constraint():
+    """Test that litellm meets the minimum version requirement."""
+    try:
+        litellm_version = importlib.metadata.version("litellm")
+        minimum_version = "1.74.3"
+        
+        assert version.parse(litellm_version) >= version.parse(minimum_version), (
+            f"litellm version {litellm_version} is below minimum required version {minimum_version}"
+        )
+    except importlib.metadata.PackageNotFoundError:
+        pytest.fail("litellm package is not installed")
+
+
+def test_llm_creation_with_relaxed_litellm_constraint():
+    """Test that LLM can be created successfully with the relaxed litellm constraint."""
+    llm = LLM(model="gpt-4o-mini")
+    assert llm is not None
+    assert llm.model == "gpt-4o-mini"
+
+
+def test_basic_llm_functionality_with_relaxed_constraint():
+    """Test that basic LLM functionality works with the relaxed litellm constraint."""
+    llm = LLM(model="gpt-4o-mini", temperature=0.7, max_tokens=100)
+    
+    assert llm.model == "gpt-4o-mini"
+    assert llm.temperature == 0.7
+    assert llm.max_tokens == 100
+
+
+def test_agent_creation_with_relaxed_litellm_constraint():
+    """Test that Agent can be created with LLM using relaxed litellm constraint."""
+    llm = LLM(model="gpt-4o-mini")
+    agent = Agent(
+        role="Test Agent",
+        goal="Test goal",
+        backstory="Test backstory",
+        llm=llm
+    )
+    
+    assert agent is not None
+    assert agent.llm == llm
+    assert agent.role == "Test Agent"
+
+
+def test_crew_functionality_with_relaxed_litellm_constraint():
+    """Test that Crew functionality works with the relaxed litellm constraint."""
+    llm = LLM(model="gpt-4o-mini")
+    agent = Agent(
+        role="Test Agent",
+        goal="Test goal", 
+        backstory="Test backstory",
+        llm=llm
+    )
+    
+    task = Task(
+        description="Test task description",
+        expected_output="Test output",
+        agent=agent
+    )
+    
+    crew = Crew(
+        agents=[agent],
+        tasks=[task]
+    )
+    
+    assert crew is not None
+    assert len(crew.agents) == 1
+    assert len(crew.tasks) == 1
+    assert crew.agents[0] == agent
+    assert crew.tasks[0] == task
+
+
+def test_litellm_import_functionality():
+    """Test that litellm can be imported and basic functionality works."""
+    import litellm
+    from litellm.exceptions import ContextWindowExceededError, AuthenticationError
+    
+    assert hasattr(litellm, 'completion')
+    assert ContextWindowExceededError is not None
+    assert AuthenticationError is not None
+
+
+def test_llm_supports_function_calling():
+    """Test that LLM function calling support detection works with relaxed constraint."""
+    llm = LLM(model="gpt-4o-mini")
+    
+    supports_functions = llm.supports_function_calling()
+    assert isinstance(supports_functions, bool)
+
+
+def test_llm_context_window_size():
+    """Test that LLM context window size detection works with relaxed constraint."""
+    llm = LLM(model="gpt-4o-mini")
+    
+    context_window = llm.get_context_window_size()
+    assert isinstance(context_window, int)
+    assert context_window > 0
+
+
+def test_llm_anthropic_model_detection():
+    """Test that Anthropic model detection works with relaxed constraint."""
+    anthropic_llm = LLM(model="anthropic/claude-3-sonnet")
+    openai_llm = LLM(model="gpt-4o-mini")
+    
+    assert anthropic_llm._is_anthropic_model() is True
+    assert openai_llm._is_anthropic_model() is False
--- a/tests/utilities/events/test_crewai_event_bus.py
+++ b/tests/utilities/events/test_crewai_event_bus.py
@@ -1,31 +1,13 @@
-import threading
-from typing import Any, Callable, cast
 from unittest.mock import Mock

-import pytest
-
 from crewai.utilities.events.base_events import BaseEvent
 from crewai.utilities.events.crewai_event_bus import crewai_event_bus


-@pytest.fixture(autouse=True)
-def scoped_event_handlers():
-    with crewai_event_bus.scoped_handlers():
-        yield
-
-
 class TestEvent(BaseEvent):
    pass


-class AnotherThreadTestEvent(BaseEvent):
-    pass
-
-
-class CrossThreadTestEvent(BaseEvent):
-    pass
-
-
 def test_specific_event_handler():
    mock_handler = Mock()

@@ -62,444 +44,4 @@ def test_event_bus_error_handling(capfd):

    out, err = capfd.readouterr()
    assert "Simulated handler failure" in out
-    assert "Global handler 'broken_handler' failed" in out
-
-
-def test_singleton_pattern_across_threads():
-    instances = []
-
-    def get_instance():
-        instances.append(crewai_event_bus)
-
-    threads = []
-    for _ in range(10):
-        thread = threading.Thread(target=get_instance)
-        threads.append(thread)
-        thread.start()
-
-    for thread in threads:
-        thread.join()
-    assert len(instances) == 10
-    for instance in instances:
-        assert instance is crewai_event_bus
-        assert instance is instances[0]
-
-
-def test_default_handlers_are_global():
-    """Test that handlers registered with @crewai_event_bus.on() are global by default."""
-    received_events = []
-    mock_handler = Mock()
-
-    @crewai_event_bus.on(CrossThreadTestEvent)
-    def global_handler(source, event):
-        received_events.append((source, event))
-        mock_handler(source, event)
-
-    def thread_worker(thread_id):
-        # Emit event from a different thread
-        event = CrossThreadTestEvent(type=f"cross_thread_event_{thread_id}")
-        crewai_event_bus.emit(f"thread_source_{thread_id}", event)
-
-    # Start multiple threads that emit events
-    threads = []
-    for i in range(3):
-        thread = threading.Thread(target=thread_worker, args=(i,))
-        threads.append(thread)
-        thread.start()
-
-    for thread in threads:
-        thread.join()
-
-    # Verify that the global handler received all events from different threads
-    assert len(received_events) == 3
-    assert mock_handler.call_count == 3
-
-    # Check that events from different threads were received
-    for i in range(3):
-        source, event = received_events[i]
-        assert source == f"thread_source_{i}"
-        assert event.type == f"cross_thread_event_{i}"
-
-
-def test_scoped_handlers_thread_isolation():
-    """Test that scoped_handlers() provides thread-local isolation for testing."""
-    global_events = []
-    scoped_events = []
-
-    # Register a global handler
-    @crewai_event_bus.on(CrossThreadTestEvent)
-    def global_handler(source, event):
-        global_events.append((source, event))
-
-    # Emit an event - should be received by global handler
-    event1 = CrossThreadTestEvent(type="event_1")
-    crewai_event_bus.emit("source_1", event1)
-    assert len(global_events) == 1
-
-    # Use scoped handlers for testing isolation
-    with crewai_event_bus.scoped_handlers():
-        # Register a handler in the scoped context (thread-local)
-        @crewai_event_bus.on(CrossThreadTestEvent)
-        def scoped_handler(source, event):
-            scoped_events.append((source, event))
-
-        # Emit event - should be received by scoped handler only
-        event2 = CrossThreadTestEvent(type="event_2")
-        crewai_event_bus.emit("source_2", event2)
-
-    # After scope, emit another event - should be received by global handler only
-    event3 = CrossThreadTestEvent(type="event_3")
-    crewai_event_bus.emit("source_3", event3)
-
-    # Verify events
-    assert len(global_events) == 2  # event_1 and event_3
-    assert len(scoped_events) == 1  # only event_2
-    assert global_events[0] == ("source_1", event1)
-    assert scoped_events[0] == ("source_2", event2)
-    assert global_events[1] == ("source_3", event3)
-
-
-def test_scoped_handlers_thread_safety():
-    """Test that scoped handlers work correctly across multiple threads."""
-    thread_results = {}
-
-    def thread_worker(thread_id):
-        with crewai_event_bus.scoped_handlers():
-            mock_handler = Mock()
-
-            @crewai_event_bus.on(AnotherThreadTestEvent)
-            def scoped_handler(source, event):
-                mock_handler(f"scoped_thread_{thread_id}", event)
-
-            scoped_event = AnotherThreadTestEvent(type=f"scoped_event_{thread_id}")
-            crewai_event_bus.emit(f"scoped_source_{thread_id}", scoped_event)
-
-            thread_results[thread_id] = {
-                "mock_handler": mock_handler,
-                "scoped_event": scoped_event,
-            }
-
-        # After scope, emit event - should not be received by scoped handler
-        post_scoped_event = AnotherThreadTestEvent(type=f"post_scoped_{thread_id}")
-        crewai_event_bus.emit(f"post_source_{thread_id}", post_scoped_event)
-
-    threads = []
-    for i in range(5):
-        thread = threading.Thread(target=thread_worker, args=(i,))
-        threads.append(thread)
-        thread.start()
-
-    for thread in threads:
-        thread.join()
-
-    for thread_id, result in thread_results.items():
-        result["mock_handler"].assert_called_once_with(
-            f"scoped_thread_{thread_id}", result["scoped_event"]
-        )
-
-
-def test_register_handler_method():
-    """Test the register_handler method works with global handlers."""
-    received_events = []
-
-    def handler(source, event):
-        received_events.append((source, event))
-
-    # Register handler using the method
-    crewai_event_bus.register_handler(CrossThreadTestEvent, handler)
-
-    # Emit event from different thread
-    def thread_worker():
-        event = CrossThreadTestEvent(type="test_event")
-        crewai_event_bus.emit("thread_source", event)
-
-    thread = threading.Thread(target=thread_worker)
-    thread.start()
-    thread.join()
-
-    # Verify handler received the event
-    assert len(received_events) == 1
-    assert received_events[0] == (
-        "thread_source",
-        CrossThreadTestEvent(type="test_event"),
-    )
-
-
-def test_scoped_global_handlers():
-    """Test the scoped_global_handlers context manager."""
-    global_events = []
-
-    def global_handler(source, event):
-        global_events.append((source, event))
-
-    # Register a global handler
-    crewai_event_bus.register_handler(CrossThreadTestEvent, global_handler)
-
-    # Emit an event - should be received
-    event1 = CrossThreadTestEvent(type="event_1")
-    crewai_event_bus.emit("source_1", event1)
-    assert len(global_events) == 1
-
-    # Use scoped global handlers
-    with crewai_event_bus.scoped_global_handlers():
-        # Register a different handler in scope
-        def scoped_handler(source, event):
-            global_events.append(("scoped", source, event))
-
-        crewai_event_bus.register_handler(CrossThreadTestEvent, scoped_handler)
-
-        # Emit event - should be received by scoped handler
-        event2 = CrossThreadTestEvent(type="event_2")
-        crewai_event_bus.emit("source_2", event2)
-
-    # After scope, original handler should be restored
-    event3 = CrossThreadTestEvent(type="event_3")
-    crewai_event_bus.emit("source_3", event3)
-
-    # Verify events
-    assert len(global_events) == 3
-    assert global_events[0] == ("source_1", event1)
-    assert global_events[1] == ("scoped", "source_2", event2)
-    assert global_events[2] == ("source_3", event3)
-
-
-def test_handler_duplication_scenarios():
-    """Test various scenarios where handler duplication can occur."""
-    call_counts = []
-
-    def handler(source, event):
-        call_counts.append(1)
-
-    # Scenario 1: Register the same handler multiple times
-    crewai_event_bus.register_handler(TestEvent, handler)
-    crewai_event_bus.register_handler(TestEvent, handler)  # Duplicate registration
-
-    # Scenario 2: Use decorator multiple times on the same function
-    @crewai_event_bus.on(TestEvent)
-    def decorated_handler1(source, event):
-        call_counts.append(1)
-
-    @crewai_event_bus.on(TestEvent)
-    def decorated_handler2(source, event):  # Same function name, different instance
-        call_counts.append(1)
-
-    # Emit an event
-    event = TestEvent(type="test_event")
-    crewai_event_bus.emit("source", event)
-
-    # Currently, all handlers are called (including duplicates)
-    # This shows the current behavior - handlers can be duplicated
-    assert len(call_counts) >= 4  # At least 4 calls (2 direct + 2 decorated)
-
-
-def test_module_reload_duplication():
-    """Test duplication that could occur from module reloading."""
-    call_counts = []
-
-    def create_handler():
-        def handler(source, event):
-            call_counts.append(1)
-
-        return handler
-
-    # Simulate module reload scenario
-    handler1 = create_handler()
-    handler2 = create_handler()  # Same function, different instance
-
-    crewai_event_bus.register_handler(TestEvent, handler1)
-    crewai_event_bus.register_handler(TestEvent, handler2)
-
-    event = TestEvent(type="test_event")
-    crewai_event_bus.emit("source", event)
-
-    # Both handlers are called (duplication)
-    assert len(call_counts) == 2
-
-
-def test_listener_class_duplication():
-    """Test duplication from multiple listener class instances."""
-    call_counts = []
-
-    class TestListener:
-        def __init__(self):
-            @crewai_event_bus.on(TestEvent)
-            def handler(source, event):
-                call_counts.append(1)
-
-    # Create multiple instances (simulating multiple imports)
-    listener1 = TestListener()
-    listener2 = TestListener()
-
-    event = TestEvent(type="test_event")
-    crewai_event_bus.emit("source", event)
-
-    # Both instances register handlers (duplication)
-    assert len(call_counts) == 2
-
-
-def test_handler_deduplication():
-    """Test that duplicate handlers are automatically prevented."""
-    call_counts = []
-
-    def handler(source, event):
-        call_counts.append(1)
-
-    # Register the same handler multiple times
-    result1 = crewai_event_bus.register_handler(TestEvent, handler)
-    result2 = crewai_event_bus.register_handler(
-        TestEvent, handler
-    )  # Duplicate registration
-
-    # First registration should succeed, second should fail
-    assert result1 is True
-    assert result2 is False
-
-    # Emit an event
-    event = TestEvent(type="test_event")
-    crewai_event_bus.emit("source", event)
-
-    # Handler should only be called once (no duplication)
-    assert len(call_counts) == 1
-
-
-def test_decorator_deduplication():
-    """Test that decorator prevents duplicate registrations."""
-    call_counts = []
-
-    # Define the same handler function
-    def handler(source, event):
-        call_counts.append(1)
-
-    # Register using decorator
-    @crewai_event_bus.on(TestEvent)
-    def decorated_handler(source, event):
-        call_counts.append(1)
-
-    # Try to register the same function again using register_handler
-    result = crewai_event_bus.register_handler(
-        TestEvent, cast(Callable[[Any, BaseEvent], None], decorated_handler)
-    )
-
-    # Should fail because it's already registered
-    assert result is False
-
-    # Emit an event
-    event = TestEvent(type="test_event")
-    crewai_event_bus.emit("source", event)
-
-    # Should only be called once
-    assert len(call_counts) == 1
-
-
-def test_handler_unregistration():
-    """Test that handlers can be unregistered."""
-    call_counts = []
-
-    def handler(source, event):
-        call_counts.append(1)
-
-    # Register handler
-    crewai_event_bus.register_handler(TestEvent, handler)
-
-    # Verify it's registered
-    assert crewai_event_bus.get_handler_count(TestEvent) == 1
-
-    # Emit event - should be called
-    event = TestEvent(type="test_event")
-    crewai_event_bus.emit("source", event)
-    assert len(call_counts) == 1
-
-    # Unregister handler
-    result = crewai_event_bus.unregister_handler(TestEvent, handler)
-    assert result is True
-    assert crewai_event_bus.get_handler_count(TestEvent) == 0
-
-    # Emit event again - should not be called
-    crewai_event_bus.emit("source", event)
-    assert len(call_counts) == 1  # Still only 1 call
-
-
-def test_handler_count_tracking():
-    """Test that handler counts are tracked correctly."""
-
-    def handler1(source, event):
-        pass
-
-    def handler2(source, event):
-        pass
-
-    # Initially no handlers
-    assert crewai_event_bus.get_handler_count(TestEvent) == 0
-
-    # Register first handler
-    crewai_event_bus.register_handler(TestEvent, handler1)
-    assert crewai_event_bus.get_handler_count(TestEvent) == 1
-
-    # Register second handler
-    crewai_event_bus.register_handler(TestEvent, handler2)
-    assert crewai_event_bus.get_handler_count(TestEvent) == 2
-
-    # Try to register first handler again (should fail)
-    crewai_event_bus.register_handler(TestEvent, handler1)
-    assert crewai_event_bus.get_handler_count(TestEvent) == 2  # Count unchanged
-
-    # Unregister first handler
-    crewai_event_bus.unregister_handler(TestEvent, handler1)
-    assert crewai_event_bus.get_handler_count(TestEvent) == 1
-
-    # Unregister second handler
-    crewai_event_bus.unregister_handler(TestEvent, handler2)
-    assert crewai_event_bus.get_handler_count(TestEvent) == 0
-
-
-def test_different_event_types_dont_conflict():
-    """Test that handlers for different event types don't interfere."""
-    test_event_calls = []
-    cross_thread_calls = []
-
-    def test_event_handler(source, event):
-        test_event_calls.append(1)
-
-    def cross_thread_handler(source, event):
-        cross_thread_calls.append(1)
-
-    # Register handlers for different event types
-    crewai_event_bus.register_handler(TestEvent, test_event_handler)
-    crewai_event_bus.register_handler(CrossThreadTestEvent, cross_thread_handler)
-
-    # Emit TestEvent
-    test_event = TestEvent(type="test")
-    crewai_event_bus.emit("source", test_event)
-    assert len(test_event_calls) == 1
-    assert len(cross_thread_calls) == 0
-
-    # Emit CrossThreadTestEvent
-    cross_thread_event = CrossThreadTestEvent(type="cross_thread")
-    crewai_event_bus.emit("source", cross_thread_event)
-    assert len(test_event_calls) == 1  # Unchanged
-    assert len(cross_thread_calls) == 1
-
-
-def test_scoped_handlers_with_deduplication():
-    """Test that deduplication works within scoped handlers."""
-    call_counts = []
-
-    def handler(source, event):
-        call_counts.append(1)
-
-    # Register global handler
-    crewai_event_bus.register_handler(TestEvent, handler)
-
-    # Use scoped handlers
-    with crewai_event_bus.scoped_handlers():
-        # Try to register the same handler in scoped context
-        @crewai_event_bus.on(TestEvent)
-        def scoped_handler(source, event):
-            call_counts.append(1)
-
-        # Emit event - should be called by both global and scoped handlers
-        event = TestEvent(type="test_event")
-        crewai_event_bus.emit("source", event)
-
-    # Should have 2 calls (1 global + 1 scoped)
-    assert len(call_counts) == 2
+    assert "Handler 'broken_handler' failed" in out
--- a/tests/utilities/test_chromadb_utils.py
+++ b/tests/utilities/test_chromadb_utils.py
@@ -1,16 +1,27 @@
+import multiprocessing
+import tempfile
 import unittest
-from typing import Any, Dict, List, Union

-import pytest
+from chromadb.config import Settings
+from unittest.mock import patch, MagicMock

 from crewai.utilities.chromadb import (
    MAX_COLLECTION_LENGTH,
    MIN_COLLECTION_LENGTH,
    is_ipv4_pattern,
    sanitize_collection_name,
+    create_persistent_client,
 )


+def persistent_client_worker(path, queue):
+    try:
+        create_persistent_client(path=path)
+        queue.put(None)
+    except Exception as e:
+        queue.put(e)
+
+
 class TestChromadbUtils(unittest.TestCase):
    def test_sanitize_collection_name_long_name(self):
        """Test sanitizing a very long collection name."""
@@ -79,3 +90,34 @@ class TestChromadbUtils(unittest.TestCase):
            self.assertLessEqual(len(sanitized), MAX_COLLECTION_LENGTH)
            self.assertTrue(sanitized[0].isalnum())
            self.assertTrue(sanitized[-1].isalnum())
+
+    def test_create_persistent_client_passes_args(self):
+        with patch(
+            "crewai.utilities.chromadb.PersistentClient"
+        ) as mock_persistent_client, tempfile.TemporaryDirectory() as tmpdir:
+            mock_instance = MagicMock()
+            mock_persistent_client.return_value = mock_instance
+
+            settings = Settings(allow_reset=True)
+            client = create_persistent_client(path=tmpdir, settings=settings)
+
+            mock_persistent_client.assert_called_once_with(
+                path=tmpdir, settings=settings
+            )
+            self.assertIs(client, mock_instance)
+
+    def test_create_persistent_client_process_safe(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            queue = multiprocessing.Queue()
+            processes = [
+                multiprocessing.Process(
+                    target=persistent_client_worker, args=(tmpdir, queue)
+                )
+                for _ in range(5)
+            ]
+
+            [p.start() for p in processes]
+            [p.join() for p in processes]
+
+            errors = [queue.get(timeout=5) for _ in processes]
+            self.assertTrue(all(err is None for err in errors))
--- a/uv.lock
+++ b/uv.lock
Author	SHA1	Message	Date
Devin AI	cc08e36d32	feat: change litellm dependency from strict pin to minimum version constraint - Change litellm==1.74.3 to litellm>=1.74.3 in pyproject.toml - Update uv.lock with new dependency constraint - Add comprehensive tests to verify minimum version constraint works - Allows users to install newer litellm versions for features like Claude 4 Sonnet Fixes #3207 Co-Authored-By: Jo\u00E3o <joao@crewai.com>	2025-07-22 23:50:14 +00:00
Tony Kipkemboi	9a65573955	Feature/update docs (#3205 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details * docs: add create_directory parameter * docs: remove string guardrails to focus on function guardrails * docs: remove get help from docs.json * docs: update pt-BR docs.json changes	2025-07-22 13:55:27 -04:00
Lucas Gomide	27623a1d01	feat: remove duplicate print on LLM call error (#3183 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details By improving litellm handler error / outputs Co-authored-by: Lorenze Jay <63378463+lorenzejay@users.noreply.github.com>	2025-07-21 22:08:07 -04:00
João Moura	2593242234	Adding Support to adhoc tool calling using the internal LLM class (#3195 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details * Adding Support to adhoc tool calling using the internal LLM class * fix type	2025-07-21 19:36:48 -03:00
Greyson LaLonde	2ab6c31544	chore: add deprecation notices to UserMemory (#3201 ) - Mark UserMemory and UserMemoryItem for removal in v0.156.0 or 2025-08-04 - Update all references with deprecation warnings - Users should migrate to ExternalMemory	2025-07-21 15:26:34 -04:00
Lucas Gomide	3c55c8a22a	fix: append user message when last message is from assistent when using Ollama models (#3200 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Ollama doesn't supports last message to be 'assistant' We can drop this commit after merging https://github.com/BerriAI/litellm/pull/10917	2025-07-21 13:30:40 -04:00
Ranuga Disansa	424433ff58	docs: Add Tavily Search & Extractor tools to Search-Research suite (#3146 ) * docs: Add Tavily Search and Extractor tools documentation * docs: Add Tavily Search and Extractor tools to the documentation --------- Co-authored-by: Tony Kipkemboi <iamtonykipkemboi@gmail.com>	2025-07-21 12:01:29 -04:00
Lucas Gomide	2fd99503ed	build: upgrade LiteLLM to 1.74.3 (#3199 )	2025-07-21 09:58:47 -04:00
Vidit Ostwal	942014962e	fixed save method, changed the test cases (#3187 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details * fixed save method, changed the test cases * Linting fixed	2025-07-18 15:10:26 -04:00
Lucas Gomide	2ab79a7dd5	feat: drop unsupported stop parameter for LLM models automatically (#3184 )	2025-07-18 13:54:28 -04:00
Lucas Gomide	27c449c9c4	test: remove workaround related to SQLite without FTS5 (#3179 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details For more details check out [here](actions/runner-images#12576)	2025-07-18 09:37:15 -04:00
Vini Brasil	9737333ffd	Use file lock around Chroma client initialization (#3181 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details This commit fixes a bug with concurrent processess and Chroma where `table collections already exists` (and similar) were raised. https://cookbook.chromadb.dev/core/system_constraints/	2025-07-17 11:50:45 -03:00
Lucas Gomide	bf248d5118	docs: fix neatlogs documentation (#3171 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details	2025-07-16 21:18:04 -04:00
Lorenze Jay	2490e8cd46	Update CrewAI version to 0.148.0 in project templates and dependencies (#3172 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details * Update CrewAI version to 0.148.0 in project templates and dependencies * Update crewai-tools dependency to version 0.55.0 in pyproject.toml and uv.lock for improved functionality and performance.	2025-07-16 12:36:43 -07:00
Lucas Gomide	9b67e5a15f	Emit events about Agent eval (#3168 ) * feat: emit events abou Agent Eval We are triggering events when an evaluation has started/completed/failed * style: fix type checking issues	2025-07-16 13:18:59 -04:00
Lucas Gomide	6ebb6c9b63	Supporting eval single Agent/LiteAgent (#3167 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details * refactor: rely on task completion event to evaluate agents * feat: remove Crew dependency to evaluate agent * feat: drop execution_context in AgentEvaluator * chore: drop experimental Agent Eval feature from stable crew.test * feat: support eval LiteAgent * resolve linter issues	2025-07-15 09:22:41 -04:00
Lucas Gomide	53f674be60	chore: remove evaluation folder (#3159 ) This folder was moved to `experimental` folder	2025-07-15 08:30:20 -04:00
Paras Sakarwal	11717a5213	docs: added integration with neatlogs (#3138 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details	2025-07-14 11:08:24 -04:00