fix: cast dict values to str in _format_prompt

- Add str() casts for type safety - These values are always strings when called from invoke
fix: update CrewAgentExecutor.invoke type signature
2026-03-18 01:38:13 +00:00 · 2025-07-22 10:34:10 -04:00 · 2025-07-22 10:27:58 -04:00 · 2025-07-22 10:21:31 -04:00 · 2025-07-22 10:16:53 -04:00 · 2025-07-21 22:08:07 -04:00
75 changed files with 3682 additions and 536 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -37,25 +37,9 @@ jobs:
      - name: Install the project
        run: uv sync --dev --all-extras

-      - name: Install SQLite with FTS5 support
-        run: |
-          # WORKAROUND: GitHub Actions' Ubuntu runner uses SQLite without FTS5 support compiled in.
-          # This is a temporary fix until the runner includes SQLite with FTS5 or Python's sqlite3
-          # module is compiled with FTS5 support by default.
-          # TODO: Remove this workaround once GitHub Actions runners include SQLite FTS5 support
-          
-          # Install pysqlite3-binary which has FTS5 support
-          uv pip install pysqlite3-binary
-          # Create a sitecustomize.py to override sqlite3 with pysqlite3
-          mkdir -p .pytest_sqlite_override
-          echo "import sys; import pysqlite3; sys.modules['sqlite3'] = pysqlite3" > .pytest_sqlite_override/sitecustomize.py
-          # Test FTS5 availability
-          PYTHONPATH=.pytest_sqlite_override uv run python -c "import sqlite3; print(f'SQLite version: {sqlite3.sqlite_version}')"
-          PYTHONPATH=.pytest_sqlite_override uv run python -c "import sqlite3; conn = sqlite3.connect(':memory:'); conn.execute('CREATE VIRTUAL TABLE test USING fts5(content)'); print('FTS5 module available')"
-
      - name: Run tests (group ${{ matrix.group }} of 8)
        run: |
-          PYTHONPATH=.pytest_sqlite_override uv run pytest \
+          uv run pytest \
            --block-network \
            --timeout=30 \
            -vv \
--- a/.gitignore
+++ b/.gitignore
@@ -26,4 +26,5 @@ test_flow.html
 crewairules.mdc
 plan.md
 conceptual_plan.md
-build_image
+build_image
+chromadb-*.lock
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -9,12 +9,7 @@
  },
  "favicon": "/images/favicon.svg",
  "contextual": {
-    "options": [
-      "copy",
-      "view",
-      "chatgpt",
-      "claude"
-    ]
+    "options": ["copy", "view", "chatgpt", "claude"]
  },
  "navigation": {
    "languages": [
@@ -55,32 +50,22 @@
            "groups": [
              {
                "group": "Get Started",
-                "pages": [
-                  "en/introduction",
-                  "en/installation",
-                  "en/quickstart"
-                ]
+                "pages": ["en/introduction", "en/installation", "en/quickstart"]
              },
              {
                "group": "Guides",
                "pages": [
                  {
                    "group": "Strategy",
-                    "pages": [
-                      "en/guides/concepts/evaluating-use-cases"
-                    ]
+                    "pages": ["en/guides/concepts/evaluating-use-cases"]
                  },
                  {
                    "group": "Agents",
-                    "pages": [
-                      "en/guides/agents/crafting-effective-agents"
-                    ]
+                    "pages": ["en/guides/agents/crafting-effective-agents"]
                  },
                  {
                    "group": "Crews",
-                    "pages": [
-                      "en/guides/crews/first-crew"
-                    ]
+                    "pages": ["en/guides/crews/first-crew"]
                  },
                  {
                    "group": "Flows",
@@ -94,7 +79,6 @@
                    "pages": [
                      "en/guides/advanced/customizing-prompts",
                      "en/guides/advanced/fingerprinting"
-
                    ]
                  }
                ]
@@ -182,7 +166,9 @@
                      "en/tools/search-research/websitesearchtool",
                      "en/tools/search-research/codedocssearchtool",
                      "en/tools/search-research/youtubechannelsearchtool",
-                      "en/tools/search-research/youtubevideosearchtool"
+                      "en/tools/search-research/youtubevideosearchtool",
+                      "en/tools/search-research/tavilysearchtool",
+                      "en/tools/search-research/tavilyextractortool"
                    ]
                  },
                  {
@@ -241,6 +227,7 @@
                  "en/observability/langtrace",
                  "en/observability/maxim",
                  "en/observability/mlflow",
+                  "en/observability/neatlogs",
                  "en/observability/openlit",
                  "en/observability/opik",
                  "en/observability/patronus-evaluation",
@@ -274,9 +261,7 @@
              },
              {
                "group": "Telemetry",
-                "pages": [
-                  "en/telemetry"
-                ]
+                "pages": ["en/telemetry"]
              }
            ]
          },
@@ -285,9 +270,7 @@
            "groups": [
              {
                "group": "Getting Started",
-                "pages": [
-                  "en/enterprise/introduction"
-                ]
+                "pages": ["en/enterprise/introduction"]
              },
              {
                "group": "Features",
@@ -342,9 +325,7 @@
              },
              {
                "group": "Resources",
-                "pages": [
-                  "en/enterprise/resources/frequently-asked-questions"
-                ]
+                "pages": ["en/enterprise/resources/frequently-asked-questions"]
              }
            ]
          },
@@ -353,9 +334,7 @@
            "groups": [
              {
                "group": "Getting Started",
-                "pages": [
-                  "en/api-reference/introduction"
-                ]
+                "pages": ["en/api-reference/introduction"]
              },
              {
                "group": "Endpoints",
@@ -365,16 +344,13 @@
          },
          {
            "tab": "Examples",
-                        "groups": [
+            "groups": [
              {
                "group": "Examples",
-                "pages": [
-                  "en/examples/example"
-                ]
+                "pages": ["en/examples/example"]
              }
            ]
          }
-
        ]
      },
      {
@@ -425,21 +401,15 @@
                "pages": [
                  {
                    "group": "Estratégia",
-                    "pages": [
-                      "pt-BR/guides/concepts/evaluating-use-cases"
-                    ]
+                    "pages": ["pt-BR/guides/concepts/evaluating-use-cases"]
                  },
                  {
                    "group": "Agentes",
-                    "pages": [
-                      "pt-BR/guides/agents/crafting-effective-agents"
-                    ]
+                    "pages": ["pt-BR/guides/agents/crafting-effective-agents"]
                  },
                  {
                    "group": "Crews",
-                    "pages": [
-                      "pt-BR/guides/crews/first-crew"
-                    ]
+                    "pages": ["pt-BR/guides/crews/first-crew"]
                  },
                  {
                    "group": "Flows",
@@ -632,9 +602,7 @@
              },
              {
                "group": "Telemetria",
-                "pages": [
-                  "pt-BR/telemetry"
-                ]
+                "pages": ["pt-BR/telemetry"]
              }
            ]
          },
@@ -643,9 +611,7 @@
            "groups": [
              {
                "group": "Começando",
-                "pages": [
-                  "pt-BR/enterprise/introduction"
-                ]
+                "pages": ["pt-BR/enterprise/introduction"]
              },
              {
                "group": "Funcionalidades",
@@ -710,9 +676,7 @@
            "groups": [
              {
                "group": "Começando",
-                "pages": [
-                  "pt-BR/api-reference/introduction"
-                ]
+                "pages": ["pt-BR/api-reference/introduction"]
              },
              {
                "group": "Endpoints",
@@ -722,16 +686,13 @@
          },
          {
            "tab": "Exemplos",
-                        "groups": [
+            "groups": [
              {
                "group": "Exemplos",
-                "pages": [
-                  "pt-BR/examples/example"
-                ]
+                "pages": ["pt-BR/examples/example"]
              }
            ]
          }
-
        ]
      }
    ]
--- a/docs/en/concepts/memory.mdx
+++ b/docs/en/concepts/memory.mdx
@@ -712,7 +712,7 @@ crew = Crew(
    memory_config={
        "provider": "mem0",
        "config": {"user_id": "john"},
-        "user_memory": {}  # Required - triggers user memory initialization
+        "user_memory": {}  # DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, use external_memory instead
    },
    process=Process.sequential,
    verbose=True
--- a/docs/en/observability/neatlogs.mdx
+++ b/docs/en/observability/neatlogs.mdx
@@ -0,0 +1,134 @@
+---
+title: Neatlogs Integration
+description: Understand, debug, and share your CrewAI agent runs
+icon: magnifying-glass-chart
+---
+
+# Introduction
+
+Neatlogs helps you **see what your agent did**, **why**, and **share it**.
+
+It captures every step: thoughts, tool calls, responses, evaluations. No raw logs. Just clear, structured traces. Great for debugging and collaboration.
+
+## Why use Neatlogs?
+
+CrewAI agents use multiple tools and reasoning steps. When something goes wrong, you need context — not just errors.
+
+Neatlogs lets you:
+
+- Follow the full decision path
+- Add feedback directly on steps
+- Chat with the trace using AI assistant
+- Share runs publicly for feedback
+- Turn insights into tasks
+
+All in one place.
+
+Manage your traces effortlessly
+
+![Traces](/images/neatlogs-1.png)
+![Trace Response](/images/neatlogs-2.png)
+
+The best UX to view a CrewAI trace. Post comments anywhere you want. Use AI to debug.
+
+![Trace Details](/images/neatlogs-3.png)
+![Ai Chat Bot With A Trace](/images/neatlogs-4.png)
+![Comments Drawer](/images/neatlogs-5.png)
+
+## Core Features
+
+- **Trace Viewer**: Track thoughts, tools, and decisions in sequence
+- **Inline Comments**: Tag teammates on any trace step
+- **Feedback & Evaluation**: Mark outputs as correct or incorrect
+- **Error Highlighting**: Automatic flagging of API/tool failures
+- **Task Conversion**: Convert comments into assigned tasks
+- **Ask the Trace (AI)**: Chat with your trace using Neatlogs AI bot
+- **Public Sharing**: Publish trace links to your community
+
+## Quick Setup with CrewAI
+
+<Steps>
+  <Step title="Sign Up & Get API Key">
+    Visit [neatlogs.com](https://neatlogs.com/?utm_source=crewAI-docs), create a project, copy the API key.
+  </Step>
+  <Step title="Install SDK">
+    ```bash
+    pip install neatlogs
+    ```
+    (Latest version 0.8.0, Python 3.8+; MIT license)
+  </Step>
+  <Step title="Initialize Neatlogs">
+    Before starting Crew agents, add:
+
+    ```python
+    import neatlogs
+    neatlogs.init("YOUR_PROJECT_API_KEY")
+    ```
+
+    Agents run as usual. Neatlogs captures everything automatically.
+
+  </Step>
+</Steps>
+
+
+
+## Under the Hood
+
+According to GitHub, Neatlogs:
+
+- Captures thoughts, tool calls, responses, errors, and token stats
+- Supports AI-powered task generation and robust evaluation workflows
+
+All with just two lines of code.
+
+
+
+## Watch It Work
+
+### 🔍 Full Demo (4 min)
+
+<iframe
+  width="100%"
+  height="315"
+  src="https://www.youtube.com/embed/8KDme9T2I7Q?si=b8oHteaBwFNs_Duk"
+  title="YouTube video player"
+  frameBorder="0"
+  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
+  allowFullScreen
+></iframe>
+
+### ⚙️ CrewAI Integration (30 s)
+
+<iframe
+  className="w-full aspect-video rounded-xl"
+  src="https://www.loom.com/embed/9c78b552af43452bb3e4783cb8d91230?sid=e9d7d370-a91a-49b0-809e-2f375d9e801d"
+  title="Loom video player"
+  frameBorder="0"
+  allowFullScreen
+></iframe>
+
+
+
+## Links & Support
+
+- 📘 [Neatlogs Docs](https://docs.neatlogs.com/)
+- 🔐 [Dashboard & API Key](https://app.neatlogs.com/)
+- 🐦 [Follow on Twitter](https://twitter.com/neatlogs)
+- 📧 Contact: hello@neatlogs.com
+- 🛠 [GitHub SDK](https://github.com/NeatLogs/neatlogs)
+
+
+
+## TL;DR
+
+With just:
+
+```bash
+pip install neatlogs
+
+import neatlogs
+neatlogs.init("YOUR_API_KEY")
+
+You can now capture, understand, share, and act on your CrewAI agent runs in seconds.
+No setup overhead. Full trace transparency. Full team collaboration.
+```
--- a/docs/en/tools/search-research/overview.mdx
+++ b/docs/en/tools/search-research/overview.mdx
@@ -44,6 +44,14 @@ These tools enable your agents to search the web, research topics, and find info
  <Card title="YouTube Video Search" icon="play" href="/en/tools/search-research/youtubevideosearchtool">
    Find and analyze YouTube videos by topic, keyword, or criteria.
  </Card>
+
+  <Card title="Tavily Search Tool" icon="magnifying-glass" href="/en/tools/search-research/tavilysearchtool">
+    Comprehensive web search using Tavily's AI-powered search API.
+  </Card>
+
+  <Card title="Tavily Extractor Tool" icon="file-text" href="/en/tools/search-research/tavilyextractortool">
+    Extract structured content from web pages using the Tavily API.
+  </Card>
 </CardGroup>

 ## **Common Use Cases**
@@ -55,17 +63,19 @@ These tools enable your agents to search the web, research topics, and find info
 - **Academic Research**: Find scholarly articles and technical papers

 ```python
-from crewai_tools import SerperDevTool, GitHubSearchTool, YoutubeVideoSearchTool
+from crewai_tools import SerperDevTool, GitHubSearchTool, YoutubeVideoSearchTool, TavilySearchTool, TavilyExtractorTool

 # Create research tools
 web_search = SerperDevTool()
 code_search = GitHubSearchTool()
 video_research = YoutubeVideoSearchTool()
+tavily_search = TavilySearchTool()
+content_extractor = TavilyExtractorTool()

 # Add to your agent
 agent = Agent(
    role="Research Analyst",
-    tools=[web_search, code_search, video_research],
+    tools=[web_search, code_search, video_research, tavily_search, content_extractor],
    goal="Gather comprehensive information on any topic"
 )
 ```
--- a/docs/en/tools/search-research/tavilyextractortool.mdx
+++ b/docs/en/tools/search-research/tavilyextractortool.mdx
@@ -0,0 +1,139 @@
+---
+title: "Tavily Extractor Tool"
+description: "Extract structured content from web pages using the Tavily API"
+icon: "file-text"
+---
+
+The `TavilyExtractorTool` allows CrewAI agents to extract structured content from web pages using the Tavily API. It can process single URLs or lists of URLs and provides options for controlling the extraction depth and including images.
+
+## Installation
+
+To use the `TavilyExtractorTool`, you need to install the `tavily-python` library:
+
+```shell
+pip install 'crewai[tools]' tavily-python
+```
+
+You also need to set your Tavily API key as an environment variable:
+
+```bash
+export TAVILY_API_KEY='your-tavily-api-key'
+```
+
+## Example Usage
+
+Here's how to initialize and use the `TavilyExtractorTool` within a CrewAI agent:
+
+```python
+import os
+from crewai import Agent, Task, Crew
+from crewai_tools import TavilyExtractorTool
+
+# Ensure TAVILY_API_KEY is set in your environment
+# os.environ["TAVILY_API_KEY"] = "YOUR_API_KEY"
+
+# Initialize the tool
+tavily_tool = TavilyExtractorTool()
+
+# Create an agent that uses the tool
+extractor_agent = Agent(
+    role='Web Content Extractor',
+    goal='Extract key information from specified web pages',
+    backstory='You are an expert at extracting relevant content from websites using the Tavily API.',
+    tools=[tavily_tool],
+    verbose=True
+)
+
+# Define a task for the agent
+extract_task = Task(
+    description='Extract the main content from the URL https://example.com using basic extraction depth.',
+    expected_output='A JSON string containing the extracted content from the URL.',
+    agent=extractor_agent
+)
+
+# Create and run the crew
+crew = Crew(
+    agents=[extractor_agent],
+    tasks=[extract_task],
+    verbose=2
+)
+
+result = crew.kickoff()
+print(result)
+```
+
+## Configuration Options
+
+The `TavilyExtractorTool` accepts the following arguments:
+
+- `urls` (Union[List[str], str]): **Required**. A single URL string or a list of URL strings to extract data from.
+- `include_images` (Optional[bool]): Whether to include images in the extraction results. Defaults to `False`.
+- `extract_depth` (Literal["basic", "advanced"]): The depth of extraction. Use `"basic"` for faster, surface-level extraction or `"advanced"` for more comprehensive extraction. Defaults to `"basic"`.
+- `timeout` (int): The maximum time in seconds to wait for the extraction request to complete. Defaults to `60`.
+
+## Advanced Usage
+
+### Multiple URLs with Advanced Extraction
+
+```python
+# Example with multiple URLs and advanced extraction
+multi_extract_task = Task(
+    description='Extract content from https://example.com and https://anotherexample.org using advanced extraction.',
+    expected_output='A JSON string containing the extracted content from both URLs.',
+    agent=extractor_agent
+)
+
+# Configure the tool with custom parameters
+custom_extractor = TavilyExtractorTool(
+    extract_depth='advanced',
+    include_images=True,
+    timeout=120
+)
+
+agent_with_custom_tool = Agent(
+    role="Advanced Content Extractor",
+    goal="Extract comprehensive content with images",
+    tools=[custom_extractor]
+)
+```
+
+### Tool Parameters
+
+You can customize the tool's behavior by setting parameters during initialization:
+
+```python
+# Initialize with custom configuration
+extractor_tool = TavilyExtractorTool(
+    extract_depth='advanced',  # More comprehensive extraction
+    include_images=True,       # Include image results
+    timeout=90                 # Custom timeout
+)
+```
+
+## Features
+
+- **Single or Multiple URLs**: Extract content from one URL or process multiple URLs in a single request
+- **Configurable Depth**: Choose between basic (fast) and advanced (comprehensive) extraction modes
+- **Image Support**: Optionally include images in the extraction results
+- **Structured Output**: Returns well-formatted JSON containing the extracted content
+- **Error Handling**: Robust handling of network timeouts and extraction errors
+
+## Response Format
+
+The tool returns a JSON string representing the structured data extracted from the provided URL(s). The exact structure depends on the content of the pages and the `extract_depth` used.
+
+Common response elements include:
+- **Title**: The page title
+- **Content**: Main text content of the page
+- **Images**: Image URLs and metadata (when `include_images=True`)
+- **Metadata**: Additional page information like author, description, etc.
+
+## Use Cases
+
+- **Content Analysis**: Extract and analyze content from competitor websites
+- **Research**: Gather structured data from multiple sources for analysis
+- **Content Migration**: Extract content from existing websites for migration
+- **Monitoring**: Regular extraction of content for change detection
+- **Data Collection**: Systematic extraction of information from web sources
+
+Refer to the [Tavily API documentation](https://docs.tavily.com/docs/tavily-api/python-sdk#extract) for detailed information about the response structure and available options.
--- a/docs/en/tools/search-research/tavilysearchtool.mdx
+++ b/docs/en/tools/search-research/tavilysearchtool.mdx
@@ -0,0 +1,122 @@
+---
+title: "Tavily Search Tool"
+description: "Perform comprehensive web searches using the Tavily Search API"
+icon: "magnifying-glass"
+---
+
+The `TavilySearchTool` provides an interface to the Tavily Search API, enabling CrewAI agents to perform comprehensive web searches. It allows for specifying search depth, topics, time ranges, included/excluded domains, and whether to include direct answers, raw content, or images in the results.
+
+## Installation
+
+To use the `TavilySearchTool`, you need to install the `tavily-python` library:
+
+```shell
+pip install 'crewai[tools]' tavily-python
+```
+
+## Environment Variables
+
+Ensure your Tavily API key is set as an environment variable:
+
+```bash
+export TAVILY_API_KEY='your_tavily_api_key'
+```
+
+## Example Usage
+
+Here's how to initialize and use the `TavilySearchTool` within a CrewAI agent:
+
+```python
+import os
+from crewai import Agent, Task, Crew
+from crewai_tools import TavilySearchTool
+
+# Ensure the TAVILY_API_KEY environment variable is set
+# os.environ["TAVILY_API_KEY"] = "YOUR_TAVILY_API_KEY"
+
+# Initialize the tool
+tavily_tool = TavilySearchTool()
+
+# Create an agent that uses the tool
+researcher = Agent(
+    role='Market Researcher',
+    goal='Find information about the latest AI trends',
+    backstory='An expert market researcher specializing in technology.',
+    tools=[tavily_tool],
+    verbose=True
+)
+
+# Create a task for the agent
+research_task = Task(
+    description='Search for the top 3 AI trends in 2024.',
+    expected_output='A JSON report summarizing the top 3 AI trends found.',
+    agent=researcher
+)
+
+# Form the crew and kick it off
+crew = Crew(
+    agents=[researcher],
+    tasks=[research_task],
+    verbose=2
+)
+
+result = crew.kickoff()
+print(result)
+```
+
+## Configuration Options
+
+The `TavilySearchTool` accepts the following arguments during initialization or when calling the `run` method:
+
+- `query` (str): **Required**. The search query string.
+- `search_depth` (Literal["basic", "advanced"], optional): The depth of the search. Defaults to `"basic"`.
+- `topic` (Literal["general", "news", "finance"], optional): The topic to focus the search on. Defaults to `"general"`.
+- `time_range` (Literal["day", "week", "month", "year"], optional): The time range for the search. Defaults to `None`.
+- `days` (int, optional): The number of days to search back. Relevant if `time_range` is not set. Defaults to `7`.
+- `max_results` (int, optional): The maximum number of search results to return. Defaults to `5`.
+- `include_domains` (Sequence[str], optional): A list of domains to prioritize in the search. Defaults to `None`.
+- `exclude_domains` (Sequence[str], optional): A list of domains to exclude from the search. Defaults to `None`.
+- `include_answer` (Union[bool, Literal["basic", "advanced"]], optional): Whether to include a direct answer synthesized from the search results. Defaults to `False`.
+- `include_raw_content` (bool, optional): Whether to include the raw HTML content of the searched pages. Defaults to `False`.
+- `include_images` (bool, optional): Whether to include image results. Defaults to `False`.
+- `timeout` (int, optional): The request timeout in seconds. Defaults to `60`.
+
+## Advanced Usage
+
+You can configure the tool with custom parameters:
+
+```python
+# Example: Initialize with specific parameters
+custom_tavily_tool = TavilySearchTool(
+    search_depth='advanced',
+    max_results=10,
+    include_answer=True
+)
+
+# The agent will use these defaults
+agent_with_custom_tool = Agent(
+    role="Advanced Researcher",
+    goal="Conduct detailed research with comprehensive results",
+    tools=[custom_tavily_tool]
+)
+```
+
+## Features
+
+- **Comprehensive Search**: Access to Tavily's powerful search index
+- **Configurable Depth**: Choose between basic and advanced search modes
+- **Topic Filtering**: Focus searches on general, news, or finance topics
+- **Time Range Control**: Limit results to specific time periods
+- **Domain Control**: Include or exclude specific domains
+- **Direct Answers**: Get synthesized answers from search results
+- **Content Filtering**: Prevent context window issues with automatic content truncation
+
+## Response Format
+
+The tool returns search results as a JSON string containing:
+- Search results with titles, URLs, and content snippets
+- Optional direct answers to queries
+- Optional image results
+- Optional raw HTML content (when enabled)
+
+Content for each result is automatically truncated to prevent context window issues while maintaining the most relevant information.
--- a/docs/images/neatlogs-1.png
+++ b/docs/images/neatlogs-1.png
--- a/docs/images/neatlogs-2.png
+++ b/docs/images/neatlogs-2.png
--- a/docs/images/neatlogs-3.png
+++ b/docs/images/neatlogs-3.png
--- a/docs/images/neatlogs-4.png
+++ b/docs/images/neatlogs-4.png
--- a/docs/images/neatlogs-5.png
+++ b/docs/images/neatlogs-5.png
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -11,7 +11,7 @@ dependencies = [
    # Core Dependencies
    "pydantic>=2.4.2",
    "openai>=1.13.3",
-    "litellm==1.72.6",
+    "litellm==1.74.3",
    "instructor>=1.3.3",
    # Text Processing
    "pdfplumber>=0.11.4",
@@ -39,6 +39,7 @@ dependencies = [
    "tomli>=2.0.2",
    "blinker>=1.9.0",
    "json5>=0.10.0",
+    "portalocker==2.7.0",
 ]

 [project.urls]
@@ -47,7 +48,7 @@ Documentation = "https://docs.crewai.com"
 Repository = "https://github.com/crewAIInc/crewAI"

 [project.optional-dependencies]
-tools = ["crewai-tools~=0.51.0"]
+tools = ["crewai-tools~=0.55.0"]
 embeddings = [
    "tiktoken~=0.8.0"
 ]
--- a/src/crewai/init.py
+++ b/src/crewai/init.py
@@ -54,7 +54,7 @@ def _track_install_async():

 _track_install_async()

-__version__ = "0.141.0"
+__version__ = "0.148.0"
 __all__ = [
    "Agent",
    "Crew",
--- a/src/crewai/agent.py
+++ b/src/crewai/agent.py
@@ -1,7 +1,18 @@
 import shutil
 import subprocess
 import time
-from typing import Any, Callable, Dict, List, Literal, Optional, Sequence, Tuple, Type, Union
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    List,
+    Literal,
+    Optional,
+    Sequence,
+    Tuple,
+    Type,
+    Union,
+)

 from pydantic import Field, InstanceOf, PrivateAttr, model_validator

@@ -76,6 +87,12 @@ class Agent(BaseAgent):
    """

    _times_executed: int = PrivateAttr(default=0)
+    agent_executor: Optional[CrewAgentExecutor] = Field(
+        default=None,
+        init=False,  # Not included in __init__ as it's created dynamically in create_agent_executor()
+        exclude=True,  # Excluded from serialization to avoid circular references
+        description="The agent executor instance for running tasks. Created dynamically when needed.",
+    )
    max_execution_time: Optional[int] = Field(
        default=None,
        description="Maximum execution time for an agent to execute a task",
@@ -162,7 +179,7 @@ class Agent(BaseAgent):
    )
    guardrail: Optional[Union[Callable[[Any], Tuple[bool, Any]], str]] = Field(
        default=None,
-        description="Function or string description of a guardrail to validate agent output"
+        description="Function or string description of a guardrail to validate agent output",
    )
    guardrail_max_retries: int = Field(
        default=3, description="Maximum number of retries when guardrail fails"
@@ -340,7 +357,6 @@ class Agent(BaseAgent):
            self.knowledge_config.model_dump() if self.knowledge_config else {}
        )

-
        if self.knowledge or (self.crew and self.crew.knowledge):
            crewai_event_bus.emit(
                self,
@@ -531,6 +547,11 @@ class Agent(BaseAgent):
        Returns:
            The output of the agent.
        """
+        if not self.agent_executor:
+            raise ValueError(
+                "Agent executor not initialized. Call create_agent_executor() first."
+            )
+
        return self.agent_executor.invoke(
            {
                "input": task_prompt,
--- a/src/crewai/agents/crew_agent_executor.py
+++ b/src/crewai/agents/crew_agent_executor.py
@@ -96,7 +96,7 @@ class CrewAgentExecutor(CrewAgentExecutorMixin):
            )
        )

-    def invoke(self, inputs: Dict[str, str]) -> Dict[str, Any]:
+    def invoke(self, inputs: Dict[str, Union[str, bool, None]]) -> Dict[str, Any]:
        if "system" in self.prompt:
            system_prompt = self._format_prompt(self.prompt.get("system", ""), inputs)
            user_prompt = self._format_prompt(self.prompt.get("user", ""), inputs)
@@ -120,11 +120,7 @@ class CrewAgentExecutor(CrewAgentExecutorMixin):
            raise
        except Exception as e:
            handle_unknown_error(self._printer, e)
-            if e.__class__.__module__.startswith("litellm"):
-                # Do not retry on litellm errors
-                raise e
-            else:
-                raise e
+            raise

        if self.ask_for_human_input:
            formatted_answer = self._handle_human_feedback(formatted_answer)
@@ -159,7 +155,7 @@ class CrewAgentExecutor(CrewAgentExecutorMixin):
                    messages=self.messages,
                    callbacks=self.callbacks,
                    printer=self._printer,
-                    from_task=self.task
+                    from_task=self.task,
                )
                formatted_answer = process_llm_response(answer, self.use_stop_words)

@@ -375,10 +371,13 @@ class CrewAgentExecutor(CrewAgentExecutorMixin):
        training_data[agent_id] = agent_training_data
        training_handler.save(training_data)

-    def _format_prompt(self, prompt: str, inputs: Dict[str, str]) -> str:
-        prompt = prompt.replace("{input}", inputs["input"])
-        prompt = prompt.replace("{tool_names}", inputs["tool_names"])
-        prompt = prompt.replace("{tools}", inputs["tools"])
+    def _format_prompt(
+        self, prompt: str, inputs: Dict[str, Union[str, bool, None]]
+    ) -> str:
+        # Cast to str to satisfy type checker - these are always strings when called
+        prompt = prompt.replace("{input}", str(inputs["input"]))
+        prompt = prompt.replace("{tool_names}", str(inputs["tool_names"]))
+        prompt = prompt.replace("{tools}", str(inputs["tools"]))
        return prompt

    def _handle_human_feedback(self, formatted_answer: AgentFinish) -> AgentFinish:
--- a/src/crewai/cli/templates/crew/pyproject.toml
+++ b/src/crewai/cli/templates/crew/pyproject.toml
@@ -5,7 +5,7 @@ description = "{{name}} using crewAI"
 authors = [{ name = "Your Name", email = "you@example.com" }]
 requires-python = ">=3.10,<3.14"
 dependencies = [
-    "crewai[tools]>=0.141.0,<1.0.0"
+    "crewai[tools]>=0.148.0,<1.0.0"
 ]

 [project.scripts]
--- a/src/crewai/cli/templates/flow/pyproject.toml
+++ b/src/crewai/cli/templates/flow/pyproject.toml
@@ -5,7 +5,7 @@ description = "{{name}} using crewAI"
 authors = [{ name = "Your Name", email = "you@example.com" }]
 requires-python = ">=3.10,<3.14"
 dependencies = [
-    "crewai[tools]>=0.141.0,<1.0.0",
+    "crewai[tools]>=0.148.0,<1.0.0",
 ]

 [project.scripts]
--- a/src/crewai/cli/templates/tool/pyproject.toml
+++ b/src/crewai/cli/templates/tool/pyproject.toml
@@ -5,7 +5,7 @@ description = "Power up your crews with {{folder_name}}"
 readme = "README.md"
 requires-python = ">=3.10,<3.14"
 dependencies = [
-    "crewai[tools]>=0.141.0"
+    "crewai[tools]>=0.148.0"
 ]

 [tool.crewai]
--- a/src/crewai/crew.py
+++ b/src/crewai/crew.py
@@ -161,7 +161,7 @@ class Crew(FlowTrackable, BaseModel):
    )
    user_memory: Optional[InstanceOf[UserMemory]] = Field(
        default=None,
-        description="An instance of the UserMemory to be used by the Crew to store/fetch memories of a specific user.",
+        description="DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, whichever comes first. Use external_memory instead.",
    )
    external_memory: Optional[InstanceOf[ExternalMemory]] = Field(
        default=None,
@@ -327,7 +327,7 @@ class Crew(FlowTrackable, BaseModel):
        self._short_term_memory = self.short_term_memory
        self._entity_memory = self.entity_memory

-        # UserMemory is gonna to be deprecated in the future, but we have to initialize a default value for now
+        # UserMemory will be removed in version 0.156.0 or on 2025-08-04, whichever comes first
        self._user_memory = None

        if self.memory:
@@ -1255,6 +1255,7 @@ class Crew(FlowTrackable, BaseModel):
        if self.external_memory:
            copied_data["external_memory"] = self.external_memory.model_copy(deep=True)
        if self.user_memory:
+            # DEPRECATED: UserMemory will be removed in version 0.156.0 or on 2025-08-04
            copied_data["user_memory"] = self.user_memory.model_copy(deep=True)

        copied_data.pop("agents", None)
@@ -1313,7 +1314,6 @@ class Crew(FlowTrackable, BaseModel):
        n_iterations: int,
        eval_llm: Union[str, InstanceOf[BaseLLM]],
        inputs: Optional[Dict[str, Any]] = None,
-        include_agent_eval: Optional[bool] = False
    ) -> None:
        """Test and evaluate the Crew with the given inputs for n iterations concurrently using concurrent.futures."""
        try:
@@ -1333,28 +1333,13 @@ class Crew(FlowTrackable, BaseModel):
            )
            test_crew = self.copy()

-            # TODO: Refator to use a single Evaluator Manage class
            evaluator = CrewEvaluator(test_crew, llm_instance)

-            if include_agent_eval:
-                from crewai.evaluation import create_default_evaluator
-                agent_evaluator = create_default_evaluator(crew=test_crew)
-
            for i in range(1, n_iterations + 1):
                evaluator.set_iteration(i)
-
-                if include_agent_eval:
-                    agent_evaluator.set_iteration(i)
-
                test_crew.kickoff(inputs=inputs)

-                # TODO: Refactor to use ListenerEvents instead of trigger each iteration manually
-                if include_agent_eval:
-                    agent_evaluator.evaluate_current_iteration()
-
            evaluator.print_crew_evaluation_result()
-            if include_agent_eval:
-                agent_evaluator.get_agent_evaluation(include_evaluation_feedback=True)

            crewai_event_bus.emit(
                self,
--- a/src/crewai/evaluation/agent_evaluator.py
+++ b/src/crewai/evaluation/agent_evaluator.py
@@ -1,178 +0,0 @@
-from crewai.evaluation.base_evaluator import AgentEvaluationResult, AggregationStrategy
-from crewai.agent import Agent
-from crewai.task import Task
-from crewai.evaluation.evaluation_display import EvaluationDisplayFormatter
-
-from typing import Any, Dict
-from collections import defaultdict
-from crewai.evaluation import BaseEvaluator, create_evaluation_callbacks
-from collections.abc import Sequence
-from crewai.crew import Crew
-from crewai.utilities.events.crewai_event_bus import crewai_event_bus
-from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
-
-class AgentEvaluator:
-    def __init__(
-        self,
-        evaluators: Sequence[BaseEvaluator] | None = None,
-        crew: Crew | None = None,
-    ):
-        self.crew: Crew | None = crew
-        self.evaluators: Sequence[BaseEvaluator] | None = evaluators
-
-        self.agent_evaluators: dict[str, Sequence[BaseEvaluator] | None] = {}
-        if crew is not None:
-            assert crew and crew.agents is not None
-            for agent in crew.agents:
-                self.agent_evaluators[str(agent.id)] = self.evaluators
-
-        self.callback = create_evaluation_callbacks()
-        self.console_formatter = ConsoleFormatter()
-        self.display_formatter = EvaluationDisplayFormatter()
-
-        self.iteration = 1
-        self.iterations_results: dict[int, dict[str, list[AgentEvaluationResult]]] = {}
-
-    def set_iteration(self, iteration: int) -> None:
-        self.iteration = iteration
-
-    def evaluate_current_iteration(self) -> dict[str, list[AgentEvaluationResult]]:
-        if not self.crew:
-            raise ValueError("Cannot evaluate: no crew was provided to the evaluator.")
-
-        if not self.callback:
-            raise ValueError("Cannot evaluate: no callback was set. Use set_callback() method first.")
-
-        from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn
-        evaluation_results: defaultdict[str, list[AgentEvaluationResult]] = defaultdict(list)
-
-        total_evals = 0
-        for agent in self.crew.agents:
-            for task in self.crew.tasks:
-                if task.agent and task.agent.id == agent.id and self.agent_evaluators.get(str(agent.id)):
-                    total_evals += 1
-
-        with Progress(
-            SpinnerColumn(),
-            TextColumn("[bold blue]{task.description}[/bold blue]"),
-            BarColumn(),
-            TextColumn("{task.percentage:.0f}% completed"),
-            console=self.console_formatter.console
-        ) as progress:
-            eval_task = progress.add_task(f"Evaluating agents (iteration {self.iteration})...", total=total_evals)
-
-            for agent in self.crew.agents:
-                evaluator = self.agent_evaluators.get(str(agent.id))
-                if not evaluator:
-                    continue
-
-                for task in self.crew.tasks:
-
-                    if task.agent and str(task.agent.id) != str(agent.id):
-                        continue
-
-                    trace = self.callback.get_trace(str(agent.id), str(task.id))
-                    if not trace:
-                        self.console_formatter.print(f"[yellow]Warning: No trace found for agent {agent.role} on task {task.description[:30]}...[/yellow]")
-                        progress.update(eval_task, advance=1)
-                        continue
-
-                    with crewai_event_bus.scoped_handlers():
-                        result = self.evaluate(
-                            agent=agent,
-                            task=task,
-                            execution_trace=trace,
-                            final_output=task.output
-                        )
-                        evaluation_results[agent.role].append(result)
-                        progress.update(eval_task, advance=1)
-
-        self.iterations_results[self.iteration] = evaluation_results
-        return evaluation_results
-
-    def get_evaluation_results(self):
-        if self.iteration in self.iterations_results:
-            return self.iterations_results[self.iteration]
-
-        return self.evaluate_current_iteration()
-
-    def display_results_with_iterations(self):
-        self.display_formatter.display_summary_results(self.iterations_results)
-
-    def get_agent_evaluation(self, strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE, include_evaluation_feedback: bool = False):
-        agent_results = {}
-        with crewai_event_bus.scoped_handlers():
-            task_results = self.get_evaluation_results()
-            for agent_role, results in task_results.items():
-                if not results:
-                    continue
-
-                agent_id = results[0].agent_id
-
-                aggregated_result = self.display_formatter._aggregate_agent_results(
-                    agent_id=agent_id,
-                    agent_role=agent_role,
-                    results=results,
-                    strategy=strategy
-                )
-
-                agent_results[agent_role] = aggregated_result
-
-
-            if self.iteration == max(self.iterations_results.keys()):
-                self.display_results_with_iterations()
-
-            if include_evaluation_feedback:
-                self.display_evaluation_with_feedback()
-
-        return agent_results
-
-    def display_evaluation_with_feedback(self):
-        self.display_formatter.display_evaluation_with_feedback(self.iterations_results)
-
-    def evaluate(
-        self,
-        agent: Agent,
-        task: Task,
-        execution_trace: Dict[str, Any],
-        final_output: Any
-    ) -> AgentEvaluationResult:
-        result = AgentEvaluationResult(
-            agent_id=str(agent.id),
-            task_id=str(task.id)
-        )
-        assert self.evaluators is not None
-        for evaluator in self.evaluators:
-            try:
-                score = evaluator.evaluate(
-                    agent=agent,
-                    task=task,
-                    execution_trace=execution_trace,
-                    final_output=final_output
-                )
-                result.metrics[evaluator.metric_category] = score
-            except Exception as e:
-                self.console_formatter.print(f"Error in {evaluator.metric_category.value} evaluator: {str(e)}")
-
-        return result
-
-def create_default_evaluator(crew, llm=None):
-    from crewai.evaluation import (
-        GoalAlignmentEvaluator,
-        SemanticQualityEvaluator,
-        ToolSelectionEvaluator,
-        ParameterExtractionEvaluator,
-        ToolInvocationEvaluator,
-        ReasoningEfficiencyEvaluator
-    )
-
-    evaluators = [
-        GoalAlignmentEvaluator(llm=llm),
-        SemanticQualityEvaluator(llm=llm),
-        ToolSelectionEvaluator(llm=llm),
-        ParameterExtractionEvaluator(llm=llm),
-        ToolInvocationEvaluator(llm=llm),
-        ReasoningEfficiencyEvaluator(llm=llm),
-    ]
-
-    return AgentEvaluator(evaluators=evaluators, crew=crew)
--- a/src/crewai/experimental/init.py
+++ b/src/crewai/experimental/init.py
@@ -0,0 +1,40 @@
+from crewai.experimental.evaluation import (
+    BaseEvaluator,
+    EvaluationScore,
+    MetricCategory,
+    AgentEvaluationResult,
+    SemanticQualityEvaluator,
+    GoalAlignmentEvaluator,
+    ReasoningEfficiencyEvaluator,
+    ToolSelectionEvaluator,
+    ParameterExtractionEvaluator,
+    ToolInvocationEvaluator,
+    EvaluationTraceCallback,
+    create_evaluation_callbacks,
+    AgentEvaluator,
+    create_default_evaluator,
+    ExperimentRunner,
+    ExperimentResults,
+    ExperimentResult,
+)
+
+
+__all__ = [
+    "BaseEvaluator",
+    "EvaluationScore",
+    "MetricCategory",
+    "AgentEvaluationResult",
+    "SemanticQualityEvaluator",
+    "GoalAlignmentEvaluator",
+    "ReasoningEfficiencyEvaluator",
+    "ToolSelectionEvaluator",
+    "ParameterExtractionEvaluator",
+    "ToolInvocationEvaluator",
+    "EvaluationTraceCallback",
+    "create_evaluation_callbacks",
+    "AgentEvaluator",
+    "create_default_evaluator",
+    "ExperimentRunner",
+    "ExperimentResults",
+    "ExperimentResult"
+]
--- a/src/crewai/experimental/evaluation/init.py
+++ b/src/crewai/experimental/evaluation/init.py
@@ -1,40 +1,35 @@
-from crewai.evaluation.base_evaluator import (
+from crewai.experimental.evaluation.base_evaluator import (
    BaseEvaluator,
    EvaluationScore,
    MetricCategory,
    AgentEvaluationResult
 )

-from crewai.evaluation.metrics.semantic_quality_metrics import (
-    SemanticQualityEvaluator
-)
-
-from crewai.evaluation.metrics.goal_metrics import (
-    GoalAlignmentEvaluator
-)
-
-from crewai.evaluation.metrics.reasoning_metrics import (
-    ReasoningEfficiencyEvaluator
-)
-
-
-from crewai.evaluation.metrics.tools_metrics import (
+from crewai.experimental.evaluation.metrics import (
+    SemanticQualityEvaluator,
+    GoalAlignmentEvaluator,
+    ReasoningEfficiencyEvaluator,
    ToolSelectionEvaluator,
    ParameterExtractionEvaluator,
    ToolInvocationEvaluator
 )

-from crewai.evaluation.evaluation_listener import (
+from crewai.experimental.evaluation.evaluation_listener import (
    EvaluationTraceCallback,
    create_evaluation_callbacks
 )

-
-from crewai.evaluation.agent_evaluator import (
+from crewai.experimental.evaluation.agent_evaluator import (
    AgentEvaluator,
    create_default_evaluator
 )

+from crewai.experimental.evaluation.experiment import (
+    ExperimentRunner,
+    ExperimentResults,
+    ExperimentResult
+)
+
 __all__ = [
    "BaseEvaluator",
    "EvaluationScore",
@@ -49,5 +44,8 @@ __all__ = [
    "EvaluationTraceCallback",
    "create_evaluation_callbacks",
    "AgentEvaluator",
-    "create_default_evaluator"
-]
+    "create_default_evaluator",
+    "ExperimentRunner",
+    "ExperimentResults",
+    "ExperimentResult"
+]
--- a/src/crewai/experimental/evaluation/agent_evaluator.py
+++ b/src/crewai/experimental/evaluation/agent_evaluator.py
@@ -0,0 +1,245 @@
+import threading
+from typing import Any
+
+from crewai.experimental.evaluation.base_evaluator import AgentEvaluationResult, AggregationStrategy
+from crewai.agent import Agent
+from crewai.task import Task
+from crewai.experimental.evaluation.evaluation_display import EvaluationDisplayFormatter
+from crewai.utilities.events.agent_events import AgentEvaluationStartedEvent, AgentEvaluationCompletedEvent, AgentEvaluationFailedEvent
+from crewai.experimental.evaluation import BaseEvaluator, create_evaluation_callbacks
+from collections.abc import Sequence
+from crewai.utilities.events.crewai_event_bus import crewai_event_bus
+from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
+from crewai.utilities.events.task_events import TaskCompletedEvent
+from crewai.utilities.events.agent_events import LiteAgentExecutionCompletedEvent
+from crewai.experimental.evaluation.base_evaluator import AgentAggregatedEvaluationResult, EvaluationScore, MetricCategory
+
+class ExecutionState:
+    def __init__(self):
+        self.traces = {}
+        self.current_agent_id: str | None = None
+        self.current_task_id: str | None = None
+        self.iteration = 1
+        self.iterations_results = {}
+        self.agent_evaluators = {}
+
+class AgentEvaluator:
+    def __init__(
+        self,
+        agents: list[Agent],
+        evaluators: Sequence[BaseEvaluator] | None = None,
+    ):
+        self.agents: list[Agent] = agents
+        self.evaluators: Sequence[BaseEvaluator] | None = evaluators
+
+        self.callback = create_evaluation_callbacks()
+        self.console_formatter = ConsoleFormatter()
+        self.display_formatter = EvaluationDisplayFormatter()
+
+        self._thread_local: threading.local = threading.local()
+
+        for agent in self.agents:
+            self._execution_state.agent_evaluators[str(agent.id)] = self.evaluators
+
+        self._subscribe_to_events()
+
+    @property
+    def _execution_state(self) -> ExecutionState:
+        if not hasattr(self._thread_local, 'execution_state'):
+            self._thread_local.execution_state = ExecutionState()
+        return self._thread_local.execution_state
+
+    def _subscribe_to_events(self) -> None:
+        from typing import cast
+        crewai_event_bus.register_handler(TaskCompletedEvent, cast(Any, self._handle_task_completed))
+        crewai_event_bus.register_handler(LiteAgentExecutionCompletedEvent, cast(Any, self._handle_lite_agent_completed))
+
+    def _handle_task_completed(self, source: Any, event: TaskCompletedEvent) -> None:
+        assert event.task is not None
+        agent = event.task.agent
+        if agent and str(getattr(agent, 'id', 'unknown')) in self._execution_state.agent_evaluators:
+            self.emit_evaluation_started_event(agent_role=agent.role, agent_id=str(agent.id), task_id=str(event.task.id))
+
+            state = ExecutionState()
+            state.current_agent_id = str(agent.id)
+            state.current_task_id = str(event.task.id)
+
+            assert state.current_agent_id is not None and state.current_task_id is not None
+            trace = self.callback.get_trace(state.current_agent_id, state.current_task_id)
+
+            if not trace:
+                return
+
+            result = self.evaluate(
+                agent=agent,
+                task=event.task,
+                execution_trace=trace,
+                final_output=event.output,
+                state=state
+            )
+
+            current_iteration = self._execution_state.iteration
+            if current_iteration not in self._execution_state.iterations_results:
+                self._execution_state.iterations_results[current_iteration] = {}
+
+            if agent.role not in self._execution_state.iterations_results[current_iteration]:
+                self._execution_state.iterations_results[current_iteration][agent.role] = []
+
+            self._execution_state.iterations_results[current_iteration][agent.role].append(result)
+
+    def _handle_lite_agent_completed(self, source: object, event: LiteAgentExecutionCompletedEvent) -> None:
+        agent_info = event.agent_info
+        agent_id = str(agent_info["id"])
+
+        if agent_id in self._execution_state.agent_evaluators:
+            state = ExecutionState()
+            state.current_agent_id = agent_id
+            state.current_task_id = "lite_task"
+
+            target_agent = None
+            for agent in self.agents:
+                if str(agent.id) == agent_id:
+                    target_agent = agent
+                    break
+
+            if not target_agent:
+                return
+
+            assert state.current_agent_id is not None and state.current_task_id is not None
+            trace = self.callback.get_trace(state.current_agent_id, state.current_task_id)
+
+            if not trace:
+                return
+
+            result = self.evaluate(
+                agent=target_agent,
+                execution_trace=trace,
+                final_output=event.output,
+                state=state
+            )
+
+            current_iteration = self._execution_state.iteration
+            if current_iteration not in self._execution_state.iterations_results:
+                self._execution_state.iterations_results[current_iteration] = {}
+
+            agent_role = target_agent.role
+            if agent_role not in self._execution_state.iterations_results[current_iteration]:
+                self._execution_state.iterations_results[current_iteration][agent_role] = []
+
+            self._execution_state.iterations_results[current_iteration][agent_role].append(result)
+
+    def set_iteration(self, iteration: int) -> None:
+        self._execution_state.iteration = iteration
+
+    def reset_iterations_results(self) -> None:
+        self._execution_state.iterations_results = {}
+
+    def get_evaluation_results(self) -> dict[str, list[AgentEvaluationResult]]:
+        if self._execution_state.iterations_results and self._execution_state.iteration in self._execution_state.iterations_results:
+            return self._execution_state.iterations_results[self._execution_state.iteration]
+        return {}
+
+    def display_results_with_iterations(self) -> None:
+        self.display_formatter.display_summary_results(self._execution_state.iterations_results)
+
+    def get_agent_evaluation(self, strategy: AggregationStrategy = AggregationStrategy.SIMPLE_AVERAGE, include_evaluation_feedback: bool = True) -> dict[str, AgentAggregatedEvaluationResult]:
+        agent_results = {}
+        with crewai_event_bus.scoped_handlers():
+            task_results = self.get_evaluation_results()
+            for agent_role, results in task_results.items():
+                if not results:
+                    continue
+
+                agent_id = results[0].agent_id
+
+                aggregated_result = self.display_formatter._aggregate_agent_results(
+                    agent_id=agent_id,
+                    agent_role=agent_role,
+                    results=results,
+                    strategy=strategy
+                )
+
+                agent_results[agent_role] = aggregated_result
+
+
+            if self._execution_state.iterations_results and self._execution_state.iteration == max(self._execution_state.iterations_results.keys(), default=0):
+                self.display_results_with_iterations()
+
+            if include_evaluation_feedback:
+                self.display_evaluation_with_feedback()
+
+        return agent_results
+
+    def display_evaluation_with_feedback(self) -> None:
+        self.display_formatter.display_evaluation_with_feedback(self._execution_state.iterations_results)
+
+    def evaluate(
+        self,
+        agent: Agent,
+        execution_trace: dict[str, Any],
+        final_output: Any,
+        state: ExecutionState,
+        task: Task | None = None,
+    ) -> AgentEvaluationResult:
+        result = AgentEvaluationResult(
+            agent_id=state.current_agent_id or str(agent.id),
+            task_id=state.current_task_id or (str(task.id) if task else "unknown_task")
+        )
+
+        assert self.evaluators is not None
+        task_id = str(task.id) if task else None
+        for evaluator in self.evaluators:
+            try:
+                self.emit_evaluation_started_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id)
+                score = evaluator.evaluate(
+                    agent=agent,
+                    task=task,
+                    execution_trace=execution_trace,
+                    final_output=final_output
+                )
+                result.metrics[evaluator.metric_category] = score
+                self.emit_evaluation_completed_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id, metric_category=evaluator.metric_category, score=score)
+            except Exception as e:
+                self.emit_evaluation_failed_event(agent_role=agent.role, agent_id=str(agent.id), task_id=task_id, error=str(e))
+                self.console_formatter.print(f"Error in {evaluator.metric_category.value} evaluator: {str(e)}")
+
+        return result
+
+    def emit_evaluation_started_event(self, agent_role: str, agent_id: str, task_id: str | None = None):
+        crewai_event_bus.emit(
+            self,
+            AgentEvaluationStartedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration)
+        )
+
+    def emit_evaluation_completed_event(self, agent_role: str, agent_id: str, task_id: str | None = None, metric_category: MetricCategory | None = None, score: EvaluationScore | None = None):
+        crewai_event_bus.emit(
+            self,
+            AgentEvaluationCompletedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration, metric_category=metric_category, score=score)
+        )
+
+    def emit_evaluation_failed_event(self, agent_role: str, agent_id: str, error: str, task_id: str | None = None):
+        crewai_event_bus.emit(
+            self,
+            AgentEvaluationFailedEvent(agent_role=agent_role, agent_id=agent_id, task_id=task_id, iteration=self._execution_state.iteration, error=error)
+        )
+
+def create_default_evaluator(agents: list[Agent], llm: None = None):
+    from crewai.experimental.evaluation import (
+        GoalAlignmentEvaluator,
+        SemanticQualityEvaluator,
+        ToolSelectionEvaluator,
+        ParameterExtractionEvaluator,
+        ToolInvocationEvaluator,
+        ReasoningEfficiencyEvaluator
+    )
+
+    evaluators = [
+        GoalAlignmentEvaluator(llm=llm),
+        SemanticQualityEvaluator(llm=llm),
+        ToolSelectionEvaluator(llm=llm),
+        ParameterExtractionEvaluator(llm=llm),
+        ToolInvocationEvaluator(llm=llm),
+        ReasoningEfficiencyEvaluator(llm=llm),
+    ]
+
+    return AgentEvaluator(evaluators=evaluators, agents=agents)
--- a/src/crewai/experimental/evaluation/base_evaluator.py
+++ b/src/crewai/experimental/evaluation/base_evaluator.py
@@ -57,9 +57,9 @@ class BaseEvaluator(abc.ABC):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: Any,
+        task: Task | None = None,
    ) -> EvaluationScore:
        pass

--- a/src/crewai/experimental/evaluation/evaluation_display.py
+++ b/src/crewai/experimental/evaluation/evaluation_display.py
@@ -3,8 +3,8 @@ from typing import Dict, Any, List
 from rich.table import Table
 from rich.box import HEAVY_EDGE, ROUNDED
 from collections.abc import Sequence
-from crewai.evaluation.base_evaluator import AgentAggregatedEvaluationResult, AggregationStrategy, AgentEvaluationResult, MetricCategory
-from crewai.evaluation import EvaluationScore
+from crewai.experimental.evaluation.base_evaluator import AgentAggregatedEvaluationResult, AggregationStrategy, AgentEvaluationResult, MetricCategory
+from crewai.experimental.evaluation import EvaluationScore
 from crewai.utilities.events.utils.console_formatter import ConsoleFormatter
 from crewai.utilities.llm_utils import create_llm

@@ -17,7 +17,6 @@ class EvaluationDisplayFormatter:
            self.console_formatter.print("[yellow]No evaluation results to display[/yellow]")
            return

-        # Get all agent roles across all iterations
        all_agent_roles: set[str] = set()
        for iter_results in iterations_results.values():
            all_agent_roles.update(iter_results.keys())
@@ -25,7 +24,6 @@ class EvaluationDisplayFormatter:
        for agent_role in sorted(all_agent_roles):
            self.console_formatter.print(f"\n[bold cyan]Agent: {agent_role}[/bold cyan]")

-            # Process each iteration
            for iter_num, results in sorted(iterations_results.items()):
                if agent_role not in results or not results[agent_role]:
                    continue
@@ -33,23 +31,19 @@ class EvaluationDisplayFormatter:
                agent_results = results[agent_role]
                agent_id = agent_results[0].agent_id

-                # Aggregate results for this agent in this iteration
                aggregated_result = self._aggregate_agent_results(
                    agent_id=agent_id,
                    agent_role=agent_role,
                    results=agent_results,
                )

-                # Display iteration header
                self.console_formatter.print(f"\n[bold]Iteration {iter_num}[/bold]")

-                # Create table for this iteration
                table = Table(box=ROUNDED)
                table.add_column("Metric", style="cyan")
                table.add_column("Score (1-10)", justify="center")
                table.add_column("Feedback", style="green")

-                # Add metrics to table
                if aggregated_result.metrics:
                    for metric, evaluation_score in aggregated_result.metrics.items():
                        score = evaluation_score.score
@@ -91,7 +85,6 @@ class EvaluationDisplayFormatter:
                        "Overall agent evaluation score"
                    )

-                # Print the table for this iteration
                self.console_formatter.print(table)

    def display_summary_results(self, iterations_results: Dict[int, Dict[str, List[AgentAggregatedEvaluationResult]]]):
@@ -248,7 +241,6 @@ class EvaluationDisplayFormatter:
            feedback_summary = None
            if feedbacks:
                if len(feedbacks) > 1:
-                    # Use the summarization method for multiple feedbacks
                    feedback_summary = self._summarize_feedbacks(
                        agent_role=agent_role,
                        metric=category.title(),
@@ -307,7 +299,7 @@ class EvaluationDisplayFormatter:
                strategy_guidance = "Focus on the highest-scoring aspects and strengths demonstrated."
            elif strategy == AggregationStrategy.WORST_PERFORMANCE:
                strategy_guidance = "Focus on areas that need improvement and common issues across tasks."
-            else:  # Default/average strategies
+            else:
                strategy_guidance = "Provide a balanced analysis of strengths and weaknesses across all tasks."

            prompt = [
--- a/src/crewai/experimental/evaluation/evaluation_listener.py
+++ b/src/crewai/experimental/evaluation/evaluation_listener.py
@@ -9,7 +9,9 @@ from crewai.utilities.events.base_event_listener import BaseEventListener
 from crewai.utilities.events.crewai_event_bus import CrewAIEventsBus
 from crewai.utilities.events.agent_events import (
    AgentExecutionStartedEvent,
-    AgentExecutionCompletedEvent
+    AgentExecutionCompletedEvent,
+    LiteAgentExecutionStartedEvent,
+    LiteAgentExecutionCompletedEvent
 )
 from crewai.utilities.events.tool_usage_events import (
    ToolUsageFinishedEvent,
@@ -52,10 +54,18 @@ class EvaluationTraceCallback(BaseEventListener):
        def on_agent_started(source, event: AgentExecutionStartedEvent):
            self.on_agent_start(event.agent, event.task)

+        @event_bus.on(LiteAgentExecutionStartedEvent)
+        def on_lite_agent_started(source, event: LiteAgentExecutionStartedEvent):
+            self.on_lite_agent_start(event.agent_info)
+
        @event_bus.on(AgentExecutionCompletedEvent)
        def on_agent_completed(source, event: AgentExecutionCompletedEvent):
            self.on_agent_finish(event.agent, event.task, event.output)

+        @event_bus.on(LiteAgentExecutionCompletedEvent)
+        def on_lite_agent_completed(source, event: LiteAgentExecutionCompletedEvent):
+            self.on_lite_agent_finish(event.output)
+
        @event_bus.on(ToolUsageFinishedEvent)
        def on_tool_completed(source, event: ToolUsageFinishedEvent):
            self.on_tool_use(event.tool_name, event.tool_args, event.output, success=True)
@@ -88,19 +98,38 @@ class EvaluationTraceCallback(BaseEventListener):
        def on_llm_call_completed(source, event: LLMCallCompletedEvent):
            self.on_llm_call_end(event.messages, event.response)

+    def on_lite_agent_start(self, agent_info: dict[str, Any]):
+        self.current_agent_id = agent_info['id']
+        self.current_task_id = "lite_task"
+
+        trace_key = f"{self.current_agent_id}_{self.current_task_id}"
+        self._init_trace(
+            trace_key=trace_key,
+            agent_id=self.current_agent_id,
+            task_id=self.current_task_id,
+            tool_uses=[],
+            llm_calls=[],
+            start_time=datetime.now(),
+            final_output=None
+        )
+
+    def _init_trace(self, trace_key: str, **kwargs: Any):
+        self.traces[trace_key] = kwargs
+
    def on_agent_start(self, agent: Agent, task: Task):
        self.current_agent_id = agent.id
        self.current_task_id = task.id

        trace_key = f"{agent.id}_{task.id}"
-        self.traces[trace_key] = {
-            "agent_id": agent.id,
-            "task_id": task.id,
-            "tool_uses": [],
-            "llm_calls": [],
-            "start_time": datetime.now(),
-            "final_output": None
-        }
+        self._init_trace(
+            trace_key=trace_key,
+            agent_id=agent.id,
+            task_id=task.id,
+            tool_uses=[],
+            llm_calls=[],
+            start_time=datetime.now(),
+            final_output=None
+        )

    def on_agent_finish(self, agent: Agent, task: Task, output: Any):
        trace_key = f"{agent.id}_{task.id}"
@@ -108,9 +137,20 @@ class EvaluationTraceCallback(BaseEventListener):
            self.traces[trace_key]["final_output"] = output
            self.traces[trace_key]["end_time"] = datetime.now()

+        self._reset_current()
+
+    def _reset_current(self):
        self.current_agent_id = None
        self.current_task_id = None

+    def on_lite_agent_finish(self, output: Any):
+        trace_key = f"{self.current_agent_id}_lite_task"
+        if trace_key in self.traces:
+            self.traces[trace_key]["final_output"] = output
+            self.traces[trace_key]["end_time"] = datetime.now()
+
+        self._reset_current()
+
    def on_tool_use(self, tool_name: str, tool_args: dict[str, Any] | str, result: Any,
                   success: bool = True, error_type: str | None = None):
        if not self.current_agent_id or not self.current_task_id:
@@ -187,4 +227,8 @@ class EvaluationTraceCallback(BaseEventListener):


 def create_evaluation_callbacks() -> EvaluationTraceCallback:
-    return EvaluationTraceCallback()
+    from crewai.utilities.events.crewai_event_bus import crewai_event_bus
+
+    callback = EvaluationTraceCallback()
+    callback.setup_listeners(crewai_event_bus)
+    return callback
--- a/src/crewai/experimental/evaluation/experiment/init.py
+++ b/src/crewai/experimental/evaluation/experiment/init.py
@@ -0,0 +1,8 @@
+from crewai.experimental.evaluation.experiment.runner import ExperimentRunner
+from crewai.experimental.evaluation.experiment.result import ExperimentResults, ExperimentResult
+
+__all__ = [
+    "ExperimentRunner",
+    "ExperimentResults",
+    "ExperimentResult"
+]
--- a/src/crewai/experimental/evaluation/experiment/result.py
+++ b/src/crewai/experimental/evaluation/experiment/result.py
@@ -0,0 +1,122 @@
+import json
+import os
+from datetime import datetime, timezone
+from typing import Any
+from pydantic import BaseModel
+
+class ExperimentResult(BaseModel):
+    identifier: str
+    inputs: dict[str, Any]
+    score: int | dict[str, int | float]
+    expected_score: int | dict[str, int | float]
+    passed: bool
+    agent_evaluations: dict[str, Any] | None = None
+
+class ExperimentResults:
+    def __init__(self, results: list[ExperimentResult], metadata: dict[str, Any] | None = None):
+        self.results = results
+        self.metadata = metadata or {}
+        self.timestamp = datetime.now(timezone.utc)
+
+        from crewai.experimental.evaluation.experiment.result_display import ExperimentResultsDisplay
+        self.display = ExperimentResultsDisplay()
+
+    def to_json(self, filepath: str | None = None) -> dict[str, Any]:
+        data = {
+            "timestamp": self.timestamp.isoformat(),
+            "metadata": self.metadata,
+            "results": [r.model_dump(exclude={"agent_evaluations"}) for r in self.results]
+        }
+
+        if filepath:
+            with open(filepath, 'w') as f:
+                json.dump(data, f, indent=2)
+            self.display.console.print(f"[green]Results saved to {filepath}[/green]")
+
+        return data
+
+    def compare_with_baseline(self, baseline_filepath: str, save_current: bool = True, print_summary: bool = False) -> dict[str, Any]:
+        baseline_runs = []
+
+        if os.path.exists(baseline_filepath) and os.path.getsize(baseline_filepath) > 0:
+            try:
+                with open(baseline_filepath, 'r') as f:
+                    baseline_data = json.load(f)
+
+                if isinstance(baseline_data, dict) and "timestamp" in baseline_data:
+                    baseline_runs = [baseline_data]
+                elif isinstance(baseline_data, list):
+                    baseline_runs = baseline_data
+            except (json.JSONDecodeError, FileNotFoundError) as e:
+                self.display.console.print(f"[yellow]Warning: Could not load baseline file: {str(e)}[/yellow]")
+
+        if not baseline_runs:
+            if save_current:
+                current_data = self.to_json()
+                with open(baseline_filepath, 'w') as f:
+                    json.dump([current_data], f, indent=2)
+                self.display.console.print(f"[green]Saved current results as new baseline to {baseline_filepath}[/green]")
+            return {"is_baseline": True, "changes": {}}
+
+        baseline_runs.sort(key=lambda x: x.get("timestamp", ""), reverse=True)
+        latest_run = baseline_runs[0]
+
+        comparison = self._compare_with_run(latest_run)
+
+        if print_summary:
+            self.display.comparison_summary(comparison, latest_run["timestamp"])
+
+        if save_current:
+            current_data = self.to_json()
+            baseline_runs.append(current_data)
+            with open(baseline_filepath, 'w') as f:
+                json.dump(baseline_runs, f, indent=2)
+            self.display.console.print(f"[green]Added current results to baseline file {baseline_filepath}[/green]")
+
+        return comparison
+
+    def _compare_with_run(self, baseline_run: dict[str, Any]) -> dict[str, Any]:
+        baseline_results = baseline_run.get("results", [])
+
+        baseline_lookup = {}
+        for result in baseline_results:
+            test_identifier = result.get("identifier")
+            if test_identifier:
+                baseline_lookup[test_identifier] = result
+
+        improved = []
+        regressed = []
+        unchanged = []
+        new_tests = []
+
+        for result in self.results:
+            test_identifier = result.identifier
+            if not test_identifier or test_identifier not in baseline_lookup:
+                new_tests.append(test_identifier)
+                continue
+
+            baseline_result = baseline_lookup[test_identifier]
+            baseline_passed = baseline_result.get("passed", False)
+            if result.passed and not baseline_passed:
+                improved.append(test_identifier)
+            elif not result.passed and baseline_passed:
+                regressed.append(test_identifier)
+            else:
+                unchanged.append(test_identifier)
+
+        missing_tests = []
+        current_test_identifiers = {result.identifier for result in self.results}
+        for result in baseline_results:
+            test_identifier = result.get("identifier")
+            if test_identifier and test_identifier not in current_test_identifiers:
+                missing_tests.append(test_identifier)
+
+        return {
+            "improved": improved,
+            "regressed": regressed,
+            "unchanged": unchanged,
+            "new_tests": new_tests,
+            "missing_tests": missing_tests,
+            "total_compared": len(improved) + len(regressed) + len(unchanged),
+            "baseline_timestamp": baseline_run.get("timestamp", "unknown")
+        }
--- a/src/crewai/experimental/evaluation/experiment/result_display.py
+++ b/src/crewai/experimental/evaluation/experiment/result_display.py
@@ -0,0 +1,70 @@
+from typing import Dict, Any
+from rich.console import Console
+from rich.table import Table
+from rich.panel import Panel
+from crewai.experimental.evaluation.experiment.result import ExperimentResults
+
+class ExperimentResultsDisplay:
+    def __init__(self):
+        self.console = Console()
+
+    def summary(self, experiment_results: ExperimentResults):
+        total = len(experiment_results.results)
+        passed = sum(1 for r in experiment_results.results if r.passed)
+
+        table = Table(title="Experiment Summary")
+        table.add_column("Metric", style="cyan")
+        table.add_column("Value", style="green")
+
+        table.add_row("Total Test Cases", str(total))
+        table.add_row("Passed", str(passed))
+        table.add_row("Failed", str(total - passed))
+        table.add_row("Success Rate", f"{(passed / total * 100):.1f}%" if total > 0 else "N/A")
+
+        self.console.print(table)
+
+    def comparison_summary(self, comparison: Dict[str, Any], baseline_timestamp: str):
+        self.console.print(Panel(f"[bold]Comparison with baseline run from {baseline_timestamp}[/bold]",
+                                 expand=False))
+
+        table = Table(title="Results Comparison")
+        table.add_column("Metric", style="cyan")
+        table.add_column("Count", style="white")
+        table.add_column("Details", style="dim")
+
+        improved = comparison.get("improved", [])
+        if improved:
+            details = ", ".join([f"{test_identifier}" for test_identifier in improved[:3]])
+            if len(improved) > 3:
+                details += f" and {len(improved) - 3} more"
+            table.add_row("✅ Improved", str(len(improved)), details)
+        else:
+            table.add_row("✅ Improved", "0", "")
+
+        regressed = comparison.get("regressed", [])
+        if regressed:
+            details = ", ".join([f"{test_identifier}" for test_identifier in regressed[:3]])
+            if len(regressed) > 3:
+                details += f" and {len(regressed) - 3} more"
+            table.add_row("❌ Regressed", str(len(regressed)), details, style="red")
+        else:
+            table.add_row("❌ Regressed", "0", "")
+
+        unchanged = comparison.get("unchanged", [])
+        table.add_row("⏺ Unchanged", str(len(unchanged)), "")
+
+        new_tests = comparison.get("new_tests", [])
+        if new_tests:
+            details = ", ".join(new_tests[:3])
+            if len(new_tests) > 3:
+                details += f" and {len(new_tests) - 3} more"
+            table.add_row("➕ New Tests", str(len(new_tests)), details)
+
+        missing_tests = comparison.get("missing_tests", [])
+        if missing_tests:
+            details = ", ".join(missing_tests[:3])
+            if len(missing_tests) > 3:
+                details += f" and {len(missing_tests) - 3} more"
+            table.add_row("➖ Missing Tests", str(len(missing_tests)), details)
+
+        self.console.print(table)
--- a/src/crewai/experimental/evaluation/experiment/runner.py
+++ b/src/crewai/experimental/evaluation/experiment/runner.py
@@ -0,0 +1,125 @@
+from collections import defaultdict
+from hashlib import md5
+from typing import Any
+
+from crewai import Crew, Agent
+from crewai.experimental.evaluation import AgentEvaluator, create_default_evaluator
+from crewai.experimental.evaluation.experiment.result_display import ExperimentResultsDisplay
+from crewai.experimental.evaluation.experiment.result import ExperimentResults, ExperimentResult
+from crewai.experimental.evaluation.evaluation_display import AgentAggregatedEvaluationResult
+
+class ExperimentRunner:
+    def __init__(self, dataset: list[dict[str, Any]]):
+        self.dataset = dataset or []
+        self.evaluator: AgentEvaluator | None = None
+        self.display = ExperimentResultsDisplay()
+
+    def run(self, crew: Crew | None = None, agents: list[Agent] | None = None, print_summary: bool = False) -> ExperimentResults:
+        if crew and not agents:
+            agents = crew.agents
+
+        assert agents is not None
+        self.evaluator = create_default_evaluator(agents=agents)
+
+        results = []
+
+        for test_case in self.dataset:
+            self.evaluator.reset_iterations_results()
+            result = self._run_test_case(test_case=test_case, crew=crew, agents=agents)
+            results.append(result)
+
+        experiment_results = ExperimentResults(results)
+
+        if print_summary:
+            self.display.summary(experiment_results)
+
+        return experiment_results
+
+    def _run_test_case(self, test_case: dict[str, Any], agents: list[Agent], crew: Crew | None = None) -> ExperimentResult:
+        inputs = test_case["inputs"]
+        expected_score = test_case["expected_score"]
+        identifier = test_case.get("identifier") or md5(str(test_case).encode(), usedforsecurity=False).hexdigest()
+
+        try:
+            self.display.console.print(f"[dim]Running crew with input: {str(inputs)[:50]}...[/dim]")
+            self.display.console.print("\n")
+            if crew:
+                crew.kickoff(inputs=inputs)
+            else:
+                for agent in agents:
+                    agent.kickoff(**inputs)
+
+            assert self.evaluator is not None
+            agent_evaluations = self.evaluator.get_agent_evaluation()
+
+            actual_score = self._extract_scores(agent_evaluations)
+
+            passed = self._assert_scores(expected_score, actual_score)
+            return ExperimentResult(
+                identifier=identifier,
+                inputs=inputs,
+                score=actual_score,
+                expected_score=expected_score,
+                passed=passed,
+                agent_evaluations=agent_evaluations
+            )
+
+        except Exception as e:
+            self.display.console.print(f"[red]Error running test case: {str(e)}[/red]")
+            return ExperimentResult(
+                identifier=identifier,
+                inputs=inputs,
+                score=0,
+                expected_score=expected_score,
+                passed=False
+            )
+
+    def _extract_scores(self, agent_evaluations: dict[str, AgentAggregatedEvaluationResult]) -> float | dict[str,  float]:
+        all_scores: dict[str, list[float]] = defaultdict(list)
+        for evaluation in agent_evaluations.values():
+            for metric_name, score in evaluation.metrics.items():
+                if score.score is not None:
+                    all_scores[metric_name.value].append(score.score)
+
+        avg_scores = {m: sum(s)/len(s) for m, s in all_scores.items()}
+
+        if len(avg_scores) == 1:
+            return list(avg_scores.values())[0]
+
+        return avg_scores
+
+    def _assert_scores(self, expected: float | dict[str, float],
+                        actual: float | dict[str, float]) -> bool:
+        """
+        Compare expected and actual scores, and return whether the test case passed.
+
+        The rules for comparison are as follows:
+        - If both expected and actual scores are single numbers, the actual score must be >= expected.
+        - If expected is a single number and actual is a dict, compare against the average of actual values.
+        - If expected is a dict and actual is a single number, actual must be >= all expected values.
+        - If both are dicts, actual must have matching keys with values >= expected values.
+        """
+
+        if isinstance(expected, (int, float)) and isinstance(actual, (int, float)):
+            return actual >= expected
+
+        if isinstance(expected, dict) and isinstance(actual, (int, float)):
+            return all(actual >= exp_score for exp_score in expected.values())
+
+        if isinstance(expected, (int, float)) and isinstance(actual, dict):
+            if not actual:
+                return False
+            avg_score = sum(actual.values()) / len(actual)
+            return avg_score >= expected
+
+        if isinstance(expected, dict) and isinstance(actual, dict):
+            if not expected:
+                return True
+            matching_keys = set(expected.keys()) & set(actual.keys())
+            if not matching_keys:
+                return False
+
+            # All matching keys must have actual >= expected
+            return all(actual[key] >= expected[key] for key in matching_keys)
+
+        return False
--- a/src/crewai/experimental/evaluation/json_parser.py
+++ b/src/crewai/experimental/evaluation/json_parser.py
--- a/src/crewai/experimental/evaluation/metrics/init.py
+++ b/src/crewai/experimental/evaluation/metrics/init.py
@@ -0,0 +1,26 @@
+from crewai.experimental.evaluation.metrics.reasoning_metrics import (
+    ReasoningEfficiencyEvaluator
+)
+
+from crewai.experimental.evaluation.metrics.tools_metrics import (
+    ToolSelectionEvaluator,
+    ParameterExtractionEvaluator,
+    ToolInvocationEvaluator
+)
+
+from crewai.experimental.evaluation.metrics.goal_metrics import (
+    GoalAlignmentEvaluator
+)
+
+from crewai.experimental.evaluation.metrics.semantic_quality_metrics import (
+    SemanticQualityEvaluator
+)
+
+__all__ = [
+    "ReasoningEfficiencyEvaluator",
+    "ToolSelectionEvaluator",
+    "ParameterExtractionEvaluator",
+    "ToolInvocationEvaluator",
+    "GoalAlignmentEvaluator",
+    "SemanticQualityEvaluator"
+]
--- a/src/crewai/experimental/evaluation/metrics/goal_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/goal_metrics.py
@@ -3,8 +3,8 @@ from typing import Any, Dict
 from crewai.agent import Agent
 from crewai.task import Task

-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
+from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
+from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response

 class GoalAlignmentEvaluator(BaseEvaluator):
    @property
@@ -14,10 +14,14 @@ class GoalAlignmentEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: Any,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}\nExpected output: {task.expected_output}\n"
+
        prompt = [
            {"role": "system", "content": """You are an expert evaluator assessing how well an AI agent's output aligns with its assigned task goal.

@@ -37,8 +41,7 @@ Return your evaluation as JSON with fields 'score' (number) and 'feedback' (stri
            {"role": "user", "content": f"""
 Agent role: {agent.role}
 Agent goal: {agent.goal}
-Task description: {task.description}
-Expected output: {task.expected_output}
+{task_context}

 Agent's final output:
 {final_output}
--- a/src/crewai/experimental/evaluation/metrics/reasoning_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/reasoning_metrics.py
@@ -16,8 +16,8 @@ from collections.abc import Sequence
 from crewai.agent import Agent
 from crewai.task import Task

-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
+from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
+from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response
 from crewai.tasks.task_output import TaskOutput

 class ReasoningPatternType(Enum):
@@ -36,10 +36,14 @@ class ReasoningEfficiencyEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
-        final_output: TaskOutput,
+        final_output: TaskOutput | str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}\nExpected output: {task.expected_output}\n"
+
        llm_calls = execution_trace.get("llm_calls", [])

        if not llm_calls or len(llm_calls) < 2:
@@ -83,6 +87,8 @@ class ReasoningEfficiencyEvaluator(BaseEvaluator):

        call_samples = self._get_call_samples(llm_calls)

+        final_output = final_output.raw if isinstance(final_output, TaskOutput) else final_output
+
        prompt = [
            {"role": "system", "content": """You are an expert evaluator assessing the reasoning efficiency of an AI agent's thought process.

@@ -117,7 +123,7 @@ Return your evaluation as JSON with the following structure:
 }"""},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Reasoning efficiency metrics:
 - Total LLM calls: {efficiency_metrics["total_llm_calls"]}
@@ -130,7 +136,7 @@ Sample of agent reasoning flow (chronological sequence):
 {call_samples}

 Agent's final output:
-{final_output.raw[:500]}... (truncated)
+{final_output[:500]}... (truncated)

 Evaluate the reasoning efficiency of this agent based on these interaction patterns.
 Identify any inefficient reasoning patterns and provide specific suggestions for optimization.
--- a/src/crewai/experimental/evaluation/metrics/semantic_quality_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/semantic_quality_metrics.py
@@ -3,8 +3,8 @@ from typing import Any, Dict
 from crewai.agent import Agent
 from crewai.task import Task

-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
+from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
+from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response

 class SemanticQualityEvaluator(BaseEvaluator):
    @property
@@ -14,10 +14,13 @@ class SemanticQualityEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: Any,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
        prompt = [
            {"role": "system", "content": """You are an expert evaluator assessing the semantic quality of an AI agent's output.

@@ -37,7 +40,7 @@ Return your evaluation as JSON with fields 'score' (number) and 'feedback' (stri
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Agent's final output:
 {final_output}
--- a/src/crewai/experimental/evaluation/metrics/tools_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/tools_metrics.py
@@ -1,8 +1,8 @@
 import json
 from typing import Dict, Any

-from crewai.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
-from crewai.evaluation.json_parser import extract_json_from_llm_response
+from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
+from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response
 from crewai.agent import Agent
 from crewai.task import Task

@@ -16,10 +16,14 @@ class ToolSelectionEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
+
        tool_uses = execution_trace.get("tool_uses", [])
        tool_count = len(tool_uses)
        unique_tool_types = set([tool.get("tool", "Unknown tool") for tool in tool_uses])
@@ -72,7 +76,7 @@ Return your evaluation as JSON with these fields:
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Available tools for this agent:
 {available_tools_info}
@@ -128,10 +132,13 @@ class ParameterExtractionEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
        tool_uses = execution_trace.get("tool_uses", [])
        tool_count = len(tool_uses)

@@ -212,7 +219,7 @@ Return your evaluation as JSON with these fields:
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Parameter extraction examples:
 {param_samples_text}
@@ -267,10 +274,13 @@ class ToolInvocationEvaluator(BaseEvaluator):
    def evaluate(
        self,
        agent: Agent,
-        task: Task,
        execution_trace: Dict[str, Any],
        final_output: str,
+        task: Task | None = None,
    ) -> EvaluationScore:
+        task_context = ""
+        if task is not None:
+            task_context = f"Task description: {task.description}"
        tool_uses = execution_trace.get("tool_uses", [])
        tool_errors = []
        tool_count = len(tool_uses)
@@ -352,7 +362,7 @@ Return your evaluation as JSON with these fields:
 """},
            {"role": "user", "content": f"""
 Agent role: {agent.role}
-Task description: {task.description}
+{task_context}

 Tool invocation examples:
 {invocation_samples_text}
--- a/src/crewai/experimental/evaluation/testing.py
+++ b/src/crewai/experimental/evaluation/testing.py
@@ -0,0 +1,52 @@
+import inspect
+
+from typing_extensions import Any
+import warnings
+from crewai.experimental.evaluation.experiment import ExperimentResults, ExperimentRunner
+from crewai import Crew, Agent
+
+def assert_experiment_successfully(experiment_results: ExperimentResults, baseline_filepath: str | None = None) -> None:
+    failed_tests = [result for result in experiment_results.results if not result.passed]
+
+    if failed_tests:
+        detailed_failures: list[str] = []
+
+        for result in failed_tests:
+            expected = result.expected_score
+            actual = result.score
+            detailed_failures.append(f"- {result.identifier}: expected {expected}, got {actual}")
+
+        failure_details = "\n".join(detailed_failures)
+        raise AssertionError(f"The following test cases failed:\n{failure_details}")
+
+    baseline_filepath = baseline_filepath or _get_baseline_filepath_fallback()
+    comparison = experiment_results.compare_with_baseline(baseline_filepath=baseline_filepath)
+    assert_experiment_no_regression(comparison)
+
+def assert_experiment_no_regression(comparison_result: dict[str, list[str]]) -> None:
+    regressed = comparison_result.get("regressed", [])
+    if regressed:
+        raise AssertionError(f"Regression detected! The following tests that previously passed now fail: {regressed}")
+
+    missing_tests = comparison_result.get("missing_tests", [])
+    if missing_tests:
+        warnings.warn(
+            f"Warning: {len(missing_tests)} tests from the baseline are missing in the current run: {missing_tests}",
+            UserWarning
+        )
+
+def run_experiment(dataset: list[dict[str, Any]], crew: Crew | None = None, agents: list[Agent] | None = None, verbose: bool = False) -> ExperimentResults:
+    runner = ExperimentRunner(dataset=dataset)
+
+    return runner.run(agents=agents, crew=crew, print_summary=verbose)
+
+def _get_baseline_filepath_fallback() -> str:
+    test_func_name = "experiment_fallback"
+
+    try:
+        current_frame = inspect.currentframe()
+        if current_frame is not None:
+            test_func_name = current_frame.f_back.f_back.f_code.co_name # type: ignore[union-attr]
+    except Exception:
+        ...
+    return f"{test_func_name}_results.json"
--- a/src/crewai/knowledge/storage/knowledge_storage.py
+++ b/src/crewai/knowledge/storage/knowledge_storage.py
@@ -18,6 +18,7 @@ from crewai.utilities.chromadb import sanitize_collection_name
 from crewai.utilities.constants import KNOWLEDGE_DIRECTORY
 from crewai.utilities.logger import Logger
 from crewai.utilities.paths import db_storage_path
+from crewai.utilities.chromadb import create_persistent_client


@contextlib.contextmanager
@@ -84,14 +85,11 @@ class KnowledgeStorage(BaseKnowledgeStorage):
                raise Exception("Collection not initialized")

    def initialize_knowledge_storage(self):
-        base_path = os.path.join(db_storage_path(), "knowledge")
-        chroma_client = chromadb.PersistentClient(
-            path=base_path,
+        self.app = create_persistent_client(
+            path=os.path.join(db_storage_path(), "knowledge"),
            settings=Settings(allow_reset=True),
        )

-        self.app = chroma_client
-
        try:
            collection_name = (
                f"knowledge_{self.collection_name}"
@@ -111,9 +109,8 @@ class KnowledgeStorage(BaseKnowledgeStorage):
    def reset(self):
        base_path = os.path.join(db_storage_path(), KNOWLEDGE_DIRECTORY)
        if not self.app:
-            self.app = chromadb.PersistentClient(
-                path=base_path,
-                settings=Settings(allow_reset=True),
+            self.app = create_persistent_client(
+                path=base_path, settings=Settings(allow_reset=True)
            )

        self.app.reset()
--- a/src/crewai/lite_agent.py
+++ b/src/crewai/lite_agent.py
@@ -305,6 +305,7 @@ class LiteAgent(FlowTrackable, BaseModel):
        """
        # Create agent info for event emission
        agent_info = {
+            "id": self.id,
            "role": self.role,
            "goal": self.goal,
            "backstory": self.backstory,
--- a/src/crewai/llm.py
+++ b/src/crewai/llm.py
@@ -59,6 +59,7 @@ from crewai.utilities.exceptions.context_window_exceeding_exception import (

 load_dotenv()

+litellm.suppress_debug_info = True

 class FilteredStream(io.TextIOBase):
    _lock = None
@@ -76,9 +77,7 @@ class FilteredStream(io.TextIOBase):

            # Skip common noisy LiteLLM banners and any other lines that contain "litellm"
            if (
-                "give feedback / get help" in lower_s
-                or "litellm.info:" in lower_s
-                or "litellm" in lower_s
+                "litellm.info:" in lower_s
                or "Consider using a smaller input or implementing a text splitting strategy" in lower_s
            ):
                return 0
@@ -760,7 +759,7 @@ class LLM(BaseLLM):
        available_functions: Optional[Dict[str, Any]] = None,
        from_task: Optional[Any] = None,
        from_agent: Optional[Any] = None,
-    ) -> str:
+    ) -> str | Any:
        """Handle a non-streaming response from the LLM.

        Args:
@@ -784,13 +783,11 @@ class LLM(BaseLLM):
            # Convert litellm's context window error to our own exception type
            # for consistent handling in the rest of the codebase
            raise LLMContextLengthExceededException(str(e))
-
        # --- 2) Extract response message and content
        response_message = cast(Choices, cast(ModelResponse, response).choices)[
            0
        ].message
        text_response = response_message.content or ""
-
        # --- 3) Handle callbacks with usage info
        if callbacks and len(callbacks) > 0:
            for callback in callbacks:
@@ -803,21 +800,22 @@ class LLM(BaseLLM):
                            start_time=0,
                            end_time=0,
                        )
-
        # --- 4) Check for tool calls
        tool_calls = getattr(response_message, "tool_calls", [])

-        # --- 5) If no tool calls or no available functions, return the text response directly
-        if not tool_calls or not available_functions:
+        # --- 5) If no tool calls or no available functions, return the text response directly as long as there is a text response
+        if (not tool_calls or not available_functions) and text_response:
            self._handle_emit_call_events(response=text_response, call_type=LLMCallType.LLM_CALL, from_task=from_task, from_agent=from_agent, messages=params["messages"])
            return text_response
+        # --- 6) If there is no text response, no available functions, but there are tool calls, return the tool calls
+        elif tool_calls and not available_functions and not text_response:
+            return tool_calls

-        # --- 6) Handle tool calls if present
+        # --- 7) Handle tool calls if present
        tool_result = self._handle_tool_call(tool_calls, available_functions)
        if tool_result is not None:
            return tool_result
-
-        # --- 7) If tool call handling didn't return a result, emit completion event and return text response
+        # --- 8) If tool call handling didn't return a result, emit completion event and return text response
        self._handle_emit_call_events(response=text_response, call_type=LLMCallType.LLM_CALL, from_task=from_task, from_agent=from_agent, messages=params["messages"])
        return text_response

@@ -952,22 +950,18 @@ class LLM(BaseLLM):
        # --- 3) Convert string messages to proper format if needed
        if isinstance(messages, str):
            messages = [{"role": "user", "content": messages}]
-
        # --- 4) Handle O1 model special case (system messages not supported)
        if "o1" in self.model.lower():
            for message in messages:
                if message.get("role") == "system":
                    message["role"] = "assistant"
-
        # --- 5) Set up callbacks if provided
        with suppress_warnings():
            if callbacks and len(callbacks) > 0:
                self.set_callbacks(callbacks)
-
            try:
                # --- 6) Prepare parameters for the completion call
                params = self._prepare_completion_params(messages, tools)
-
                # --- 7) Make the completion call and handle response
                if self.stream:
                    return self._handle_streaming_response(
@@ -984,12 +978,32 @@ class LLM(BaseLLM):
                # whether to summarize the content or abort based on the respect_context_window flag
                raise
            except Exception as e:
+                unsupported_stop = "Unsupported parameter" in str(e) and "'stop'" in str(e)
+
+                if unsupported_stop:
+                    if "additional_drop_params" in self.additional_params and isinstance(self.additional_params["additional_drop_params"], list):
+                        self.additional_params["additional_drop_params"].append("stop")
+                    else:
+                        self.additional_params = {"additional_drop_params": ["stop"]}
+
+                    logging.info(
+                        "Retrying LLM call without the unsupported 'stop'"
+                    )
+
+                    return self.call(
+                        messages,
+                        tools=tools,
+                        callbacks=callbacks,
+                        available_functions=available_functions,
+                        from_task=from_task,
+                        from_agent=from_agent,
+                    )
+
                assert hasattr(crewai_event_bus, "emit")
                crewai_event_bus.emit(
                    self,
                    event=LLMCallFailedEvent(error=str(e), from_task=from_task, from_agent=from_agent),
                )
-                logging.error(f"LiteLLM call failed: {str(e)}")
                raise

    def _handle_emit_call_events(self, response: Any, call_type: LLMCallType, from_task: Optional[Any] = None, from_agent: Optional[Any] = None, messages: str | list[dict[str, Any]] | None = None):
@@ -1058,6 +1072,15 @@ class LLM(BaseLLM):
                messages.append({"role": "user", "content": "Please continue."})
            return messages

+        # TODO: Remove this code after merging PR https://github.com/BerriAI/litellm/pull/10917
+        # Ollama doesn't supports last message to be 'assistant'
+        if "ollama" in self.model.lower() and messages and messages[-1]["role"] == "assistant":
+            messages = messages.copy()
+            messages.append(
+                {"role": "user", "content": ""}
+            )
+            return messages
+
        # Handle Anthropic models
        if not self.is_anthropic:
            return messages
--- a/src/crewai/memory/contextual/contextual_memory.py
+++ b/src/crewai/memory/contextual/contextual_memory.py
@@ -108,6 +108,7 @@ class ContextualMemory:

    def _fetch_user_context(self, query: str) -> str:
        """
+        DEPRECATED: Will be removed in version 0.156.0 or on 2025-08-04, whichever comes first.
        Fetches and formats relevant user information from User Memory.
        Args:
            query (str): The search query to find relevant user memories.
--- a/src/crewai/memory/storage/mem0_storage.py
+++ b/src/crewai/memory/storage/mem0_storage.py
@@ -64,6 +64,7 @@ class Mem0Storage(Storage):
    def save(self, value: Any, metadata: Dict[str, Any]) -> None:
        user_id = self._get_user_id()
        agent_name = self._get_agent_name()
+        assistant_message = [{"role" : "assistant","content" : value}] 
        params = None
        if self.memory_type == "short_term":
            params = {
@@ -93,7 +94,8 @@ class Mem0Storage(Storage):
        if params:
            if isinstance(self.memory, MemoryClient):
                params["output_format"] = "v1.1"
-            self.memory.add(value, **params)
+            
+            self.memory.add(assistant_message, **params)

    def search(
        self,
--- a/src/crewai/memory/storage/rag_storage.py
+++ b/src/crewai/memory/storage/rag_storage.py
@@ -4,12 +4,12 @@ import logging
 import os
 import shutil
 import uuid
+
 from typing import Any, Dict, List, Optional
-
 from chromadb.api import ClientAPI
-
 from crewai.memory.storage.base_rag_storage import BaseRAGStorage
 from crewai.utilities import EmbeddingConfigurator
+from crewai.utilities.chromadb import create_persistent_client
 from crewai.utilities.constants import MAX_FILE_NAME_LENGTH
 from crewai.utilities.paths import db_storage_path

@@ -60,17 +60,15 @@ class RAGStorage(BaseRAGStorage):
        self.embedder_config = configurator.configure_embedder(self.embedder_config)

    def _initialize_app(self):
-        import chromadb
        from chromadb.config import Settings

        self._set_embedder_config()
-        chroma_client = chromadb.PersistentClient(
+
+        self.app = create_persistent_client(
            path=self.path if self.path else self.storage_file_name,
            settings=Settings(allow_reset=self.allow_reset),
        )

-        self.app = chroma_client
-
        self.collection = self.app.get_or_create_collection(
            name=self.type, embedding_function=self.embedder_config
        )
--- a/src/crewai/memory/user/user_memory.py
+++ b/src/crewai/memory/user/user_memory.py
@@ -14,7 +14,8 @@ class UserMemory(Memory):

    def __init__(self, crew=None):
        warnings.warn(
-            "UserMemory is deprecated and will be removed in a future version. "
+            "UserMemory is deprecated and will be removed in version 0.156.0 "
+            "or on 2025-08-04, whichever comes first. "
            "Please use ExternalMemory instead.",
            DeprecationWarning,
            stacklevel=2,
--- a/src/crewai/memory/user/user_memory_item.py
+++ b/src/crewai/memory/user/user_memory_item.py
@@ -1,8 +1,16 @@
+import warnings
 from typing import Any, Dict, Optional


 class UserMemoryItem:
    def __init__(self, data: Any, user: str, metadata: Optional[Dict[str, Any]] = None):
+        warnings.warn(
+            "UserMemoryItem is deprecated and will be removed in version 0.156.0 "
+            "or on 2025-08-04, whichever comes first. "
+            "Please use ExternalMemory instead.",
+            DeprecationWarning,
+            stacklevel=2,
+        )
        self.data = data
        self.user = user
        self.metadata = metadata if metadata is not None else {}
--- a/src/crewai/utilities/agent_utils.py
+++ b/src/crewai/utilities/agent_utils.py
@@ -157,10 +157,6 @@ def get_llm_response(
            from_agent=from_agent,
        )
    except Exception as e:
-        printer.print(
-            content=f"Error during LLM call: {e}",
-            color="red",
-        )
        raise e
    if not answer:
        printer.print(
@@ -232,12 +228,17 @@ def handle_unknown_error(printer: Any, exception: Exception) -> None:
        printer: Printer instance for output
        exception: The exception that occurred
    """
+    error_message = str(exception)
+
+    if "litellm" in error_message:
+        return
+
    printer.print(
        content="An unknown error occurred. Please check the details below.",
        color="red",
    )
    printer.print(
-        content=f"Error details: {exception}",
+        content=f"Error details: {error_message}",
        color="red",
    )

--- a/src/crewai/utilities/chromadb.py
+++ b/src/crewai/utilities/chromadb.py
@@ -1,6 +1,10 @@
 import re
+import portalocker
+from chromadb import PersistentClient
+from hashlib import md5
 from typing import Optional

+
 MIN_COLLECTION_LENGTH = 3
 MAX_COLLECTION_LENGTH = 63
 DEFAULT_COLLECTION = "default_collection"
@@ -60,3 +64,16 @@ def sanitize_collection_name(name: Optional[str], max_collection_length: int = M
            sanitized = sanitized[:-1] + "z"

    return sanitized
+
+
+def create_persistent_client(path: str, **kwargs):
+    """
+    Creates a persistent client for ChromaDB with a lock file to prevent
+    concurrent creations. Works for both multi-threads and multi-processes
+    environments.
+    """
+    lockfile = f"chromadb-{md5(path.encode(), usedforsecurity=False).hexdigest()}.lock"
+    with portalocker.Lock(lockfile):
+        client = PersistentClient(path=path, **kwargs)
+
+    return client
--- a/src/crewai/utilities/events/init.py
+++ b/src/crewai/utilities/events/init.py
@@ -17,6 +17,9 @@ from .agent_events import (
    AgentExecutionStartedEvent,
    AgentExecutionCompletedEvent,
    AgentExecutionErrorEvent,
+    AgentEvaluationStartedEvent,
+    AgentEvaluationCompletedEvent,
+    AgentEvaluationFailedEvent,
 )
 from .task_events import (
    TaskStartedEvent,
@@ -74,6 +77,9 @@ __all__ = [
    "AgentExecutionStartedEvent",
    "AgentExecutionCompletedEvent",
    "AgentExecutionErrorEvent",
+    "AgentEvaluationStartedEvent",
+    "AgentEvaluationCompletedEvent",
+    "AgentEvaluationFailedEvent",
    "TaskStartedEvent",
    "TaskCompletedEvent",
    "TaskFailedEvent",
--- a/src/crewai/utilities/events/agent_events.py
+++ b/src/crewai/utilities/events/agent_events.py
@@ -123,3 +123,28 @@ class AgentLogsExecutionEvent(BaseEvent):
    type: str = "agent_logs_execution"

    model_config = {"arbitrary_types_allowed": True}
+
+# Agent Eval events
+class AgentEvaluationStartedEvent(BaseEvent):
+    agent_id: str
+    agent_role: str
+    task_id: str | None = None
+    iteration: int
+    type: str = "agent_evaluation_started"
+
+class AgentEvaluationCompletedEvent(BaseEvent):
+    agent_id: str
+    agent_role: str
+    task_id: str | None = None
+    iteration: int
+    metric_category: Any
+    score: Any
+    type: str = "agent_evaluation_completed"
+
+class AgentEvaluationFailedEvent(BaseEvent):
+    agent_id: str
+    agent_role: str
+    task_id: str | None = None
+    iteration: int
+    error: str
+    type: str = "agent_evaluation_failed"
--- a/src/crewai/utilities/events/event_types.py
+++ b/src/crewai/utilities/events/event_types.py
@@ -4,6 +4,7 @@ from .agent_events import (
    AgentExecutionCompletedEvent,
    AgentExecutionErrorEvent,
    AgentExecutionStartedEvent,
+    LiteAgentExecutionCompletedEvent,
 )
 from .crew_events import (
    CrewKickoffCompletedEvent,
@@ -80,6 +81,7 @@ EventTypes = Union[
    CrewTrainFailedEvent,
    AgentExecutionStartedEvent,
    AgentExecutionCompletedEvent,
+    LiteAgentExecutionCompletedEvent,
    TaskStartedEvent,
    TaskCompletedEvent,
    TaskFailedEvent,
--- a/tests/agent_test.py
+++ b/tests/agent_test.py
@@ -2010,7 +2010,6 @@ def test_crew_agent_executor_litellm_auth_error():
    from litellm.exceptions import AuthenticationError

    from crewai.agents.tools_handler import ToolsHandler
-    from crewai.utilities import Printer

    # Create an agent and executor
    agent = Agent(
@@ -2043,7 +2042,6 @@ def test_crew_agent_executor_litellm_auth_error():
    # Mock the LLM call to raise AuthenticationError
    with (
        patch.object(LLM, "call") as mock_llm_call,
-        patch.object(Printer, "print") as mock_printer,
        pytest.raises(AuthenticationError) as exc_info,
    ):
        mock_llm_call.side_effect = AuthenticationError(
@@ -2057,13 +2055,6 @@ def test_crew_agent_executor_litellm_auth_error():
            }
        )

-    # Verify error handling messages
-    error_message = f"Error during LLM call: {str(mock_llm_call.side_effect)}"
-    mock_printer.assert_any_call(
-        content=error_message,
-        color="red",
-    )
-
    # Verify the call was only made once (no retries)
    mock_llm_call.assert_called_once()

--- a/tests/cassettes/TestAgentEvaluator.test_eval_lite_agent.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_eval_lite_agent.yaml
@@ -0,0 +1,237 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are Test Agent. An agent
+      created for testing purposes\nYour personal goal is: Complete test tasks successfully\n\nTo
+      give my best complete final answer to the task respond using the exact following
+      format:\n\nThought: I now can give a great answer\nFinal Answer: Your final
+      answer must be the great and the most complete as possible, it must be outcome
+      described.\n\nI MUST use these formats, my job depends on it!"}, {"role": "user",
+      "content": "Complete this task successfully"}], "model": "gpt-4o-mini", "stop":
+      ["\nObservation:"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '583'
+      content-type:
+      - application/json
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAAwAAAP//jFNNb9swDL3nVxA6J0U+HKTNbd0woMAOw7Bu6LbCUCXa1iqLgkgnzYr8
+        98FKWqdbB+wiQHx81OMj9TgCUM6qNSjTaDFt9JNL+TZ7N/dfrusPN01NyV6vPk3f/mrl5vLrXI17
+        Bt39RCNPrDNDbfQojsIBNgm1YF91tlrOl+fzxXKWgZYs+p5WR5kUNGldcJP5dF5MpqvJ7PzIbsgZ
+        ZLWG7yMAgMd89jqDxQe1hun4KdIis65RrZ+TAFQi30eUZnYsOogaD6ChIBiy9M8NdXUja7iCQFsw
+        OkDtNgga6l4/6MBbTAA/wnsXtIc3+b6Gjx41I8REG2cRWoStkwakQeCIxlXOgEXRzjNQgvzigwBV
+        OUU038OOOgiIFhr0MdPHoIOFK9g67wEDdwlBCI7OIjgB7oxB5qrzfpeznxRokIZS3wwk5EiB8ey0
+        54RVx7r3PXTenwA6BBLdzy27fXtE9s/+eqpjojv+g6oqFxw3ZULNFHovWSiqjO5HALd5jt2L0aiY
+        qI1SCt1jfu7i4lBODdszgEVxBIVE+yE+KxbjV8qVR79PFkEZbRq0A3XYGt1ZRyfA6KTpv9W8VvvQ
+        uAv1/5QfAGMwCtoyJrTOvOx4SEvYf65/pT2bnAUrxrRxBktxmPpBWKx05w8rr3jHgm1ZuVBjiskd
+        9r6K5aLQy0LjxcKo0X70GwAA//8DAMz2wVUFBAAA
+    headers:
+      CF-RAY:
+      - 95f93ea9af627e0b-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Tue, 15 Jul 2025 12:25:54 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=GRZmZLrjW5ZRHNmUJa4ccrMcy20D1rmeqK6Ptlv0mRY-1752582354-1.0.1.1-xKd_yga48Eedech5TRlThlEpDgsB2whxkWHlCyAGOVMqMcvH1Ju9FdXYbuQ9NdUQcVxPLgiGM35lYhqSLVQiXDyK01dnyp2Gvm560FBN9DY;
+        path=/; expires=Tue, 15-Jul-25 12:55:54 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=MYqswpSR7sqr4kGp6qZVkaL7HDYwMiww49PeN9QBP.A-1752582354973-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '4047'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '4440'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999885'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_5704c0f206a927ddc12aa1a19b612a75
+    status:
+      code: 200
+      message: OK
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are an expert evaluator
+      assessing how well an AI agent''s output aligns with its assigned task goal.\n\nScore
+      the agent''s goal alignment on a scale from 0-10 where:\n- 0: Complete misalignment,
+      agent did not understand or attempt the task goal\n- 5: Partial alignment, agent
+      attempted the task but missed key requirements\n- 10: Perfect alignment, agent
+      fully satisfied all task requirements\n\nConsider:\n1. Did the agent correctly
+      interpret the task goal?\n2. Did the final output directly address the requirements?\n3.
+      Did the agent focus on relevant aspects of the task?\n4. Did the agent provide
+      all requested information or deliverables?\n\nReturn your evaluation as JSON
+      with fields ''score'' (number) and ''feedback'' (string).\n"}, {"role": "user",
+      "content": "\nAgent role: Test Agent\nAgent goal: Complete test tasks successfully\n\n\nAgent''s
+      final output:\nPlease provide me with the specific details or context of the
+      task you need help with, and I will ensure to complete it successfully and provide
+      a thorough response.\n\nEvaluate how well the agent''s output aligns with the
+      assigned task goal.\n"}], "model": "gpt-4o-mini", "stop": []}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '1196'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=GRZmZLrjW5ZRHNmUJa4ccrMcy20D1rmeqK6Ptlv0mRY-1752582354-1.0.1.1-xKd_yga48Eedech5TRlThlEpDgsB2whxkWHlCyAGOVMqMcvH1Ju9FdXYbuQ9NdUQcVxPLgiGM35lYhqSLVQiXDyK01dnyp2Gvm560FBN9DY;
+        _cfuvid=MYqswpSR7sqr4kGp6qZVkaL7HDYwMiww49PeN9QBP.A-1752582354973-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAA4xUy27bQAy8+yuIPdtGbMdN4FvbSxM0QIsEKNA6MJhdSmK82hWWVFwj8L8XKz/k
+        9AH0ogOHnOFjVq8DAMPOLMDYCtXWjR990O+TT7dfZs/v5OtFy/ef7++mxfu7j83t/cONGeaK+PRM
+        Vo9VYxvrxpNyDHvYJkKlzDq5mk/n19PZfN4BdXTkc1nZ6OgyjmoOPJpeTC9HF1ejyfWhuopsScwC
+        fgwAAF67b+4zOPppFnAxPEZqEsGSzOKUBGBS9DliUIRFMagZ9qCNQSl0rb8uA8DSiI2JlmYB0+E+
+        UBC5J7TrHFuah4oASwoKjh2EqOCojkE0oRIgWE+YoA2OUhZzHEqIBWhFoChrKCP6IWwqthWwgEY4
+        bItASbRLEpDWWhIpWu+3Y7gJooRuCKyAsiYHRUxQx0TgSJG9DIGDY4ua5RA82nVW5cDKqPxCWYhC
+        iSXBhrU69TOGbxV7ysxSxY0Awoa951AGkq69/do67QLZk8vBJsUXdgQYtoBWW/SQSJoYpFPq2Ptp
+        MLjTttC51DFXVIPjRFb9drw0y7A7v0uiohXM3git92cAhhAVs7c6RzwekN3JAz6WTYpP8lupKTiw
+        VKtEKDHke4vGxnTobgDw2HmtfWMf06RYN7rSuKZObjo7eM30Fu/R6yOoUdH38dnkCLzhWx1ud+ZW
+        Y9FW5PrS3trYOo5nwOBs6j+7+Rv3fnIO5f/Q94C11Ci5VZPIsX07cZ+WKP8B/pV22nLXsBFKL2xp
+        pUwpX8JRga3fv0sjW1GqVwWHklKTuHuc+ZKD3eAXAAAA//8DADksFsafBAAA
+    headers:
+      CF-RAY:
+      - 95f93ec73a1c7e0b-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Tue, 15 Jul 2025 12:25:57 GMT
+      Server:
+      - cloudflare
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '1544'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '1546'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999732'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_44930ba12ad8d1e3f0beed1d5e3d8b0c
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/cassettes/TestAgentEvaluator.test_eval_specific_agents_from_crew.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_eval_specific_agents_from_crew.yaml
--- a/tests/cassettes/TestAgentEvaluator.test_evaluate_current_iteration.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_evaluate_current_iteration.yaml
@@ -427,4 +427,140 @@ interactions:
    status:
      code: 200
      message: OK
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are an expert evaluator
+      assessing how well an AI agent''s output aligns with its assigned task goal.\n\nScore
+      the agent''s goal alignment on a scale from 0-10 where:\n- 0: Complete misalignment,
+      agent did not understand or attempt the task goal\n- 5: Partial alignment, agent
+      attempted the task but missed key requirements\n- 10: Perfect alignment, agent
+      fully satisfied all task requirements\n\nConsider:\n1. Did the agent correctly
+      interpret the task goal?\n2. Did the final output directly address the requirements?\n3.
+      Did the agent focus on relevant aspects of the task?\n4. Did the agent provide
+      all requested information or deliverables?\n\nReturn your evaluation as JSON
+      with fields ''score'' (number) and ''feedback'' (string).\n"}, {"role": "user",
+      "content": "\nAgent role: Test Agent\nAgent goal: Complete test tasks successfully\nTask
+      description: Test task description\nExpected output: Expected test output\n\nAgent''s
+      final output:\nThe expected test output is a comprehensive document that outlines
+      the specific parameters and criteria that define success for the task at hand.
+      It should include detailed descriptions of the tasks, the goals that need to
+      be achieved, and any specific formatting or structural requirements necessary
+      for the output. Each component of the task must be analyzed and addressed, providing
+      context as well as examples where applicable. Additionally, any tools or methodologies
+      that are relevant to executing the tasks successfully should be outlined, including
+      any potential risks or challenges that may arise during the process. This document
+      serves as a guiding framework to ensure that all aspects of the task are thoroughly
+      considered and executed to meet the high standards expected.\n\nEvaluate how
+      well the agent''s output aligns with the assigned task goal.\n"}], "model":
+      "gpt-4o-mini", "stop": []}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '1893'
+      content-type:
+      - application/json
+      cookie:
+      - _cfuvid=XwsgBfgvDGlKFQ4LiGYGIARIoSNTiwidqoo9UZcc.XY-1752087999227-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAAwAAAP//jFRNbxs5DL37VxA6jwPHddrUxxwWi2BRtEAPRevCYCSOh41GUkWOnTTI
+        fy8kf4zT5rCXOfCRT4+P5DxNAAw7swRjO1TbJz+90dvFxy//vX0za7dfr29+3eo/n75++Mh0O/za
+        maZUxLsfZPVYdWFjnzwpx7CHbSZUKqyX767mV/PL2eKqAn105EvZJul0Eac9B57OZ/PFdPZuenl9
+        qO4iWxKzhG8TAICn+i06g6MHs4RZc4z0JIIbMstTEoDJ0ZeIQREWxaCmGUEbg1Ko0p9WAWBlxMZM
+        K7OEq2YfaIncHdr7EluZzx0BbigopBy37MgBgiNF9uTAkdjMqbQOsYVdhwraEdBDIqvkIA6aBgXp
+        4uAdcLB+cNTArmPbAQfHFpUEJPYEQ3CUi2LHYVPoCpOi3EOmnwNn6imoXMC/cUdbyk3FWw7oj8+4
+        SAIhKkgiyy1b9P4RHHneUn4pTEn0WIYC6YDX5866aqDH+yKHFRJm5cqInjeB3AWM7vQsUgzhTFb9
+        48GtUlloSwMkZ4bEDMetOaSg1QH9XldVwSrk2wY4iBLWSs/hmG47zGiVMouylZP7WHkzdRSEtwQu
+        2qH4dhyBjcWKHWsXhzJTEgpVAwagByySirgzRSfLDrtzsTKr8Hy+VJnaQbAsdhi8PwMwhKhYfKzr
+        /P2APJ8W2MdNyvFO/ig1LQeWbp0JJYayrKIxmYo+TwC+10MZXuy+STn2Sdca76k+92ax2POZ8T5H
+        9P31AdSo6Mf4YjFvXuFb71dezk7NWLQdubF0vEscHMczYHLW9d9qXuPed85h83/oR8BaSkpunTI5
+        ti87HtMy/agTfT3t5HIVbITyli2tlSmXSThqcfD7n4qRR1Hq1y2HDeWUuf5ZyiQnz5PfAAAA//8D
+        AEfUP8BcBQAA
+    headers:
+      CF-RAY:
+      - 95f365f1bfc87ded-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Mon, 14 Jul 2025 19:24:07 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=PcC3_3T8.MK_WpZlQLdZfwpNv9Pe45AIYmrXOSgJ65E-1752521047-1.0.1.1-eyqwSWfQC7ZV6.JwTsTihK1ZWCrEmxd52CtNcfe.fw1UjjBN9rdTU4G7hRZiNqHQYo4sVZMmgRgqM9k7HRSzN2zln0bKmMiOuSQTZh6xF_I;
+        path=/; expires=Mon, 14-Jul-25 19:54:07 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=JvQ1c4qYZefNwOPoVNgAtX8ET7ObU.JKDvGc43LOR6g-1752521047741-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '2729'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '2789'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999559'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_74f6e8ff49db25dbea3d3525cc149e8e
+    status:
+      code: 200
+      message: OK
 version: 1
--- a/tests/cassettes/TestAgentEvaluator.test_failed_evaluation.yaml
+++ b/tests/cassettes/TestAgentEvaluator.test_failed_evaluation.yaml
@@ -0,0 +1,123 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "system", "content": "You are Test Agent. An agent
+      created for testing purposes\nYour personal goal is: Complete test tasks successfully\nTo
+      give my best complete final answer to the task respond using the exact following
+      format:\n\nThought: I now can give a great answer\nFinal Answer: Your final
+      answer must be the great and the most complete as possible, it must be outcome
+      described.\n\nI MUST use these formats, my job depends on it!"}, {"role": "user",
+      "content": "\nCurrent Task: Test task description\n\nThis is the expected criteria
+      for your final answer: Expected test output\nyou MUST return the actual complete
+      content as the final answer, not a summary.\n\nBegin! This is VERY important
+      to you, use the tools available and give your best Final Answer, your job depends
+      on it!\n\nThought:"}], "model": "gpt-4o-mini", "stop": ["\nObservation:"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '879'
+      content-type:
+      - application/json
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.93.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.93.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAAwAAAP//jFTBbhtHDL3rK4g5rwRbtaNYt9RoEaNoUaBODm0DgZnh7jKe5WyHXDmO
+        4X8vZiRLcupDLwvsPPLxPQ45jzMAx8GtwfkezQ9jnP9oeLv98N5+vfl9+4v89Mf76+XV7XDz8Yc/
+        r39T15SM9PkLeXvOWvg0jJGMk+xgnwmNCuv56nJ5+XZ1tbqswJACxZLWjTa/SPOBhefLs+XF/Gw1
+        P3+7z+4Te1K3hr9mAACP9Vt0SqCvbg1nzfPJQKrYkVsfggBcTrGcOFRlNRRzzRH0SYykSr8BSffg
+        UaDjLQFCV2QDit5TBvhbfmbBCO/q/xpue1ZgBesJ6OtI3iiAkRqkycbJGrjv2ffgk5S6CqkFhECG
+        HClAIPWZx9Kkgtz3aJVq37vChXoH2qcpBogp3UHkO1rAbU/QViW7Os8hLD5OgQBjBCFfOpEfgKVN
+        ecBSpoFAQxK1jMbSgY+Y2R6aWjJTT6K8JSHVBlACYOgpk3gCS4DyADqS55YpQDdxoMhCuoCbgwKf
+        tpSB0PeAJdaKseKpOsn0z8SZBhJrgESnXERY8S0JRsxWulkoilkKkDJ0JJQx8jcKi13DX3pWyuWm
+        FPDQN8jU7mW3KRfdSaj2r5ZLMEmgXOYg7K5OlcQYI1Cs4vSFavSVmLWnsDgdnEztpFiGV6YYTwAU
+        SVYbXkf20x55OgxpTN2Y02f9LtW1LKz9JhNqkjKQaml0FX2aAXyqyzC9mG835jSMtrF0R7Xc+Zvz
+        HZ877uARvXqzBy0ZxuP58nLVvMK32Q2rnqyT8+h7CsfU4+7hFDidALMT1/9V8xr3zjlL93/oj4D3
+        NBqFzZgpsH/p+BiW6Utd0dfDDl2ugl2ZK/a0MaZcbiJQi1PcPRxOH9Ro2LQsHeUxc309yk3Onmb/
+        AgAA//8DAAbYfvVABQAA
+    headers:
+      CF-RAY:
+      - 95f9c7ffa8331b11-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Tue, 15 Jul 2025 13:59:38 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=J_xe1AP.B5P6D2GVMCesyioeS5E9DnYT34rbwQUefFc-1752587978-1.0.1.1-5Dflk5cAj6YCsOSVbCFWWSpXpw_mXsczIdzWzs2h2OwDL01HQbduE5LAToy67sfjFjHeeO4xRrqPLUQpySy2QqyHXbI_fzX4UAt3.UdwHxU;
+        path=/; expires=Tue, 15-Jul-25 14:29:38 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=0rTD8RMpxBQQy42jzmum16_eoRtWNfaZMG_TJkhGS7I-1752587978437-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '2623'
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '2626'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999813'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_ccc347e91010713379c920aa0efd1f4f
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/cassettes/test_llm_call_when_stop_is_unsupported.yaml
+++ b/tests/cassettes/test_llm_call_when_stop_is_unsupported.yaml
@@ -0,0 +1,209 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini", "stop": ["stop"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '115'
+      content-type:
+      - application/json
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: "{\n  \"error\": {\n    \"message\": \"Unsupported parameter: 'stop'
+        is not supported with this model.\",\n    \"type\": \"invalid_request_error\",\n
+        \   \"param\": \"stop\",\n    \"code\": \"unsupported_parameter\"\n  }\n}"
+    headers:
+      CF-RAY:
+      - 961215744c94cb45-GIG
+      Connection:
+      - keep-alive
+      Content-Length:
+      - '196'
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:46:46 GMT
+      Server:
+      - cloudflare
+      Set-Cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        path=/; expires=Fri, 18-Jul-25 13:16:46 GMT; domain=.api.openai.com; HttpOnly;
+        Secure; SameSite=None
+      - _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000;
+        path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '20'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '32'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_7be4715c3ee32aa406eacb68c7cc966e
+    status:
+      code: 400
+      message: Bad Request
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini"}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '97'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAA3RSwU7jMBC95ytGPlYNakJhQ2/sgSsg7QUhFA32pJni2JHtwFao/76yC3XQwsWH
+        efOe35uZ9wJAsBIbELLHIIdRl78nGvaqOt/dPDxf71/fdg/9bXO3e5ETXt+LZWTY5x3J8Mk6k3YY
+        NQW25ghLRxgoqla/LupmXTeXqwQMVpGONFuVAxsu61W9LldXZVV/MHvLkrzYwGMBAPCe3ujRKPor
+        NpB0UmUg73FLYnNqAhDO6lgR6D37gCaIZQalNYFMsv2nJ5A4ckANtoMbh0YSsIfF4g4d+8XibM50
+        1E0eo3MzaT0D0BgbMCZPnp8+kMPJZceGfd86Qm9N/NkHO4qEHgqAp5R6+hJEjM4OY2iDfaEkW62P
+        ciLPOYPNJxhsQJ3rV83yG7VWUUDWfjY1IVH2pDIzjxgnxXYGFLNs/5v5TvuYm802q1yuf9TPgJQ0
+        BlLt6Eix/Jo4tzmKZ/hT22nIybHw5F5ZUhuYXFyEog4nfTwQ4fc+0NB2bLbkRsfpSuKui0PxDwAA
+        //8DAN7IUy8kAwAA
+    headers:
+      CF-RAY:
+      - 961216c3f9837e07-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:47:41 GMT
+      Server:
+      - cloudflare
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '1027'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '1029'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_19a0763b09f0410b9d09598078a04cd6
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/cassettes/test_llm_call_when_stop_is_unsupported_when_additional_drop_params_is_provided.yaml
+++ b/tests/cassettes/test_llm_call_when_stop_is_unsupported_when_additional_drop_params_is_provided.yaml
@@ -0,0 +1,206 @@
+interactions:
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini", "stop": ["stop"]}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '115'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: "{\n  \"error\": {\n    \"message\": \"Unsupported parameter: 'stop'
+        is not supported with this model.\",\n    \"type\": \"invalid_request_error\",\n
+        \   \"param\": \"stop\",\n    \"code\": \"unsupported_parameter\"\n  }\n}"
+    headers:
+      CF-RAY:
+      - 961220323a627e05-GRU
+      Connection:
+      - keep-alive
+      Content-Length:
+      - '196'
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:54:06 GMT
+      Server:
+      - cloudflare
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '9'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '11'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_e8d7880c5977029062d8487d215e5282
+    status:
+      code: 400
+      message: Bad Request
+- request:
+    body: '{"messages": [{"role": "user", "content": "What is the capital of France?"}],
+      "model": "o1-mini"}'
+    headers:
+      accept:
+      - application/json
+      accept-encoding:
+      - gzip, deflate, zstd
+      connection:
+      - keep-alive
+      content-length:
+      - '97'
+      content-type:
+      - application/json
+      cookie:
+      - __cf_bm=KwJ1K47OHX4n2TZN8bMW37yKzKyK__S4HbTiCfyWjXM-1752842806-1.0.1.1-lweHFR7Kv2v7hT5I6xxYVz_7Ruu6aBdEgpJrSWrMxi_ficAeWC0oDeQ.0w2Lr1WRejIjqqcwSgdl6RixF2qEkjJZfS0pz_Vjjqexe44ayp4;
+        _cfuvid=zv09c6bwcgNsYU80ah3wXzqeaIKyt_h61EAh_XRA87I-1752842806652-0.0.1.1-604800000
+      host:
+      - api.openai.com
+      user-agent:
+      - OpenAI/Python 1.75.0
+      x-stainless-arch:
+      - arm64
+      x-stainless-async:
+      - 'false'
+      x-stainless-lang:
+      - python
+      x-stainless-os:
+      - MacOS
+      x-stainless-package-version:
+      - 1.75.0
+      x-stainless-raw-response:
+      - 'true'
+      x-stainless-read-timeout:
+      - '600.0'
+      x-stainless-retry-count:
+      - '0'
+      x-stainless-runtime:
+      - CPython
+      x-stainless-runtime-version:
+      - 3.11.12
+    method: POST
+    uri: https://api.openai.com/v1/chat/completions
+  response:
+    body:
+      string: !!binary |
+        H4sIAAAAAAAAA3SSQW/bMAyF7/4Vgo5BXCSeV6c5bkAPPTVbMaAYCoOT6JitLAkSPbQo8t8HKWns
+        Yu1FB3181HsUXwshJGm5FVL1wGrwpvw2In/fXY3Pcd/sftzf9ENvnurm569dc9/IZVK4P4+o+E11
+        odzgDTI5e8QqIDCmruvma7Wpv1T1ZQaD02iSzK3LgSyV1aqqy9VVua5Oyt6Rwii34nchhBCv+Uwe
+        rcZnuRWr5dvNgDHCHuX2XCSEDM6kGwkxUmSwLJcTVM4y2mz7rkehwBODEa4T1wGsQkFRLBa3ECgu
+        FhdzZcBujJCc29GYGQBrHUNKnj0/nMjh7LIjS7FvA0J0Nr0c2XmZ6aEQ4iGnHt8FkT64wXPL7glz
+        23V9bCenOc/h5kTZMZgZuKyWH/RrNTKQibO5SQWqRz1JpyHDqMnNQDFL97+dj3ofk5Pdz5xVm08f
+        mIBS6Bl16wNqUu9DT2UB0yZ+Vnaec7YsI4a/pLBlwpD+QmMHoznuiIwvkXFoO7J7DD5QXpT03cWh
+        +AcAAP//AwAo/zsSJwMAAA==
+    headers:
+      CF-RAY:
+      - 961220338bd47e05-GRU
+      Connection:
+      - keep-alive
+      Content-Encoding:
+      - gzip
+      Content-Type:
+      - application/json
+      Date:
+      - Fri, 18 Jul 2025 12:54:08 GMT
+      Server:
+      - cloudflare
+      Transfer-Encoding:
+      - chunked
+      X-Content-Type-Options:
+      - nosniff
+      access-control-expose-headers:
+      - X-Request-ID
+      alt-svc:
+      - h3=":443"; ma=86400
+      cf-cache-status:
+      - DYNAMIC
+      openai-organization:
+      - crewai-iuxna1
+      openai-processing-ms:
+      - '1280'
+      openai-project:
+      - proj_xitITlrFeen7zjNSzML82h9x
+      openai-version:
+      - '2020-10-01'
+      strict-transport-security:
+      - max-age=31536000; includeSubDomains; preload
+      x-envoy-upstream-service-time:
+      - '1286'
+      x-ratelimit-limit-requests:
+      - '30000'
+      x-ratelimit-limit-tokens:
+      - '150000000'
+      x-ratelimit-remaining-requests:
+      - '29999'
+      x-ratelimit-remaining-tokens:
+      - '149999990'
+      x-ratelimit-reset-requests:
+      - 2ms
+      x-ratelimit-reset-tokens:
+      - 0s
+      x-request-id:
+      - req_b7390d46fa4e14380d42162cb22045df
+    status:
+      code: 200
+      message: OK
+version: 1
--- a/tests/evaluation/metrics/init.py
+++ b/tests/evaluation/metrics/init.py
--- a/tests/evaluation/test_agent_evaluator.py
+++ b/tests/evaluation/test_agent_evaluator.py
@@ -1,95 +0,0 @@
-import pytest
-
-from crewai.agent import Agent
-from crewai.task import Task
-from crewai.crew import Crew
-from crewai.evaluation.agent_evaluator import AgentEvaluator
-from crewai.evaluation.base_evaluator import AgentEvaluationResult
-from crewai.evaluation import (
-    GoalAlignmentEvaluator,
-    SemanticQualityEvaluator,
-    ToolSelectionEvaluator,
-    ParameterExtractionEvaluator,
-    ToolInvocationEvaluator,
-    ReasoningEfficiencyEvaluator
-)
-
-from crewai.evaluation import create_default_evaluator
-class TestAgentEvaluator:
-    @pytest.fixture
-    def mock_crew(self):
-        agent = Agent(
-            role="Test Agent",
-            goal="Complete test tasks successfully",
-            backstory="An agent created for testing purposes",
-            allow_delegation=False,
-            verbose=False
-        )
-
-        task = Task(
-            description="Test task description",
-            agent=agent,
-            expected_output="Expected test output"
-        )
-
-        crew = Crew(
-            agents=[agent],
-            tasks=[task]
-        )
-        return crew
-
-    def test_set_iteration(self):
-        agent_evaluator = AgentEvaluator()
-
-        agent_evaluator.set_iteration(3)
-        assert agent_evaluator.iteration == 3
-
-    @pytest.mark.vcr(filter_headers=["authorization"])
-    def test_evaluate_current_iteration(self, mock_crew):
-        agent_evaluator = AgentEvaluator(crew=mock_crew, evaluators=[GoalAlignmentEvaluator()])
-
-        mock_crew.kickoff()
-
-        results = agent_evaluator.evaluate_current_iteration()
-
-        assert isinstance(results, dict)
-
-        agent, = mock_crew.agents
-        task, = mock_crew.tasks
-
-        assert len(mock_crew.agents) == 1
-        assert agent.role in results
-        assert len(results[agent.role]) == 1
-
-        result, = results[agent.role]
-        assert isinstance(result, AgentEvaluationResult)
-
-        assert result.agent_id == str(agent.id)
-        assert result.task_id == str(task.id)
-
-        goal_alignment, = result.metrics.values()
-        assert goal_alignment.score == 5.0
-
-        expected_feedback = "The agent's output demonstrates an understanding of the need for a comprehensive document"
-        assert expected_feedback in goal_alignment.feedback
-
-        assert goal_alignment.raw_response is not None
-        assert '"score": 5' in goal_alignment.raw_response
-
-    def test_create_default_evaluator(self, mock_crew):
-        agent_evaluator = create_default_evaluator(crew=mock_crew)
-        assert isinstance(agent_evaluator, AgentEvaluator)
-        assert agent_evaluator.crew == mock_crew
-
-        expected_types = [
-            GoalAlignmentEvaluator,
-            SemanticQualityEvaluator,
-            ToolSelectionEvaluator,
-            ParameterExtractionEvaluator,
-            ToolInvocationEvaluator,
-            ReasoningEfficiencyEvaluator
-        ]
-
-        assert len(agent_evaluator.evaluators) == len(expected_types)
-        for evaluator, expected_type in zip(agent_evaluator.evaluators, expected_types):
-            assert isinstance(evaluator, expected_type)
--- a/tests/experimental/evaluation/init.py
+++ b/tests/experimental/evaluation/init.py
--- a/tests/experimental/evaluation/metrics/init.py
+++ b/tests/experimental/evaluation/metrics/init.py
--- a/tests/experimental/evaluation/metrics/base_evaluation_metrics_test.py
+++ b/tests/experimental/evaluation/metrics/base_evaluation_metrics_test.py
--- a/tests/experimental/evaluation/metrics/test_goal_metrics.py
+++ b/tests/experimental/evaluation/metrics/test_goal_metrics.py
@@ -1,8 +1,8 @@
 from unittest.mock import patch, MagicMock
-from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
+from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest

-from crewai.evaluation.base_evaluator import EvaluationScore
-from crewai.evaluation.metrics.goal_metrics import GoalAlignmentEvaluator
+from crewai.experimental.evaluation.base_evaluator import EvaluationScore
+from crewai.experimental.evaluation.metrics.goal_metrics import GoalAlignmentEvaluator
 from crewai.utilities.llm_utils import LLM


--- a/tests/experimental/evaluation/metrics/test_reasoning_metrics.py
+++ b/tests/experimental/evaluation/metrics/test_reasoning_metrics.py
@@ -3,12 +3,12 @@ from unittest.mock import patch, MagicMock
 from typing import List, Dict, Any

 from crewai.tasks.task_output import TaskOutput
-from crewai.evaluation.metrics.reasoning_metrics import (
+from crewai.experimental.evaluation.metrics.reasoning_metrics import (
    ReasoningEfficiencyEvaluator,
 )
-from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
+from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
 from crewai.utilities.llm_utils import LLM
-from crewai.evaluation.base_evaluator import EvaluationScore
+from crewai.experimental.evaluation.base_evaluator import EvaluationScore

 class TestReasoningEfficiencyEvaluator(BaseEvaluationMetricsTest):
    @pytest.fixture
--- a/tests/experimental/evaluation/metrics/test_semantic_quality_metrics.py
+++ b/tests/experimental/evaluation/metrics/test_semantic_quality_metrics.py
@@ -1,8 +1,8 @@
 from unittest.mock import patch, MagicMock

-from crewai.evaluation.base_evaluator import EvaluationScore
-from crewai.evaluation.metrics.semantic_quality_metrics import SemanticQualityEvaluator
-from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
+from crewai.experimental.evaluation.base_evaluator import EvaluationScore
+from crewai.experimental.evaluation.metrics.semantic_quality_metrics import SemanticQualityEvaluator
+from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
 from crewai.utilities.llm_utils import LLM

 class TestSemanticQualityEvaluator(BaseEvaluationMetricsTest):
--- a/tests/experimental/evaluation/metrics/test_tools_metrics.py
+++ b/tests/experimental/evaluation/metrics/test_tools_metrics.py
@@ -1,12 +1,12 @@
 from unittest.mock import patch, MagicMock

-from crewai.evaluation.metrics.tools_metrics import (
+from crewai.experimental.evaluation.metrics.tools_metrics import (
    ToolSelectionEvaluator,
    ParameterExtractionEvaluator,
    ToolInvocationEvaluator
 )
 from crewai.utilities.llm_utils import LLM
-from tests.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest
+from tests.experimental.evaluation.metrics.base_evaluation_metrics_test import BaseEvaluationMetricsTest

 class TestToolSelectionEvaluator(BaseEvaluationMetricsTest):
    def test_no_tools_available(self, mock_task, mock_agent):
--- a/tests/experimental/evaluation/test_agent_evaluator.py
+++ b/tests/experimental/evaluation/test_agent_evaluator.py
@@ -0,0 +1,278 @@
+import pytest
+
+from crewai.agent import Agent
+from crewai.task import Task
+from crewai.crew import Crew
+from crewai.experimental.evaluation.agent_evaluator import AgentEvaluator
+from crewai.experimental.evaluation.base_evaluator import AgentEvaluationResult
+from crewai.experimental.evaluation import (
+    GoalAlignmentEvaluator,
+    SemanticQualityEvaluator,
+    ToolSelectionEvaluator,
+    ParameterExtractionEvaluator,
+    ToolInvocationEvaluator,
+    ReasoningEfficiencyEvaluator,
+    MetricCategory,
+    EvaluationScore
+)
+
+from crewai.utilities.events.agent_events import AgentEvaluationStartedEvent, AgentEvaluationCompletedEvent, AgentEvaluationFailedEvent
+from crewai.utilities.events.crewai_event_bus import crewai_event_bus
+from crewai.experimental.evaluation import create_default_evaluator
+
+class TestAgentEvaluator:
+    @pytest.fixture
+    def mock_crew(self):
+        agent = Agent(
+            role="Test Agent",
+            goal="Complete test tasks successfully",
+            backstory="An agent created for testing purposes",
+            allow_delegation=False,
+            verbose=False
+        )
+
+        task = Task(
+            description="Test task description",
+            agent=agent,
+            expected_output="Expected test output"
+        )
+
+        crew = Crew(
+            agents=[agent],
+            tasks=[task]
+        )
+        return crew
+
+    def test_set_iteration(self):
+        agent_evaluator = AgentEvaluator(agents=[])
+
+        agent_evaluator.set_iteration(3)
+        assert agent_evaluator._execution_state.iteration == 3
+
+    @pytest.mark.vcr(filter_headers=["authorization"])
+    def test_evaluate_current_iteration(self, mock_crew):
+        agent_evaluator = AgentEvaluator(agents=mock_crew.agents, evaluators=[GoalAlignmentEvaluator()])
+
+        mock_crew.kickoff()
+
+        results = agent_evaluator.get_evaluation_results()
+
+        assert isinstance(results, dict)
+
+        agent, = mock_crew.agents
+        task, = mock_crew.tasks
+
+        assert len(mock_crew.agents) == 1
+        assert agent.role in results
+        assert len(results[agent.role]) == 1
+
+        result, = results[agent.role]
+        assert isinstance(result, AgentEvaluationResult)
+
+        assert result.agent_id == str(agent.id)
+        assert result.task_id == str(task.id)
+
+        goal_alignment, = result.metrics.values()
+        assert goal_alignment.score == 5.0
+
+        expected_feedback = "The agent's output demonstrates an understanding of the need for a comprehensive document outlining task"
+        assert expected_feedback in goal_alignment.feedback
+
+        assert goal_alignment.raw_response is not None
+        assert '"score": 5' in goal_alignment.raw_response
+
+    def test_create_default_evaluator(self, mock_crew):
+        agent_evaluator = create_default_evaluator(agents=mock_crew.agents)
+        assert isinstance(agent_evaluator, AgentEvaluator)
+        assert agent_evaluator.agents == mock_crew.agents
+
+        expected_types = [
+            GoalAlignmentEvaluator,
+            SemanticQualityEvaluator,
+            ToolSelectionEvaluator,
+            ParameterExtractionEvaluator,
+            ToolInvocationEvaluator,
+            ReasoningEfficiencyEvaluator
+        ]
+
+        assert len(agent_evaluator.evaluators) == len(expected_types)
+        for evaluator, expected_type in zip(agent_evaluator.evaluators, expected_types):
+            assert isinstance(evaluator, expected_type)
+
+    @pytest.mark.vcr(filter_headers=["authorization"])
+    def test_eval_lite_agent(self):
+        agent = Agent(
+            role="Test Agent",
+            goal="Complete test tasks successfully",
+            backstory="An agent created for testing purposes",
+        )
+
+        with crewai_event_bus.scoped_handlers():
+            events = {}
+            @crewai_event_bus.on(AgentEvaluationStartedEvent)
+            def capture_started(source, event):
+                events["started"] = event
+
+            @crewai_event_bus.on(AgentEvaluationCompletedEvent)
+            def capture_completed(source, event):
+                events["completed"] = event
+
+            @crewai_event_bus.on(AgentEvaluationFailedEvent)
+            def capture_failed(source, event):
+                events["failed"] = event
+
+            agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[GoalAlignmentEvaluator()])
+
+            agent.kickoff(messages="Complete this task successfully")
+
+            assert events.keys() == {"started", "completed"}
+            assert events["started"].agent_id == str(agent.id)
+            assert events["started"].agent_role == agent.role
+            assert events["started"].task_id is None
+            assert events["started"].iteration == 1
+
+            assert events["completed"].agent_id == str(agent.id)
+            assert events["completed"].agent_role == agent.role
+            assert events["completed"].task_id is None
+            assert events["completed"].iteration == 1
+            assert events["completed"].metric_category == MetricCategory.GOAL_ALIGNMENT
+            assert isinstance(events["completed"].score, EvaluationScore)
+            assert events["completed"].score.score == 2.0
+
+            results = agent_evaluator.get_evaluation_results()
+
+            assert isinstance(results, dict)
+
+            result, = results[agent.role]
+            assert isinstance(result, AgentEvaluationResult)
+
+            assert result.agent_id == str(agent.id)
+            assert result.task_id == "lite_task"
+
+            goal_alignment, = result.metrics.values()
+            assert goal_alignment.score == 2.0
+
+            expected_feedback = "The agent did not demonstrate a clear understanding of the task goal, which is to complete test tasks successfully"
+            assert expected_feedback in goal_alignment.feedback
+
+            assert goal_alignment.raw_response is not None
+            assert '"score": 2' in goal_alignment.raw_response
+
+    @pytest.mark.vcr(filter_headers=["authorization"])
+    def test_eval_specific_agents_from_crew(self, mock_crew):
+        agent = Agent(
+            role="Test Agent Eval",
+            goal="Complete test tasks successfully",
+            backstory="An agent created for testing purposes",
+        )
+        task = Task(
+            description="Test task description",
+            agent=agent,
+            expected_output="Expected test output"
+        )
+        mock_crew.agents.append(agent)
+        mock_crew.tasks.append(task)
+
+        with crewai_event_bus.scoped_handlers():
+            events = {}
+            @crewai_event_bus.on(AgentEvaluationStartedEvent)
+            def capture_started(source, event):
+                events["started"] = event
+
+            @crewai_event_bus.on(AgentEvaluationCompletedEvent)
+            def capture_completed(source, event):
+                events["completed"] = event
+
+            @crewai_event_bus.on(AgentEvaluationFailedEvent)
+            def capture_failed(source, event):
+                events["failed"] = event
+
+            agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[GoalAlignmentEvaluator()])
+            mock_crew.kickoff()
+
+            assert events.keys() == {"started", "completed"}
+            assert events["started"].agent_id == str(agent.id)
+            assert events["started"].agent_role == agent.role
+            assert events["started"].task_id == str(task.id)
+            assert events["started"].iteration == 1
+
+            assert events["completed"].agent_id == str(agent.id)
+            assert events["completed"].agent_role == agent.role
+            assert events["completed"].task_id == str(task.id)
+            assert events["completed"].iteration == 1
+            assert events["completed"].metric_category == MetricCategory.GOAL_ALIGNMENT
+            assert isinstance(events["completed"].score, EvaluationScore)
+            assert events["completed"].score.score == 5.0
+
+            results = agent_evaluator.get_evaluation_results()
+
+            assert isinstance(results, dict)
+            assert len(results.keys()) == 1
+            result, = results[agent.role]
+            assert isinstance(result, AgentEvaluationResult)
+
+            assert result.agent_id == str(agent.id)
+            assert result.task_id == str(task.id)
+
+            goal_alignment, = result.metrics.values()
+            assert goal_alignment.score == 5.0
+
+            expected_feedback = "The agent provided a thorough guide on how to conduct a test task but failed to produce specific expected output"
+            assert expected_feedback in goal_alignment.feedback
+
+            assert goal_alignment.raw_response is not None
+            assert '"score": 5' in goal_alignment.raw_response
+
+
+    @pytest.mark.vcr(filter_headers=["authorization"])
+    def test_failed_evaluation(self, mock_crew):
+        agent, = mock_crew.agents
+        task, = mock_crew.tasks
+
+        with crewai_event_bus.scoped_handlers():
+            events = {}
+
+            @crewai_event_bus.on(AgentEvaluationStartedEvent)
+            def capture_started(source, event):
+                events["started"] = event
+
+            @crewai_event_bus.on(AgentEvaluationCompletedEvent)
+            def capture_completed(source, event):
+                events["completed"] = event
+
+            @crewai_event_bus.on(AgentEvaluationFailedEvent)
+            def capture_failed(source, event):
+                events["failed"] = event
+
+            # Create a mock evaluator that will raise an exception
+            from crewai.experimental.evaluation.base_evaluator import BaseEvaluator
+            from crewai.experimental.evaluation import MetricCategory
+            class FailingEvaluator(BaseEvaluator):
+                metric_category = MetricCategory.GOAL_ALIGNMENT
+
+                def evaluate(self, agent, task, execution_trace, final_output):
+                    raise ValueError("Forced evaluation failure")
+
+            agent_evaluator = AgentEvaluator(agents=[agent], evaluators=[FailingEvaluator()])
+            mock_crew.kickoff()
+
+            assert events.keys() == {"started", "failed"}
+            assert events["started"].agent_id == str(agent.id)
+            assert events["started"].agent_role == agent.role
+            assert events["started"].task_id == str(task.id)
+            assert events["started"].iteration == 1
+
+            assert events["failed"].agent_id == str(agent.id)
+            assert events["failed"].agent_role == agent.role
+            assert events["failed"].task_id == str(task.id)
+            assert events["failed"].iteration == 1
+            assert events["failed"].error == "Forced evaluation failure"
+
+            results = agent_evaluator.get_evaluation_results()
+            result, = results[agent.role]
+            assert isinstance(result, AgentEvaluationResult)
+
+            assert result.agent_id == str(agent.id)
+            assert result.task_id == str(task.id)
+
+            assert result.metrics == {}
--- a/tests/experimental/evaluation/test_experiment_result.py
+++ b/tests/experimental/evaluation/test_experiment_result.py
@@ -0,0 +1,111 @@
+import pytest
+from unittest.mock import MagicMock, patch
+
+from crewai.experimental.evaluation.experiment.result import ExperimentResult, ExperimentResults
+
+
+class TestExperimentResult:
+    @pytest.fixture
+    def mock_results(self):
+        return [
+            ExperimentResult(
+                identifier="test-1",
+                inputs={"query": "What is the capital of France?"},
+                score=10,
+                expected_score=7,
+                passed=True
+            ),
+            ExperimentResult(
+                identifier="test-2",
+                inputs={"query": "Who wrote Hamlet?"},
+                score={"relevance": 9, "factuality": 8},
+                expected_score={"relevance": 7, "factuality": 7},
+                passed=True,
+                agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
+            ),
+            ExperimentResult(
+                identifier="test-3",
+                inputs={"query": "Any query"},
+                score={"relevance": 9, "factuality": 8},
+                expected_score={"relevance": 7, "factuality": 7},
+                passed=False,
+                agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
+            ),
+            ExperimentResult(
+                identifier="test-4",
+                inputs={"query": "Another query"},
+                score={"relevance": 9, "factuality": 8},
+                expected_score={"relevance": 7, "factuality": 7},
+                passed=True,
+                agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
+            ),
+            ExperimentResult(
+                identifier="test-6",
+                inputs={"query": "Yet another query"},
+                score={"relevance": 9, "factuality": 8},
+                expected_score={"relevance": 7, "factuality": 7},
+                passed=True,
+                agent_evaluations={"agent1": {"metrics": {"goal_alignment": {"score": 9}}}}
+            )
+        ]
+
+    @patch('os.path.exists', return_value=True)
+    @patch('os.path.getsize', return_value=1)
+    @patch('json.load')
+    @patch('builtins.open', new_callable=MagicMock)
+    def test_experiment_results_compare_with_baseline(self, mock_open, mock_json_load, mock_path_getsize, mock_path_exists, mock_results):
+        baseline_data = {
+            "timestamp": "2023-01-01T00:00:00+00:00",
+            "results": [
+                {
+                    "identifier": "test-1",
+                    "inputs": {"query": "What is the capital of France?"},
+                    "score": 7,
+                    "expected_score": 7,
+                    "passed": False
+                },
+                {
+                    "identifier": "test-2",
+                    "inputs": {"query": "Who wrote Hamlet?"},
+                    "score": {"relevance": 8, "factuality": 7},
+                    "expected_score": {"relevance": 7, "factuality": 7},
+                    "passed": True
+                },
+                {
+                    "identifier": "test-3",
+                    "inputs": {"query": "Any query"},
+                    "score": {"relevance": 8, "factuality": 7},
+                    "expected_score": {"relevance": 7, "factuality": 7},
+                    "passed": True
+                },
+                {
+                    "identifier": "test-4",
+                    "inputs": {"query": "Another query"},
+                    "score": {"relevance": 8, "factuality": 7},
+                    "expected_score": {"relevance": 7, "factuality": 7},
+                    "passed": True
+                },
+                {
+                    "identifier": "test-5",
+                    "inputs": {"query": "Another query"},
+                    "score": {"relevance": 8, "factuality": 7},
+                    "expected_score": {"relevance": 7, "factuality": 7},
+                    "passed": True
+                }
+            ]
+        }
+
+        mock_json_load.return_value = baseline_data
+
+        results = ExperimentResults(results=mock_results)
+        results.display = MagicMock()
+
+        comparison = results.compare_with_baseline(baseline_filepath="baseline.json")
+
+        assert "baseline_timestamp" in comparison
+        assert comparison["baseline_timestamp"] == "2023-01-01T00:00:00+00:00"
+        assert comparison["improved"] == ["test-1"]
+        assert comparison["regressed"] == ["test-3"]
+        assert comparison["unchanged"] == ["test-2", "test-4"]
+        assert comparison["new_tests"] == ["test-6"]
+        assert comparison["missing_tests"] == ["test-5"]
--- a/tests/experimental/evaluation/test_experiment_runner.py
+++ b/tests/experimental/evaluation/test_experiment_runner.py
@@ -0,0 +1,197 @@
+import pytest
+from unittest.mock import MagicMock, patch
+
+from crewai.crew import Crew
+from crewai.experimental.evaluation.experiment.runner import ExperimentRunner
+from crewai.experimental.evaluation.experiment.result import ExperimentResults
+from crewai.experimental.evaluation.evaluation_display import AgentAggregatedEvaluationResult
+from crewai.experimental.evaluation.base_evaluator import MetricCategory, EvaluationScore
+
+
+class TestExperimentRunner:
+    @pytest.fixture
+    def mock_crew(self):
+        return MagicMock(llm=Crew)
+
+    @pytest.fixture
+    def mock_evaluator_results(self):
+        agent_evaluation = AgentAggregatedEvaluationResult(
+            agent_id="Test Agent",
+            agent_role="Test Agent Role",
+            metrics={
+                MetricCategory.GOAL_ALIGNMENT: EvaluationScore(
+                    score=9,
+                    feedback="Test feedback for goal alignment",
+                    raw_response="Test raw response for goal alignment"
+                ),
+                MetricCategory.REASONING_EFFICIENCY: EvaluationScore(
+                    score=None,
+                    feedback="Reasoning efficiency not applicable",
+                    raw_response="Reasoning efficiency not applicable"
+                ),
+                MetricCategory.PARAMETER_EXTRACTION: EvaluationScore(
+                    score=7,
+                    feedback="Test parameter extraction explanation",
+                    raw_response="Test raw output"
+                ),
+                MetricCategory.TOOL_SELECTION: EvaluationScore(
+                    score=8,
+                    feedback="Test tool selection explanation",
+                    raw_response="Test raw output"
+                )
+            }
+        )
+
+        return {"Test Agent": agent_evaluation}
+
+    @patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
+    def test_run_success(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
+        dataset = [
+            {
+                "identifier": "test-case-1",
+                "inputs": {"query": "Test query 1"},
+                "expected_score": 8
+            },
+            {
+                "identifier": "test-case-2",
+                "inputs": {"query": "Test query 2"},
+                "expected_score": {"goal_alignment": 7}
+            },
+            {
+                "inputs": {"query": "Test query 3"},
+                "expected_score": {"tool_selection": 9}
+            }
+        ]
+
+        mock_evaluator = MagicMock()
+        mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
+        mock_evaluator.reset_iterations_results = MagicMock()
+        mock_create_evaluator.return_value = mock_evaluator
+
+        runner = ExperimentRunner(dataset=dataset)
+
+        results = runner.run(crew=mock_crew)
+
+        assert isinstance(results, ExperimentResults)
+        result_1, result_2, result_3 = results.results
+        assert len(results.results) == 3
+
+        assert result_1.identifier == "test-case-1"
+        assert result_1.inputs == {"query": "Test query 1"}
+        assert result_1.expected_score == 8
+        assert result_1.passed is True
+
+        assert result_2.identifier == "test-case-2"
+        assert result_2.inputs == {"query": "Test query 2"}
+        assert isinstance(result_2.expected_score, dict)
+        assert "goal_alignment" in result_2.expected_score
+        assert result_2.passed is True
+
+        assert result_3.identifier == "c2ed49e63aa9a83af3ca382794134fd5"
+        assert result_3.inputs == {"query": "Test query 3"}
+        assert isinstance(result_3.expected_score, dict)
+        assert "tool_selection" in result_3.expected_score
+        assert result_3.passed is False
+
+        assert mock_crew.kickoff.call_count == 3
+        mock_crew.kickoff.assert_any_call(inputs={"query": "Test query 1"})
+        mock_crew.kickoff.assert_any_call(inputs={"query": "Test query 2"})
+        mock_crew.kickoff.assert_any_call(inputs={"query": "Test query 3"})
+
+        assert mock_evaluator.reset_iterations_results.call_count == 3
+        assert mock_evaluator.get_agent_evaluation.call_count == 3
+
+
+    @patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
+    def test_run_success_with_unknown_metric(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
+        dataset = [
+            {
+                "identifier": "test-case-2",
+                "inputs": {"query": "Test query 2"},
+                "expected_score": {"goal_alignment": 7, "unknown_metric": 8}
+            }
+        ]
+
+        mock_evaluator = MagicMock()
+        mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
+        mock_evaluator.reset_iterations_results = MagicMock()
+        mock_create_evaluator.return_value = mock_evaluator
+
+        runner = ExperimentRunner(dataset=dataset)
+
+        results = runner.run(crew=mock_crew)
+
+        result, = results.results
+
+        assert result.identifier == "test-case-2"
+        assert result.inputs == {"query": "Test query 2"}
+        assert isinstance(result.expected_score, dict)
+        assert "goal_alignment" in result.expected_score.keys()
+        assert "unknown_metric" in result.expected_score.keys()
+        assert result.passed is True
+
+    @patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
+    def test_run_success_with_single_metric_evaluator_and_expected_specific_metric(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
+        dataset = [
+            {
+                "identifier": "test-case-2",
+                "inputs": {"query": "Test query 2"},
+                "expected_score": {"goal_alignment": 7}
+            }
+        ]
+
+        mock_evaluator = MagicMock()
+        mock_create_evaluator["Test Agent"].metrics = {
+            MetricCategory.GOAL_ALIGNMENT: EvaluationScore(
+                    score=9,
+                    feedback="Test feedback for goal alignment",
+                    raw_response="Test raw response for goal alignment"
+                )
+        }
+        mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
+        mock_evaluator.reset_iterations_results = MagicMock()
+        mock_create_evaluator.return_value = mock_evaluator
+
+        runner = ExperimentRunner(dataset=dataset)
+
+        results = runner.run(crew=mock_crew)
+        result, = results.results
+
+        assert result.identifier == "test-case-2"
+        assert result.inputs == {"query": "Test query 2"}
+        assert isinstance(result.expected_score, dict)
+        assert "goal_alignment" in result.expected_score.keys()
+        assert result.passed is True
+
+    @patch('crewai.experimental.evaluation.experiment.runner.create_default_evaluator')
+    def test_run_success_when_expected_metric_is_not_available(self, mock_create_evaluator, mock_crew, mock_evaluator_results):
+        dataset = [
+            {
+                "identifier": "test-case-2",
+                "inputs": {"query": "Test query 2"},
+                "expected_score": {"unknown_metric": 7}
+            }
+        ]
+
+        mock_evaluator = MagicMock()
+        mock_create_evaluator["Test Agent"].metrics = {
+            MetricCategory.GOAL_ALIGNMENT: EvaluationScore(
+                score=5,
+                feedback="Test feedback for goal alignment",
+                raw_response="Test raw response for goal alignment"
+            )
+        }
+        mock_evaluator.get_agent_evaluation.return_value = mock_evaluator_results
+        mock_evaluator.reset_iterations_results = MagicMock()
+        mock_create_evaluator.return_value = mock_evaluator
+
+        runner = ExperimentRunner(dataset=dataset)
+
+        results = runner.run(crew=mock_crew)
+        result, = results.results
+
+        assert result.identifier == "test-case-2"
+        assert result.inputs == {"query": "Test query 2"}
+        assert isinstance(result.expected_score, dict)
+        assert "unknown_metric" in result.expected_score.keys()
+        assert result.passed is False
--- a/tests/llm_test.py
+++ b/tests/llm_test.py
@@ -1,3 +1,4 @@
+import logging
 import os
 from time import sleep
 from unittest.mock import MagicMock, patch
@@ -664,3 +665,49 @@ def test_handle_streaming_tool_calls_no_tools(mock_emit):
        expected_completed_llm_call=1,
        expected_final_chunk_result=response,
    )
+
+
+@pytest.mark.vcr(filter_headers=["authorization"])
+def test_llm_call_when_stop_is_unsupported(caplog):
+    llm = LLM(model="o1-mini", stop=["stop"])
+    with caplog.at_level(logging.INFO):
+        result = llm.call("What is the capital of France?")
+        assert "Retrying LLM call without the unsupported 'stop'" in caplog.text
+    assert isinstance(result, str)
+    assert "Paris" in result
+
+@pytest.mark.vcr(filter_headers=["authorization"])
+def test_llm_call_when_stop_is_unsupported_when_additional_drop_params_is_provided(caplog):
+    llm = LLM(model="o1-mini", stop=["stop"], additional_drop_params=["another_param"])
+    with caplog.at_level(logging.INFO):
+        result = llm.call("What is the capital of France?")
+        assert "Retrying LLM call without the unsupported 'stop'" in caplog.text
+    assert isinstance(result, str)
+    assert "Paris" in result
+
+
+@pytest.fixture
+def ollama_llm():
+    return LLM(model="ollama/llama3.2:3b")
+
+def test_ollama_appends_dummy_user_message_when_last_is_assistant(ollama_llm):
+    original_messages = [
+        {"role": "user", "content": "Hi there"},
+        {"role": "assistant", "content": "Hello!"},
+    ]
+
+    formatted = ollama_llm._format_messages_for_provider(original_messages)
+
+    assert len(formatted) == len(original_messages) + 1
+    assert formatted[-1]["role"] == "user"
+    assert formatted[-1]["content"] == ""
+
+
+def test_ollama_does_not_modify_when_last_is_user(ollama_llm):
+    original_messages = [
+        {"role": "user", "content": "Tell me a joke."},
+    ]
+
+    formatted = ollama_llm._format_messages_for_provider(original_messages)
+
+    assert formatted == original_messages
--- a/tests/storage/test_mem0_storage.py
+++ b/tests/storage/test_mem0_storage.py
@@ -1,14 +1,10 @@
-import os
 from unittest.mock import MagicMock, patch

 import pytest
 from mem0.client.main import MemoryClient
 from mem0.memory.main import Memory

-from crewai.agent import Agent
-from crewai.crew import Crew
 from crewai.memory.storage.mem0_storage import Mem0Storage
-from crewai.task import Task


 # Define the class (if not already defined)
@@ -172,7 +168,7 @@ def test_save_method_with_memory_oss(mem0_storage_with_mocked_config):
    mem0_storage.save(test_value, test_metadata)
    
    mem0_storage.memory.add.assert_called_once_with(
-        test_value,
+        [{'role': 'assistant' , 'content': test_value}],
        agent_id="Test_Agent",
        infer=False,
        metadata={"type": "short_term", "key": "value"},
@@ -191,7 +187,7 @@ def test_save_method_with_memory_client(mem0_storage_with_memory_client_using_co
    mem0_storage.save(test_value, test_metadata)
    
    mem0_storage.memory.add.assert_called_once_with(
-        test_value,
+        [{'role': 'assistant' , 'content': test_value}],
        agent_id="Test_Agent",
        infer=False,
        metadata={"type": "short_term", "key": "value"},
--- a/tests/utilities/test_chromadb_utils.py
+++ b/tests/utilities/test_chromadb_utils.py
@@ -1,16 +1,27 @@
+import multiprocessing
+import tempfile
 import unittest
-from typing import Any, Dict, List, Union

-import pytest
+from chromadb.config import Settings
+from unittest.mock import patch, MagicMock

 from crewai.utilities.chromadb import (
    MAX_COLLECTION_LENGTH,
    MIN_COLLECTION_LENGTH,
    is_ipv4_pattern,
    sanitize_collection_name,
+    create_persistent_client,
 )


+def persistent_client_worker(path, queue):
+    try:
+        create_persistent_client(path=path)
+        queue.put(None)
+    except Exception as e:
+        queue.put(e)
+
+
 class TestChromadbUtils(unittest.TestCase):
    def test_sanitize_collection_name_long_name(self):
        """Test sanitizing a very long collection name."""
@@ -79,3 +90,34 @@ class TestChromadbUtils(unittest.TestCase):
            self.assertLessEqual(len(sanitized), MAX_COLLECTION_LENGTH)
            self.assertTrue(sanitized[0].isalnum())
            self.assertTrue(sanitized[-1].isalnum())
+
+    def test_create_persistent_client_passes_args(self):
+        with patch(
+            "crewai.utilities.chromadb.PersistentClient"
+        ) as mock_persistent_client, tempfile.TemporaryDirectory() as tmpdir:
+            mock_instance = MagicMock()
+            mock_persistent_client.return_value = mock_instance
+
+            settings = Settings(allow_reset=True)
+            client = create_persistent_client(path=tmpdir, settings=settings)
+
+            mock_persistent_client.assert_called_once_with(
+                path=tmpdir, settings=settings
+            )
+            self.assertIs(client, mock_instance)
+
+    def test_create_persistent_client_process_safe(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            queue = multiprocessing.Queue()
+            processes = [
+                multiprocessing.Process(
+                    target=persistent_client_worker, args=(tmpdir, queue)
+                )
+                for _ in range(5)
+            ]
+
+            [p.start() for p in processes]
+            [p.join() for p in processes]
+
+            errors = [queue.get(timeout=5) for _ in processes]
+            self.assertTrue(all(err is None for err in errors))
--- a/uv.lock
+++ b/uv.lock
@@ -696,6 +696,7 @@ dependencies = [
    { name = "opentelemetry-exporter-otlp-proto-http" },
    { name = "opentelemetry-sdk" },
    { name = "pdfplumber" },
+    { name = "portalocker" },
    { name = "pydantic" },
    { name = "pyjwt" },
    { name = "python-dotenv" },
@@ -762,13 +763,13 @@ requires-dist = [
    { name = "blinker", specifier = ">=1.9.0" },
    { name = "chromadb", specifier = ">=0.5.23" },
    { name = "click", specifier = ">=8.1.7" },
-    { name = "crewai-tools", marker = "extra == 'tools'", specifier = "~=0.51.0" },
+    { name = "crewai-tools", marker = "extra == 'tools'", specifier = "~=0.55.0" },
    { name = "docling", marker = "extra == 'docling'", specifier = ">=2.12.0" },
    { name = "instructor", specifier = ">=1.3.3" },
    { name = "json-repair", specifier = "==0.25.2" },
    { name = "json5", specifier = ">=0.10.0" },
    { name = "jsonref", specifier = ">=1.1.0" },
-    { name = "litellm", specifier = "==1.72.6" },
+    { name = "litellm", specifier = "==1.74.3" },
    { name = "mem0ai", marker = "extra == 'mem0'", specifier = ">=0.1.94" },
    { name = "onnxruntime", specifier = "==1.22.0" },
    { name = "openai", specifier = ">=1.13.3" },
@@ -780,6 +781,7 @@ requires-dist = [
    { name = "pandas", marker = "extra == 'pandas'", specifier = ">=2.2.3" },
    { name = "pdfplumber", specifier = ">=0.11.4" },
    { name = "pdfplumber", marker = "extra == 'pdfplumber'", specifier = ">=0.11.4" },
+    { name = "portalocker", specifier = "==2.7.0" },
    { name = "pydantic", specifier = ">=2.4.2" },
    { name = "pyjwt", specifier = ">=2.9.0" },
    { name = "python-dotenv", specifier = ">=1.0.0" },
@@ -813,7 +815,7 @@ dev = [

 [[package]]
 name = "crewai-tools"
-version = "0.51.0"
+version = "0.55.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "chromadb" },
@@ -829,9 +831,9 @@ dependencies = [
    { name = "requests" },
    { name = "tiktoken" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/a1/ef/3426aebf495a887898466d38d6b78b09317d4c102a89493699d6af5aa823/crewai_tools-0.51.0.tar.gz", hash = "sha256:a5d73f344b740b13ffef8d189d6d210da143227982edf619e4de77896e2fd017", size = 1011735, upload-time = "2025-07-09T16:39:24.179Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/f6/75/d8cae7f84e78a93210f91a4580aec8eb72dc1f33368655a8ad4e381d575b/crewai_tools-0.55.0.tar.gz", hash = "sha256:0961821128b07148197b89b1827b6c0a548424fa8a01674991528a56fd03fe81", size = 1015820, upload-time = "2025-07-16T19:16:36.648Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/61/ea/9931f130dae5910a1b2e9d1fc6347d991100538e1faf3ece37ec8380ec96/crewai_tools-0.51.0-py3-none-any.whl", hash = "sha256:ba67e6bed6134e374c96fe9038bce6045600ff3b5358f6a6d75ff8f316defd06", size = 633012, upload-time = "2025-07-09T16:39:22.239Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/98/da76dff3b814f5a6c9cbce7dacc09462669174083fd872b21c9373cdd412/crewai_tools-0.55.0-py3-none-any.whl", hash = "sha256:f69967394a9b5c85cab8722dfbae320e0a80d6124a3f36063c5864fe3516ee06", size = 634456, upload-time = "2025-07-16T19:16:35.259Z" },
 ]

 [[package]]
@@ -2266,7 +2268,7 @@ wheels = [

 [[package]]
 name = "litellm"
-version = "1.72.6"
+version = "1.74.3"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "aiohttp" },
@@ -2281,9 +2283,9 @@ dependencies = [
    { name = "tiktoken" },
    { name = "tokenizers" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/8d/15/df75f278fd998f6d6900f692b9de2fba2814b316c123c99072a813668aac/litellm-1.72.6.tar.gz", hash = "sha256:4e5c7e4273b09b765302d2faaec30f77b42255c0055b427b55ea02b8092b8582", size = 8393603, upload-time = "2025-06-14T21:43:11.023Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/cd/e3/3091066f6682016840e9a36111560656b609b95de04b2ec7b19ad2580eaa/litellm-1.74.3.tar.gz", hash = "sha256:a9e87ebe78947ceec67e75f830f1c956cc653b84563574241acea9c84e7e3ca1", size = 9256457, upload-time = "2025-07-12T20:06:06.128Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/96/c9/4aae0b77632279eef9716dbcb98edd8b36c08a9da070e2470ca9c410c0f8/litellm-1.72.6-py3-none-any.whl", hash = "sha256:e0ae98d25db4910e78b1a0a604f24c0d6875f6cdea02426b264a45d4fbdb8c46", size = 8302810, upload-time = "2025-06-14T21:43:08.628Z" },
+    { url = "https://files.pythonhosted.org/packages/14/6f/07735b5178f32e28daf8a30ed6ad3e2c8c06ac374dc06aecde007110470f/litellm-1.74.3-py3-none-any.whl", hash = "sha256:638ec73633c6f2cf78a7343723d8f3bc13c192558fcbaa29f3ba6bc7802e8663", size = 8618899, upload-time = "2025-07-12T20:06:03.609Z" },
 ]

 [[package]]
@@ -3797,14 +3799,14 @@ wheels = [

 [[package]]
 name = "portalocker"
-version = "2.10.1"
+version = "2.7.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "pywin32", marker = "sys_platform == 'win32'" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/ed/d3/c6c64067759e87af98cc668c1cc75171347d0f1577fab7ca3749134e3cd4/portalocker-2.10.1.tar.gz", hash = "sha256:ef1bf844e878ab08aee7e40184156e1151f228f103aa5c6bd0724cc330960f8f", size = 40891, upload-time = "2024-07-13T23:15:34.86Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/1f/f8/969e6f280201b40b31bcb62843c619f343dcc351dff83a5891530c9dd60e/portalocker-2.7.0.tar.gz", hash = "sha256:032e81d534a88ec1736d03f780ba073f047a06c478b06e2937486f334e955c51", size = 20183, upload-time = "2023-01-18T23:36:14.436Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/9b/fb/a70a4214956182e0d7a9099ab17d50bfcba1056188e9b14f35b9e2b62a0d/portalocker-2.10.1-py3-none-any.whl", hash = "sha256:53a5984ebc86a025552264b459b46a2086e269b21823cb572f8f28ee759e45bf", size = 18423, upload-time = "2024-07-13T23:15:32.602Z" },
+    { url = "https://files.pythonhosted.org/packages/8c/df/d4f711d168524f5aebd7fb30969eaa31e3048cf8979688cde3b08f6e5eb8/portalocker-2.7.0-py2.py3-none-any.whl", hash = "sha256:a07c5b4f3985c3cf4798369631fb7011adb498e2a46d8440efc75a8f29a0f983", size = 15502, upload-time = "2023-01-18T23:36:12.849Z" },
 ]

 [[package]]
Author	SHA1	Message	Date
Greyson LaLonde	4315f33e88	fix: cast dict values to str in _format_prompt - Add str() casts for type safety - These values are always strings when called from invoke	2025-07-22 10:34:10 -04:00
Greyson LaLonde	cf0a17f099	fix: update CrewAgentExecutor.invoke type signature - Change inputs parameter from Dict[str, str] to Dict[str, Union[str, bool, None]] - Matches actual usage where ask_for_human_input can be bool or None	2025-07-22 10:27:58 -04:00
Greyson LaLonde	a893e6030b	fix: handle None agent_executor and type mismatch - Add None check before accessing agent_executor attributes - Convert task.human_input to bool for type compatibility	2025-07-22 10:21:31 -04:00
Greyson LaLonde	767bbd693d	fix: add type annotation for agent_executor field - Fixes 'Unresolved attribute reference' IDE warning	2025-07-22 10:16:53 -04:00
Lucas Gomide	27623a1d01	feat: remove duplicate print on LLM call error (#3183 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details By improving litellm handler error / outputs Co-authored-by: Lorenze Jay <63378463+lorenzejay@users.noreply.github.com>	2025-07-21 22:08:07 -04:00
João Moura	2593242234	Adding Support to adhoc tool calling using the internal LLM class (#3195 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details * Adding Support to adhoc tool calling using the internal LLM class * fix type	2025-07-21 19:36:48 -03:00
Greyson LaLonde	2ab6c31544	chore: add deprecation notices to UserMemory (#3201 ) - Mark UserMemory and UserMemoryItem for removal in v0.156.0 or 2025-08-04 - Update all references with deprecation warnings - Users should migrate to ExternalMemory	2025-07-21 15:26:34 -04:00
Lucas Gomide	3c55c8a22a	fix: append user message when last message is from assistent when using Ollama models (#3200 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Ollama doesn't supports last message to be 'assistant' We can drop this commit after merging https://github.com/BerriAI/litellm/pull/10917	2025-07-21 13:30:40 -04:00
Ranuga Disansa	424433ff58	docs: Add Tavily Search & Extractor tools to Search-Research suite (#3146 ) * docs: Add Tavily Search and Extractor tools documentation * docs: Add Tavily Search and Extractor tools to the documentation --------- Co-authored-by: Tony Kipkemboi <iamtonykipkemboi@gmail.com>	2025-07-21 12:01:29 -04:00
Lucas Gomide	2fd99503ed	build: upgrade LiteLLM to 1.74.3 (#3199 )	2025-07-21 09:58:47 -04:00
Vidit Ostwal	942014962e	fixed save method, changed the test cases (#3187 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details * fixed save method, changed the test cases * Linting fixed	2025-07-18 15:10:26 -04:00
Lucas Gomide	2ab79a7dd5	feat: drop unsupported stop parameter for LLM models automatically (#3184 )	2025-07-18 13:54:28 -04:00
Lucas Gomide	27c449c9c4	test: remove workaround related to SQLite without FTS5 (#3179 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details For more details check out [here](actions/runner-images#12576)	2025-07-18 09:37:15 -04:00
Vini Brasil	9737333ffd	Use file lock around Chroma client initialization (#3181 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details This commit fixes a bug with concurrent processess and Chroma where `table collections already exists` (and similar) were raised. https://cookbook.chromadb.dev/core/system_constraints/	2025-07-17 11:50:45 -03:00
Lucas Gomide	bf248d5118	docs: fix neatlogs documentation (#3171 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details	2025-07-16 21:18:04 -04:00
Lorenze Jay	2490e8cd46	Update CrewAI version to 0.148.0 in project templates and dependencies (#3172 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details * Update CrewAI version to 0.148.0 in project templates and dependencies * Update crewai-tools dependency to version 0.55.0 in pyproject.toml and uv.lock for improved functionality and performance.	2025-07-16 12:36:43 -07:00
Lucas Gomide	9b67e5a15f	Emit events about Agent eval (#3168 ) * feat: emit events abou Agent Eval We are triggering events when an evaluation has started/completed/failed * style: fix type checking issues	2025-07-16 13:18:59 -04:00
Lucas Gomide	6ebb6c9b63	Supporting eval single Agent/LiteAgent (#3167 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details * refactor: rely on task completion event to evaluate agents * feat: remove Crew dependency to evaluate agent * feat: drop execution_context in AgentEvaluator * chore: drop experimental Agent Eval feature from stable crew.test * feat: support eval LiteAgent * resolve linter issues	2025-07-15 09:22:41 -04:00
Lucas Gomide	53f674be60	chore: remove evaluation folder (#3159 ) This folder was moved to `experimental` folder	2025-07-15 08:30:20 -04:00
Paras Sakarwal	11717a5213	docs: added integration with neatlogs (#3138 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details Mark stale issues and pull requests / stale (push) Has been cancelled Details	2025-07-14 11:08:24 -04:00
Lucas Gomide	b6d699f764	Implement thread-safe AgentEvaluator (#3157 ) Some checks failed Notify Downstream / notify-downstream (push) Has been cancelled Details * refactor: implement thread-safe AgentEvaluator with hybrid state management * chore: remove useless comments	2025-07-14 10:05:42 -04:00
Lucas Gomide	5b15061b87	test: add test helper to assert Agent Experiments (#3156 )	2025-07-14 09:24:49 -04:00
Lucas Gomide	1b6b2b36d9	Introduce Evaluator Experiment (#3133 ) * feat: add exchanged messages in LLMCallCompletedEvent * feat: add GoalAlignment metric for Agent evaluation * feat: add SemanticQuality metric for Agent evaluation * feat: add Tool Metrics for Agent evaluation * feat: add Reasoning Metrics for Agent evaluation, still in progress * feat: add AgentEvaluator class This class will evaluate Agent' results and report to user * fix: do not evaluate Agent by default This is a experimental feature we still need refine it further * test: add Agent eval tests * fix: render all feedback per iteration * style: resolve linter issues * style: fix mypy issues * fix: allow messages be empty on LLMCallCompletedEvent * feat: add Experiment evaluation framework with baseline comparison * fix: reset evaluator for each experiement iteraction * fix: fix track of new test cases * chore: split Experimental evaluation classes * refactor: remove unused method * refactor: isolate Console print in a dedicated class * fix: make crew required to run an experiment * fix: use time-aware to define experiment result * test: add tests for Evaluator Experiment * style: fix linter issues * fix: encode string before hashing * style: resolve linter issues * feat: add experimental folder for beta features (#3141) * test: move tests to experimental folder	2025-07-14 09:06:45 -04:00