feat: fix test

feat: change opdeai model
feat: back to sync
2026-06-29 20:18:11 +00:00 · 2024-07-25 15:30:54 -03:00 · 2024-07-25 13:44:32 -03:00 · 2024-07-25 13:43:54 -03:00 · 2024-07-25 12:58:55 -03:00 · 2024-07-25 12:09:02 -03:00
15 changed files with 359 additions and 491 deletions
--- a/docs/core-concepts/Testing.md
+++ b/docs/core-concepts/Testing.md
@@ -0,0 +1,41 @@
+---
+title: crewAI Testing
+description: Learn how to test your crewAI Crew and evaluate their performance.
+---
+
+## Introduction
+
+Testing is a crucial part of the development process, and it is essential to ensure that your crew is performing as expected. And with crewAI, you can easily test your crew and evaluate its performance using the built-in testing capabilities.
+
+### Using the Testing Feature
+
+We added the CLI command `crewai test` to make it easy to test your crew. This command will run your crew for a specified number of iterations and provide detailed performance metrics.
+The parameters are `n_iterations` and `model` which are optional and default to 2 and `gpt-4o-mini` respectively. For now the only provider available is OpenAI.
+
+```bash
+crewai test
+```
+
+If you want to run more iterations or use a different model, you can specify the parameters like this:
+
+```bash
+crewai test --n_iterations 5 --model gpt-4o
+```
+
+What happens when you run the `crewai test` command is that the crew will be executed for the specified number of iterations, and the performance metrics will be displayed at the end of the run.
+
+A table of scores at the end will show the performance of the crew in terms of the following metrics:
+```
+                Task Scores
+          (1-10 Higher is better)
+┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
+┃ Tasks/Crew ┃ Run 1 ┃ Run 2 ┃ Avg. Total ┃
+┡━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
+│ Task 1     │ 10.0  │ 9.0   │ 9.5        │
+│ Task 2     │ 9.0   │ 9.0   │ 9.0        │
+│ Crew       │ 9.5   │ 9.0   │ 9.2        │
+└────────────┴───────┴───────┴────────────┘
+```
+
+The example above shows the test results for two runs of the crew with two tasks, with the average total score for each task and the crew as a whole.
+
--- a/docs/how-to/Start-a-New-CrewAI-Project.md
+++ b/docs/how-to/Start-a-New-CrewAI-Project.md
@@ -16,7 +16,7 @@ We assume you have already installed CrewAI. If not, please refer to the [instal
 To create a new project, run the following CLI command:

 ```shell
-$ crewai create <project_name>
+$ crewai create my_project
 ```

 This command will create a new project folder with the following structure:
@@ -79,77 +79,8 @@ research_candidates_task:
    {job_requirements}
  expected_output: >
    A list of 10 potential candidates with their contact information and brief profiles highlighting their suitability.
-  agent: researcher # THIS NEEDS TO MATCH THE AGENT NAME IN THE AGENTS.YAML FILE AND THE AGENT DEFINED IN THE Crew.PY FILE
-  context: # THESE NEED TO MATCH THE TASK NAMES DEFINED ABOVE AND THE TASKS.YAML FILE AND THE TASK DEFINED IN THE Crew.PY FILE
-    - researcher
 ```

-### Referencing Variables:
-Your defined functions with the same name will be used. For example, you can reference the agent for specific tasks from task.yaml file. Ensure your annotated agent and function name is the same otherwise your task wont recognize the reference properly.
-
-#### Example References
-agent.yaml
-```yaml
-email_summarizer:
-    role: >
-      Email Summarizer
-    goal: >
-      Summarize emails into a concise and clear summary
-    backstory: >
-      You will create a 5 bullet point summary of the report
-    llm: mixtal_llm
-```
-
-task.yaml
-```yaml
-email_summarizer_task:
-    description: >
-      Summarize the email into a 5 bullet point summary
-    expected_output: >
-      A 5 bullet point summary of the email
-    agent: email_summarizer
-    context:
-      - reporting_task
-      - research_task
-```
-
-Use the annotations are used to properly reference the agent and task in the crew.py file.
-
-### Annotations include:
-* @agent
-* @task
-* @crew
-* @llm
-* @tool
-* @callback
-* @output_json
-* @output_pydantic
-* @cache_handler
-
-
-crew.py
-```py
-...
-    @llm
-    def mixtal_llm(self):
-        return ChatGroq(temperature=0, model_name="mixtral-8x7b-32768")
-
-    @agent
-    def email_summarizer(self) -> Agent:
-        return Agent(
-            config=self.agents_config["email_summarizer"],
-        )
-    ## ...other tasks defined
-    @task
-    def email_summarizer_task(self) -> Task:
-        return Task(
-            config=self.tasks_config["email_summarizer_task"],
-        )
-...
-```
-
-
-
 ## Installing Dependencies

 To install the dependencies for your project, you can use Poetry. First, navigate to your project directory:
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -129,6 +129,7 @@ nav:
    - Training: 'core-concepts/Training-Crew.md'
    - Memory: 'core-concepts/Memory.md'
    - Planning: 'core-concepts/Planning.md'
+    - Testing: 'core-concepts/Testing.md'
    - Using LangChain Tools: 'core-concepts/Using-LangChain-Tools.md'
    - Using LlamaIndex Tools: 'core-concepts/Using-LlamaIndex-Tools.md'
  - How to Guides:
--- a/src/crewai/cli/cli.py
+++ b/src/crewai/cli/cli.py
@@ -10,7 +10,6 @@ from .replay_from_task import replay_task_command
 from .reset_memories_command import reset_memories_command
 from .test_crew import test_crew
 from .train_crew import train_crew
-from .doc_generator import generate_documentation


@click.group()
@@ -147,18 +146,6 @@ def test(n_iterations: int, model: str):
    click.echo(f"Testing the crew for {n_iterations} iterations with model {model}")
    test_crew(n_iterations, model)

-@crewai.command()
-@click.option('--output', '-o', default='crew_documentation.md', help='Output file for the documentation')
-@click.option('--format', '-f', default='markdown', help='Output format')
-def generate_docs(output, format):
-    """Generate documentation for the current project setup."""
-    try:
-        click.echo(f"Generating documentation in {format} format...")
-        generate_documentation(output, format)
-        click.echo(f"Documentation generated and saved to {output}")
-    except ValueError as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        click.echo("Please ensure you are in the root directory of your CrewAI project.")

 if __name__ == "__main__":
    crewai()
--- a/src/crewai/cli/doc_generator.py
+++ b/src/crewai/cli/doc_generator.py
@@ -1,204 +0,0 @@
-import os
-import yaml
-import logging
-
-
-def is_project_root():
-    """
-    Check if the current directory is the root of a CrewAI project.
-
-    Returns:
-        bool: True if in project root, False otherwise.
-    """
-    # Check for key indicators of a CrewAI project root
-    indicators = ["pyproject.toml", "poetry.lock", "src"]
-    return all(os.path.exists(indicator) for indicator in indicators)
-
-
-def generate_documentation(output_file, format):
-    """
-    Generate documentation for the current CrewAI project setup.
-
-    Args:
-        output_file (str): The path and filename where the generated documentation
-                           will be saved.
-        format (str): The desired output format for the documentation.
-                      Supported values currently 'markdown'.
-
-    Returns:
-        None: The function writes the generated documentation to the specified
-              output file and doesn't return any value.
-
-    Raises:
-        ValueError: If not in the project root or if an unsupported output format is specified.
-    """
-    if not is_project_root():
-        raise ValueError(
-            "Not in the root of a CrewAI project."
-        )
-
-    # Load the current project configuration
-    config = load_crew_configuration()
-
-    if config is None:
-        logging.error("Failed to load crew configuration. Exiting.")
-        return
-
-    if format == "markdown":
-        content = generate_markdown(config)
-    else:
-        raise ValueError(f"Unsupported output format: {format}")
-
-    with open(output_file, "w") as f:
-        f.write(content)
-
-    logging.info(f"Documentation generated and saved to {output_file}")
-
-
-def find_config_dir():
-    """
-    Find the configuration directory based on the project structure.
-
-    This function attempts to locate the configuration directory for a CrewAI project
-    by assuming a standard project structure. It starts from the current working
-    directory and constructs an expected path to the config directory.
-
-    Returns:
-        str or None: The path to the configuration directory if found, None otherwise.
-
-    The function performs the following steps:
-    1. Gets the current working directory.
-    2. Extracts the project name from the current directory path.
-    3. Constructs the expected config path using the project structure convention.
-    4. Checks if the expected config directory exists.
-    5. Returns the path if found, or None if not found.
-
-    Logging:
-        - Logs debug information about the search process.
-        - Logs the starting directory, the checked path, and the result of the search.
-
-    Note:
-        This function assumes a specific project structure where the config
-        directory is located at 'src/<project_name>/config' relative to the
-        project root.
-    """
-    current_dir = os.getcwd()
-    logging.debug(f"Starting search from: {current_dir}")
-
-    # Split the path to get the project name
-    path_parts = current_dir.split(os.path.sep)
-    project_name = path_parts[-1]
-
-    # Construct the expected config path
-    expected_config_path = os.path.join(current_dir, "src", project_name, "config")
-
-    logging.debug(f"Checking for config directory: {expected_config_path}")
-
-    if os.path.isdir(expected_config_path):
-        logging.debug(f"Found config directory: {expected_config_path}")
-        return expected_config_path
-
-    logging.debug("Config directory not found in the expected location")
-    return None
-
-
-def load_crew_configuration():
-    """
-    Load the crew configuration from YAML files.
-
-    This function attempts to find the configuration directory and load the agents
-    and tasks configurations from their respective YAML files.
-
-    Returns:
-        dict or None: A dictionary containing 'agents' and 'tasks' configurations
-                      if successful, None if there was an error.
-
-    The function performs the following steps:
-    1. Finds the configuration directory using find_config_dir().
-    2. Constructs paths to agents.yaml and tasks.yaml files.
-    3. Checks if both files exist.
-    4. Loads and parses the YAML content of both files.
-    5. Returns a dictionary with the parsed configurations.
-
-    Logging:
-        - Logs an error if the configuration directory is not found.
-        - Logs an error if either agents.yaml or tasks.yaml is not found.
-
-    Note:
-        This function assumes that the configuration files are named 'agents.yaml'
-        and 'tasks.yaml' and are located in the directory returned by find_config_dir().
-    """
-    config_dir = find_config_dir()
-    if not config_dir:
-        logging.error(
-            "Configuration directory not found. Make sure you're in the root of your CrewAI project."
-        )
-        return None
-
-    agents_file = os.path.join(config_dir, "agents.yaml")
-    tasks_file = os.path.join(config_dir, "tasks.yaml")
-
-    if not os.path.exists(agents_file) or not os.path.exists(tasks_file):
-        logging.error(f"agents.yaml or tasks.yaml not found in {config_dir}")
-        return None
-
-    with open(agents_file, "r") as f:
-        agents_config = yaml.safe_load(f)
-
-    with open(tasks_file, "r") as f:
-        tasks_config = yaml.safe_load(f)
-
-    return {"agents": agents_config, "tasks": tasks_config}
-
-
-def generate_markdown(config):
-    """
-    Generate Markdown documentation for the CrewAI project configuration.
-
-    This function takes the parsed configuration dictionary and generates
-    a formatted Markdown string containing documentation for the project's
-    agents and tasks.
-
-    Args:
-        config (dict): A dictionary containing the parsed configuration
-                       with 'agents' and 'tasks' keys.
-
-    Returns:
-        str: A formatted Markdown string containing the project documentation.
-             If the input config is None, it returns an error message.
-
-    The generated Markdown includes:
-    1. A title for the project documentation.
-    2. A section for Agents, listing each agent's name, role, goal, and backstory.
-    3. A section for Tasks, listing each task's name, description, expected output,
-       and assigned agent.
-
-    Each piece of information is wrapped in code blocks for better readability
-    in rendered Markdown.
-
-    Note:
-        This function assumes that the config dictionary has the correct structure
-        with 'agents' and 'tasks' keys, each containing nested dictionaries of
-        agent and task information respectively.
-    """
-    if config is None:
-        return "# Error: No crew configuration available"
-
-    md = "# CrewAI Project Documentation\n\n"
-
-    md += "## Agents\n\n"
-    for agent_name, agent_data in config["agents"].items():
-        md += f"### \n```\n{agent_name}\n```\n"
-        md += f"Role: \n```\n{agent_data.get('role', 'Not specified')}\n```\n"
-        md += f"Goal: \n```\n{agent_data.get('goal', 'Not specified')}\n```\n"
-        md += f"Backstory: \n```\n{agent_data.get('backstory', 'Not specified')}\n```\n"
-        md += f""
-
-    md += "## Tasks\n\n"
-    for task_name, task_data in config["tasks"].items():
-        md += f"### {task_name}\n"
-        md += f"Description: \n```\n{task_data.get('description', 'Not specified')}\n```\n"
-        md += f"Expected Output: \n```\n{task_data.get('expected_output', 'Not specified')}\n```\n"
-        md += f"Assigned Agent: \n```\n{task_data.get('agent', 'Not assigned')}\n```\n"
-
-    return md
--- a/src/crewai/cli/templates/config/tasks.yaml
+++ b/src/crewai/cli/templates/config/tasks.yaml
@@ -5,7 +5,6 @@ research_task:
    the current year is 2024.
  expected_output: >
    A list with 10 bullet points of the most relevant information about {topic}
-  agent: researcher

 reporting_task:
  description: >
@@ -14,4 +13,3 @@ reporting_task:
  expected_output: >
    A fully fledge reports with the mains topics, each with a full section of information.
    Formatted as markdown without '```'
-  agent: reporting_analyst
--- a/src/crewai/cli/templates/crew.py
+++ b/src/crewai/cli/templates/crew.py
@@ -32,12 +32,14 @@ class {{crew_name}}Crew():
 	def research_task(self) -> Task:
 		return Task(
 			config=self.tasks_config['research_task'],
+			agent=self.researcher()
 		)

 	@task
 	def reporting_task(self) -> Task:
 		return Task(
 			config=self.tasks_config['reporting_task'],
+			agent=self.reporting_analyst(),
 			output_file='report.md'
 		)

--- a/src/crewai/cli/templates/main.py
+++ b/src/crewai/cli/templates/main.py
@@ -48,7 +48,7 @@ def test():
        "topic": "AI LLMs"
    }
    try:
-        {{crew_name}}Crew().crew().test(n_iterations=int(sys.argv[1]), model=sys.argv[2], inputs=inputs)
+        {{crew_name}}Crew().crew().test(n_iterations=int(sys.argv[1]), openai_model_name=sys.argv[2], inputs=inputs)

    except Exception as e:
        raise Exception(f"An error occurred while replaying the crew: {e}")
--- a/src/crewai/crew.py
+++ b/src/crewai/crew.py
@@ -37,6 +37,7 @@ from crewai.utilities.constants import (
    TRAINED_AGENTS_DATA_FILE,
    TRAINING_DATA_FILE,
 )
+from crewai.utilities.evaluators.crew_evaluator_handler import CrewEvaluator
 from crewai.utilities.evaluators.task_evaluator import TaskEvaluator
 from crewai.utilities.formatter import (
    aggregate_raw_outputs_from_task_outputs,
@@ -967,10 +968,19 @@ class Crew(BaseModel):
        return total_usage_metrics

    def test(
-        self, n_iterations: int, model: str, inputs: Optional[Dict[str, Any]] = None
+        self,
+        n_iterations: int,
+        openai_model_name: str,
+        inputs: Optional[Dict[str, Any]] = None,
    ) -> None:
-        """Test the crew with the given inputs."""
-        pass
+        """Test and evaluate the Crew with the given inputs for n iterations."""
+        evaluator = CrewEvaluator(self, openai_model_name)
+
+        for i in range(1, n_iterations + 1):
+            evaluator.set_iteration(i)
+            self.kickoff(inputs=inputs)
+
+        evaluator.print_crew_evaluation_result()

    def __repr__(self):
        return f"Crew(id={self.id}, process={self.process}, number_of_agents={len(self.agents)}, number_of_tasks={len(self.tasks)})"
--- a/src/crewai/project/init.py
+++ b/src/crewai/project/init.py
@@ -1,25 +1,2 @@
-from .annotations import (
-    agent,
-    crew,
-    task,
-    output_json,
-    output_pydantic,
-    tool,
-    callback,
-    llm,
-    cache_handler,
-)
+from .annotations import agent, crew, task
 from .crew_base import CrewBase
-
-__all__ = [
-    "agent",
-    "crew",
-    "task",
-    "output_json",
-    "output_pydantic",
-    "tool",
-    "callback",
-    "CrewBase",
-    "llm",
-    "cache_handler",
-]
--- a/src/crewai/project/annotations.py
+++ b/src/crewai/project/annotations.py
@@ -30,37 +30,6 @@ def agent(func):
    return func


-def llm(func):
-    func.is_llm = True
-    func = memoize(func)
-    return func
-
-
-def output_json(cls):
-    cls.is_output_json = True
-    return cls
-
-
-def output_pydantic(cls):
-    cls.is_output_pydantic = True
-    return cls
-
-
-def tool(func):
-    func.is_tool = True
-    return memoize(func)
-
-
-def callback(func):
-    func.is_callback = True
-    return memoize(func)
-
-
-def cache_handler(func):
-    func.is_cache_handler = True
-    return memoize(func)
-
-
 def crew(func):
    def wrapper(self, *args, **kwargs):
        instantiated_tasks = []
--- a/src/crewai/project/crew_base.py
+++ b/src/crewai/project/crew_base.py
@@ -1,7 +1,6 @@
 import inspect
 import os
 from pathlib import Path
-from typing import Any, Callable, Dict

 import yaml
 from dotenv import load_dotenv
@@ -21,6 +20,11 @@ def CrewBase(cls):
                base_directory = Path(frame_info.filename).parent.resolve()
                break

+        if base_directory is None:
+            raise Exception(
+                "Unable to dynamically determine the project's base directory, you must run it from the project's root directory."
+            )
+
        original_agents_config_path = getattr(
            cls, "agents_config", "config/agents.yaml"
        )
@@ -28,20 +32,12 @@ def CrewBase(cls):

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
-
-            if self.base_directory is None:
-                raise Exception(
-                    "Unable to dynamically determine the project's base directory, you must run it from the project's root directory."
-                )
-
            self.agents_config = self.load_yaml(
                os.path.join(self.base_directory, self.original_agents_config_path)
            )
            self.tasks_config = self.load_yaml(
                os.path.join(self.base_directory, self.original_tasks_config_path)
            )
-            self.map_all_agent_variables()
-            self.map_all_task_variables()

        @staticmethod
        def load_yaml(config_path: str):
@@ -49,138 +45,4 @@ def CrewBase(cls):
                # parsedContent = YamlParser.parse(file)  # type: ignore # Argument 1 to "parse" has incompatible type "TextIOWrapper"; expected "YamlParser"
                return yaml.safe_load(file)

-        def _get_all_functions(self):
-            return {
-                name: getattr(self, name)
-                for name in dir(self)
-                if callable(getattr(self, name))
-            }
-
-        def _filter_functions(
-            self, functions: Dict[str, Callable], attribute: str
-        ) -> Dict[str, Callable]:
-            return {
-                name: func
-                for name, func in functions.items()
-                if hasattr(func, attribute)
-            }
-
-        def map_all_agent_variables(self) -> None:
-            all_functions = self._get_all_functions()
-            llms = self._filter_functions(all_functions, "is_llm")
-            tool_functions = self._filter_functions(all_functions, "is_tool")
-            cache_handler_functions = self._filter_functions(
-                all_functions, "is_cache_handler"
-            )
-            callbacks = self._filter_functions(all_functions, "is_callback")
-            agents = self._filter_functions(all_functions, "is_agent")
-
-            for agent_name, agent_info in self.agents_config.items():
-                self._map_agent_variables(
-                    agent_name,
-                    agent_info,
-                    agents,
-                    llms,
-                    tool_functions,
-                    cache_handler_functions,
-                    callbacks,
-                )
-
-        def _map_agent_variables(
-            self,
-            agent_name: str,
-            agent_info: Dict[str, Any],
-            agents: Dict[str, Callable],
-            llms: Dict[str, Callable],
-            tool_functions: Dict[str, Callable],
-            cache_handler_functions: Dict[str, Callable],
-            callbacks: Dict[str, Callable],
-        ) -> None:
-            if llm := agent_info.get("llm"):
-                self.agents_config[agent_name]["llm"] = llms[llm]()
-
-            if tools := agent_info.get("tools"):
-                self.agents_config[agent_name]["tools"] = [
-                    tool_functions[tool]() for tool in tools
-                ]
-
-            if function_calling_llm := agent_info.get("function_calling_llm"):
-                self.agents_config[agent_name]["function_calling_llm"] = agents[
-                    function_calling_llm
-                ]()
-
-            if step_callback := agent_info.get("step_callback"):
-                self.agents_config[agent_name]["step_callback"] = callbacks[
-                    step_callback
-                ]()
-
-            if cache_handler := agent_info.get("cache_handler"):
-                self.agents_config[agent_name]["cache_handler"] = (
-                    cache_handler_functions[cache_handler]()
-                )
-
-        def map_all_task_variables(self) -> None:
-            all_functions = self._get_all_functions()
-            agents = self._filter_functions(all_functions, "is_agent")
-            tasks = self._filter_functions(all_functions, "is_task")
-            output_json_functions = self._filter_functions(
-                all_functions, "is_output_json"
-            )
-            tool_functions = self._filter_functions(all_functions, "is_tool")
-            callback_functions = self._filter_functions(all_functions, "is_callback")
-            output_pydantic_functions = self._filter_functions(
-                all_functions, "is_output_pydantic"
-            )
-
-            for task_name, task_info in self.tasks_config.items():
-                self._map_task_variables(
-                    task_name,
-                    task_info,
-                    agents,
-                    tasks,
-                    output_json_functions,
-                    tool_functions,
-                    callback_functions,
-                    output_pydantic_functions,
-                )
-
-        def _map_task_variables(
-            self,
-            task_name: str,
-            task_info: Dict[str, Any],
-            agents: Dict[str, Callable],
-            tasks: Dict[str, Callable],
-            output_json_functions: Dict[str, Callable],
-            tool_functions: Dict[str, Callable],
-            callback_functions: Dict[str, Callable],
-            output_pydantic_functions: Dict[str, Callable],
-        ) -> None:
-            if context_list := task_info.get("context"):
-                self.tasks_config[task_name]["context"] = [
-                    tasks[context_task_name]() for context_task_name in context_list
-                ]
-
-            if tools := task_info.get("tools"):
-                self.tasks_config[task_name]["tools"] = [
-                    tool_functions[tool]() for tool in tools
-                ]
-
-            if agent_name := task_info.get("agent"):
-                self.tasks_config[task_name]["agent"] = agents[agent_name]()
-
-            if output_json := task_info.get("output_json"):
-                self.tasks_config[task_name]["output_json"] = output_json_functions[
-                    output_json
-                ]
-
-            if output_pydantic := task_info.get("output_pydantic"):
-                self.tasks_config[task_name]["output_pydantic"] = (
-                    output_pydantic_functions[output_pydantic]
-                )
-
-            if callbacks := task_info.get("callbacks"):
-                self.tasks_config[task_name]["callbacks"] = [
-                    callback_functions[callback]() for callback in callbacks
-                ]
-
    return WrappedClass
--- a/src/crewai/utilities/evaluators/crew_evaluator_handler.py
+++ b/src/crewai/utilities/evaluators/crew_evaluator_handler.py
@@ -0,0 +1,149 @@
+from collections import defaultdict
+
+from langchain_openai import ChatOpenAI
+from pydantic import BaseModel, Field
+from rich.console import Console
+from rich.table import Table
+
+from crewai.agent import Agent
+from crewai.task import Task
+from crewai.tasks.task_output import TaskOutput
+
+
+class TaskEvaluationPydanticOutput(BaseModel):
+    quality: float = Field(
+        description="A score from 1 to 10 evaluating on completion, quality, and overall performance from the task_description and task_expected_output to the actual Task Output."
+    )
+
+
+class CrewEvaluator:
+    """
+    A class to evaluate the performance of the agents in the crew based on the tasks they have performed.
+
+    Attributes:
+        crew (Crew): The crew of agents to evaluate.
+        openai_model_name (str): The model to use for evaluating the performance of the agents (for now ONLY OpenAI accepted).
+        tasks_scores (defaultdict): A dictionary to store the scores of the agents for each task.
+        iteration (int): The current iteration of the evaluation.
+    """
+
+    tasks_scores: defaultdict = defaultdict(list)
+    iteration: int = 0
+
+    def __init__(self, crew, openai_model_name: str):
+        self.crew = crew
+        self.openai_model_name = openai_model_name
+        self._setup_for_evaluating()
+
+    def _setup_for_evaluating(self) -> None:
+        """Sets up the crew for evaluating."""
+        for task in self.crew.tasks:
+            task.callback = self.evaluate
+
+    def set_iteration(self, iteration: int) -> None:
+        self.iteration = iteration
+
+    def _evaluator_agent(self):
+        return Agent(
+            role="Task Execution Evaluator",
+            goal=(
+                "Your goal is to evaluate the performance of the agents in the crew based on the tasks they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
+            ),
+            backstory="Evaluator agent for crew evaluation with precise capabilities to evaluate the performance of the agents in the crew based on the tasks they have performed",
+            verbose=False,
+            llm=ChatOpenAI(model=self.openai_model_name),
+        )
+
+    def _evaluation_task(
+        self, evaluator_agent: Agent, task_to_evaluate: Task, task_output: str
+    ) -> Task:
+        return Task(
+            description=(
+                "Based on the task description and the expected output, compare and evaluate the performance of the agents in the crew based on the Task Output they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
+                f"task_description: {task_to_evaluate.description} "
+                f"task_expected_output: {task_to_evaluate.expected_output} "
+                f"agent: {task_to_evaluate.agent.role if task_to_evaluate.agent else None} "
+                f"agent_goal: {task_to_evaluate.agent.goal if task_to_evaluate.agent else None} "
+                f"Task Output: {task_output}"
+            ),
+            expected_output="Evaluation Score from 1 to 10 based on the performance of the agents on the tasks",
+            agent=evaluator_agent,
+            output_pydantic=TaskEvaluationPydanticOutput,
+        )
+
+    def print_crew_evaluation_result(self) -> None:
+        """
+        Prints the evaluation result of the crew in a table.
+        A Crew with 2 tasks using the command crewai test -n 2
+        will output the following table:
+
+                        Task Scores
+                    (1-10 Higher is better)
+            ┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
+            ┃ Tasks/Crew ┃ Run 1 ┃ Run 2 ┃ Avg. Total ┃
+            ┡━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
+            │ Task 1     │ 10.0  │ 9.0   │ 9.5        │
+            │ Task 2     │ 9.0   │ 9.0   │ 9.0        │
+            │ Crew       │ 9.5   │ 9.0   │ 9.2        │
+            └────────────┴───────┴───────┴────────────┘
+        """
+        task_averages = [
+            sum(scores) / len(scores) for scores in zip(*self.tasks_scores.values())
+        ]
+        crew_average = sum(task_averages) / len(task_averages)
+
+        # Create a table
+        table = Table(title="Tasks Scores \n (1-10 Higher is better)")
+
+        # Add columns for the table
+        table.add_column("Tasks/Crew")
+        for run in range(1, len(self.tasks_scores) + 1):
+            table.add_column(f"Run {run}")
+        table.add_column("Avg. Total")
+
+        # Add rows for each task
+        for task_index in range(len(task_averages)):
+            task_scores = [
+                self.tasks_scores[run][task_index]
+                for run in range(1, len(self.tasks_scores) + 1)
+            ]
+            avg_score = task_averages[task_index]
+            table.add_row(
+                f"Task {task_index + 1}", *map(str, task_scores), f"{avg_score:.1f}"
+            )
+
+        # Add a row for the crew average
+        crew_scores = [
+            sum(self.tasks_scores[run]) / len(self.tasks_scores[run])
+            for run in range(1, len(self.tasks_scores) + 1)
+        ]
+        table.add_row("Crew", *map(str, crew_scores), f"{crew_average:.1f}")
+
+        # Display the table in the terminal
+        console = Console()
+        console.print(table)
+
+    def evaluate(self, task_output: TaskOutput):
+        """Evaluates the performance of the agents in the crew based on the tasks they have performed."""
+        current_task = None
+        for task in self.crew.tasks:
+            if task.description == task_output.description:
+                current_task = task
+                break
+
+        if not current_task or not task_output:
+            raise ValueError(
+                "Task to evaluate and task output are required for evaluation"
+            )
+
+        evaluator_agent = self._evaluator_agent()
+        evaluation_task = self._evaluation_task(
+            evaluator_agent, current_task, task_output.raw
+        )
+
+        evaluation_result = evaluation_task.execute_sync()
+
+        if isinstance(evaluation_result.pydantic, TaskEvaluationPydanticOutput):
+            self.tasks_scores[self.iteration].append(evaluation_result.pydantic.quality)
+        else:
+            raise ValueError("Evaluation result is not in the expected format")
--- a/tests/crew_test.py
+++ b/tests/crew_test.py
@@ -8,6 +8,7 @@ from unittest.mock import MagicMock, patch

 import pydantic_core
 import pytest
+
 from crewai.agent import Agent
 from crewai.agents.cache import CacheHandler
 from crewai.crew import Crew
@@ -2499,3 +2500,34 @@ def test_conditional_should_execute():
        assert condition_mock.call_count == 1
        assert condition_mock() is True
        assert mock_execute_sync.call_count == 2
+
+
+@mock.patch("crewai.crew.CrewEvaluator")
+@mock.patch("crewai.crew.Crew.kickoff")
+def test_crew_testing_function(mock_kickoff, crew_evaluator):
+    task = Task(
+        description="Come up with a list of 5 interesting ideas to explore for an article, then write one amazing paragraph highlight for each idea that showcases how good an article about this topic could be. Return the list of ideas with their paragraph and your notes.",
+        expected_output="5 bullet points with a paragraph for each idea.",
+        agent=researcher,
+    )
+
+    crew = Crew(
+        agents=[researcher],
+        tasks=[task],
+    )
+    n_iterations = 2
+    crew.test(n_iterations, openai_model_name="gpt-4o-mini", inputs={"topic": "AI"})
+
+    assert len(mock_kickoff.mock_calls) == n_iterations
+    mock_kickoff.assert_has_calls(
+        [mock.call(inputs={"topic": "AI"}), mock.call(inputs={"topic": "AI"})]
+    )
+
+    crew_evaluator.assert_has_calls(
+        [
+            mock.call(crew, "gpt-4o-mini"),
+            mock.call().set_iteration(1),
+            mock.call().set_iteration(2),
+            mock.call().print_crew_evaluation_result(),
+        ]
+    )
--- a/tests/utilities/evaluators/test_crew_evaluator_handler.py
+++ b/tests/utilities/evaluators/test_crew_evaluator_handler.py
@@ -0,0 +1,113 @@
+from unittest import mock
+
+import pytest
+
+from crewai.agent import Agent
+from crewai.crew import Crew
+from crewai.task import Task
+from crewai.tasks.task_output import TaskOutput
+from crewai.utilities.evaluators.crew_evaluator_handler import (
+    CrewEvaluator,
+    TaskEvaluationPydanticOutput,
+)
+
+
+class TestCrewEvaluator:
+    @pytest.fixture
+    def crew_planner(self):
+        agent = Agent(role="Agent 1", goal="Goal 1", backstory="Backstory 1")
+        task = Task(
+            description="Task 1",
+            expected_output="Output 1",
+            agent=agent,
+        )
+        crew = Crew(agents=[agent], tasks=[task])
+
+        return CrewEvaluator(crew, openai_model_name="gpt-4o-mini")
+
+    def test_setup_for_evaluating(self, crew_planner):
+        crew_planner._setup_for_evaluating()
+        assert crew_planner.crew.tasks[0].callback == crew_planner.evaluate
+
+    def test_set_iteration(self, crew_planner):
+        crew_planner.set_iteration(1)
+        assert crew_planner.iteration == 1
+
+    def test_evaluator_agent(self, crew_planner):
+        agent = crew_planner._evaluator_agent()
+        assert agent.role == "Task Execution Evaluator"
+        assert (
+            agent.goal
+            == "Your goal is to evaluate the performance of the agents in the crew based on the tasks they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
+        )
+        assert (
+            agent.backstory
+            == "Evaluator agent for crew evaluation with precise capabilities to evaluate the performance of the agents in the crew based on the tasks they have performed"
+        )
+        assert agent.verbose is False
+        assert agent.llm.model_name == "gpt-4o-mini"
+
+    def test_evaluation_task(self, crew_planner):
+        evaluator_agent = Agent(
+            role="Evaluator Agent",
+            goal="Evaluate the performance of the agents in the crew",
+            backstory="Master in Evaluation",
+        )
+        task_to_evaluate = Task(
+            description="Task 1",
+            expected_output="Output 1",
+            agent=Agent(role="Agent 1", goal="Goal 1", backstory="Backstory 1"),
+        )
+        task_output = "Task Output 1"
+        task = crew_planner._evaluation_task(
+            evaluator_agent, task_to_evaluate, task_output
+        )
+
+        assert task.description.startswith(
+            "Based on the task description and the expected output, compare and evaluate the performance of the agents in the crew based on the Task Output they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
+        )
+
+        assert task.agent == evaluator_agent
+        assert (
+            task.description
+            == "Based on the task description and the expected output, compare and evaluate "
+            "the performance of the agents in the crew based on the Task Output they have "
+            "performed using score from 1 to 10 evaluating on completion, quality, and overall "
+            "performance.task_description: Task 1 task_expected_output: Output 1 "
+            "agent: Agent 1 agent_goal: Goal 1 Task Output: Task Output 1"
+        )
+
+    @mock.patch("crewai.utilities.evaluators.crew_evaluator_handler.Console")
+    @mock.patch("crewai.utilities.evaluators.crew_evaluator_handler.Table")
+    def test_print_crew_evaluation_result(self, table, console, crew_planner):
+        crew_planner.tasks_scores = {
+            1: [10, 9, 8],
+            2: [9, 8, 7],
+        }
+
+        crew_planner.print_crew_evaluation_result()
+
+        table.assert_has_calls(
+            [
+                mock.call(title="Tasks Scores \n (1-10 Higher is better)"),
+                mock.call().add_column("Tasks/Crew"),
+                mock.call().add_column("Run 1"),
+                mock.call().add_column("Run 2"),
+                mock.call().add_column("Avg. Total"),
+                mock.call().add_row("Task 1", "10", "9", "9.5"),
+                mock.call().add_row("Task 2", "9", "8", "8.5"),
+                mock.call().add_row("Task 3", "8", "7", "7.5"),
+                mock.call().add_row("Crew", "9.0", "8.0", "8.5"),
+            ]
+        )
+        console.assert_has_calls([mock.call(), mock.call().print(table())])
+
+    def test_evaluate(self, crew_planner):
+        task_output = TaskOutput(
+            description="Task 1", agent=str(crew_planner.crew.agents[0])
+        )
+
+        with mock.patch.object(Task, "execute_sync") as execute:
+            execute().pydantic = TaskEvaluationPydanticOutput(quality=9.5)
+            crew_planner.evaluate(task_output)
+            assert crew_planner.tasks_scores[0] == [9.5]
Author	SHA1	Message	Date
Eduardo Chiarotti	616ffe2aba	feat: fix test	2024-07-25 15:30:54 -03:00
Eduardo Chiarotti	a6bce1089a	feat: change opdeai model	2024-07-25 13:44:32 -03:00
Eduardo Chiarotti	cb8fbf61de	feat: back to sync	2024-07-25 13:43:54 -03:00
Eduardo Chiarotti	4d2cdc3d96	feat: improve tests and fix some issue	2024-07-25 12:58:55 -03:00
Eduardo Chiarotti	890c03a0a6	docs: add docs for Testing	2024-07-25 12:09:02 -03:00
Eduardo Chiarotti	e4b419d5be	feat: add raise ValueError when testing if output is not the expected	2024-07-24 13:35:29 -03:00
Eduardo Chiarotti	8ffc4f79fa	feat: fix type checking issue	2024-07-24 13:34:59 -03:00
Eduardo Chiarotti	c05ef3c8cf	feat: add tests	2024-07-24 13:14:20 -03:00
Eduardo Chiarotti	cf600c1a43	feat: improve testing output table	2024-07-24 11:39:43 -03:00
Eduardo Chiarotti	2a88d1d462	feat: add docs and add unit test	2024-07-24 11:05:09 -03:00
Eduardo Chiarotti	660a2ae837	feat: add crew Testing/evalauting feature	2024-07-24 09:14:09 -03:00