Compare commits

..

11 Commits

Author SHA1 Message Date
Eduardo Chiarotti
616ffe2aba feat: fix test 2024-07-25 15:30:54 -03:00
Eduardo Chiarotti
a6bce1089a feat: change opdeai model 2024-07-25 13:44:32 -03:00
Eduardo Chiarotti
cb8fbf61de feat: back to sync 2024-07-25 13:43:54 -03:00
Eduardo Chiarotti
4d2cdc3d96 feat: improve tests and fix some issue 2024-07-25 12:58:55 -03:00
Eduardo Chiarotti
890c03a0a6 docs: add docs for Testing 2024-07-25 12:09:02 -03:00
Eduardo Chiarotti
e4b419d5be feat: add raise ValueError when testing if output is not the expected 2024-07-24 13:35:29 -03:00
Eduardo Chiarotti
8ffc4f79fa feat: fix type checking issue 2024-07-24 13:34:59 -03:00
Eduardo Chiarotti
c05ef3c8cf feat: add tests 2024-07-24 13:14:20 -03:00
Eduardo Chiarotti
cf600c1a43 feat: improve testing output table 2024-07-24 11:39:43 -03:00
Eduardo Chiarotti
2a88d1d462 feat: add docs and add unit test 2024-07-24 11:05:09 -03:00
Eduardo Chiarotti
660a2ae837 feat: add crew Testing/evalauting feature 2024-07-24 09:14:09 -03:00
17 changed files with 374 additions and 354 deletions

View File

@@ -33,7 +33,6 @@ A crew in crewAI represents a collaborative group of agents working together to
| **Manager Callbacks** _(optional)_ | `manager_callbacks` | `manager_callbacks` takes a list of callback handlers to be executed by the manager agent when a hierarchical process is used. |
| **Prompt File** _(optional)_ | `prompt_file` | Path to the prompt JSON file to be used for the crew. |
| **Planning** *(optional)* | `planning` | Adds planning ability to the Crew. When activated before each Crew iteration, all Crew data is sent to an AgentPlanner that will plan the tasks and this plan will be added to each task description.
| **Planning LLM** *(optional)* | `planning_llm` | The language model used by the AgentPlanner in a planning process. |
!!! note "Crew Max RPM"
The `max_rpm` attribute sets the maximum number of requests per minute the crew can perform to avoid rate limits and will override individual agents' `max_rpm` settings if you set it.

View File

@@ -23,25 +23,6 @@ my_crew = Crew(
From this point on, your crew will have planning enabled, and the tasks will be planned before each iteration.
#### Planning LLM
Now you can define the LLM that will be used to plan the tasks. You can use any ChatOpenAI LLM model available.
```python
from crewai import Crew, Agent, Task, Process
from langchain_openai import ChatOpenAI
# Assemble your crew with planning capabilities and custom LLM
my_crew = Crew(
agents=self.agents,
tasks=self.tasks,
process=Process.sequential,
planning=True,
planning_llm=ChatOpenAI(model="gpt-4o")
)
```
### Example
When running the base case example, you will see something like the following output, which represents the output of the AgentPlanner responsible for creating the step-by-step logic to add to the Agents tasks.

View File

@@ -0,0 +1,41 @@
---
title: crewAI Testing
description: Learn how to test your crewAI Crew and evaluate their performance.
---
## Introduction
Testing is a crucial part of the development process, and it is essential to ensure that your crew is performing as expected. And with crewAI, you can easily test your crew and evaluate its performance using the built-in testing capabilities.
### Using the Testing Feature
We added the CLI command `crewai test` to make it easy to test your crew. This command will run your crew for a specified number of iterations and provide detailed performance metrics.
The parameters are `n_iterations` and `model` which are optional and default to 2 and `gpt-4o-mini` respectively. For now the only provider available is OpenAI.
```bash
crewai test
```
If you want to run more iterations or use a different model, you can specify the parameters like this:
```bash
crewai test --n_iterations 5 --model gpt-4o
```
What happens when you run the `crewai test` command is that the crew will be executed for the specified number of iterations, and the performance metrics will be displayed at the end of the run.
A table of scores at the end will show the performance of the crew in terms of the following metrics:
```
Task Scores
(1-10 Higher is better)
┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ Tasks/Crew ┃ Run 1 ┃ Run 2 ┃ Avg. Total ┃
┡━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ Task 1 │ 10.0 │ 9.0 │ 9.5 │
│ Task 2 │ 9.0 │ 9.0 │ 9.0 │
│ Crew │ 9.5 │ 9.0 │ 9.2 │
└────────────┴───────┴───────┴────────────┘
```
The example above shows the test results for two runs of the crew with two tasks, with the average total score for each task and the crew as a whole.

View File

@@ -16,7 +16,7 @@ We assume you have already installed CrewAI. If not, please refer to the [instal
To create a new project, run the following CLI command:
```shell
$ crewai create <project_name>
$ crewai create my_project
```
This command will create a new project folder with the following structure:
@@ -79,77 +79,8 @@ research_candidates_task:
{job_requirements}
expected_output: >
A list of 10 potential candidates with their contact information and brief profiles highlighting their suitability.
agent: researcher # THIS NEEDS TO MATCH THE AGENT NAME IN THE AGENTS.YAML FILE AND THE AGENT DEFINED IN THE Crew.PY FILE
context: # THESE NEED TO MATCH THE TASK NAMES DEFINED ABOVE AND THE TASKS.YAML FILE AND THE TASK DEFINED IN THE Crew.PY FILE
- researcher
```
### Referencing Variables:
Your defined functions with the same name will be used. For example, you can reference the agent for specific tasks from task.yaml file. Ensure your annotated agent and function name is the same otherwise your task wont recognize the reference properly.
#### Example References
agent.yaml
```yaml
email_summarizer:
role: >
Email Summarizer
goal: >
Summarize emails into a concise and clear summary
backstory: >
You will create a 5 bullet point summary of the report
llm: mixtal_llm
```
task.yaml
```yaml
email_summarizer_task:
description: >
Summarize the email into a 5 bullet point summary
expected_output: >
A 5 bullet point summary of the email
agent: email_summarizer
context:
- reporting_task
- research_task
```
Use the annotations are used to properly reference the agent and task in the crew.py file.
### Annotations include:
* @agent
* @task
* @crew
* @llm
* @tool
* @callback
* @output_json
* @output_pydantic
* @cache_handler
crew.py
```py
...
@llm
def mixtal_llm(self):
return ChatGroq(temperature=0, model_name="mixtral-8x7b-32768")
@agent
def email_summarizer(self) -> Agent:
return Agent(
config=self.agents_config["email_summarizer"],
)
## ...other tasks defined
@task
def email_summarizer_task(self) -> Task:
return Task(
config=self.tasks_config["email_summarizer_task"],
)
...
```
## Installing Dependencies
To install the dependencies for your project, you can use Poetry. First, navigate to your project directory:

View File

@@ -129,6 +129,7 @@ nav:
- Training: 'core-concepts/Training-Crew.md'
- Memory: 'core-concepts/Memory.md'
- Planning: 'core-concepts/Planning.md'
- Testing: 'core-concepts/Testing.md'
- Using LangChain Tools: 'core-concepts/Using-LangChain-Tools.md'
- Using LlamaIndex Tools: 'core-concepts/Using-LlamaIndex-Tools.md'
- How to Guides:

View File

@@ -5,7 +5,6 @@ research_task:
the current year is 2024.
expected_output: >
A list with 10 bullet points of the most relevant information about {topic}
agent: researcher
reporting_task:
description: >
@@ -14,4 +13,3 @@ reporting_task:
expected_output: >
A fully fledge reports with the mains topics, each with a full section of information.
Formatted as markdown without '```'
agent: reporting_analyst

View File

@@ -32,12 +32,14 @@ class {{crew_name}}Crew():
def research_task(self) -> Task:
return Task(
config=self.tasks_config['research_task'],
agent=self.researcher()
)
@task
def reporting_task(self) -> Task:
return Task(
config=self.tasks_config['reporting_task'],
agent=self.reporting_analyst(),
output_file='report.md'
)

View File

@@ -48,7 +48,7 @@ def test():
"topic": "AI LLMs"
}
try:
{{crew_name}}Crew().crew().test(n_iterations=int(sys.argv[1]), model=sys.argv[2], inputs=inputs)
{{crew_name}}Crew().crew().test(n_iterations=int(sys.argv[1]), openai_model_name=sys.argv[2], inputs=inputs)
except Exception as e:
raise Exception(f"An error occurred while replaying the crew: {e}")

View File

@@ -37,6 +37,7 @@ from crewai.utilities.constants import (
TRAINED_AGENTS_DATA_FILE,
TRAINING_DATA_FILE,
)
from crewai.utilities.evaluators.crew_evaluator_handler import CrewEvaluator
from crewai.utilities.evaluators.task_evaluator import TaskEvaluator
from crewai.utilities.formatter import (
aggregate_raw_outputs_from_task_outputs,
@@ -154,10 +155,6 @@ class Crew(BaseModel):
default=False,
description="Plan the crew execution and add the plan to the crew.",
)
planning_llm: Optional[Any] = Field(
default=None,
description="Language model that will run the AgentPlanner if planning is True.",
)
task_execution_output_json_files: Optional[List[str]] = Field(
default=None,
description="List of file paths for task execution JSON files.",
@@ -563,12 +560,15 @@ class Crew(BaseModel):
def _handle_crew_planning(self):
"""Handles the Crew planning."""
self._logger.log("info", "Planning the crew execution")
result = CrewPlanner(
tasks=self.tasks, planning_agent_llm=self.planning_llm
)._handle_crew_planning()
result = CrewPlanner(self.tasks)._handle_crew_planning()
for task, step_plan in zip(self.tasks, result.list_of_plans_per_task):
task.description += step_plan
if result is not None and hasattr(result, "list_of_plans_per_task"):
for task, step_plan in zip(self.tasks, result.list_of_plans_per_task):
task.description += step_plan
else:
self._logger.log(
"info", "Something went wrong with the planning process of the Crew"
)
def _store_execution_log(
self,
@@ -968,10 +968,19 @@ class Crew(BaseModel):
return total_usage_metrics
def test(
self, n_iterations: int, model: str, inputs: Optional[Dict[str, Any]] = None
self,
n_iterations: int,
openai_model_name: str,
inputs: Optional[Dict[str, Any]] = None,
) -> None:
"""Test the crew with the given inputs."""
pass
"""Test and evaluate the Crew with the given inputs for n iterations."""
evaluator = CrewEvaluator(self, openai_model_name)
for i in range(1, n_iterations + 1):
evaluator.set_iteration(i)
self.kickoff(inputs=inputs)
evaluator.print_crew_evaluation_result()
def __repr__(self):
return f"Crew(id={self.id}, process={self.process}, number_of_agents={len(self.agents)}, number_of_tasks={len(self.tasks)})"

View File

@@ -1,25 +1,2 @@
from .annotations import (
agent,
crew,
task,
output_json,
output_pydantic,
tool,
callback,
llm,
cache_handler,
)
from .annotations import agent, crew, task
from .crew_base import CrewBase
__all__ = [
"agent",
"crew",
"task",
"output_json",
"output_pydantic",
"tool",
"callback",
"CrewBase",
"llm",
"cache_handler",
]

View File

@@ -30,37 +30,6 @@ def agent(func):
return func
def llm(func):
func.is_llm = True
func = memoize(func)
return func
def output_json(cls):
cls.is_output_json = True
return cls
def output_pydantic(cls):
cls.is_output_pydantic = True
return cls
def tool(func):
func.is_tool = True
return memoize(func)
def callback(func):
func.is_callback = True
return memoize(func)
def cache_handler(func):
func.is_cache_handler = True
return memoize(func)
def crew(func):
def wrapper(self, *args, **kwargs):
instantiated_tasks = []

View File

@@ -1,7 +1,6 @@
import inspect
import os
from pathlib import Path
from typing import Any, Callable, Dict
import yaml
from dotenv import load_dotenv
@@ -21,6 +20,11 @@ def CrewBase(cls):
base_directory = Path(frame_info.filename).parent.resolve()
break
if base_directory is None:
raise Exception(
"Unable to dynamically determine the project's base directory, you must run it from the project's root directory."
)
original_agents_config_path = getattr(
cls, "agents_config", "config/agents.yaml"
)
@@ -28,20 +32,12 @@ def CrewBase(cls):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
if self.base_directory is None:
raise Exception(
"Unable to dynamically determine the project's base directory, you must run it from the project's root directory."
)
self.agents_config = self.load_yaml(
os.path.join(self.base_directory, self.original_agents_config_path)
)
self.tasks_config = self.load_yaml(
os.path.join(self.base_directory, self.original_tasks_config_path)
)
self.map_all_agent_variables()
self.map_all_task_variables()
@staticmethod
def load_yaml(config_path: str):
@@ -49,138 +45,4 @@ def CrewBase(cls):
# parsedContent = YamlParser.parse(file) # type: ignore # Argument 1 to "parse" has incompatible type "TextIOWrapper"; expected "YamlParser"
return yaml.safe_load(file)
def _get_all_functions(self):
return {
name: getattr(self, name)
for name in dir(self)
if callable(getattr(self, name))
}
def _filter_functions(
self, functions: Dict[str, Callable], attribute: str
) -> Dict[str, Callable]:
return {
name: func
for name, func in functions.items()
if hasattr(func, attribute)
}
def map_all_agent_variables(self) -> None:
all_functions = self._get_all_functions()
llms = self._filter_functions(all_functions, "is_llm")
tool_functions = self._filter_functions(all_functions, "is_tool")
cache_handler_functions = self._filter_functions(
all_functions, "is_cache_handler"
)
callbacks = self._filter_functions(all_functions, "is_callback")
agents = self._filter_functions(all_functions, "is_agent")
for agent_name, agent_info in self.agents_config.items():
self._map_agent_variables(
agent_name,
agent_info,
agents,
llms,
tool_functions,
cache_handler_functions,
callbacks,
)
def _map_agent_variables(
self,
agent_name: str,
agent_info: Dict[str, Any],
agents: Dict[str, Callable],
llms: Dict[str, Callable],
tool_functions: Dict[str, Callable],
cache_handler_functions: Dict[str, Callable],
callbacks: Dict[str, Callable],
) -> None:
if llm := agent_info.get("llm"):
self.agents_config[agent_name]["llm"] = llms[llm]()
if tools := agent_info.get("tools"):
self.agents_config[agent_name]["tools"] = [
tool_functions[tool]() for tool in tools
]
if function_calling_llm := agent_info.get("function_calling_llm"):
self.agents_config[agent_name]["function_calling_llm"] = agents[
function_calling_llm
]()
if step_callback := agent_info.get("step_callback"):
self.agents_config[agent_name]["step_callback"] = callbacks[
step_callback
]()
if cache_handler := agent_info.get("cache_handler"):
self.agents_config[agent_name]["cache_handler"] = (
cache_handler_functions[cache_handler]()
)
def map_all_task_variables(self) -> None:
all_functions = self._get_all_functions()
agents = self._filter_functions(all_functions, "is_agent")
tasks = self._filter_functions(all_functions, "is_task")
output_json_functions = self._filter_functions(
all_functions, "is_output_json"
)
tool_functions = self._filter_functions(all_functions, "is_tool")
callback_functions = self._filter_functions(all_functions, "is_callback")
output_pydantic_functions = self._filter_functions(
all_functions, "is_output_pydantic"
)
for task_name, task_info in self.tasks_config.items():
self._map_task_variables(
task_name,
task_info,
agents,
tasks,
output_json_functions,
tool_functions,
callback_functions,
output_pydantic_functions,
)
def _map_task_variables(
self,
task_name: str,
task_info: Dict[str, Any],
agents: Dict[str, Callable],
tasks: Dict[str, Callable],
output_json_functions: Dict[str, Callable],
tool_functions: Dict[str, Callable],
callback_functions: Dict[str, Callable],
output_pydantic_functions: Dict[str, Callable],
) -> None:
if context_list := task_info.get("context"):
self.tasks_config[task_name]["context"] = [
tasks[context_task_name]() for context_task_name in context_list
]
if tools := task_info.get("tools"):
self.tasks_config[task_name]["tools"] = [
tool_functions[tool]() for tool in tools
]
if agent_name := task_info.get("agent"):
self.tasks_config[task_name]["agent"] = agents[agent_name]()
if output_json := task_info.get("output_json"):
self.tasks_config[task_name]["output_json"] = output_json_functions[
output_json
]
if output_pydantic := task_info.get("output_pydantic"):
self.tasks_config[task_name]["output_pydantic"] = (
output_pydantic_functions[output_pydantic]
)
if callbacks := task_info.get("callbacks"):
self.tasks_config[task_name]["callbacks"] = [
callback_functions[callback]() for callback in callbacks
]
return WrappedClass

View File

@@ -0,0 +1,149 @@
from collections import defaultdict
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from rich.console import Console
from rich.table import Table
from crewai.agent import Agent
from crewai.task import Task
from crewai.tasks.task_output import TaskOutput
class TaskEvaluationPydanticOutput(BaseModel):
quality: float = Field(
description="A score from 1 to 10 evaluating on completion, quality, and overall performance from the task_description and task_expected_output to the actual Task Output."
)
class CrewEvaluator:
"""
A class to evaluate the performance of the agents in the crew based on the tasks they have performed.
Attributes:
crew (Crew): The crew of agents to evaluate.
openai_model_name (str): The model to use for evaluating the performance of the agents (for now ONLY OpenAI accepted).
tasks_scores (defaultdict): A dictionary to store the scores of the agents for each task.
iteration (int): The current iteration of the evaluation.
"""
tasks_scores: defaultdict = defaultdict(list)
iteration: int = 0
def __init__(self, crew, openai_model_name: str):
self.crew = crew
self.openai_model_name = openai_model_name
self._setup_for_evaluating()
def _setup_for_evaluating(self) -> None:
"""Sets up the crew for evaluating."""
for task in self.crew.tasks:
task.callback = self.evaluate
def set_iteration(self, iteration: int) -> None:
self.iteration = iteration
def _evaluator_agent(self):
return Agent(
role="Task Execution Evaluator",
goal=(
"Your goal is to evaluate the performance of the agents in the crew based on the tasks they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
),
backstory="Evaluator agent for crew evaluation with precise capabilities to evaluate the performance of the agents in the crew based on the tasks they have performed",
verbose=False,
llm=ChatOpenAI(model=self.openai_model_name),
)
def _evaluation_task(
self, evaluator_agent: Agent, task_to_evaluate: Task, task_output: str
) -> Task:
return Task(
description=(
"Based on the task description and the expected output, compare and evaluate the performance of the agents in the crew based on the Task Output they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
f"task_description: {task_to_evaluate.description} "
f"task_expected_output: {task_to_evaluate.expected_output} "
f"agent: {task_to_evaluate.agent.role if task_to_evaluate.agent else None} "
f"agent_goal: {task_to_evaluate.agent.goal if task_to_evaluate.agent else None} "
f"Task Output: {task_output}"
),
expected_output="Evaluation Score from 1 to 10 based on the performance of the agents on the tasks",
agent=evaluator_agent,
output_pydantic=TaskEvaluationPydanticOutput,
)
def print_crew_evaluation_result(self) -> None:
"""
Prints the evaluation result of the crew in a table.
A Crew with 2 tasks using the command crewai test -n 2
will output the following table:
Task Scores
(1-10 Higher is better)
┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ Tasks/Crew ┃ Run 1 ┃ Run 2 ┃ Avg. Total ┃
┡━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ Task 1 │ 10.0 │ 9.0 │ 9.5 │
│ Task 2 │ 9.0 │ 9.0 │ 9.0 │
│ Crew │ 9.5 │ 9.0 │ 9.2 │
└────────────┴───────┴───────┴────────────┘
"""
task_averages = [
sum(scores) / len(scores) for scores in zip(*self.tasks_scores.values())
]
crew_average = sum(task_averages) / len(task_averages)
# Create a table
table = Table(title="Tasks Scores \n (1-10 Higher is better)")
# Add columns for the table
table.add_column("Tasks/Crew")
for run in range(1, len(self.tasks_scores) + 1):
table.add_column(f"Run {run}")
table.add_column("Avg. Total")
# Add rows for each task
for task_index in range(len(task_averages)):
task_scores = [
self.tasks_scores[run][task_index]
for run in range(1, len(self.tasks_scores) + 1)
]
avg_score = task_averages[task_index]
table.add_row(
f"Task {task_index + 1}", *map(str, task_scores), f"{avg_score:.1f}"
)
# Add a row for the crew average
crew_scores = [
sum(self.tasks_scores[run]) / len(self.tasks_scores[run])
for run in range(1, len(self.tasks_scores) + 1)
]
table.add_row("Crew", *map(str, crew_scores), f"{crew_average:.1f}")
# Display the table in the terminal
console = Console()
console.print(table)
def evaluate(self, task_output: TaskOutput):
"""Evaluates the performance of the agents in the crew based on the tasks they have performed."""
current_task = None
for task in self.crew.tasks:
if task.description == task_output.description:
current_task = task
break
if not current_task or not task_output:
raise ValueError(
"Task to evaluate and task output are required for evaluation"
)
evaluator_agent = self._evaluator_agent()
evaluation_task = self._evaluation_task(
evaluator_agent, current_task, task_output.raw
)
evaluation_result = evaluation_task.execute_sync()
if isinstance(evaluation_result.pydantic, TaskEvaluationPydanticOutput):
self.tasks_scores[self.iteration].append(evaluation_result.pydantic.quality)
else:
raise ValueError("Evaluation result is not in the expected format")

View File

@@ -1,6 +1,5 @@
from typing import Any, List, Optional
from typing import List, Optional
from langchain_openai import ChatOpenAI
from pydantic import BaseModel
from crewai.agent import Agent
@@ -12,27 +11,17 @@ class PlannerTaskPydanticOutput(BaseModel):
class CrewPlanner:
def __init__(self, tasks: List[Task], planning_agent_llm: Optional[Any] = None):
def __init__(self, tasks: List[Task]):
self.tasks = tasks
if planning_agent_llm is None:
self.planning_agent_llm = ChatOpenAI(model="gpt-4o-mini")
else:
self.planning_agent_llm = planning_agent_llm
def _handle_crew_planning(self) -> PlannerTaskPydanticOutput:
def _handle_crew_planning(self) -> Optional[BaseModel]:
"""Handles the Crew planning by creating detailed step-by-step plans for each task."""
planning_agent = self._create_planning_agent()
tasks_summary = self._create_tasks_summary()
planner_task = self._create_planner_task(planning_agent, tasks_summary)
result = planner_task.execute_sync()
if isinstance(result.pydantic, PlannerTaskPydanticOutput):
return result.pydantic
raise ValueError("Failed to get the Planning output")
return planner_task.execute_sync().pydantic
def _create_planning_agent(self) -> Agent:
"""Creates the planning agent for the crew planning."""
@@ -43,7 +32,6 @@ class CrewPlanner:
"available to each agent so that they can perform the tasks in an exemplary manner"
),
backstory="Planner agent for crew planning",
llm=self.planning_agent_llm,
)
def _create_planner_task(self, planning_agent: Agent, tasks_summary: str) -> Task:

View File

@@ -8,6 +8,7 @@ from unittest.mock import MagicMock, patch
import pydantic_core
import pytest
from crewai.agent import Agent
from crewai.agents.cache import CacheHandler
from crewai.crew import Crew
@@ -2499,3 +2500,34 @@ def test_conditional_should_execute():
assert condition_mock.call_count == 1
assert condition_mock() is True
assert mock_execute_sync.call_count == 2
@mock.patch("crewai.crew.CrewEvaluator")
@mock.patch("crewai.crew.Crew.kickoff")
def test_crew_testing_function(mock_kickoff, crew_evaluator):
task = Task(
description="Come up with a list of 5 interesting ideas to explore for an article, then write one amazing paragraph highlight for each idea that showcases how good an article about this topic could be. Return the list of ideas with their paragraph and your notes.",
expected_output="5 bullet points with a paragraph for each idea.",
agent=researcher,
)
crew = Crew(
agents=[researcher],
tasks=[task],
)
n_iterations = 2
crew.test(n_iterations, openai_model_name="gpt-4o-mini", inputs={"topic": "AI"})
assert len(mock_kickoff.mock_calls) == n_iterations
mock_kickoff.assert_has_calls(
[mock.call(inputs={"topic": "AI"}), mock.call(inputs={"topic": "AI"})]
)
crew_evaluator.assert_has_calls(
[
mock.call(crew, "gpt-4o-mini"),
mock.call().set_iteration(1),
mock.call().set_iteration(2),
mock.call().print_crew_evaluation_result(),
]
)

View File

@@ -0,0 +1,113 @@
from unittest import mock
import pytest
from crewai.agent import Agent
from crewai.crew import Crew
from crewai.task import Task
from crewai.tasks.task_output import TaskOutput
from crewai.utilities.evaluators.crew_evaluator_handler import (
CrewEvaluator,
TaskEvaluationPydanticOutput,
)
class TestCrewEvaluator:
@pytest.fixture
def crew_planner(self):
agent = Agent(role="Agent 1", goal="Goal 1", backstory="Backstory 1")
task = Task(
description="Task 1",
expected_output="Output 1",
agent=agent,
)
crew = Crew(agents=[agent], tasks=[task])
return CrewEvaluator(crew, openai_model_name="gpt-4o-mini")
def test_setup_for_evaluating(self, crew_planner):
crew_planner._setup_for_evaluating()
assert crew_planner.crew.tasks[0].callback == crew_planner.evaluate
def test_set_iteration(self, crew_planner):
crew_planner.set_iteration(1)
assert crew_planner.iteration == 1
def test_evaluator_agent(self, crew_planner):
agent = crew_planner._evaluator_agent()
assert agent.role == "Task Execution Evaluator"
assert (
agent.goal
== "Your goal is to evaluate the performance of the agents in the crew based on the tasks they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
)
assert (
agent.backstory
== "Evaluator agent for crew evaluation with precise capabilities to evaluate the performance of the agents in the crew based on the tasks they have performed"
)
assert agent.verbose is False
assert agent.llm.model_name == "gpt-4o-mini"
def test_evaluation_task(self, crew_planner):
evaluator_agent = Agent(
role="Evaluator Agent",
goal="Evaluate the performance of the agents in the crew",
backstory="Master in Evaluation",
)
task_to_evaluate = Task(
description="Task 1",
expected_output="Output 1",
agent=Agent(role="Agent 1", goal="Goal 1", backstory="Backstory 1"),
)
task_output = "Task Output 1"
task = crew_planner._evaluation_task(
evaluator_agent, task_to_evaluate, task_output
)
assert task.description.startswith(
"Based on the task description and the expected output, compare and evaluate the performance of the agents in the crew based on the Task Output they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
)
assert task.agent == evaluator_agent
assert (
task.description
== "Based on the task description and the expected output, compare and evaluate "
"the performance of the agents in the crew based on the Task Output they have "
"performed using score from 1 to 10 evaluating on completion, quality, and overall "
"performance.task_description: Task 1 task_expected_output: Output 1 "
"agent: Agent 1 agent_goal: Goal 1 Task Output: Task Output 1"
)
@mock.patch("crewai.utilities.evaluators.crew_evaluator_handler.Console")
@mock.patch("crewai.utilities.evaluators.crew_evaluator_handler.Table")
def test_print_crew_evaluation_result(self, table, console, crew_planner):
crew_planner.tasks_scores = {
1: [10, 9, 8],
2: [9, 8, 7],
}
crew_planner.print_crew_evaluation_result()
table.assert_has_calls(
[
mock.call(title="Tasks Scores \n (1-10 Higher is better)"),
mock.call().add_column("Tasks/Crew"),
mock.call().add_column("Run 1"),
mock.call().add_column("Run 2"),
mock.call().add_column("Avg. Total"),
mock.call().add_row("Task 1", "10", "9", "9.5"),
mock.call().add_row("Task 2", "9", "8", "8.5"),
mock.call().add_row("Task 3", "8", "7", "7.5"),
mock.call().add_row("Crew", "9.0", "8.0", "8.5"),
]
)
console.assert_has_calls([mock.call(), mock.call().print(table())])
def test_evaluate(self, crew_planner):
task_output = TaskOutput(
description="Task 1", agent=str(crew_planner.crew.agents[0])
)
with mock.patch.object(Task, "execute_sync") as execute:
execute().pydantic = TaskEvaluationPydanticOutput(quality=9.5)
crew_planner.evaluate(task_output)
assert crew_planner.tasks_scores[0] == [9.5]

View File

@@ -1,11 +1,10 @@
from unittest.mock import patch
from crewai.tasks.task_output import TaskOutput
import pytest
from langchain_openai import ChatOpenAI
from crewai.agent import Agent
from crewai.task import Task
from crewai.tasks.task_output import TaskOutput
from crewai.utilities.planning_handler import CrewPlanner, PlannerTaskPydanticOutput
@@ -29,19 +28,7 @@ class TestCrewPlanner:
agent=Agent(role="Agent 3", goal="Goal 3", backstory="Backstory 3"),
),
]
return CrewPlanner(tasks, None)
@pytest.fixture
def crew_planner_different_llm(self):
tasks = [
Task(
description="Task 1",
expected_output="Output 1",
agent=Agent(role="Agent 1", goal="Goal 1", backstory="Backstory 1"),
)
]
planning_agent_llm = ChatOpenAI(model="gpt-3.5-turbo")
return CrewPlanner(tasks, planning_agent_llm)
return CrewPlanner(tasks)
def test_handle_crew_planning(self, crew_planner):
with patch.object(Task, "execute_sync") as execute:
@@ -53,7 +40,7 @@ class TestCrewPlanner:
),
)
result = crew_planner._handle_crew_planning()
assert crew_planner.planning_agent_llm.model_name == "gpt-4o-mini"
assert isinstance(result, PlannerTaskPydanticOutput)
assert len(result.list_of_plans_per_task) == len(crew_planner.tasks)
execute.assert_called_once()
@@ -85,22 +72,3 @@ class TestCrewPlanner:
assert isinstance(tasks_summary, str)
assert tasks_summary.startswith("\n Task Number 1 - Task 1")
assert tasks_summary.endswith('"agent_tools": []\n ')
def test_handle_crew_planning_different_llm(self, crew_planner_different_llm):
with patch.object(Task, "execute_sync") as execute:
execute.return_value = TaskOutput(
description="Description",
agent="agent",
pydantic=PlannerTaskPydanticOutput(list_of_plans_per_task=["Plan 1"]),
)
result = crew_planner_different_llm._handle_crew_planning()
assert (
crew_planner_different_llm.planning_agent_llm.model_name
== "gpt-3.5-turbo"
)
assert isinstance(result, PlannerTaskPydanticOutput)
assert len(result.list_of_plans_per_task) == len(
crew_planner_different_llm.tasks
)
execute.assert_called_once()