Files
crewAI/src/crewai/utilities/evaluators/crew_evaluator_handler.py
devin-ai-integration[bot] 807c13e144 Add support for custom LLM implementations (#2277)
* Add support for custom LLM implementations

Co-Authored-By: Joe Moura <joao@crewai.com>

* Fix import sorting and type annotations

Co-Authored-By: Joe Moura <joao@crewai.com>

* Fix linting issues with import sorting

Co-Authored-By: Joe Moura <joao@crewai.com>

* Fix type errors in crew.py by updating tool-related methods to return List[BaseTool]

Co-Authored-By: Joe Moura <joao@crewai.com>

* Enhance custom LLM implementation with better error handling, documentation, and test coverage

Co-Authored-By: Joe Moura <joao@crewai.com>

* Refactor LLM module by extracting BaseLLM to a separate file

This commit moves the BaseLLM abstract base class from llm.py to a new file llms/base_llm.py to improve code organization. The changes include:

- Creating a new file src/crewai/llms/base_llm.py
- Moving the BaseLLM class to the new file
- Updating imports in __init__.py and llm.py to reflect the new location
- Updating test cases to use the new import path

The refactoring maintains the existing functionality while improving the project's module structure.

* Add AISuite LLM support and update dependencies

- Integrate AISuite as a new third-party LLM option
- Update pyproject.toml and uv.lock to include aisuite package
- Modify BaseLLM to support more flexible initialization
- Remove unnecessary LLM imports across multiple files
- Implement AISuiteLLM with basic chat completion functionality

* Update AISuiteLLM and LLM utility type handling

- Modify AISuiteLLM to support more flexible input types for messages
- Update type hints in AISuiteLLM to allow string or list of message dictionaries
- Enhance LLM utility function to support broader LLM type annotations
- Remove default `self.stop` attribute from BaseLLM initialization

* Update LLM imports and type hints across multiple files

- Modify imports in crew_chat.py to use LLM instead of BaseLLM
- Update type hints in llm_utils.py to use LLM type
- Add optional `stop` parameter to BaseLLM initialization
- Refactor type handling for LLM creation and usage

* Improve stop words handling in CrewAgentExecutor

- Add support for handling existing stop words in LLM configuration
- Ensure stop words are correctly merged and deduplicated
- Update type hints to support both LLM and BaseLLM types

* Remove abstract method set_callbacks from BaseLLM class

* Enhance CustomLLM and JWTAuthLLM initialization with model parameter

- Update CustomLLM to accept a model parameter during initialization
- Modify test cases to include the new model argument
- Ensure JWTAuthLLM and TimeoutHandlingLLM also utilize the model parameter in their constructors
- Update type hints in create_llm function to support both LLM and BaseLLM types

* Enhance create_llm function to support BaseLLM type

- Update the create_llm function to accept both LLM and BaseLLM instances
- Ensure compatibility with existing LLM handling logic

* Update type hint for initialize_chat_llm to support BaseLLM

- Modify the return type of initialize_chat_llm function to allow for both LLM and BaseLLM instances
- Ensure compatibility with recent changes in create_llm function

* Refactor AISuiteLLM to include tools parameter in completion methods

- Update the _prepare_completion_params method to accept an optional tools parameter
- Modify the chat completion method to utilize the new tools parameter for enhanced functionality
- Clean up print statements for better code clarity

* Remove unused tool_calls handling in AISuiteLLM chat completion method for cleaner code.

* Refactor Crew class and LLM hierarchy for improved type handling and code clarity

- Update Crew class methods to enhance readability with consistent formatting and type hints.
- Change LLM class to inherit from BaseLLM for better structure.
- Remove unnecessary type checks and streamline tool handling in CrewAgentExecutor.
- Adjust BaseLLM to provide default implementations for stop words and context window size methods.
- Clean up AISuiteLLM by removing unused methods related to stop words and context window size.

* Remove unused `stream` method from `BaseLLM` class to enhance code clarity and maintainability.

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Joe Moura <joao@crewai.com>
Co-authored-by: Lorenze Jay <lorenzejaytech@gmail.com>
Co-authored-by: João Moura <joaomdmoura@gmail.com>
Co-authored-by: Brandon Hancock (bhancock_ai) <109994880+bhancockio@users.noreply.github.com>
2025-03-25 12:39:08 -04:00

193 lines
8.9 KiB
Python

from collections import defaultdict
from pydantic import BaseModel, Field, InstanceOf
from rich.box import HEAVY_EDGE
from rich.console import Console
from rich.table import Table
from crewai.agent import Agent
from crewai.llm import BaseLLM
from crewai.task import Task
from crewai.tasks.task_output import TaskOutput
from crewai.telemetry import Telemetry
class TaskEvaluationPydanticOutput(BaseModel):
quality: float = Field(
description="A score from 1 to 10 evaluating on completion, quality, and overall performance from the task_description and task_expected_output to the actual Task Output."
)
class CrewEvaluator:
"""
A class to evaluate the performance of the agents in the crew based on the tasks they have performed.
Attributes:
crew (Crew): The crew of agents to evaluate.
eval_llm (BaseLLM): Language model instance to use for evaluations
tasks_scores (defaultdict): A dictionary to store the scores of the agents for each task.
iteration (int): The current iteration of the evaluation.
"""
tasks_scores: defaultdict = defaultdict(list)
run_execution_times: defaultdict = defaultdict(list)
iteration: int = 0
def __init__(self, crew, eval_llm: InstanceOf[BaseLLM]):
self.crew = crew
self.llm = eval_llm
self._telemetry = Telemetry()
self._setup_for_evaluating()
def _setup_for_evaluating(self) -> None:
"""Sets up the crew for evaluating."""
for task in self.crew.tasks:
task.callback = self.evaluate
def _evaluator_agent(self):
return Agent(
role="Task Execution Evaluator",
goal=(
"Your goal is to evaluate the performance of the agents in the crew based on the tasks they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
),
backstory="Evaluator agent for crew evaluation with precise capabilities to evaluate the performance of the agents in the crew based on the tasks they have performed",
verbose=False,
llm=self.llm,
)
def _evaluation_task(
self, evaluator_agent: Agent, task_to_evaluate: Task, task_output: str
) -> Task:
return Task(
description=(
"Based on the task description and the expected output, compare and evaluate the performance of the agents in the crew based on the Task Output they have performed using score from 1 to 10 evaluating on completion, quality, and overall performance."
f"task_description: {task_to_evaluate.description} "
f"task_expected_output: {task_to_evaluate.expected_output} "
f"agent: {task_to_evaluate.agent.role if task_to_evaluate.agent else None} "
f"agent_goal: {task_to_evaluate.agent.goal if task_to_evaluate.agent else None} "
f"Task Output: {task_output}"
),
expected_output="Evaluation Score from 1 to 10 based on the performance of the agents on the tasks",
agent=evaluator_agent,
output_pydantic=TaskEvaluationPydanticOutput,
)
def set_iteration(self, iteration: int) -> None:
self.iteration = iteration
def print_crew_evaluation_result(self) -> None:
"""
Prints the evaluation result of the crew in a table.
A Crew with 2 tasks using the command crewai test -n 3
will output the following table:
Tasks Scores
(1-10 Higher is better)
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Tasks/Crew/Agents ┃ Run 1 ┃ Run 2 ┃ Run 3 ┃ Avg. Total ┃ Agents ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Task 1 │ 9.0 │ 10.0 │ 9.0 │ 9.3 │ - AI LLMs Senior Researcher │
│ │ │ │ │ │ - AI LLMs Reporting Analyst │
│ │ │ │ │ │ │
│ Task 2 │ 9.0 │ 9.0 │ 9.0 │ 9.0 │ - AI LLMs Senior Researcher │
│ │ │ │ │ │ - AI LLMs Reporting Analyst │
│ │ │ │ │ │ │
│ Crew │ 9.0 │ 9.5 │ 9.0 │ 9.2 │ │
│ Execution Time (s) │ 42 │ 79 │ 52 │ 57 │ │
└────────────────────┴───────┴───────┴───────┴────────────┴──────────────────────────────┘
"""
task_averages = [
sum(scores) / len(scores) for scores in zip(*self.tasks_scores.values())
]
crew_average = sum(task_averages) / len(task_averages)
table = Table(title="Tasks Scores \n (1-10 Higher is better)", box=HEAVY_EDGE)
table.add_column("Tasks/Crew/Agents", style="cyan")
for run in range(1, len(self.tasks_scores) + 1):
table.add_column(f"Run {run}", justify="center")
table.add_column("Avg. Total", justify="center")
table.add_column("Agents", style="green")
for task_index, task in enumerate(self.crew.tasks):
task_scores = [
self.tasks_scores[run][task_index]
for run in range(1, len(self.tasks_scores) + 1)
]
avg_score = task_averages[task_index]
agents = list(task.processed_by_agents)
# Add the task row with the first agent
table.add_row(
f"Task {task_index + 1}",
*[f"{score:.1f}" for score in task_scores],
f"{avg_score:.1f}",
f"- {agents[0]}" if agents else "",
)
# Add rows for additional agents
for agent in agents[1:]:
table.add_row("", "", "", "", "", f"- {agent}")
# Add a blank separator row if it's not the last task
if task_index < len(self.crew.tasks) - 1:
table.add_row("", "", "", "", "", "")
# Add Crew and Execution Time rows
crew_scores = [
sum(self.tasks_scores[run]) / len(self.tasks_scores[run])
for run in range(1, len(self.tasks_scores) + 1)
]
table.add_row(
"Crew",
*[f"{score:.2f}" for score in crew_scores],
f"{crew_average:.1f}",
"",
)
run_exec_times = [
int(sum(tasks_exec_times))
for _, tasks_exec_times in self.run_execution_times.items()
]
execution_time_avg = int(sum(run_exec_times) / len(run_exec_times))
table.add_row(
"Execution Time (s)", *map(str, run_exec_times), f"{execution_time_avg}", ""
)
console = Console()
console.print(table)
def evaluate(self, task_output: TaskOutput):
"""Evaluates the performance of the agents in the crew based on the tasks they have performed."""
current_task = None
for task in self.crew.tasks:
if task.description == task_output.description:
current_task = task
break
if not current_task or not task_output:
raise ValueError(
"Task to evaluate and task output are required for evaluation"
)
evaluator_agent = self._evaluator_agent()
evaluation_task = self._evaluation_task(
evaluator_agent, current_task, task_output.raw
)
evaluation_result = evaluation_task.execute_sync()
if isinstance(evaluation_result.pydantic, TaskEvaluationPydanticOutput):
self._test_result_span = self._telemetry.individual_test_result_span(
self.crew,
evaluation_result.pydantic.quality,
current_task.execution_duration,
self.llm.model,
)
self.tasks_scores[self.iteration].append(evaluation_result.pydantic.quality)
self.run_execution_times[self.iteration].append(
current_task.execution_duration
)
else:
raise ValueError("Evaluation result is not in the expected format")