Introduce Evaluator Experiment (#3133)

* feat: add exchanged messages in LLMCallCompletedEvent * feat: add GoalAlignment metric for Agent evaluation * feat: add SemanticQuality metric for Agent evaluation * feat: add Tool Metrics for Agent evaluation * feat: add Reasoning Metrics for Agent evaluation, still in progress * feat: add AgentEvaluator class This class will evaluate Agent' results and report to user * fix: do not evaluate Agent by default This is a experimental feature we still need refine it further * test: add Agent eval tests * fix: render all feedback per iteration * style: resolve linter issues * style: fix mypy issues * fix: allow messages be empty on LLMCallCompletedEvent * feat: add Experiment evaluation framework with baseline comparison * fix: reset evaluator for each experiement iteraction * fix: fix track of new test cases * chore: split Experimental evaluation classes * refactor: remove unused method * refactor: isolate Console print in a dedicated class * fix: make crew required to run an experiment * fix: use time-aware to define experiment result * test: add tests for Evaluator Experiment * style: fix linter issues * fix: encode string before hashing * style: resolve linter issues * feat: add experimental folder for beta features (#3141) * test: move tests to experimental folder
2026-05-03 16:22:49 +00:00 · 2025-07-14 10:06:45 -03:00
parent 3ada4053bd
commit 1b6b2b36d9
27 changed files with 2512 additions and 16 deletions
--- a/src/crewai/experimental/evaluation/metrics/goal_metrics.py
+++ b/src/crewai/experimental/evaluation/metrics/goal_metrics.py
@@ -0,0 +1,66 @@
+from typing import Any, Dict
+
+from crewai.agent import Agent
+from crewai.task import Task
+
+from crewai.experimental.evaluation.base_evaluator import BaseEvaluator, EvaluationScore, MetricCategory
+from crewai.experimental.evaluation.json_parser import extract_json_from_llm_response
+
+class GoalAlignmentEvaluator(BaseEvaluator):
+    @property
+    def metric_category(self) -> MetricCategory:
+        return MetricCategory.GOAL_ALIGNMENT
+
+    def evaluate(
+        self,
+        agent: Agent,
+        task: Task,
+        execution_trace: Dict[str, Any],
+        final_output: Any,
+    ) -> EvaluationScore:
+        prompt = [
+            {"role": "system", "content": """You are an expert evaluator assessing how well an AI agent's output aligns with its assigned task goal.
+
+Score the agent's goal alignment on a scale from 0-10 where:
+- 0: Complete misalignment, agent did not understand or attempt the task goal
+- 5: Partial alignment, agent attempted the task but missed key requirements
+- 10: Perfect alignment, agent fully satisfied all task requirements
+
+Consider:
+1. Did the agent correctly interpret the task goal?
+2. Did the final output directly address the requirements?
+3. Did the agent focus on relevant aspects of the task?
+4. Did the agent provide all requested information or deliverables?
+
+Return your evaluation as JSON with fields 'score' (number) and 'feedback' (string).
+"""},
+            {"role": "user", "content": f"""
+Agent role: {agent.role}
+Agent goal: {agent.goal}
+Task description: {task.description}
+Expected output: {task.expected_output}
+
+Agent's final output:
+{final_output}
+
+Evaluate how well the agent's output aligns with the assigned task goal.
+"""}
+        ]
+        assert self.llm is not None
+        response = self.llm.call(prompt)
+
+        try:
+            evaluation_data: dict[str, Any] = extract_json_from_llm_response(response)
+            assert evaluation_data is not None
+
+            return EvaluationScore(
+                score=evaluation_data.get("score", 0),
+                feedback=evaluation_data.get("feedback", response),
+                raw_response=response
+            )
+        except Exception:
+            return EvaluationScore(
+                score=None,
+                feedback=f"Failed to parse evaluation. Raw response: {response}",
+                raw_response=response
+            )