Files
crewAI/lib/crewai/tests/agents/test_agent_reasoning.py
Lorenze Jay 32d7b4a8d4 Lorenze/feat/plan execute pattern (#4817)
* feat: introduce PlanningConfig for enhanced agent planning capabilities (#4344)

* feat: introduce PlanningConfig for enhanced agent planning capabilities

This update adds a new PlanningConfig class to manage agent planning configurations, allowing for customizable planning behavior before task execution. The existing reasoning parameter is deprecated in favor of this new configuration, ensuring backward compatibility while enhancing the planning process. Additionally, the Agent class has been updated to utilize this new configuration, and relevant utility functions have been adjusted accordingly. Tests have been added to validate the new planning functionality and ensure proper integration with existing agent workflows.

* dropping redundancy

* fix test

* revert handle_reasoning here

* refactor: update reasoning handling in Agent class

This commit modifies the Agent class to conditionally call the handle_reasoning function based on the executor class being used. The legacy CrewAgentExecutor will continue to utilize handle_reasoning, while the new AgentExecutor will manage planning internally. Additionally, the PlanningConfig class has been referenced in the documentation to clarify its role in enabling or disabling planning. Tests have been updated to reflect these changes and ensure proper functionality.

* improve planning prompts

* matching

* refactor: remove default enabled flag from PlanningConfig in Agent class

* more cassettes

* fix test

* refactor: update planning prompt and remove deprecated methods in reasoning handler

* improve planning prompt

* Lorenze/feat planning pt 2 todo list gen (#4449)

* feat: introduce PlanningConfig for enhanced agent planning capabilities

This update adds a new PlanningConfig class to manage agent planning configurations, allowing for customizable planning behavior before task execution. The existing reasoning parameter is deprecated in favor of this new configuration, ensuring backward compatibility while enhancing the planning process. Additionally, the Agent class has been updated to utilize this new configuration, and relevant utility functions have been adjusted accordingly. Tests have been added to validate the new planning functionality and ensure proper integration with existing agent workflows.

* dropping redundancy

* fix test

* revert handle_reasoning here

* refactor: update reasoning handling in Agent class

This commit modifies the Agent class to conditionally call the handle_reasoning function based on the executor class being used. The legacy CrewAgentExecutor will continue to utilize handle_reasoning, while the new AgentExecutor will manage planning internally. Additionally, the PlanningConfig class has been referenced in the documentation to clarify its role in enabling or disabling planning. Tests have been updated to reflect these changes and ensure proper functionality.

* improve planning prompts

* matching

* refactor: remove default enabled flag from PlanningConfig in Agent class

* more cassettes

* fix test

* feat: enhance agent planning with structured todo management

This commit introduces a new planning system within the AgentExecutor class, allowing for the creation of structured todo items from planning steps. The TodoList and TodoItem models have been added to facilitate tracking of plan execution. The reasoning plan now includes a list of steps, improving the clarity and organization of agent tasks. Additionally, tests have been added to validate the new planning functionality and ensure proper integration with existing workflows.

* refactor: update planning prompt and remove deprecated methods in reasoning handler

* improve planning prompt

* improve handler

* linted

* linted

* Lorenze/feat/planning pt 3 todo list execution (#4450)

* feat: introduce PlanningConfig for enhanced agent planning capabilities

This update adds a new PlanningConfig class to manage agent planning configurations, allowing for customizable planning behavior before task execution. The existing reasoning parameter is deprecated in favor of this new configuration, ensuring backward compatibility while enhancing the planning process. Additionally, the Agent class has been updated to utilize this new configuration, and relevant utility functions have been adjusted accordingly. Tests have been added to validate the new planning functionality and ensure proper integration with existing agent workflows.

* dropping redundancy

* fix test

* revert handle_reasoning here

* refactor: update reasoning handling in Agent class

This commit modifies the Agent class to conditionally call the handle_reasoning function based on the executor class being used. The legacy CrewAgentExecutor will continue to utilize handle_reasoning, while the new AgentExecutor will manage planning internally. Additionally, the PlanningConfig class has been referenced in the documentation to clarify its role in enabling or disabling planning. Tests have been updated to reflect these changes and ensure proper functionality.

* improve planning prompts

* matching

* refactor: remove default enabled flag from PlanningConfig in Agent class

* more cassettes

* fix test

* feat: enhance agent planning with structured todo management

This commit introduces a new planning system within the AgentExecutor class, allowing for the creation of structured todo items from planning steps. The TodoList and TodoItem models have been added to facilitate tracking of plan execution. The reasoning plan now includes a list of steps, improving the clarity and organization of agent tasks. Additionally, tests have been added to validate the new planning functionality and ensure proper integration with existing workflows.

* refactor: update planning prompt and remove deprecated methods in reasoning handler

* improve planning prompt

* improve handler

* execute todos and be able to track them

* feat: introduce PlannerObserver and StepExecutor for enhanced plan execution

This commit adds the PlannerObserver and StepExecutor classes to the CrewAI framework, implementing the observation phase of the Plan-and-Execute architecture. The PlannerObserver analyzes step execution results, determines plan validity, and suggests refinements, while the StepExecutor executes individual todo items in isolation. These additions improve the overall planning and execution process, allowing for more dynamic and responsive agent behavior. Additionally, new observation events have been defined to facilitate monitoring and logging of the planning process.

* refactor: enhance final answer synthesis in AgentExecutor

This commit improves the synthesis of final answers in the AgentExecutor class by implementing a more coherent approach to combining results from multiple todo items. The method now utilizes a single LLM call to generate a polished response, falling back to concatenation if the synthesis fails. Additionally, the test cases have been updated to reflect the changes in planning and execution, ensuring that the results are properly validated and that the plan-and-execute architecture is functioning as intended.

* refactor: enhance final answer synthesis in AgentExecutor

This commit improves the synthesis of final answers in the AgentExecutor class by implementing a more coherent approach to combining results from multiple todo items. The method now utilizes a single LLM call to generate a polished response, falling back to concatenation if the synthesis fails. Additionally, the test cases have been updated to reflect the changes in planning and execution, ensuring that the results are properly validated and that the plan-and-execute architecture is functioning as intended.

* refactor: implement structured output handling in final answer synthesis

This commit enhances the final answer synthesis process in the AgentExecutor class by introducing support for structured outputs when a response model is specified. The synthesis method now utilizes the response model to produce outputs that conform to the expected schema, while still falling back to concatenation in case of synthesis failures. This change ensures that intermediate steps yield free-text results, but the final output can be structured, improving the overall coherence and usability of the synthesized answers.

* regen tests

* linted

* fix

* Enhance PlanningConfig and AgentExecutor with Reasoning Effort Levels

This update introduces a new  attribute in the  class, allowing users to customize the observation and replanning behavior during task execution. The  class has been modified to utilize this new attribute, routing step observations based on the specified reasoning effort level: low, medium, or high.

Additionally, tests have been added to validate the functionality of the reasoning effort levels, ensuring that the agent behaves as expected under different configurations. This enhancement improves the adaptability and efficiency of the planning process in agent execution.

* regen cassettes for test and fix test

* cassette regen

* fixing tests

* dry

* Refactor PlannerObserver and StepExecutor to Utilize I18N for Prompts

This update enhances the PlannerObserver and StepExecutor classes by integrating the I18N utility for managing prompts and messages. The system and user prompts are now retrieved from the I18N module, allowing for better localization and maintainability. Additionally, the code has been cleaned up to remove hardcoded strings, improving readability and consistency across the planning and execution processes.

* Refactor PlannerObserver and StepExecutor to Utilize I18N for Prompts

This update enhances the PlannerObserver and StepExecutor classes by integrating the I18N utility for managing prompts and messages. The system and user prompts are now retrieved from the I18N module, allowing for better localization and maintainability. Additionally, the code has been cleaned up to remove hardcoded strings, improving readability and consistency across the planning and execution processes.

* consolidate agent logic

* fix datetime

* improving step executor

* refactor: streamline observation and refinement process in PlannerObserver

- Updated the PlannerObserver to apply structured refinements directly from observations without requiring a second LLM call.
- Renamed  method to  for clarity.
- Enhanced documentation to reflect changes in how refinements are handled.
- Removed unnecessary LLM message building and parsing logic, simplifying the refinement process.
- Updated event emissions to include summaries of refinements instead of raw data.

* enhance step executor with tool usage events and validation

- Added event emissions for tool usage, including started and finished events, to track tool execution.
- Implemented validation to ensure expected tools are called during step execution, raising errors when not.
- Refactored the  method to handle tool execution with event logging.
- Introduced a new method  for parsing tool input into a structured format.
- Updated tests to cover new functionality and ensure correct behavior of tool usage events.

* refactor: enhance final answer synthesis logic in AgentExecutor

- Updated the finalization process to conditionally skip synthesis when the last todo result is sufficient as a complete answer.
- Introduced a new method to determine if the last todo result can be used directly, improving efficiency.
- Added tests to verify the new behavior, ensuring synthesis is skipped when appropriate and maintained when a response model is set.

* fix: update observation handling in PlannerObserver for LLM errors

- Modified the error handling in the PlannerObserver to default to a conservative replan when an LLM call fails.
- Updated the return values to indicate that the step was not completed successfully and that a full replan is needed.
- Added a new test to verify the behavior of the observer when an LLM error occurs, ensuring the correct replan logic is triggered.

* refactor: enhance planning and execution flow in agents

- Updated the PlannerObserver to accept a kickoff input for standalone task execution, improving flexibility in task handling.
- Refined the step execution process in StepExecutor to support multi-turn action loops, allowing for iterative tool execution and observation.
- Introduced a method to extract relevant task sections from descriptions, ensuring clarity in task requirements.
- Enhanced the AgentExecutor to manage step failures more effectively, triggering replans only when necessary and preserving completed task history.
- Updated translations to reflect changes in planning principles and execution prompts, emphasizing concrete and executable steps.

* refactor: update setup_native_tools to include tool_name_mapping

- Modified the setup_native_tools function to return an additional mapping of tool names.
- Updated StepExecutor and AgentExecutor classes to accommodate the new return value from setup_native_tools.

* fix tests

* linted

* linted

* feat: enhance image block handling in Anthropic provider and update AgentExecutor logic

- Added a method to convert OpenAI-style image_url blocks to Anthropic's required format.
- Updated AgentExecutor to handle cases where no todos are ready, introducing a needs_replan return state.
- Improved fallback answer generation in AgentExecutor to prevent RuntimeErrors when no final output is produced.

* lint

* lint

* 1. Added failed to TodoStatus (planning_types.py)

  - TodoStatus now includes failed as a valid state: Literal[pending, running, completed, failed]
  - Added mark_failed(step_number, result) method to TodoList
  - Added get_failed_todos() method to TodoList
  - Updated is_complete to treat both completed and failed as terminal states
  - Updated replace_pending_todos docstring to mention failed items are preserved

  2. Mark running todos as failed before replan (agent_executor.py)

  All three effort-level handlers now call mark_failed() on the current todo before routing to replan_now:

  - Low effort (handle_step_observed_low): hard-failure branch
  - Medium effort (handle_step_observed_medium): needs_full_replan branch
  - High effort (decide_next_action): both needs_full_replan and step_completed_successfully=False branches

  3. Updated _should_replan to use get_failed_todos()

  Previously filtered on todo.status == failed which was dead code. Now uses the proper accessor method that will actually find failed items.

  What this fixes: Before these changes, a step that triggered a replan would stay in running status permanently, causing is_complete to never
  return True and next_pending to skip it — leading to stuck execution states. Now failed steps are properly tracked, replanning context correctly
  reports them, and LiteAgentOutput.failed_todos will actually return results.

* fix test

* imp on failed states

* adjusted the var name from AgentReActState to AgentExecutorState

* addressed p0 bugs

* more improvements

* linted

* regen cassette

* addressing crictical comments

* ensure configurable timeouts, max_replans and max step iterations

* adjusted tools

* dropping debug statements

* addressed comment

* fix  linter

* lints and test fixes

* fix: default observation parse fallback to failure and clean up plan-execute types

When _parse_observation_response fails all parse attempts, default to
step_completed_successfully=False instead of True to avoid silently
masking failures. Extract duplicate _extract_task_section into a shared
utility in agent_utils. Type PlanningConfig.llm as str | BaseLLM | None
instead of str | Any | None. Make StepResult a frozen dataclass for
immutability consistency with StepExecutionContext.

* fix: remove Any from function_calling_llm union type in step_executor

* fix: make BaseTool usage count thread-safe for parallel step execution

Add _usage_lock and _claim_usage() to BaseTool for atomic
check-and-increment of current_usage_count. This prevents race
conditions when parallel plan steps invoke the same tool concurrently
via execute_todos_parallel. Remove the racy pre-check from
execute_single_native_tool_call since the limit is now enforced
atomically inside tool.run().

---------

Co-authored-by: Greyson LaLonde <greyson.r.lalonde@gmail.com>
Co-authored-by: Greyson LaLonde <greyson@crewai.com>
2026-03-15 18:33:17 -07:00

346 lines
9.7 KiB
Python

"""Tests for planning/reasoning in agents."""
import warnings
import pytest
from crewai import Agent, PlanningConfig, Task
from crewai.llm import LLM
# =============================================================================
# Tests for PlanningConfig configuration (no LLM calls needed)
# =============================================================================
def test_planning_config_default_values():
"""Test PlanningConfig default values."""
config = PlanningConfig()
assert config.max_attempts is None
assert config.max_steps == 20
assert config.system_prompt is None
assert config.plan_prompt is None
assert config.refine_prompt is None
assert config.llm is None
def test_planning_config_custom_values():
"""Test PlanningConfig with custom values."""
config = PlanningConfig(
max_attempts=5,
max_steps=15,
system_prompt="Custom system",
plan_prompt="Custom plan: {description}",
refine_prompt="Custom refine: {current_plan}",
llm="gpt-4",
)
assert config.max_attempts == 5
assert config.max_steps == 15
assert config.system_prompt == "Custom system"
assert config.plan_prompt == "Custom plan: {description}"
assert config.refine_prompt == "Custom refine: {current_plan}"
assert config.llm == "gpt-4"
def test_agent_with_planning_config_custom_prompts():
"""Test agent with PlanningConfig using custom prompts."""
llm = LLM("gpt-4o-mini")
custom_system_prompt = "You are a specialized planner."
custom_plan_prompt = "Plan this task: {description}"
agent = Agent(
role="Test Agent",
goal="To test custom prompts",
backstory="I am a test agent.",
llm=llm,
planning_config=PlanningConfig(
system_prompt=custom_system_prompt,
plan_prompt=custom_plan_prompt,
max_steps=10,
),
verbose=False,
)
# Just test that the agent is created properly
assert agent.planning_config is not None
assert agent.planning_config.system_prompt == custom_system_prompt
assert agent.planning_config.plan_prompt == custom_plan_prompt
assert agent.planning_config.max_steps == 10
def test_agent_with_planning_config_disabled():
"""Test agent with PlanningConfig disabled."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Test Agent",
goal="To test disabled planning",
backstory="I am a test agent.",
llm=llm,
planning=False,
verbose=False,
)
# Planning should be disabled
assert agent.planning_enabled is False
def test_planning_enabled_property():
"""Test the planning_enabled property on Agent."""
llm = LLM("gpt-4o-mini")
# With planning_config enabled
agent_with_planning = Agent(
role="Test Agent",
goal="Test",
backstory="Test",
llm=llm,
planning=True,
)
assert agent_with_planning.planning_enabled is True
# With planning_config disabled
agent_disabled = Agent(
role="Test Agent",
goal="Test",
backstory="Test",
llm=llm,
planning=False,
)
assert agent_disabled.planning_enabled is False
# Without planning_config
agent_no_planning = Agent(
role="Test Agent",
goal="Test",
backstory="Test",
llm=llm,
)
assert agent_no_planning.planning_enabled is False
# =============================================================================
# Tests for backward compatibility with reasoning=True (no LLM calls)
# =============================================================================
def test_agent_with_reasoning_backward_compat():
"""Test agent with reasoning=True (backward compatibility)."""
llm = LLM("gpt-4o-mini")
# This should emit a deprecation warning
with warnings.catch_warnings(record=True):
warnings.simplefilter("always")
agent = Agent(
role="Test Agent",
goal="To test the reasoning feature",
backstory="I am a test agent created to verify the reasoning feature works correctly.",
llm=llm,
reasoning=True,
verbose=False,
)
# Should have created a PlanningConfig internally
assert agent.planning_config is not None
assert agent.planning_enabled is True
def test_agent_with_reasoning_and_max_attempts_backward_compat():
"""Test agent with reasoning=True and max_reasoning_attempts (backward compatibility)."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Test Agent",
goal="To test the reasoning feature",
backstory="I am a test agent.",
llm=llm,
reasoning=True,
max_reasoning_attempts=5,
verbose=False,
)
# Should have created a PlanningConfig with max_attempts
assert agent.planning_config is not None
assert agent.planning_config.max_attempts == 5
# =============================================================================
# Tests for Agent.kickoff() with planning (uses AgentExecutor)
# =============================================================================
@pytest.mark.vcr()
def test_agent_kickoff_with_planning():
"""Test Agent.kickoff() with planning enabled generates a plan."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Math Assistant",
goal="Help solve math problems step by step",
backstory="A helpful math tutor",
llm=llm,
planning_config=PlanningConfig(max_attempts=1),
verbose=False,
)
result = agent.kickoff("What is 15 + 27?")
assert result is not None
assert "42" in str(result)
@pytest.mark.vcr()
def test_agent_kickoff_without_planning():
"""Test Agent.kickoff() without planning skips plan generation."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Math Assistant",
goal="Help solve math problems",
backstory="A helpful assistant",
llm=llm,
# No planning_config = no planning
verbose=False,
)
result = agent.kickoff("What is 8 * 7?")
assert result is not None
assert "56" in str(result)
@pytest.mark.vcr()
def test_agent_kickoff_with_planning_disabled():
"""Test Agent.kickoff() with planning explicitly disabled via planning=False."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Math Assistant",
goal="Help solve math problems",
backstory="A helpful assistant",
llm=llm,
planning=False, # Explicitly disable planning
verbose=False,
)
result = agent.kickoff("What is 100 / 4?")
assert result is not None
assert "25" in str(result)
@pytest.mark.vcr()
def test_agent_kickoff_multi_step_task_with_planning():
"""Test Agent.kickoff() with a multi-step task that benefits from planning."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Math Tutor",
goal="Solve multi-step math problems",
backstory="An expert tutor who explains step by step",
llm=llm,
planning_config=PlanningConfig(max_attempts=1, max_steps=5),
verbose=False,
)
# Task requires: find primes, sum them, then double
result = agent.kickoff(
"Find the first 3 prime numbers, add them together, then multiply by 2."
)
assert result is not None
# First 3 primes: 2, 3, 5 -> sum = 10 -> doubled = 20
assert "20" in str(result)
# =============================================================================
# Tests for Agent.execute_task() with planning (uses CrewAgentExecutor)
# These test the legacy path via handle_reasoning()
# =============================================================================
@pytest.mark.vcr()
def test_agent_execute_task_with_planning():
"""Test Agent.execute_task() with planning via CrewAgentExecutor."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Math Assistant",
goal="Help solve math problems",
backstory="A helpful math tutor",
llm=llm,
planning_config=PlanningConfig(max_attempts=1),
verbose=False,
)
task = Task(
description="What is 9 + 11?",
expected_output="A number",
agent=agent,
)
result = agent.execute_task(task)
assert result is not None
assert "20" in str(result)
# Planning should be appended to task description
assert "Planning:" in task.description
@pytest.mark.vcr()
def test_agent_execute_task_without_planning():
"""Test Agent.execute_task() without planning."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Math Assistant",
goal="Help solve math problems",
backstory="A helpful assistant",
llm=llm,
verbose=False,
)
task = Task(
description="What is 12 * 3?",
expected_output="A number",
agent=agent,
)
result = agent.execute_task(task)
assert result is not None
assert "36" in str(result)
# No planning should be added
assert "Planning:" not in task.description
@pytest.mark.vcr()
def test_agent_execute_task_with_planning_refine():
"""Test Agent.execute_task() with planning that requires refinement."""
llm = LLM("gpt-4o-mini")
agent = Agent(
role="Math Tutor",
goal="Solve complex math problems step by step",
backstory="An expert tutor",
llm=llm,
planning_config=PlanningConfig(max_attempts=2),
verbose=False,
)
task = Task(
description="Calculate the area of a circle with radius 5 (use pi = 3.14)",
expected_output="The area as a number",
agent=agent,
)
result = agent.execute_task(task)
assert result is not None
# Area = pi * r^2 = 3.14 * 25 = 78.5
assert "78" in str(result) or "79" in str(result)
assert "Planning:" in task.description