adjust aop to amp docs lang (#4179)
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Notify Downstream / notify-downstream (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled

* adjust aop to amp docs lang

* whoop no print
This commit is contained in:
Lorenze Jay
2026-01-05 15:30:21 -08:00
committed by GitHub
parent f8deb0fd18
commit 25c0c030ce
203 changed files with 5176 additions and 2715 deletions

View File

@@ -1,7 +1,7 @@
---
title: 'Strategic LLM Selection Guide'
description: 'Strategic framework for choosing the right LLM for your CrewAI AI agents and writing effective task and agent definitions'
icon: 'brain-circuit'
title: "Strategic LLM Selection Guide"
description: "Strategic framework for choosing the right LLM for your CrewAI AI agents and writing effective task and agent definitions"
icon: "brain-circuit"
mode: "wide"
---
@@ -10,23 +10,35 @@ mode: "wide"
Rather than prescriptive model recommendations, we advocate for a **thinking framework** that helps you make informed decisions based on your specific use case, constraints, and requirements. The LLM landscape evolves rapidly, with new models emerging regularly and existing ones being updated frequently. What matters most is developing a systematic approach to evaluation that remains relevant regardless of which specific models are available.
<Note>
This guide focuses on strategic thinking rather than specific model recommendations, as the LLM landscape evolves rapidly.
This guide focuses on strategic thinking rather than specific model
recommendations, as the LLM landscape evolves rapidly.
</Note>
## Quick Decision Framework
<Steps>
<Step title="Analyze Your Tasks">
Begin by deeply understanding what your tasks actually require. Consider the cognitive complexity involved, the depth of reasoning needed, the format of expected outputs, and the amount of context the model will need to process. This foundational analysis will guide every subsequent decision.
Begin by deeply understanding what your tasks actually require. Consider the
cognitive complexity involved, the depth of reasoning needed, the format of
expected outputs, and the amount of context the model will need to process.
This foundational analysis will guide every subsequent decision.
</Step>
<Step title="Map Model Capabilities">
Once you understand your requirements, map them to model strengths. Different model families excel at different types of work; some are optimized for reasoning and analysis, others for creativity and content generation, and others for speed and efficiency.
Once you understand your requirements, map them to model strengths.
Different model families excel at different types of work; some are
optimized for reasoning and analysis, others for creativity and content
generation, and others for speed and efficiency.
</Step>
<Step title="Consider Constraints">
Factor in your real-world operational constraints including budget limitations, latency requirements, data privacy needs, and infrastructure capabilities. The theoretically best model may not be the practically best choice for your situation.
Factor in your real-world operational constraints including budget
limitations, latency requirements, data privacy needs, and infrastructure
capabilities. The theoretically best model may not be the practically best
choice for your situation.
</Step>
<Step title="Test and Iterate">
Start with reliable, well-understood models and optimize based on actual performance in your specific use case. Real-world results often differ from theoretical benchmarks, so empirical testing is crucial.
Start with reliable, well-understood models and optimize based on actual
performance in your specific use case. Real-world results often differ from
theoretical benchmarks, so empirical testing is crucial.
</Step>
</Steps>
@@ -43,6 +55,7 @@ The most critical step in LLM selection is understanding what your task actually
- **Complex Tasks** require multi-step reasoning, strategic thinking, and the ability to handle ambiguous or incomplete information. These might involve analyzing multiple data sources, developing comprehensive strategies, or solving problems that require breaking down into smaller components. The model needs to maintain context across multiple reasoning steps and often must make inferences that aren't explicitly stated.
- **Creative Tasks** demand a different type of cognitive capability focused on generating novel, engaging, and contextually appropriate content. This includes storytelling, marketing copy creation, and creative problem-solving. The model needs to understand nuance, tone, and audience while producing content that feels authentic and engaging rather than formulaic.
</Tab>
<Tab title="Output Requirements">
@@ -51,6 +64,7 @@ The most critical step in LLM selection is understanding what your task actually
- **Creative Content** outputs demand a balance of technical competence and creative flair. The model needs to understand audience, tone, and brand voice while producing content that engages readers and achieves specific communication goals. Quality here is often subjective and requires models that can adapt their writing style to different contexts and purposes.
- **Technical Content** sits between structured data and creative content, requiring both precision and clarity. Documentation, code generation, and technical analysis need to be accurate and comprehensive while remaining accessible to the intended audience. The model must understand complex technical concepts and communicate them effectively.
</Tab>
<Tab title="Context Needs">
@@ -59,6 +73,7 @@ The most critical step in LLM selection is understanding what your task actually
- **Long Context** requirements emerge when working with substantial documents, extended conversations, or complex multi-part tasks. The model needs to maintain coherence across thousands of tokens while referencing earlier information accurately. This capability becomes crucial for document analysis, comprehensive research, and sophisticated dialogue systems.
- **Very Long Context** scenarios push the boundaries of what's currently possible, involving massive document processing, extensive research synthesis, or complex multi-session interactions. These use cases require models specifically designed for extended context handling and often involve trade-offs between context length and processing speed.
</Tab>
</Tabs>
@@ -73,6 +88,7 @@ Understanding model capabilities requires looking beyond marketing claims and be
The strength of reasoning models lies in their ability to maintain logical consistency across extended reasoning chains and to break down complex problems into manageable components. They're particularly valuable for strategic planning, complex analysis, and situations where the quality of reasoning matters more than speed of response.
However, reasoning models often come with trade-offs in terms of speed and cost. They may also be less suitable for creative tasks or simple operations where their sophisticated reasoning capabilities aren't needed. Consider these models when your tasks involve genuine complexity that benefits from systematic, step-by-step analysis.
</Accordion>
<Accordion title="General Purpose Models" icon="microchip">
@@ -81,6 +97,7 @@ Understanding model capabilities requires looking beyond marketing claims and be
The primary advantage of general purpose models is their reliability and predictability across different types of work. They handle most standard business tasks competently, from research and analysis to content creation and data processing. This makes them excellent choices for teams that need consistent performance across varied workflows.
While general purpose models may not achieve the peak performance of specialized alternatives in specific domains, they offer operational simplicity and reduced complexity in model management. They're often the best starting point for new projects, allowing teams to understand their specific needs before potentially optimizing with more specialized models.
</Accordion>
<Accordion title="Fast & Efficient Models" icon="bolt">
@@ -89,6 +106,7 @@ Understanding model capabilities requires looking beyond marketing claims and be
These models excel in scenarios involving routine operations, simple data processing, function calling, and high-volume tasks where the cognitive requirements are relatively straightforward. They're particularly valuable for applications that need to process many requests quickly or operate within tight budget constraints.
The key consideration with efficient models is ensuring that their capabilities align with your task requirements. While they can handle many routine operations effectively, they may struggle with tasks requiring nuanced understanding, complex reasoning, or sophisticated content generation. They're best used for well-defined, routine operations where speed and cost matter more than sophistication.
</Accordion>
<Accordion title="Creative Models" icon="pen">
@@ -97,6 +115,7 @@ Understanding model capabilities requires looking beyond marketing claims and be
The strength of creative models lies in their ability to adapt writing style to different audiences, maintain consistent voice and tone, and generate content that engages readers effectively. They often perform better on tasks involving storytelling, marketing copy, brand communications, and other content where creativity and engagement are primary goals.
When selecting creative models, consider not just their ability to generate text, but their understanding of audience, context, and purpose. The best creative models can adapt their output to match specific brand voices, target different audience segments, and maintain consistency across extended content pieces.
</Accordion>
<Accordion title="Open Source Models" icon="code">
@@ -105,6 +124,7 @@ Understanding model capabilities requires looking beyond marketing claims and be
The primary benefits of open source models include elimination of per-token costs, ability to fine-tune for specific use cases, complete data privacy, and independence from external API providers. They're particularly valuable for organizations with strict data privacy requirements, budget constraints, or specific customization needs.
However, open source models require more technical expertise to deploy and maintain effectively. Teams need to consider infrastructure costs, model management complexity, and the ongoing effort required to keep models updated and optimized. The total cost of ownership may be higher than cloud-based alternatives when factoring in technical overhead.
</Accordion>
</AccordionGroup>
@@ -113,7 +133,8 @@ Understanding model capabilities requires looking beyond marketing claims and be
### a. Multi-Model Approach
<Tip>
Use different models for different purposes within the same crew to optimize both performance and cost.
Use different models for different purposes within the same crew to optimize
both performance and cost.
</Tip>
The most sophisticated CrewAI implementations often employ multiple models strategically, assigning different models to different agents based on their specific roles and requirements. This approach allows teams to optimize for both performance and cost by using the most appropriate model for each type of work.
@@ -177,6 +198,7 @@ The key to successful multi-model implementation is understanding how different
Effective manager LLMs require strong reasoning capabilities to make good delegation decisions, consistent performance to ensure predictable coordination, and excellent context management to track the state of multiple agents simultaneously. The model needs to understand the capabilities and limitations of different agents while optimizing task allocation for efficiency and quality.
Cost considerations are particularly important for manager LLMs since they're involved in every operation. The model needs to provide sufficient capability for effective coordination while remaining cost-effective for frequent use. This often means finding models that offer good reasoning capabilities without the premium pricing of the most sophisticated options.
</Tab>
<Tab title="Function Calling LLM">
@@ -185,6 +207,7 @@ The key to successful multi-model implementation is understanding how different
The most important characteristics for function calling LLMs are precision and reliability rather than creativity or sophisticated reasoning. The model needs to consistently extract the correct parameters from natural language requests and handle tool responses appropriately. Speed is also important since tool usage often involves multiple round trips that can impact overall performance.
Many teams find that specialized function calling models or general purpose models with strong tool support work better than creative or reasoning-focused models for this role. The key is ensuring that the model can reliably bridge the gap between natural language instructions and structured tool calls.
</Tab>
<Tab title="Agent-Specific Overrides">
@@ -193,6 +216,7 @@ The key to successful multi-model implementation is understanding how different
Consider agent-specific overrides when an agent's role requires capabilities that differ substantially from other crew members. For example, a creative writing agent might benefit from a model optimized for content generation, while a data analysis agent might perform better with a reasoning-focused model.
The challenge with agent-specific overrides is balancing optimization with operational complexity. Each additional model adds complexity to deployment, monitoring, and cost management. Teams should focus overrides on agents where the performance improvement justifies the additional complexity.
</Tab>
</Tabs>
@@ -209,6 +233,7 @@ Effective task definition is often more important than model selection in determ
Effective task descriptions include relevant context and constraints that help the agent understand the broader purpose and any limitations they need to work within. They break complex work into focused steps that can be executed systematically, rather than presenting overwhelming, multi-faceted objectives that are difficult to approach systematically.
Common mistakes include being too vague about objectives, failing to provide necessary context, setting unclear success criteria, or combining multiple unrelated tasks into a single description. The goal is to provide enough information for the agent to succeed while maintaining focus on a single, clear objective.
</Accordion>
<Accordion title="Expected Output Guidelines" icon="bullseye">
@@ -217,6 +242,7 @@ Effective task definition is often more important than model selection in determ
The best output guidelines provide concrete examples of quality indicators and define completion criteria clearly enough that both the agent and human reviewers can assess whether the task has been completed successfully. This reduces ambiguity and helps ensure consistent results across multiple task executions.
Avoid generic output descriptions that could apply to any task, missing format specifications that leave agents guessing about structure, unclear quality standards that make evaluation difficult, or failing to provide examples or templates that help agents understand expectations.
</Accordion>
</AccordionGroup>
@@ -229,6 +255,7 @@ Effective task definition is often more important than model selection in determ
Implementing sequential dependencies effectively requires using the context parameter to chain related tasks, building complexity gradually through task progression, and ensuring that each task produces outputs that serve as meaningful inputs for subsequent tasks. The goal is to maintain logical flow between dependent tasks while avoiding unnecessary bottlenecks.
Sequential dependencies work best when there's a clear logical progression from one task to another and when the output of one task genuinely improves the quality or feasibility of subsequent tasks. However, they can create bottlenecks if not managed carefully, so it's important to identify which dependencies are truly necessary versus those that are merely convenient.
</Tab>
<Tab title="Parallel Execution">
@@ -237,6 +264,7 @@ Effective task definition is often more important than model selection in determ
Successful parallel execution requires identifying tasks that can truly run independently, grouping related but separate work streams effectively, and planning for result integration when parallel tasks need to be combined into a final deliverable. The key is ensuring that parallel tasks don't create conflicts or redundancies that reduce overall quality.
Consider parallel execution when you have multiple independent research streams, different types of analysis that don't depend on each other, or content creation tasks that can be developed simultaneously. However, be mindful of resource allocation and ensure that parallel execution doesn't overwhelm your available model capacity or budget.
</Tab>
</Tabs>
@@ -245,7 +273,8 @@ Effective task definition is often more important than model selection in determ
### a. Role-Driven LLM Selection
<Warning>
Generic agent roles make it impossible to select the right LLM. Specific roles enable targeted model optimization.
Generic agent roles make it impossible to select the right LLM. Specific roles
enable targeted model optimization.
</Warning>
The specificity of your agent roles directly determines which LLM capabilities matter most for optimal performance. This creates a strategic opportunity to match precise model strengths with agent responsibilities.
@@ -253,6 +282,7 @@ The specificity of your agent roles directly determines which LLM capabilities m
**Generic vs. Specific Role Impact on LLM Choice:**
When defining roles, think about the specific domain knowledge, working style, and decision-making frameworks that would be most valuable for the tasks the agent will handle. The more specific and contextual the role definition, the better the model can embody that role effectively.
```python
# ✅ Specific role - clear LLM requirements
specific_agent = Agent(
@@ -273,7 +303,8 @@ specific_agent = Agent(
### b. Backstory as Model Context Amplifier
<Info>
Strategic backstories multiply your chosen LLM's effectiveness by providing domain-specific context that generic prompting cannot achieve.
Strategic backstories multiply your chosen LLM's effectiveness by providing
domain-specific context that generic prompting cannot achieve.
</Info>
A well-crafted backstory transforms your LLM choice from generic capability to specialized expertise. This is especially crucial for cost optimization - a well-contextualized efficient model can outperform a premium model without proper context.
@@ -300,6 +331,7 @@ domain_expert = Agent(
```
**Backstory Elements That Enhance LLM Performance:**
- **Domain Experience**: "10+ years in enterprise SaaS sales"
- **Specific Expertise**: "Specializes in technical due diligence for Series B+ rounds"
- **Working Style**: "Prefers data-driven decisions with clear documentation"
@@ -332,6 +364,7 @@ tech_writer = Agent(
```
**Alignment Checklist:**
- ✅ **Role Specificity**: Clear domain and responsibilities
- ✅ **LLM Match**: Model strengths align with role requirements
- ✅ **Backstory Depth**: Provides domain context the LLM can leverage
@@ -353,6 +386,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
- Are any agents heavily tool-dependent?
**Action**: Document current agent roles and identify optimization opportunities.
</Step>
<Step title="Implement Crew-Level Strategy" icon="users-gear">
@@ -369,6 +403,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
```
**Action**: Establish your crew's default LLM before optimizing individual agents.
</Step>
<Step title="Optimize High-Impact Agents" icon="star">
@@ -390,16 +425,18 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
```
**Action**: Upgrade 20% of your agents that handle 80% of the complexity.
</Step>
<Step title="Validate with Enterprise Testing" icon="test-tube">
**Once you deploy your agents to production:**
- Use [CrewAI AOP platform](https://app.crewai.com) to A/B test your model selections
- Use [CrewAI AMP platform](https://app.crewai.com) to A/B test your model selections
- Run multiple iterations with real inputs to measure consistency and performance
- Compare cost vs. performance across your optimized setup
- Share results with your team for collaborative decision-making
**Action**: Replace guesswork with data-driven validation using the testing platform.
</Step>
</Steps>
@@ -412,6 +449,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
Consider reasoning models for business strategy development, complex data analysis that requires drawing insights from multiple sources, multi-step problem solving where each step depends on previous analysis, and strategic planning tasks that require considering multiple variables and their interactions.
However, reasoning models often come with higher costs and slower response times, so they're best reserved for tasks where their sophisticated capabilities provide genuine value rather than being used for simple operations that don't require complex reasoning.
</Tab>
<Tab title="Creative Models">
@@ -420,6 +458,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
Use creative models for blog post writing and article creation, marketing copy that needs to engage and persuade, creative storytelling and narrative development, and brand communications where voice and tone are crucial. These models often understand nuance and context better than general purpose alternatives.
Creative models may be less suitable for technical or analytical tasks where precision and factual accuracy are more important than engagement and style. They're best used when the creative and communicative aspects of the output are primary success factors.
</Tab>
<Tab title="Efficient Models">
@@ -428,6 +467,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
Consider efficient models for data processing and transformation tasks, simple formatting and organization operations, function calling and tool usage where precision matters more than sophistication, and high-volume operations where cost per operation is a significant factor.
The key with efficient models is ensuring that their capabilities align with task requirements. They can handle many routine operations effectively but may struggle with tasks requiring nuanced understanding, complex reasoning, or sophisticated content generation.
</Tab>
<Tab title="Open Source Models">
@@ -436,6 +476,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
Consider open source models for internal company tools where data privacy is paramount, privacy-sensitive applications that can't use external APIs, cost-optimized deployments where per-token pricing is prohibitive, and situations requiring custom model modifications or fine-tuning.
However, open source models require more technical expertise to deploy and maintain effectively. Consider the total cost of ownership including infrastructure, technical overhead, and ongoing maintenance when evaluating open source options.
</Tab>
</Tabs>
@@ -455,6 +496,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
# Processing agent gets efficient model
processor = Agent(role="Data Processor", llm=LLM(model="gpt-4o-mini"))
```
</Accordion>
<Accordion title="Ignoring Crew-Level vs Agent-Level LLM Hierarchy" icon="shuffle">
@@ -474,6 +516,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
# Agents inherit crew LLM unless specifically overridden
agent1 = Agent(llm=LLM(model="claude-3-5-sonnet")) # Override for specific needs
```
</Accordion>
<Accordion title="Function Calling Model Mismatch" icon="screwdriver-wrench">
@@ -492,6 +535,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
llm=LLM(model="claude-3-5-sonnet") # Also strong with tools
)
```
</Accordion>
<Accordion title="Premature Optimization Without Testing" icon="gear">
@@ -507,6 +551,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
# Test performance, then optimize specific agents as needed
# Use Enterprise platform testing to validate improvements
```
</Accordion>
<Accordion title="Overlooking Context and Memory Limitations" icon="brain">
@@ -515,6 +560,7 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
**Real Example**: Using a short-context model for agents that need to maintain conversation history across multiple task iterations, or in crews with extensive agent-to-agent communication.
**CrewAI Solution**: Match context capabilities to crew communication patterns.
</Accordion>
</AccordionGroup>
@@ -522,26 +568,40 @@ Rather than repeating the strategic framework, here's a tactical checklist for i
<Steps>
<Step title="Start Simple" icon="play">
Begin with reliable, general-purpose models that are well-understood and widely supported. This provides a stable foundation for understanding your specific requirements and performance expectations before optimizing for specialized needs.
Begin with reliable, general-purpose models that are well-understood and
widely supported. This provides a stable foundation for understanding your
specific requirements and performance expectations before optimizing for
specialized needs.
</Step>
<Step title="Measure What Matters" icon="chart-line">
Develop metrics that align with your specific use case and business requirements rather than relying solely on general benchmarks. Focus on measuring outcomes that directly impact your success rather than theoretical performance indicators.
Develop metrics that align with your specific use case and business
requirements rather than relying solely on general benchmarks. Focus on
measuring outcomes that directly impact your success rather than theoretical
performance indicators.
</Step>
<Step title="Iterate Based on Results" icon="arrows-rotate">
Make model changes based on observed performance in your specific context rather than theoretical considerations or general recommendations. Real-world performance often differs significantly from benchmark results or general reputation.
Make model changes based on observed performance in your specific context
rather than theoretical considerations or general recommendations.
Real-world performance often differs significantly from benchmark results or
general reputation.
</Step>
<Step title="Consider Total Cost" icon="calculator">
Evaluate the complete cost of ownership including model costs, development time, maintenance overhead, and operational complexity. The cheapest model per token may not be the most cost-effective choice when considering all factors.
Evaluate the complete cost of ownership including model costs, development
time, maintenance overhead, and operational complexity. The cheapest model
per token may not be the most cost-effective choice when considering all
factors.
</Step>
</Steps>
<Tip>
Focus on understanding your requirements first, then select models that best match those needs. The best LLM choice is the one that consistently delivers the results you need within your operational constraints.
Focus on understanding your requirements first, then select models that best
match those needs. The best LLM choice is the one that consistently delivers
the results you need within your operational constraints.
</Tip>
### Enterprise-Grade Model Validation
For teams serious about optimizing their LLM selection, the **CrewAI AOP platform** provides sophisticated testing capabilities that go far beyond basic CLI testing. The platform enables comprehensive model evaluation that helps you make data-driven decisions about your LLM strategy.
For teams serious about optimizing their LLM selection, the **CrewAI AMP platform** provides sophisticated testing capabilities that go far beyond basic CLI testing. The platform enables comprehensive model evaluation that helps you make data-driven decisions about your LLM strategy.
<Frame>
![Enterprise Testing Interface](/images/enterprise/enterprise-testing.png)
@@ -562,7 +622,9 @@ For teams serious about optimizing their LLM selection, the **CrewAI AOP platfor
Go to [app.crewai.com](https://app.crewai.com) to get started!
<Info>
The Enterprise platform transforms model selection from guesswork into a data-driven process, enabling you to validate the principles in this guide with your actual use cases and requirements.
The Enterprise platform transforms model selection from guesswork into a
data-driven process, enabling you to validate the principles in this guide
with your actual use cases and requirements.
</Info>
## Key Principles Summary
@@ -572,21 +634,27 @@ The Enterprise platform transforms model selection from guesswork into a data-dr
Choose models based on what the task actually requires, not theoretical capabilities or general reputation.
</Card>
<Card title="Capability Matching" icon="puzzle-piece">
Align model strengths with agent roles and responsibilities for optimal performance.
</Card>
{" "}
<Card title="Capability Matching" icon="puzzle-piece">
Align model strengths with agent roles and responsibilities for optimal
performance.
</Card>
<Card title="Strategic Consistency" icon="link">
Maintain coherent model selection strategy across related components and workflows.
</Card>
{" "}
<Card title="Strategic Consistency" icon="link">
Maintain coherent model selection strategy across related components and
workflows.
</Card>
<Card title="Practical Testing" icon="flask">
Validate choices through real-world usage rather than benchmarks alone.
</Card>
{" "}
<Card title="Practical Testing" icon="flask">
Validate choices through real-world usage rather than benchmarks alone.
</Card>
<Card title="Iterative Improvement" icon="arrow-up">
Start simple and optimize based on actual performance and needs.
</Card>
{" "}
<Card title="Iterative Improvement" icon="arrow-up">
Start simple and optimize based on actual performance and needs.
</Card>
<Card title="Operational Balance" icon="scale-balanced">
Balance performance requirements with cost and complexity constraints.
@@ -594,13 +662,20 @@ The Enterprise platform transforms model selection from guesswork into a data-dr
</CardGroup>
<Check>
Remember: The best LLM choice is the one that consistently delivers the results you need within your operational constraints. Focus on understanding your requirements first, then select models that best match those needs.
Remember: The best LLM choice is the one that consistently delivers the
results you need within your operational constraints. Focus on understanding
your requirements first, then select models that best match those needs.
</Check>
## Current Model Landscape (June 2025)
<Warning>
**Snapshot in Time**: The following model rankings represent current leaderboard standings as of June 2025, compiled from [LMSys Arena](https://arena.lmsys.org/), [Artificial Analysis](https://artificialanalysis.ai/), and other leading benchmarks. LLM performance, availability, and pricing change rapidly. Always conduct your own evaluations with your specific use cases and data.
**Snapshot in Time**: The following model rankings represent current
leaderboard standings as of June 2025, compiled from [LMSys
Arena](https://arena.lmsys.org/), [Artificial
Analysis](https://artificialanalysis.ai/), and other leading benchmarks. LLM
performance, availability, and pricing change rapidly. Always conduct your own
evaluations with your specific use cases and data.
</Warning>
### Leading Models by Category
@@ -608,7 +683,10 @@ Remember: The best LLM choice is the one that consistently delivers the results
The tables below show a representative sample of current top-performing models across different categories, with guidance on their suitability for CrewAI agents:
<Note>
These tables/metrics showcase selected leading models in each category and are not exhaustive. Many excellent models exist beyond those listed here. The goal is to illustrate the types of capabilities to look for rather than provide a complete catalog.
These tables/metrics showcase selected leading models in each category and are
not exhaustive. Many excellent models exist beyond those listed here. The goal
is to illustrate the types of capabilities to look for rather than provide a
complete catalog.
</Note>
<Tabs>
@@ -624,6 +702,7 @@ These tables/metrics showcase selected leading models in each category and are n
| **Qwen3 235B (Reasoning)** | 62 | $2.63 | Moderate | Open-source alternative for reasoning tasks |
These models excel at multi-step reasoning and are ideal for agents that need to develop strategies, coordinate other agents, or analyze complex information.
</Tab>
<Tab title="Coding & Technical">
@@ -638,6 +717,7 @@ These tables/metrics showcase selected leading models in each category and are n
| **Llama 3.1 405B** | Good | 81.1% | $3.50 | Function calling LLM for tool-heavy workflows |
These models are optimized for code generation, debugging, and technical problem-solving, making them ideal for development-focused crews.
</Tab>
<Tab title="Speed & Efficiency">
@@ -652,6 +732,7 @@ These tables/metrics showcase selected leading models in each category and are n
| **Nova Micro** | High | 0.30s | $0.04 | Simple, fast task execution |
These models prioritize speed and efficiency, perfect for agents handling routine operations or requiring quick responses. **Pro tip**: Pairing these models with fast inference providers like Groq can achieve even better performance, especially for open-source models like Llama.
</Tab>
<Tab title="Balanced Performance">
@@ -666,6 +747,7 @@ These tables/metrics showcase selected leading models in each category and are n
| **Qwen3 32B** | 44 | Good | $1.23 | Budget-friendly versatility |
These models offer good performance across multiple dimensions, suitable for crews with diverse task requirements.
</Tab>
</Tabs>
@@ -676,24 +758,28 @@ These tables/metrics showcase selected leading models in each category and are n
**When performance is the priority**: Use top-tier models like **o3**, **Gemini 2.5 Pro**, or **Claude 4 Sonnet** for manager LLMs and critical agents. These models excel at complex reasoning and coordination but come with higher costs.
**Strategy**: Implement a multi-model approach where premium models handle strategic thinking while efficient models handle routine operations.
</Accordion>
<Accordion title="Cost-Conscious Crews" icon="dollar-sign">
**When budget is a primary constraint**: Focus on models like **DeepSeek R1**, **Llama 4 Scout**, or **Gemini 2.0 Flash**. These provide strong performance at significantly lower costs.
**Strategy**: Use cost-effective models for most agents, reserving premium models only for the most critical decision-making roles.
</Accordion>
<Accordion title="Specialized Workflows" icon="screwdriver-wrench">
**For specific domain expertise**: Choose models optimized for your primary use case. **Claude 4** series for coding, **Gemini 2.5 Pro** for research, **Llama 405B** for function calling.
**Strategy**: Select models based on your crew's primary function, ensuring the core capability aligns with model strengths.
</Accordion>
<Accordion title="Enterprise & Privacy" icon="shield">
**For data-sensitive operations**: Consider open-source models like **Llama 4** series, **DeepSeek V3**, or **Qwen3** that can be deployed locally while maintaining competitive performance.
**Strategy**: Deploy open-source models on private infrastructure, accepting potential performance trade-offs for data control.
</Accordion>
</AccordionGroup>
@@ -706,7 +792,10 @@ These tables/metrics showcase selected leading models in each category and are n
- **Open Source Viability**: The gap between open-source and proprietary models continues to narrow, with models like Llama 4 Maverick and DeepSeek V3 offering competitive performance at attractive price points. Fast inference providers particularly shine with open-source models, often delivering better speed-to-cost ratios than proprietary alternatives.
<Info>
**Testing is Essential**: Leaderboard rankings provide general guidance, but your specific use case, prompting style, and evaluation criteria may produce different results. Always test candidate models with your actual tasks and data before making final decisions.
**Testing is Essential**: Leaderboard rankings provide general guidance, but
your specific use case, prompting style, and evaluation criteria may produce
different results. Always test candidate models with your actual tasks and
data before making final decisions.
</Info>
### Practical Implementation Strategy
@@ -716,13 +805,20 @@ These tables/metrics showcase selected leading models in each category and are n
Begin with well-established models like **GPT-4.1**, **Claude 3.7 Sonnet**, or **Gemini 2.0 Flash** that offer good performance across multiple dimensions and have extensive real-world validation.
</Step>
<Step title="Identify Specialized Needs">
Determine if your crew has specific requirements (coding, reasoning, speed) that would benefit from specialized models like **Claude 4 Sonnet** for development or **o3** for complex analysis. For speed-critical applications, consider fast inference providers like **Groq** alongside model selection.
</Step>
{" "}
<Step title="Identify Specialized Needs">
Determine if your crew has specific requirements (coding, reasoning, speed)
that would benefit from specialized models like **Claude 4 Sonnet** for
development or **o3** for complex analysis. For speed-critical applications,
consider fast inference providers like **Groq** alongside model selection.
</Step>
<Step title="Implement Multi-Model Strategy">
Use different models for different agents based on their roles. High-capability models for managers and complex tasks, efficient models for routine operations.
</Step>
{" "}
<Step title="Implement Multi-Model Strategy">
Use different models for different agents based on their roles.
High-capability models for managers and complex tasks, efficient models for
routine operations.
</Step>
<Step title="Monitor and Optimize">
Track performance metrics relevant to your use case and be prepared to adjust model selections as new models are released or pricing changes.