crewAI/docs/en/concepts/llms.mdx

---
title: 'LLMs'
description: 'A comprehensive guide to configuring and using Large Language Models (LLMs) in your CrewAI projects'
icon: 'microchip-ai'
mode: "wide"
---

## Overview

CrewAI integrates with multiple LLM providers through providers native sdks, giving you the flexibility to choose the right model for your specific use case. This guide will help you understand how to configure and use different LLM providers in your CrewAI projects.

## When to Use Advanced LLM Configuration

- You need strict control of latency, cost, and output format.
- You need model routing by task type.
- You need reproducible, policy-sensitive behavior in production.

## When Not to Over-Configure

- You are in early prototyping with one simple task path.
- You do not yet need structured outputs or model routing.


## What are LLMs?

Large Language Models (LLMs) are the core intelligence behind CrewAI agents. They enable agents to understand context, make decisions, and generate human-like responses. Here's what you need to know:

<CardGroup cols={2}>
  <Card title="LLM Basics" icon="brain">
    Large Language Models are AI systems trained on vast amounts of text data. They power the intelligence of your CrewAI agents, enabling them to understand and generate human-like text.
  </Card>
  <Card title="Context Window" icon="window">
    The context window determines how much text an LLM can process at once. Larger windows (e.g., 128K tokens) allow for more context but may be more expensive and slower.
  </Card>
  <Card title="Temperature" icon="temperature-three-quarters">
    Temperature (0.0 to 1.0) controls response randomness. Lower values (e.g., 0.2) produce more focused, deterministic outputs, while higher values (e.g., 0.8) increase creativity and variability.
  </Card>
  <Card title="Provider Selection" icon="server">
    Each LLM provider (e.g., OpenAI, Anthropic, Google) offers different models with varying capabilities, pricing, and features. Choose based on your needs for accuracy, speed, and cost.
  </Card>
</CardGroup>

## Setting up your LLM

There are different places in CrewAI code where you can specify the model to use. Once you specify the model you are using, you will need to provide the configuration (like an API key) for each of the model providers you use. See the [provider configuration examples](#provider-configuration-examples) section for your provider.

<Tabs>
  <Tab title="1. Environment Variables">
    The simplest way to get started. Set the model in your environment directly, through an `.env` file or in your app code. If you used `crewai create` to bootstrap your project, it will be set already.

    ```bash .env
    MODEL=model-id  # e.g. gpt-4o, gemini-2.0-flash, claude-3-sonnet-...

    # Be sure to set your API keys here too. See the Provider
    # section below.
    ```

    <Warning>
      Never commit API keys to version control. Use environment files (.env) or your system's secret management.
    </Warning>
  </Tab>
  <Tab title="2. YAML Configuration">
    Create a YAML file to define your agent configurations. This method is great for version control and team collaboration:

    ```yaml agents.yaml {6}
    researcher:
        role: Research Specialist
        goal: Conduct comprehensive research and analysis
        backstory: A dedicated research professional with years of experience
        verbose: true
        llm: provider/model-id  # e.g. openai/gpt-4o, google/gemini-2.0-flash, anthropic/claude...
        # (see provider configuration examples below for more)
    ```

    <Info>
      The YAML configuration allows you to:
      - Version control your agent settings
      - Easily switch between different models
      - Share configurations across team members
      - Document model choices and their purposes
    </Info>
  </Tab>
  <Tab title="3. Direct Code">
    For maximum flexibility, configure LLMs directly in your Python code:

    ```python {4,8}
    from crewai import LLM

    # Basic configuration
    llm = LLM(model="model-id-here")  # gpt-4o, gemini-2.0-flash, anthropic/claude...

    # Advanced configuration with detailed parameters
    llm = LLM(
        model="model-id-here",  # gpt-4o, gemini-2.0-flash, anthropic/claude...
        temperature=0.7,        # Higher for more creative outputs
        timeout=120,            # Seconds to wait for response
        max_tokens=4000,        # Maximum length of response
        top_p=0.9,              # Nucleus sampling parameter
        frequency_penalty=0.1 , # Reduce repetition
        presence_penalty=0.1,   # Encourage topic diversity
        response_format={"type": "json"},  # For structured outputs
        seed=42                 # For reproducible results
    )
    ```

    <Info>
      Parameter explanations:
      - `temperature`: Controls randomness (0.0-1.0)
      - `timeout`: Maximum wait time for response
      - `max_tokens`: Limits response length
      - `top_p`: Alternative to temperature for sampling
      - `frequency_penalty`: Reduces word repetition
      - `presence_penalty`: Encourages new topics
      - `response_format`: Specifies output structure
      - `seed`: Ensures consistent outputs
    </Info>
  </Tab>
</Tabs>

## Production LLM Patterns

The basics above show how to configure one model. In real systems, you usually combine several LLM patterns for cost, quality, and reliability.

### Pattern 1: Route models by agent role

Use faster/cheaper models for extraction and heavier models for synthesis or critical decisions.

```python Code
from crewai import Agent, Crew, Process, Task

researcher = Agent(
    role="Researcher",
    goal="Collect factual inputs quickly",
    backstory="Fast information-gathering specialist",
    llm="openai/gpt-4o-mini",
)

reviewer = Agent(
    role="Reviewer",
    goal="Validate claims and produce final answer",
    backstory="Careful editor focused on correctness",
    llm="provider/model-id",
)

crew = Crew(
    agents=[researcher, reviewer],
    tasks=[
        Task(
            description="Find the latest policy changes and list the key points",
            expected_output="Bullet list of validated policy changes",
            agent=researcher,
        ),
        Task(
            description="Review findings and produce a final executive summary",
            expected_output="Concise, decision-ready summary",
            agent=reviewer,
        ),
    ],
    process=Process.sequential,
)
```

### Pattern 2: Set reliability defaults once

Configure retry, timeout, and deterministic sampling in one reusable `LLM` object.

```python Code
from crewai import LLM

reliable_llm = LLM(
    model="openai/gpt-4o-mini",
    temperature=0.1,
    timeout=45,
    max_retries=3,
    max_tokens=1200,
    seed=7,
)
```

Use this for extraction, classification, and policy-sensitive tasks where variance should be low.

### Pattern 3: Use structured outputs for machine-readable responses

For downstream automation, force JSON-shaped outputs rather than free-form prose.

```python Code
from crewai import LLM

json_llm = LLM(
    model="openai/gpt-4o",
    response_format={"type": "json"},
    temperature=0.0,
)
```

This reduces parser fragility in pipelines that feed APIs, databases, or workflow routers.

### Pattern 4: Use OpenAI Responses API for multi-turn reasoning flows

When you need built-in tools, response chaining, or reasoning-model workflows, enable the Responses API explicitly.

```python Code
from crewai import LLM

reasoning_llm = LLM(
    model="openai/o4-mini",
    api="responses",
    auto_chain=True,
    store=True,
    reasoning_effort="medium",
)
```

This is especially useful in long-running assistants where you want conversation continuity and controllable reasoning depth.

## Provider Configuration

For concept-level usage, keep provider setup minimal and explicit:

1. Set provider credentials via environment variables.
2. Pin model IDs explicitly in code or YAML.
3. Set reliability defaults (`timeout`, `max_retries`, low `temperature`) for production.

Use these pages for deeper provider setup and runtime decisions:
- Connections and provider setup: [/en/learn/llm-connections](/en/learn/llm-connections)
- Custom provider integration: [/en/learn/custom-llm](/en/learn/custom-llm)
- Production routing and reliability patterns: [/en/ai/llms/patterns](/en/ai/llms/patterns)
- Parameter contract reference: [/en/ai/llms/reference](/en/ai/llms/reference)

## Streaming Responses

CrewAI supports streaming responses from LLMs, allowing your application to receive and process outputs in real-time as they're generated.

<Tabs>
  <Tab title="Basic Setup">
    Enable streaming by setting the `stream` parameter to `True` when initializing your LLM:

    ```python
    from crewai import LLM

    # Create an LLM with streaming enabled
    llm = LLM(
        model="openai/gpt-4o",
        stream=True  # Enable streaming
    )
    ```

    When streaming is enabled, responses are delivered in chunks as they're generated, creating a more responsive user experience.
  </Tab>

  <Tab title="Event Handling">
    CrewAI emits events for each chunk received during streaming:

    ```python
    from crewai.events import (
      LLMStreamChunkEvent
    )
    from crewai.events import BaseEventListener

    class MyCustomListener(BaseEventListener):
        def setup_listeners(self, crewai_event_bus):
            @crewai_event_bus.on(LLMStreamChunkEvent)
            def on_llm_stream_chunk(self, event: LLMStreamChunkEvent):
              # Process each chunk as it arrives
              print(f"Received chunk: {event.chunk}")

    my_listener = MyCustomListener()
    ```

    <Tip>
      [Click here](/en/concepts/event-listener#event-listeners) for more details
    </Tip>
  </Tab>

  <Tab title="Agent & Task Tracking">
    All LLM events in CrewAI include agent and task information, allowing you to track and filter LLM interactions by specific agents or tasks:

    ```python
    from crewai import LLM, Agent, Task, Crew
    from crewai.events import LLMStreamChunkEvent
    from crewai.events import BaseEventListener

    class MyCustomListener(BaseEventListener):
        def setup_listeners(self, crewai_event_bus):
            @crewai_event_bus.on(LLMStreamChunkEvent)
            def on_llm_stream_chunk(source, event):
                if researcher.id == event.agent_id:
                    print("\n==============\n Got event:", event, "\n==============\n")


    my_listener = MyCustomListener()

    llm = LLM(model="gpt-4o-mini", temperature=0, stream=True)

    researcher = Agent(
        role="About User",
        goal="You know everything about the user.",
        backstory="""You are a master at understanding people and their preferences.""",
        llm=llm,
    )

    search = Task(
        description="Answer the following questions about the user: {question}",
        expected_output="An answer to the question.",
        agent=researcher,
    )

    crew = Crew(agents=[researcher], tasks=[search])

    result = crew.kickoff(
        inputs={"question": "..."}
    )
    ```

    <Info>
      This feature is particularly useful for:
      - Debugging specific agent behaviors
      - Logging LLM usage by task type
      - Auditing which agents are making what types of LLM calls
      - Performance monitoring of specific tasks
    </Info>
  </Tab>
</Tabs>

## Async LLM Calls

CrewAI supports asynchronous LLM calls for improved performance and concurrency in your AI workflows. Async calls allow you to run multiple LLM requests concurrently without blocking, making them ideal for high-throughput applications and parallel agent operations.

<Tabs>
  <Tab title="Basic Usage">
    Use the `acall` method for asynchronous LLM requests:

    ```python
    import asyncio
    from crewai import LLM

    async def main():
        llm = LLM(model="openai/gpt-4o")

        # Single async call
        response = await llm.acall("What is the capital of France?")
        print(response)

    asyncio.run(main())
    ```

    The `acall` method supports all the same parameters as the synchronous `call` method, including messages, tools, and callbacks.
  </Tab>

  <Tab title="With Streaming">
    Combine async calls with streaming for real-time concurrent responses:

    ```python
    import asyncio
    from crewai import LLM

    async def stream_async():
        llm = LLM(model="openai/gpt-4o", stream=True)

        response = await llm.acall("Write a short story about AI")

        print(response)

    asyncio.run(stream_async())
    ```
  </Tab>
</Tabs>

## Structured LLM Calls

CrewAI supports structured responses from LLM calls by allowing you to define a `response_format` using a Pydantic model. This enables the framework to automatically parse and validate the output, making it easier to integrate the response into your application without manual post-processing.

For example, you can define a Pydantic model to represent the expected response structure and pass it as the `response_format` when instantiating the LLM. The model will then be used to convert the LLM output into a structured Python object.

```python Code
from crewai import LLM

class Dog(BaseModel):
    name: str
    age: int
    breed: str


llm = LLM(model="gpt-4o", response_format=Dog)

response = llm.call(
    "Analyze the following messages and return the name, age, and breed. "
    "Meet Kona! She is 3 years old and is a black german shepherd."
)
print(response)

# Output:
# Dog(name='Kona', age=3, breed='black german shepherd')
```

## Advanced Features and Optimization

Learn how to get the most out of your LLM configuration:

<AccordionGroup>
  <Accordion title="Context Window Management">
    CrewAI includes smart context management features:

    ```python
    from crewai import LLM

    # CrewAI automatically handles:
    # 1. Token counting and tracking
    # 2. Content summarization when needed
    # 3. Task splitting for large contexts

    llm = LLM(
        model="gpt-4",
        max_tokens=4000,  # Limit response length
    )
    ```

    <Info>
      Best practices for context management:
      1. Choose models with appropriate context windows
      2. Pre-process long inputs when possible
      3. Use chunking for large documents
      4. Monitor token usage to optimize costs
    </Info>
  </Accordion>

  <Accordion title="Performance Optimization">
    <Steps>
      <Step title="Token Usage Optimization">
        Choose the right context window for your task:
        - Small tasks (up to 4K tokens): Standard models
        - Medium tasks (between 4K-32K): Enhanced models
        - Large tasks (over 32K): Large context models

        ```python
        # Configure model with appropriate settings
        llm = LLM(
            model="openai/gpt-4-turbo-preview",
            temperature=0.7,    # Adjust based on task
            max_tokens=4096,    # Set based on output needs
            timeout=300        # Longer timeout for complex tasks
        )
        ```
        <Tip>
          - Lower temperature (0.1 to 0.3) for factual responses
          - Higher temperature (0.7 to 0.9) for creative tasks
        </Tip>
      </Step>

      <Step title="Best Practices">
        1. Monitor token usage
        2. Implement rate limiting
        3. Use caching when possible
        4. Set appropriate max_tokens limits
      </Step>
    </Steps>

    <Info>
      Remember to regularly monitor your token usage and adjust your configuration as needed to optimize costs and performance.
    </Info>
  </Accordion>

  <Accordion title="Drop Additional Parameters">
    CrewAI internally uses native sdks for LLM calls, which allows you to drop additional parameters that are not needed for your specific use case. This can help simplify your code and reduce the complexity of your LLM configuration.
    For example, if you don't need to send the <code>stop</code> parameter, you can simply omit it from your LLM call:

    ```python
    from crewai import LLM
    import os

    os.environ["OPENAI_API_KEY"] = "<api-key>"

    o3_llm = LLM(
        model="o3",
        drop_params=True,
        additional_drop_params=["stop"]
    )
    ```
  </Accordion>

  <Accordion title="Transport Interceptors">
    CrewAI provides message interceptors for several providers, allowing you to hook into request/response cycles at the transport layer.

    **Supported Providers:**
    - ✅ OpenAI
    - ✅ Anthropic

    **Basic Usage:**
    ```python
import httpx
from crewai import LLM
from crewai.llms.hooks import BaseInterceptor

class CustomInterceptor(BaseInterceptor[httpx.Request, httpx.Response]):
    """Custom interceptor to modify requests and responses."""

    def on_outbound(self, request: httpx.Request) -> httpx.Request:
        """Print request before sending to the LLM provider."""
        print(request)
        return request

    def on_inbound(self, response: httpx.Response) -> httpx.Response:
        """Process response after receiving from the LLM provider."""
        print(f"Status: {response.status_code}")
        print(f"Response time: {response.elapsed}")
        return response

# Use the interceptor with an LLM
llm = LLM(
    model="openai/gpt-4o",
    interceptor=CustomInterceptor()
)
    ```

    **Important Notes:**
    - Both methods must return the received object or type of object.
    - Modifying received objects may result in unexpected behavior or application crashes.
    - Not all providers support interceptors - check the supported providers list above

    <Info>
      Interceptors operate at the transport layer. This is particularly useful for:
      - Message transformation and filtering
      - Debugging API interactions
    </Info>
  </Accordion>
</AccordionGroup>

## Common Issues and Solutions

<Tabs>
  <Tab title="Authentication">
    <Warning>
      Most authentication issues can be resolved by checking API key format and environment variable names.
    </Warning>

    ```bash
    # OpenAI
    OPENAI_API_KEY=sk-...

    # Anthropic
    ANTHROPIC_API_KEY=sk-ant-...
    ```
  </Tab>
  <Tab title="Model Names">
    <Check>
      Always include the provider prefix in model names
    </Check>

    ```python
    # Correct
    llm = LLM(model="openai/gpt-4")

    # Incorrect
    llm = LLM(model="gpt-4")
    ```
  </Tab>
  <Tab title="Context Length">
    <Tip>
      Use larger context models for extensive tasks
    </Tip>

    ```python
    # Large context model
    llm = LLM(model="openai/gpt-4o")  # 128K tokens
    ```
  </Tab>
</Tabs>