mirror of
https://github.com/crewAIInc/crewAI.git
synced 2026-07-04 06:29:22 +00:00
390 lines
13 KiB
Plaintext
390 lines
13 KiB
Plaintext
---
|
|
title: Checkpointing
|
|
description: Automatically save execution state so crews, flows, and agents can resume after failures.
|
|
icon: floppy-disk
|
|
mode: "wide"
|
|
---
|
|
|
|
Checkpointing saves a snapshot of execution state during a run so a crew, flow, or agent can resume after a failure or be forked into an alternate branch.
|
|
|
|
<CardGroup cols={2}>
|
|
<Card title="Explanation" icon="lightbulb" href="#explanation">
|
|
How checkpointing works: events, storage, and inheritance.
|
|
</Card>
|
|
<Card title="Tutorial" icon="graduation-cap" href="#tutorial-resume-a-failing-crew">
|
|
A 5-minute walkthrough: run, interrupt, resume.
|
|
</Card>
|
|
<Card title="How-to guides" icon="screwdriver-wrench" href="#how-to-guides">
|
|
Task-focused recipes for common workflows.
|
|
</Card>
|
|
<Card title="Reference" icon="book" href="#reference">
|
|
`CheckpointConfig`, events, providers, and CLI.
|
|
</Card>
|
|
</CardGroup>
|
|
|
|
## Explanation
|
|
|
|
### What a checkpoint is
|
|
|
|
A checkpoint is a serialized snapshot of `RuntimeState` written at a point in execution. It records which tasks have completed, their outputs, the current inputs, and a lineage ID that identifies the run.
|
|
|
|
When you restore from a checkpoint, CrewAI rebuilds that state, skips already-completed work, and continues. When you fork from one, CrewAI restores the state under a new lineage so the new branch and the original run do not overwrite each other.
|
|
|
|
### When checkpoints are written
|
|
|
|
Checkpointing is event-driven. The runtime subscribes to events you select via `on_events` and writes a checkpoint each time one fires. The default `task_completed` produces one checkpoint per finished task — a sensible tradeoff between granularity and disk use. Higher-frequency events like `llm_call_completed` are available for fine-grained recovery but write far more files.
|
|
|
|
### Storage
|
|
|
|
Two providers ship with CrewAI:
|
|
|
|
- `JsonProvider` writes one file per checkpoint. Human-readable and easy to inspect.
|
|
- `SqliteProvider` writes to a single SQLite database. Better for high-frequency checkpointing.
|
|
|
|
Both prune oldest checkpoints when `max_checkpoints` is set.
|
|
|
|
<Note>
|
|
Checkpoint writes are best-effort. A failed checkpoint is logged but does not interrupt the run.
|
|
</Note>
|
|
|
|
### Inheritance model
|
|
|
|
`Crew`, `Flow`, and `Agent` all accept a `checkpoint` argument. Children inherit from their parent unless they set their own value or pass `False` to opt out. Enable checkpointing once on the crew and every agent participates, or selectively exclude one agent.
|
|
|
|
## Tutorial: Resume a failing crew
|
|
|
|
This walkthrough takes ~5 minutes. You will run a two-task crew, kill it midway, and resume from the saved checkpoint.
|
|
|
|
<Steps>
|
|
<Step title="Create the crew with checkpointing enabled">
|
|
```python
|
|
from crewai import Agent, Crew, Task
|
|
|
|
researcher = Agent(role="Researcher", goal="Research", backstory="Expert")
|
|
writer = Agent(role="Writer", goal="Write", backstory="Expert")
|
|
|
|
crew = Crew(
|
|
agents=[researcher, writer],
|
|
tasks=[
|
|
Task(description="Research AI trends", agent=researcher, expected_output="bullets"),
|
|
Task(description="Write a summary", agent=writer, expected_output="paragraph"),
|
|
],
|
|
checkpoint=True,
|
|
)
|
|
```
|
|
</Step>
|
|
<Step title="Run it and interrupt after the first task">
|
|
```python
|
|
result = crew.kickoff()
|
|
```
|
|
|
|
Press `Ctrl+C` after the first task finishes. Look in `./.checkpoints/` — a file named `<timestamp>_<uuid>.json` is the checkpoint.
|
|
</Step>
|
|
<Step title="Resume from the checkpoint">
|
|
```python
|
|
from crewai import CheckpointConfig
|
|
|
|
result = crew.kickoff(
|
|
from_checkpoint=CheckpointConfig(
|
|
restore_from="./.checkpoints/<timestamp>_<uuid>.json",
|
|
),
|
|
)
|
|
```
|
|
|
|
The research task is skipped, the writer runs against the saved research output, and the crew finishes.
|
|
</Step>
|
|
</Steps>
|
|
|
|
## How-to guides
|
|
|
|
<AccordionGroup>
|
|
<Accordion title="Enable checkpointing with defaults" icon="play">
|
|
```python
|
|
crew = Crew(agents=[...], tasks=[...], checkpoint=True)
|
|
```
|
|
|
|
Writes to `./.checkpoints/` on every `task_completed`.
|
|
</Accordion>
|
|
|
|
<Accordion title="Customize storage and frequency" icon="sliders">
|
|
```python
|
|
from crewai import Crew, CheckpointConfig
|
|
|
|
crew = Crew(
|
|
agents=[...],
|
|
tasks=[...],
|
|
checkpoint=CheckpointConfig(
|
|
location="./my_checkpoints",
|
|
on_events=["task_completed", "crew_kickoff_completed"],
|
|
max_checkpoints=5,
|
|
),
|
|
)
|
|
```
|
|
</Accordion>
|
|
|
|
<Accordion title="Choose a storage provider" icon="database">
|
|
<CodeGroup>
|
|
```python JsonProvider
|
|
from crewai import Crew, CheckpointConfig
|
|
from crewai.state import JsonProvider
|
|
|
|
crew = Crew(
|
|
agents=[...],
|
|
tasks=[...],
|
|
checkpoint=CheckpointConfig(
|
|
location="./my_checkpoints",
|
|
provider=JsonProvider(),
|
|
max_checkpoints=5,
|
|
),
|
|
)
|
|
```
|
|
```python SqliteProvider
|
|
from crewai import Crew, CheckpointConfig
|
|
from crewai.state import SqliteProvider
|
|
|
|
crew = Crew(
|
|
agents=[...],
|
|
tasks=[...],
|
|
checkpoint=CheckpointConfig(
|
|
location="./.checkpoints.db",
|
|
provider=SqliteProvider(),
|
|
max_checkpoints=50,
|
|
),
|
|
)
|
|
```
|
|
</CodeGroup>
|
|
|
|
<Tip>
|
|
SQLite enables WAL journal mode for concurrent reads. Prefer it for high-frequency checkpointing.
|
|
</Tip>
|
|
</Accordion>
|
|
|
|
<Accordion title="Opt one agent out" icon="user-slash">
|
|
```python
|
|
crew = Crew(
|
|
agents=[
|
|
Agent(role="Researcher", ...),
|
|
Agent(role="Writer", ..., checkpoint=False),
|
|
],
|
|
tasks=[...],
|
|
checkpoint=True,
|
|
)
|
|
```
|
|
</Accordion>
|
|
|
|
<Accordion title="Resume via the classmethod" icon="rotate-left">
|
|
```python
|
|
config = CheckpointConfig(restore_from="./my_checkpoints/<file>.json")
|
|
crew = Crew.from_checkpoint(config)
|
|
result = crew.kickoff()
|
|
```
|
|
</Accordion>
|
|
|
|
<Accordion title="Fork into a new branch" icon="code-branch">
|
|
`fork()` restores a checkpoint under a fresh lineage so the new run does not collide with the original.
|
|
|
|
```python
|
|
config = CheckpointConfig(restore_from="./my_checkpoints/<file>.json")
|
|
crew = Crew.fork(config, branch="experiment-a")
|
|
result = crew.kickoff(inputs={"strategy": "aggressive"})
|
|
```
|
|
|
|
The `branch` label is optional; one is generated if omitted.
|
|
</Accordion>
|
|
|
|
<Accordion title="Checkpoint a Crew, Flow, or Agent" icon="cubes">
|
|
<Tabs>
|
|
<Tab title="Crew">
|
|
```python
|
|
crew = Crew(
|
|
agents=[researcher, writer],
|
|
tasks=[research_task, write_task, review_task],
|
|
checkpoint=CheckpointConfig(location="./crew_cp"),
|
|
)
|
|
```
|
|
|
|
Default trigger: `task_completed`.
|
|
</Tab>
|
|
<Tab title="Flow">
|
|
```python
|
|
from crewai.flow.flow import Flow, start, listen
|
|
from crewai import CheckpointConfig
|
|
|
|
class MyFlow(Flow):
|
|
@start()
|
|
def step_one(self):
|
|
return "data"
|
|
|
|
@listen(step_one)
|
|
def step_two(self, data):
|
|
return process(data)
|
|
|
|
flow = MyFlow(
|
|
checkpoint=CheckpointConfig(
|
|
location="./flow_cp",
|
|
on_events=["method_execution_finished"],
|
|
),
|
|
)
|
|
result = flow.kickoff()
|
|
|
|
config = CheckpointConfig(restore_from="./flow_cp/<file>.json")
|
|
flow = MyFlow.from_checkpoint(config)
|
|
result = flow.kickoff()
|
|
```
|
|
</Tab>
|
|
<Tab title="Agent">
|
|
```python
|
|
agent = Agent(
|
|
role="Researcher",
|
|
goal="Research topics",
|
|
backstory="Expert researcher",
|
|
checkpoint=CheckpointConfig(
|
|
location="./agent_cp",
|
|
on_events=["lite_agent_execution_completed"],
|
|
),
|
|
)
|
|
result = agent.kickoff(messages=[{"role": "user", "content": "Research AI trends"}])
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
</Accordion>
|
|
|
|
<Accordion title="Write a checkpoint manually" icon="code">
|
|
Register a handler on any event and call `state.checkpoint()`.
|
|
|
|
<CodeGroup>
|
|
```python Sync
|
|
from crewai.events.event_bus import crewai_event_bus
|
|
from crewai.events.types.llm_events import LLMCallCompletedEvent
|
|
|
|
@crewai_event_bus.on(LLMCallCompletedEvent)
|
|
def on_llm_done(source, event, state):
|
|
path = state.checkpoint("./my_checkpoints")
|
|
print(f"Saved checkpoint: {path}")
|
|
```
|
|
```python Async
|
|
from crewai.events.event_bus import crewai_event_bus
|
|
from crewai.events.types.llm_events import LLMCallCompletedEvent
|
|
|
|
@crewai_event_bus.on(LLMCallCompletedEvent)
|
|
async def on_llm_done_async(source, event, state):
|
|
path = await state.acheckpoint("./my_checkpoints")
|
|
print(f"Saved checkpoint: {path}")
|
|
```
|
|
</CodeGroup>
|
|
|
|
A `state` argument is supplied automatically when the handler takes three parameters. See [Event Listeners](/en/concepts/event-listener) for the full event catalog.
|
|
</Accordion>
|
|
|
|
<Accordion title="Browse, resume, and fork from the CLI" icon="terminal">
|
|
```bash
|
|
crewai checkpoint # auto-detects .checkpoints/ or .checkpoints.db
|
|
crewai checkpoint --location ./my_checkpoints
|
|
crewai checkpoint --location ./.checkpoints.db
|
|
```
|
|
|
|
<Frame>
|
|
<img src="/images/checkpointing.png" alt="Checkpoint TUI" />
|
|
</Frame>
|
|
|
|
The left panel groups checkpoints by branch; forks nest under their parent. Selecting a checkpoint shows its metadata, entity state, and task progress. **Resume** continues the run; **Fork** starts a new branch.
|
|
|
|
The detail panel exposes two editable areas:
|
|
|
|
- **Inputs** — original kickoff inputs, pre-filled and editable.
|
|
- **Task outputs** — outputs of completed tasks. Editing an output and hitting **Fork** invalidates downstream tasks so they re-run against the modified context.
|
|
|
|
<Tip>
|
|
Useful for "what if" exploration: fork, tweak, observe.
|
|
</Tip>
|
|
</Accordion>
|
|
|
|
<Accordion title="Inspect checkpoints without the TUI" icon="magnifying-glass">
|
|
```bash
|
|
crewai checkpoint list ./my_checkpoints
|
|
crewai checkpoint info ./my_checkpoints/<file>.json
|
|
crewai checkpoint info ./.checkpoints.db
|
|
```
|
|
</Accordion>
|
|
</AccordionGroup>
|
|
|
|
## Reference
|
|
|
|
### `CheckpointConfig`
|
|
|
|
<ParamField path="location" type="str" default='"./.checkpoints"'>
|
|
Storage destination. A directory for `JsonProvider`, a database file path for `SqliteProvider`.
|
|
</ParamField>
|
|
|
|
<ParamField path="on_events" type="list[str]" default='["task_completed"]'>
|
|
Event types that trigger a checkpoint. See [event types](#event-types).
|
|
</ParamField>
|
|
|
|
<ParamField path="provider" type="BaseProvider" default="JsonProvider()">
|
|
Storage backend. Either `JsonProvider` or `SqliteProvider`.
|
|
</ParamField>
|
|
|
|
<ParamField path="max_checkpoints" type="int | None" default="None">
|
|
Maximum checkpoints to retain. Oldest are pruned after each write.
|
|
</ParamField>
|
|
|
|
<ParamField path="restore_from" type="Path | str | None" default="None">
|
|
Checkpoint to restore from when passed via `from_checkpoint`.
|
|
</ParamField>
|
|
|
|
### `checkpoint` field values
|
|
|
|
Accepted by `Crew`, `Flow`, and `Agent`.
|
|
|
|
<ParamField path="None" type="default">
|
|
Inherit from parent.
|
|
</ParamField>
|
|
|
|
<ParamField path="True" type="bool">
|
|
Enable with defaults.
|
|
</ParamField>
|
|
|
|
<ParamField path="False" type="bool">
|
|
Explicit opt-out. Stops inheritance.
|
|
</ParamField>
|
|
|
|
<ParamField path="CheckpointConfig(...)" type="CheckpointConfig">
|
|
Custom configuration.
|
|
</ParamField>
|
|
|
|
### Event types
|
|
|
|
Common values for `on_events`:
|
|
|
|
| Use case | Events |
|
|
|:---------|:-------|
|
|
| After each task | `["task_completed"]` |
|
|
| After each flow method | `["method_execution_finished"]` |
|
|
| After agent execution | `["agent_execution_completed"]`, `["lite_agent_execution_completed"]` |
|
|
| On crew completion only | `["crew_kickoff_completed"]` |
|
|
| After every LLM call | `["llm_call_completed"]` |
|
|
| Everything | `["*"]` |
|
|
|
|
<Warning>
|
|
`["*"]` and high-frequency events like `llm_call_completed` write many checkpoints and can degrade performance. Pair them with `max_checkpoints`.
|
|
</Warning>
|
|
|
|
### Storage providers
|
|
|
|
<ParamField path="JsonProvider" type="provider">
|
|
One file per checkpoint, named `<timestamp>_<uuid>.json` inside `location`.
|
|
</ParamField>
|
|
|
|
<ParamField path="SqliteProvider" type="provider">
|
|
Single database file at `location` with WAL journaling.
|
|
</ParamField>
|
|
|
|
### CLI
|
|
|
|
| Command | Purpose |
|
|
|:--------|:--------|
|
|
| `crewai checkpoint` | Launch the TUI; auto-detect storage. |
|
|
| `crewai checkpoint --location <path>` | Launch the TUI against a specific location. |
|
|
| `crewai checkpoint list <path>` | List checkpoints. |
|
|
| `crewai checkpoint info <path>` | Inspect a checkpoint file or the latest entry in a SQLite database. |
|