Commit Graph

2170 Commits

Author SHA1 Message Date
Greyson LaLonde
bba48ec9df refactor: pin crewai-a2a version and move a2a tests to standalone package
Pin crewai-a2a to 1.13.0a6 to match workspace versioning convention.
Move all a2a tests and cassettes from lib/crewai to lib/crewai-a2a,
add crewai-a2a to devtools bump tooling, and update pytest/ruff/mypy configs.
2026-04-02 04:10:13 +08:00
Greyson LaLonde
2ada20e9c6 refactor: extract a2a module into standalone workspace package
Move lib/crewai/src/crewai/a2a to lib/crewai-a2a/src/crewai_a2a as a
separate workspace member. crewai's `[a2a]` optional extra now pulls in
crewai-a2a instead of listing raw deps. All imports updated from
crewai.a2a.* to crewai_a2a.*. Handle PydanticUndefinedAnnotation in
crewai/__init__.py model_rebuild to break circular import at load time.
2026-04-02 02:32:14 +08:00
João Moura
258f31d44c docs: update changelog and version for v1.13.0a6 (#5214) 1.13.0a6 2026-04-01 14:26:07 -03:00
João Moura
68720fd4e5 feat: bump versions to 1.13.0a6 (#5213) 2026-04-01 14:23:44 -03:00
alex-clawd
3132910084 perf: reduce framework overhead — lazy event bus, skip tracing when disabled (#5187)
* perf: reduce framework overhead for NVIDIA benchmarks

- Lazy initialize event bus thread pool and event loop on first emit()
  instead of at import time (~200ms savings)
- Skip trace listener registration (50+ handlers) when tracing disabled
- Skip trace prompt in non-interactive contexts (isatty check) to avoid
  20s timeout in CI/Docker/API servers
- Skip flush() when no events were emitted (avoids 30s timeout waste)
- Add _has_pending_events flag to track if any events were emitted
- Add _executor_initialized flag for lazy init double-checked locking

All existing behavior preserved when tracing IS enabled. No public APIs
changed - only conditional guards added.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: address PR review comments — tracing override, executor init order, stdin guard, unused import

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: fix ruff formatting in trace_listener.py and utils.py

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Iris Clawd <iris@crewai.com>
Co-authored-by: Greyson LaLonde <greyson.r.lalonde@gmail.com>
2026-04-01 14:17:57 -03:00
Lucas Gomide
c8f3a96779 docs: fix RBAC permission levels to match actual UI options (#5210)
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
2026-04-01 10:35:06 -04:00
João Moura
18ada25f01 docs: update changelog and version for v1.13.0a5 (#5200)
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
1.13.0a5
2026-04-01 04:00:09 -03:00
João Moura
146da8d73a feat: bump versions to 1.13.0a5 (#5199) 2026-04-01 03:59:07 -03:00
Greyson LaLonde
98c6109214 docs: update changelog and version for v1.13.0a4
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Build uv cache / build-cache (3.10) (push) Has been cancelled
Build uv cache / build-cache (3.11) (push) Has been cancelled
Build uv cache / build-cache (3.12) (push) Has been cancelled
Build uv cache / build-cache (3.13) (push) Has been cancelled
Nightly Canary Release / Check for new commits (push) Has been cancelled
Nightly Canary Release / Build nightly packages (push) Has been cancelled
Nightly Canary Release / Publish nightly to PyPI (push) Has been cancelled
1.13.0a4
2026-04-01 05:08:12 +08:00
Greyson LaLonde
54a9174c12 feat: bump versions to 1.13.0a4 2026-04-01 05:01:29 +08:00
Greyson LaLonde
c26ae969b3 docs: update changelog and version for v1.13.0a3 1.13.0a3 2026-04-01 04:16:25 +08:00
Greyson LaLonde
205555b786 feat: bump versions to 1.13.0a3 2026-04-01 04:02:29 +08:00
Greyson LaLonde
d6714a0e60 refactor: convert Flow to Pydantic BaseModel 2026-04-01 03:48:41 +08:00
dependabot[bot]
107bc7f7be chore(deps): bump the security-updates group across 1 directory with 2 updates (#5088)
Bumps the security-updates group with 2 updates in the / directory: [nltk](https://github.com/nltk/nltk) and [pypdf](https://github.com/py-pdf/pypdf).


Updates `nltk` from 3.9.3 to 3.9.4
- [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
- [Commits](https://github.com/nltk/nltk/compare/3.9.3...3.9.4)

Updates `pypdf` from 6.9.1 to 6.9.2
- [Release notes](https://github.com/py-pdf/pypdf/releases)
- [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
- [Commits](https://github.com/py-pdf/pypdf/compare/6.9.1...6.9.2)

---
updated-dependencies:
- dependency-name: nltk
  dependency-version: 3.9.4
  dependency-type: indirect
  dependency-group: security-updates
- dependency-name: pypdf
  dependency-version: 6.9.2
  dependency-type: indirect
  dependency-group: security-updates
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-31 14:03:42 -05:00
iris-clawd
b1f49b1356 docs: fix inaccuracies in agent-capabilities across all languages (#5191)
- Apps run locally (with CREWAI_PLATFORM_INTEGRATION_TOKEN env var), not remotely
- Apps auth is an integration token, not OAuth
- Updated comparison tables and card descriptions in en, pt-BR, ko, ar
2026-03-31 15:00:00 -03:00
iris-clawd
accae5ca43 docs: Add Agent Capabilities overview and improve Skills documentation (#5189)
* docs: add Agent Capabilities overview page and improve Skills docs

- New 'Agent Capabilities' page explaining all 5 extension types (Tools, MCPs, Apps, Skills, Knowledge) with comparison table and decision guide
- Rewrite Skills page with practical examples showing Skills + Tools patterns, common FAQ, and Skills vs Knowledge comparison
- Add cross-reference callout on Tools page linking to the capabilities overview
- Add agent-capabilities to Core Concepts navigation (after agents)

* docs: add pt-BR and ko translations for agent-capabilities and updated skills/tools

* docs: add Arabic (ar) translations for agent-capabilities and updated skills/tools
2026-03-31 14:47:38 -03:00
Lucas Gomide
68e943be68 feat: emit token usage data in LLMCallCompletedEvent 2026-04-01 00:18:36 +08:00
Greyson LaLonde
3283a00e31 fix(deps): cap lancedb below 0.30.1 for Windows compatibility
Some checks failed
Build uv cache / build-cache (3.10) (push) Has been cancelled
Build uv cache / build-cache (3.11) (push) Has been cancelled
Build uv cache / build-cache (3.12) (push) Has been cancelled
Build uv cache / build-cache (3.13) (push) Has been cancelled
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
lancedb 0.30.1 dropped the win_amd64 wheel, breaking installation on
Windows. Pin to <0.30.1 so uv resolves to a version that still ships
Windows binaries.
2026-03-31 16:59:45 +08:00
Greyson LaLonde
dfc0f9a317 refactor: replace InstanceOf[T] with plain type annotations
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
* refactor: replace InstanceOf[T] with plain type annotations

InstanceOf[] is a Pydantic validation wrapper that adds runtime
isinstance checks. Plain type annotations are sufficient here since
the models already use arbitrary_types_allowed or the types are
BaseModel subclasses.

* refactor: convert BaseKnowledgeStorage to BaseModel

* fix: update tests for BaseKnowledgeStorage BaseModel conversion

* fix: correct embedder config structure in test
2026-03-31 08:11:21 +08:00
Greyson LaLonde
ef79456968 chore: remove unused third_party LLM directory
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Build uv cache / build-cache (3.10) (push) Has been cancelled
Build uv cache / build-cache (3.11) (push) Has been cancelled
Build uv cache / build-cache (3.12) (push) Has been cancelled
Build uv cache / build-cache (3.13) (push) Has been cancelled
Nightly Canary Release / Check for new commits (push) Has been cancelled
Nightly Canary Release / Build nightly packages (push) Has been cancelled
Nightly Canary Release / Publish nightly to PyPI (push) Has been cancelled
2026-03-31 07:33:56 +08:00
Greyson LaLonde
6c7ea422e7 refactor: convert LLM classes to Pydantic BaseModel 2026-03-31 07:07:11 +08:00
Lorenze Jay
bb9bcd6823 refactor: remove unused and methods from (#5172)
This commit cleans up the  class by removing the  and  methods, which are no longer needed. The changes help streamline the code and improve maintainability.
2026-03-30 15:01:58 -07:00
Lucas Gomide
ac14b9127e fix: handle GPT-5.x models not supporting the stop API parameter (#5144)
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
GPT-5.x models reject the `stop` parameter at the API level with "Unsupported parameter: 'stop' is not supported with this model". This breaks CrewAI executions when routing through LiteLLM (e.g. via
OpenAI-compatible gateways like Asimov), because the LiteLLM fallback path always includes `stop` in the API request params.

The native OpenAI provider was unaffected because it never sends `stop` to the API — it applies stop words client-side via `_apply_stop_words()`. However, when the request goes through LiteLLM (custom endpoints, proxy gateways),
`stop` is sent as an API parameter and GPT-5.x rejects it.

Additionally, the existing retry logic that catches this error only matched the OpenAI API error format ("Unsupported parameter") but missed
LiteLLM's own pre-validation error format ("does not support parameters"), so the self-healing retry never triggered for LiteLLM-routed calls.
2026-03-30 11:36:51 -04:00
Thiago Moretto
98b7626784 feat: extract and publish tool metadata to AMP (#4298)
* Exporting tool's metadata to AMP - initial work

* Fix payload (nest under `tools` key)

* Remove debug message + code simplification

* Priting out detected tools

* Extract module name

* fix: address PR review feedback for tool metadata extraction

- Use sha256 instead of md5 for module name hashing (lint S324)
- Filter required list to match filtered properties in JSON schema

* fix: Use sha256 instead of md5 for module name hashing (lint S324)

- Add missing mocks to metadata extraction failure test

* style: fix ruff formatting

* fix: resolve mypy type errors in utils.py

* fix: address bot review feedback on tool metadata

- Use `is not None` instead of truthiness check so empty tools list
  is sent to the API rather than being silently dropped as None
- Strip __init__ suffix from module path for tools in __init__.py files
- Extend _unwrap_schema to handle function-before, function-wrap, and
  definitions wrapper types

* fix: capture env_vars declared with Field(default_factory=...)

When env_vars uses default_factory, pydantic stores a callable in the
schema instead of a static default value. Fall back to calling the
factory when no static default is present.

---------

Co-authored-by: Greyson LaLonde <greyson.r.lalonde@gmail.com>
2026-03-30 09:21:53 -04:00
iris-clawd
e21c506214 docs: Add comprehensive SSO configuration guide (#5152)
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
Nightly Canary Release / Check for new commits (push) Has been cancelled
Nightly Canary Release / Build nightly packages (push) Has been cancelled
Nightly Canary Release / Publish nightly to PyPI (push) Has been cancelled
* docs: add comprehensive SSO configuration guide

Add SSO documentation page covering all supported identity providers
for both SaaS (AMP) and Factory deployments.

Includes:
- Provider overview (WorkOS, Entra ID, Okta, Auth0, Keycloak)
- SaaS vs Factory SSO availability
- Step-by-step setup guides per provider with env vars
- CLI authentication via Device Authorization Grant
- RBAC integration overview
- Troubleshooting common SSO issues
- Complete environment variables reference

Placed in the Manage nav group alongside RBAC.

* fix: add key icon to SSO docs page

* fix: broken links in SSO docs (installation, configuration)
2026-03-28 13:15:34 +08:00
Greyson LaLonde
9fe0c15549 docs: update changelog and version for v1.13.0rc1
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
Nightly Canary Release / Build nightly packages (push) Has been cancelled
Nightly Canary Release / Publish nightly to PyPI (push) Has been cancelled
Nightly Canary Release / Check for new commits (push) Has been cancelled
1.13.0rc1
2026-03-27 11:30:45 +08:00
Greyson LaLonde
78d8ddb649 feat: bump versions to 1.13.0rc1 2026-03-27 11:26:04 +08:00
Greyson LaLonde
1b2062009a docs: update changelog and version for v1.13.0a2
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Nightly Canary Release / Check for new commits (push) Has been cancelled
Nightly Canary Release / Build nightly packages (push) Has been cancelled
Nightly Canary Release / Publish nightly to PyPI (push) Has been cancelled
1.13.0a2
2026-03-27 04:05:32 +08:00
Greyson LaLonde
886aa4ba8f feat: bump versions to 1.13.0a2 2026-03-27 04:00:59 +08:00
Greyson LaLonde
5bec000b21 feat: auto-update deployment test repo during release
After PyPI publish, clones crewAIInc/crew_deployment_test, bumps the
crewai[tools] pin to the new version, regenerates uv.lock, and pushes
to main. Includes retry logic for CDN propagation delays.
2026-03-27 03:54:10 +08:00
Greyson LaLonde
2965384907 feat: improve enterprise release resilience and UX
- Add --skip-to-enterprise flag to resume just Phase 3 after a failure
- Add --prerelease=allow to uv sync for alpha/beta/rc versions
- Retry uv sync up to 10 times to handle PyPI CDN propagation delay
- Update pyproject.toml [project] version field (fixes apps/api version)
- Print PR URL after creating enterprise bump PR
2026-03-27 03:36:56 +08:00
Greyson LaLonde
032ef06ef6 docs: update changelog and version for v1.13.0a1 1.13.0a1 2026-03-27 03:07:26 +08:00
Greyson LaLonde
0ce9567cfc feat: bump versions to 1.13.0a1 2026-03-27 03:00:29 +08:00
Greyson LaLonde
d7252bfee7 fix: pin Node to LTS 22 in docs broken links workflow
Mintlify doesn't support Node 25+, and `node-version: latest` was
pulling 25.8.2 causing the workflow to fail.
2026-03-27 02:36:11 +08:00
Greyson LaLonde
10fc3796bb fix: bust uv cache for freshly published packages in enterprise release 2026-03-27 02:21:31 +08:00
iris-clawd
52249683a7 docs: comprehensive RBAC permissions matrix and deployment guide (#5112)
- Add full feature permissions matrix (11 features × permission levels)
- Document Owner vs Member default permissions
- Add deployment guide: what permissions are needed to deploy from GitHub or Zip
- Document entity-level permissions (deployment permission types: run, traces, manage_settings, HITL, full_access)
- Document entity RBAC for env vars, LLM connections, and Git repositories
- Add common role patterns: Developer, Viewer/Stakeholder, Ops/Platform Admin
- Add quick-reference table for minimum deployment permissions

Addresses user feedback that RBAC was too restrictive and unclear:
members didn't know which permissions to configure for a developer profile.
2026-03-26 12:30:17 -04:00
João Moura
6193e082e1 docs: update changelog and version for v1.12.2 (#5103)
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
1.12.2
2026-03-26 03:54:26 -03:00
João Moura
33f33c6fcc feat: bump versions to 1.12.2 (#5101) 2026-03-26 03:33:10 -03:00
alex-clawd
74976b157d fix: preserve method return value as flow output for @human_feedback with emit (#5099)
* fix: preserve method return value as flow output for @human_feedback with emit

When a @human_feedback decorated method with emit= is the final method in a
flow (no downstream listeners triggered), the flow's final output was
incorrectly set to the collapsed outcome string (e.g., 'approved') instead
of the method's actual return value (e.g., a state dict).

Root cause: _process_feedback() returns the collapsed_outcome string when
emit is set, and this string was being stored as the method's result in
_method_outputs.

The fix:
1. In human_feedback.py: After _process_feedback, stash the real method_output
   on the flow instance as _human_feedback_method_output when emit is set.

2. In flow.py: After appending a method result to _method_outputs, check if
   _human_feedback_method_output is set. If so, replace the last entry with
   the stashed real output and clear the stash.

This ensures:
- Routing still works correctly (collapsed outcome used for @listen matching)
- The flow's final result is the actual method return value
- If downstream listeners execute, their results become the final output

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* style: ruff format flow.py

* fix: use per-method dict stash for concurrency safety and None returns

Addresses review comments:
- Replace single flow-level slot with dict keyed by method name,
  safe under concurrent @human_feedback+emit execution
- Dict key presence (not value) indicates stashed output,
  correctly preserving None return values
- Added test for None return value preservation

---------

Co-authored-by: Joao Moura <joao@crewai.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 03:28:17 -03:00
Greyson LaLonde
bd03f6cf64 feat: add enterprise release phase to devtools release
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Nightly Canary Release / Check for new commits (push) Has been cancelled
Nightly Canary Release / Build nightly packages (push) Has been cancelled
Nightly Canary Release / Publish nightly to PyPI (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
2026-03-26 12:22:37 +08:00
Rip&Tear
a91cd1a7d7 Revise security policy and reporting instructions (#5096)
* Revise security policy and reporting instructions

Updated the security reporting process and contact details.

* Update .github/security.md
---------
2026-03-26 10:50:21 +08:00
João Moura
66dee3195f docs: update changelog and version for v1.12.1 (#5095) 1.12.1 2026-03-25 22:52:11 -03:00
João Moura
034f576dc0 feat: bump versions to 1.12.1 (#5094)
* chore: bump version to 1.12.1 across all modules

* feat: bump versions to 1.12.1
2026-03-25 22:45:33 -03:00
Lucas Gomide
918654318b feat: add request_id to HumanFeedbackRequestedEvent (#5092)
* feat: add request_id to HumanFeedbackRequestedEvent

Allow platforms to attach a correlation identifier to human feedback requests so downstream consumers can deterministically match spans to their corresponding feedback records

* feat: add request_id to HumanFeedbackReceivedEvent for correlation

Without request_id on the received event, consumers cannot correlate
a feedback response back to its originating request. Both sides of the
request/response pair need the correlation identifier.

---------

Co-authored-by: Alex <alex@crewai.com>
2026-03-25 22:43:24 -03:00
João Moura
371e6cfd11 docs: update changelog and version for v1.12.0 (#5091) 1.12.0 2026-03-25 22:07:28 -03:00
João Moura
6fd70ce6e5 chore: bump version to 1.14.0 across all modules (#5090)
* chore: bump version to 1.14.0 across all modules

* chore: downgrade version to 1.12.0 across all modules
2026-03-25 22:03:37 -03:00
alex-clawd
c183b77991 fix: address Copilot review on OpenAI-compatible providers (#5042) (#5089)
Some checks failed
CodeQL Advanced / Analyze (actions) (push) Has been cancelled
CodeQL Advanced / Analyze (python) (push) Has been cancelled
Check Documentation Broken Links / Check broken links (push) Has been cancelled
Build uv cache / build-cache (3.10) (push) Has been cancelled
Build uv cache / build-cache (3.11) (push) Has been cancelled
Build uv cache / build-cache (3.12) (push) Has been cancelled
Build uv cache / build-cache (3.13) (push) Has been cancelled
- Delegate supports_function_calling() to parent (handles o1 models via OpenRouter)
- Guard empty env vars in base_url resolution
- Fix misleading comment about model validation rules
- Remove unused MagicMock import
- Use 'is not None' for env var restoration in tests

Co-authored-by: Joao Moura <joao@crewai.com>
2026-03-25 18:22:13 -03:00
Greyson LaLonde
b5a0d6e709 docs: update changelog and version for v1.12.0a3 1.12.0a3 2026-03-26 04:17:37 +08:00
Greyson LaLonde
454156cff9 feat: bump versions to 1.12.0a3 2026-03-26 04:12:49 +08:00
Tiago Freire
d86707da3d Fix: bad credentials for traces batch push (404) (#4947)
## Summary

### Core fixes

<details>
<summary><b>Fix silent 404 cascade on trace event send</b></summary>

When `_initialize_backend_batch` failed, `trace_batch_id` was left populated with a client-generated UUID never registered server-side. All subsequent event sends hit a non-existent batch endpoint and returned 404. Now all three failure paths (None response, non-2xx status, exception) clear `trace_batch_id`.
</details>

<details>
<summary><b>Fix first-time deferred batch init silently skipped</b></summary>

First-time users have `is_tracing_enabled_in_context() = False` by design. This caused `_initialize_backend_batch` to return early without creating the batch, and `finalize_batch` to skip finalization (same guard). The first-time handler now passes `skip_context_check=True` to bypass both guards, calls `_finalize_backend_batch` directly, gates `backend_initialized` on actual success, checks `_send_events_to_backend` return status (marking batch as failed on 500), captures event count/duration/batch ID before they're consumed by send/finalize, and cleans up all singleton state via `_reset_batch_state()` on every exit path.
</details>

<details>
<summary><b>Sync <code>is_current_batch_ephemeral</code> on batch creation success</b></summary>

When the batch is successfully created on the server, `is_current_batch_ephemeral` is now synced with the actual `use_ephemeral` value used. This prevents endpoint mismatches where the batch was created on one endpoint but events and finalization were sent to a different one, resulting in 404.
</details>

<details>
<summary><b>Route <code>mark_trace_batch_as_failed</code> to correct endpoint for ephemeral batches</b></summary>

`mark_trace_batch_as_failed` always routed to the non-ephemeral endpoint (`/tracing/batches/{id}`), causing 404s when called on ephemeral batches — the same class of endpoint mismatch this PR aims to fix. Added `mark_ephemeral_trace_batch_as_failed` to `PlusAPI` and a `_mark_batch_as_failed` helper on `TraceBatchManager` that routes based on `is_current_batch_ephemeral`.
</details>

<details>
<summary><b>Gate <code>backend_initialized</code> on actual init success (non-first-time path)</b></summary>

On the non-first-time path, `backend_initialized` was set to `True` unconditionally after `_initialize_backend_batch` returned. With the new failure-path cleanup that clears `trace_batch_id`, this created an inconsistent state: `backend_initialized=True` + `trace_batch_id=None`. Now set via `self.trace_batch_id is not None`.
</details>

### Resilience improvements

<details>
<summary><b>Retry transient failures on batch creation</b></summary>

`_initialize_backend_batch` now retries up to 2 times with 200ms backoff on transient failures (None response, 5xx, network errors). Non-transient 4xx errors are not retried. The short backoff minimizes lock hold time on the non-first-time path where `_batch_ready_cv` is held.
</details>

<details>
<summary><b>Fall back to ephemeral on server auth rejection</b></summary>

When the non-ephemeral endpoint returns 401/403 (expired token, revoked credentials, key rotation), the client automatically switches to ephemeral tracing instead of losing traces. The fallback forwards `skip_context_check` and is guarded against infinite recursion — if ephemeral also fails, `trace_batch_id` is cleared normally.
</details>

<details>
<summary><b>Fix action-event race initializing batch as non-ephemeral</b></summary>

`_handle_action_event` called `batch_manager.initialize_batch()` directly, defaulting `use_ephemeral=False`. When a `DefaultEnvEvent` or `LLMCallStartedEvent` fired before `CrewKickoffStartedEvent` in the thread pool, the batch was locked in as non-ephemeral. Now routes through `_initialize_batch()` which computes `use_ephemeral` from `_check_authenticated()`.
</details>

<details>
<summary><b>Guard <code>_mark_batch_as_failed</code> against cascading network errors</b></summary>

When `_finalize_backend_batch` failed with a network error (e.g. `[Errno 54] Connection reset by peer`), the exception handler called `_mark_batch_as_failed` — which also makes an HTTP request on the same dead connection. That second failure was unhandled. Now wrapped in a try/except so it logs at debug level instead of propagating.
</details>

<details>
<summary><b>Design decision: first-time users always use ephemeral</b></summary>

First-time trace collection **always creates ephemeral batches**, regardless of authentication status. This is intentional:

1. **The first-time handler UX is built around ephemeral traces** — it displays an access code, a 24-hour expiry link, and opens the browser to the ephemeral trace viewer. Non-ephemeral batches don't produce these artifacts, so the handler would fall through to the "Local Traces Collected" fallback even when traces were successfully sent.

2. **The server handles account linking automatically** — `LinkEphemeralTracesJob` runs on user signup and migrates ephemeral traces to permanent records. Logged-in users can access their traces via their dashboard regardless.

3. **Checking auth during batch setup broke event collection** — moving `_check_authenticated()` into `_initialize_batch` caused the batch initialization to fail silently during the flow/crew start event handler, preventing all event collection. Keeping the first-time path fast and side-effect-free preserves event collection.

The auth check is deferred to the non-first-time path (second run onwards), where `is_tracing_enabled_in_context()` is `True` and the normal tracing pipeline handles everything — including the 401/403 ephemeral fallback.
</details>


### Manual tests


<details>
<summary><b>Matrix</b></summary>

| Scenario | First run | Second run |
|----------|-----------|------------|
| Logged out, fresh `.crewai_user.json` | Ephemeral trace created, URL returned | Ephemeral trace created, URL returned |
| Logged in, fresh `.crewai_user.json` | Ephemeral trace created, URL returned | Trace batch finalized, URL returned |
| Flow execution | Tested with `poem_flow` | Tested with `poem_flow` |
| Crew execution | Tested with `hitl_crew` | Tested with `hitl_crew` |
</details>
2026-03-25 16:00:05 -04:00