From 58e0f69e86576f139b49dfb44f0aeebbc81ab2a7 Mon Sep 17 00:00:00 2001 From: Lucas Gomide Date: Wed, 17 Jun 2026 19:02:11 -0300 Subject: [PATCH] docs(enterprise): consolidate Datadog and structured logs into single guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Merges the standalone structured_logs guide into a dedicated Datadog Integration page. The stdout JSON schema is Datadog-Agent-path-specific in practice (OTLP path uses OpenTelemetry attribute names), so a vendor-neutral structured-logs page was misleading. Now Datadog customers have one canonical page covering both ingestion paths plus the dashboard import, and non-Datadog customers land on the OpenTelemetry Export page without being buried in Datadog content. - Delete docs/edge/en/enterprise/guides/structured_logs.mdx; the schema reference moves verbatim into the new datadog.mdx as an anchor-linkable section. - Rename datadog_dashboard.mdx to datadog.mdx (preserved via git mv). New structure: choose-a-path tabs (Datadog Agent recommended / Datadog OTLP intake) → log schema reference (with explicit Info callout that it's the Agent-path schema, not OTLP) → dashboard import → verify ingestion → customize → troubleshooting. - Move the Datadog OTLP UI walkthrough (site domain, API key, /v1/traces vs /v1/logs paths) onto the Datadog page so it lives in exactly one place. Datadog dashboard JSON artifact path stays at datadog_dashboard.json — the file name is artifact-specific. - Reframe capture_telemetry_logs.mdx: add a lead Tip recommending OTel as the vendor-neutral first option, and shrink the Datadog tab to a pointer to the new Datadog Integration guide. - Update docs/docs.json en edge sidebar: drop structured_logs, replace datadog_dashboard with datadog. Co-authored-by: Cursor --- docs/docs.json | 3 +- .../guides/capture_telemetry_logs.mdx | 18 +- docs/edge/en/enterprise/guides/datadog.mdx | 289 ++++++++++++++++++ .../enterprise/guides/datadog_dashboard.mdx | 136 --------- .../en/enterprise/guides/structured_logs.mdx | 142 --------- 5 files changed, 295 insertions(+), 293 deletions(-) create mode 100644 docs/edge/en/enterprise/guides/datadog.mdx delete mode 100644 docs/edge/en/enterprise/guides/datadog_dashboard.mdx delete mode 100644 docs/edge/en/enterprise/guides/structured_logs.mdx diff --git a/docs/docs.json b/docs/docs.json index 2558adf34..a870f687b 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -515,8 +515,7 @@ "edge/en/enterprise/guides/update-crew", "edge/en/enterprise/guides/enable-crew-studio", "edge/en/enterprise/guides/capture_telemetry_logs", - "edge/en/enterprise/guides/structured_logs", - "edge/en/enterprise/guides/datadog_dashboard", + "edge/en/enterprise/guides/datadog", "edge/en/enterprise/guides/azure-openai-setup", "edge/en/enterprise/guides/vertex-ai-workload-identity-setup", "edge/en/enterprise/guides/tool-repository", diff --git a/docs/edge/en/enterprise/guides/capture_telemetry_logs.mdx b/docs/edge/en/enterprise/guides/capture_telemetry_logs.mdx index 94260d6f5..f043fa0c5 100644 --- a/docs/edge/en/enterprise/guides/capture_telemetry_logs.mdx +++ b/docs/edge/en/enterprise/guides/capture_telemetry_logs.mdx @@ -9,6 +9,10 @@ CrewAI AMP can export OpenTelemetry **traces** and **logs** from your deployment Telemetry data follows the [OpenTelemetry GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) plus additional CrewAI-specific attributes. + +OpenTelemetry is the **recommended observability path** — vendor-neutral, works with any OTLP-compatible backend (Grafana, Honeycomb, NewRelic, your own collector). If you specifically use Datadog, see the dedicated [Datadog Integration](/en/enterprise/guides/datadog) guide which covers both the Datadog Agent path and Datadog's OTLP intake. + + ## Prerequisites @@ -41,19 +45,7 @@ Telemetry data follows the [OpenTelemetry GenAI semantic conventions](https://op ![OpenTelemetry collector configuration](/images/crewai-otel-collector-opentelemetry.png) - - **Datadog Site Domain** — Your Datadog site's OTLP host only, with no protocol or path. CrewAI builds the full HTTPS OTLP endpoint for you. Use the host that matches your [Datadog site](https://docs.datadoghq.com/getting_started/site/): - - `otlp.datadoghq.com` (US1) - - `otlp.us3.datadoghq.com` (US3) - - `otlp.us5.datadoghq.com` (US5) - - `otlp.datadoghq.eu` (EU1) - - `otlp.ap1.datadoghq.com` (AP1) - - **API Key** — Your Datadog API key. See [how to create one](https://docs.datadoghq.com/account_management/api-app-keys/#api-keys). - - The default Datadog template ships **traces** to the `/v1/traces` path. To export **logs** via OTLP instead, add an **OpenTelemetry Logs** collector pointed at the same Datadog OTLP host with the path set to `/v1/logs` — both signals can run side by side. - - For stdout-based log shipping (the Datadog Agent path) rather than OTLP, see [Structured JSON Logs](/en/enterprise/guides/structured_logs) and [Datadog Dashboard for crewAI](/en/enterprise/guides/datadog_dashboard). - - ![Datadog collector configuration](/images/crewai-otel-collector-datadog.png) + For Datadog setup, see the dedicated [Datadog Integration](/en/enterprise/guides/datadog) guide — it covers both the Datadog Agent path (recommended, cheaper for log volume) and Datadog's OTLP intake with full collector configuration steps. diff --git a/docs/edge/en/enterprise/guides/datadog.mdx b/docs/edge/en/enterprise/guides/datadog.mdx new file mode 100644 index 000000000..067741c53 --- /dev/null +++ b/docs/edge/en/enterprise/guides/datadog.mdx @@ -0,0 +1,289 @@ +--- +title: "Datadog Integration" +description: "Monitor self-hosted CrewAI AMP deployments in Datadog — pick the Datadog Agent path for cost-efficient log shipping or Datadog's OTLP intake for agentless setup, then import the ready-made operations dashboard." +icon: "dog" +mode: "wide" +--- + +CrewAI ships first-class support for Datadog: two log-ingestion paths, a JSON log schema designed for cheap indexing, and a ready-made operations dashboard you can import in under five minutes. + + +For vendor-neutral observability via any OTLP backend (Grafana, Honeycomb, your own collector), see [OpenTelemetry Export](/en/enterprise/guides/capture_telemetry_logs). + + +## Choose a path + + + + The Datadog Agent runs alongside your CrewAI containers (typically as a DaemonSet on Kubernetes) and tails their stdout. **Recommended** for log-heavy workloads — single-line JSON logs are cheaper to ingest than multi-line tracebacks, and every event ships with structured attributes. + + **Setup:** + 1. Run the Datadog Agent next to your CrewAI containers — see [Datadog's deployment docs](https://docs.datadoghq.com/agent/) for Kubernetes, ECS, or VM setup. Enable log collection (`logs_enabled: true`) and container log collection (`logs_config.container_collect_all: true`). + 2. Set `CREWAI_LOG_FORMAT=json` on every CrewAI container (API + workers) so each log event is a single billable line instead of a multi-line traceback. See the [log schema reference](#log-schema-reference) below for the full field contract. + 3. Confirm logs arrive in Datadog Logs with the JSON fields parsed — see [Verify ingestion](#verify-ingestion). + + **When to pick this path:** you already run the Datadog Agent for infrastructure metrics, you want logs without configuring an OTel collector in AMP, or your log volume makes per-event ingestion cost a concern. + + + Datadog accepts OTLP traffic directly at its intake endpoint, no Agent required. Easier setup if you can't run an Agent in your environment, but Datadog meters OTLP intake separately — check pricing before adopting at scale. + + **Setup:** + 1. In CrewAI AMP, go to **Settings → OpenTelemetry Collectors → Add Collector** and pick **Datadog**. + 2. Configure the connection: + - **Datadog Site Domain** — your Datadog site's OTLP host only, no protocol or path. CrewAI builds the full HTTPS OTLP endpoint for you. Use the host that matches your [Datadog site](https://docs.datadoghq.com/getting_started/site/): + - `otlp.datadoghq.com` (US1) + - `otlp.us3.datadoghq.com` (US3) + - `otlp.us5.datadoghq.com` (US5) + - `otlp.datadoghq.eu` (EU1) + - `otlp.ap1.datadoghq.com` (AP1) + - **API Key** — your Datadog API key. See [how to create one](https://docs.datadoghq.com/account_management/api-app-keys/#api-keys). + 3. The default Datadog template ships **traces** to the `/v1/traces` path. To export **logs** via OTLP, add a second **OpenTelemetry Logs** collector pointed at the same Datadog OTLP host with the path set to `/v1/logs`. Both signals can run side by side. + 4. *(optional)* Click **Test Connection** to verify CrewAI can reach the endpoint with the credentials you provided. Then click **Save**. + + ![Datadog collector configuration](/images/crewai-otel-collector-datadog.png) + + **When to pick this path:** you can't or don't want to run the Datadog Agent, or you're already using OTLP for traces and want a single export pipeline. + + + +Either path lands the same structured facets in Datadog (`@automation_id`, `@kickoff_id`, `@execution_id`, `@automation_name`, `@crewai_version`, `@exception.type`, `@gen_ai.*`), so the dashboard works identically with either choice. + +## Log schema reference + + +This schema applies to the **Datadog Agent path** — stdout JSON logs produced when `CREWAI_LOG_FORMAT=json` is set. Logs delivered via the **Datadog OTLP intake** use OpenTelemetry attribute names and may differ; see [OpenTelemetry Export](/en/enterprise/guides/capture_telemetry_logs). + + +When `CREWAI_LOG_FORMAT=json` is set, every log event is emitted as a **single JSON object per line** to stdout, with internal newlines escaped. The format is plain JSON — Datadog parses it natively, and the same payload is also consumable by Splunk, Loki, Elasticsearch, and CloudWatch without custom log pipelines. + +### Why JSON output + + + + Most managed log backends bill per event. A Python traceback in text format is counted as one event per line — 30+ events for a single error. JSON output collapses each traceback into a single event with the stack trace as an escaped string field. + + + Search by `@automation_id`, `@exception.type`, `@kickoff_id` instead of grepping free-text. Build dashboards on typed facets without parser configuration. + + + Every event carries `trace_id` and `span_id` when fired inside a recording span, so backends auto-link logs to traces. + + + The `schema` field gates compatibility — within `v1`, fields are added but never renamed or removed. + + + +### Enabling JSON output + +Set the `CREWAI_LOG_FORMAT` environment variable to `json` on every container that runs your deployment (API + workers). + +```shell +CREWAI_LOG_FORMAT=json +``` + +Restart the deployment to pick up the change. Every log line on stdout from that point on is a single JSON object. + + + The default value is `text`, which preserves the legacy human-readable line format byte-for-byte. Setting any value other than `json` falls back to text mode. There is no migration step — the variable is read at process start and the format switches immediately. + + +### Example events + +A single info-level log inside an active automation kickoff: + +```json +{ + "schema": "v1", + "ts": "2026-06-17T16:14:23.482914Z", + "level": "INFO", + "logger": "crewai_enterprise.utilities.pii_redaction", + "crewai_version": "1.14.7", + "msg": "PII tracking state reset (engines preserved)", + "automation_id": "12", + "task_id": "0843a930-b306-464b-89c8-bfafa78cc711", + "kickoff_id": "0843a930-b306-464b-89c8-bfafa78cc711", + "execution_id": "0843a930-b306-464b-89c8-bfafa78cc711", + "automation_name": "research_flow" +} +``` + +An error with a Python exception is collapsed into a single event with the traceback as a string: + +```json +{ + "schema": "v1", + "ts": "2026-06-17T16:14:31.218450Z", + "level": "ERROR", + "logger": "api.tasks.flow_run_task", + "crewai_version": "1.14.7", + "msg": "Flow execution failed", + "automation_id": "12", + "kickoff_id": "0843a930-b306-464b-89c8-bfafa78cc711", + "execution_id": "0843a930-b306-464b-89c8-bfafa78cc711", + "automation_name": "research_flow", + "exception": { + "type": "ValueError", + "message": "Topic cannot be empty", + "stacktrace": "Traceback (most recent call last):\n File \"/app/flow.py\", line 42, in summarize\n ...\nValueError: Topic cannot be empty\n" + } +} +``` + +The same error in legacy text mode would have produced ~25 separate log events (one per traceback line) — all of which the backend would bill and index individually. + +### Schema v1 fields + +Within the `v1` schema, fields are only added, never renamed or removed. New fields will appear as soon as a deployment is upgraded. + +| Field | Type | Always present | Source | +|-------|------|----------------|--------| +| `schema` | string | Yes | Constant `"v1"`. Increment indicates a breaking schema change. | +| `ts` | string (ISO-8601 UTC, microseconds) | Yes | Record creation time, e.g. `2026-06-17T16:14:23.482914Z`. | +| `level` | string | Yes | Python log level name: `DEBUG` / `INFO` / `WARNING` / `ERROR` / `CRITICAL`. | +| `logger` | string | Yes | Dotted logger name, e.g. `api.tasks.flow_run_task`. | +| `crewai_version` | string | Yes (when `crewai` package metadata is resolvable) | Installed `crewai` package version, e.g. `"1.14.7"`. | +| `msg` | string | Yes | Rendered log message (after `%`-formatting / `{}`-formatting). | +| `automation_id` | string | When `CREWAI_PLUS_ID` env var is set | Numeric deployment ID (AMP provisions this on every container). | +| `task_id` | string | On Celery worker logs | Celery task UUID, or `"no-task"` for non-task contexts. | +| `kickoff_id` | string | Inside an automation kickoff | UUID of the current kickoff. | +| `execution_id` | string | Inside an automation kickoff | UUID of the current sub-execution. Equal to `kickoff_id` at the top level; differs for nested flow methods that spawn sub-executions. | +| `automation_name` | string | Inside an automation kickoff | Human-readable automation/flow name, e.g. `"research_flow"`. | +| `trace_id` | string (32-hex) | Inside a recording OpenTelemetry span | Hex trace ID. Omitted when no span is active. | +| `span_id` | string (16-hex) | Inside a recording OpenTelemetry span | Hex span ID. Omitted when no span is active. | +| `exception` | object | When the log record has `exc_info` | `{type, message, stacktrace}` — full traceback as a single escaped string. | + + + Any additional `extra={...}` kwargs passed to a logger call appear as top-level JSON fields verbatim. Reserved field names above always win to keep the schema stable. + + +### Stability promise + +The `schema` field declares the contract. Within `v1`, CrewAI commits to: + +- **Never removing a field** that customers may have built queries or dashboards against. +- **Never renaming a field** in place — renames happen via a schema bump (e.g. `v2`), with the old name kept as a deprecated alias for at least one release cycle. +- **Adding new fields** at any time. Consumers should ignore unknown top-level keys. + +When a `v2` is introduced, both the `schema` field and the migration guide will be published in advance, and `v1` will continue to be emitted for one release cycle so dashboards and queries have time to migrate. + +## Prerequisite: promote facets + +Datadog auto-discovers fields the first time it sees them but doesn't make them queryable in widgets until they're promoted to **facets**. This is a one-time setup in your Datadog account. + + + + Open [Logs Explorer](https://app.datadoghq.com/logs) and search `service:crewai*`. You should see at least one log event. + + + Click any log entry to open the right-hand details panel. For each field below, hover the field name → click the gear icon → **Create facet**. + + - `automation_id`, `automation_name`, `execution_id`, `kickoff_id`, `task_id` + - `crewai_version`, `model_id` + - `exception.type`, `exception.message` + + Skip any field that already shows a star icon next to its name — that means it's already a facet. The `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, and `gen_ai.request.model` facets are typically promoted automatically by Datadog's LLM Observability auto-discovery, but verify they exist before importing the dashboard. + + + +## Import the dashboard + + + + Save [`datadog_dashboard.json`](https://raw.githubusercontent.com/crewAIInc/crewAI/main/docs/edge/en/enterprise/guides/datadog_dashboard.json) to your machine. + + + Navigate to **Dashboards → New Dashboard**. Click the **gear icon** in the top right of the empty dashboard and select **Import Dashboard JSON**. + + + Paste the contents of `datadog_dashboard.json` into the import dialog (or drag the file in). Click **Import**. + + Datadog creates the dashboard immediately and lands you on it. The first load may show empty widgets for a few seconds while queries execute against the time range. + + + + + Datadog's [Dashboard API](https://docs.datadoghq.com/api/latest/dashboards/#create-a-new-dashboard) accepts the same JSON via `POST /api/v1/dashboard`. Use it if you manage dashboards through Terraform, Pulumi, or CI. + + +## What you get + +The dashboard is organized into four sections plus a placeholder for a custom drill-down widget: + +| Section | Widgets | Useful for | +|---------|---------|------------| +| **Header** | Total Executions · Error Rate (%) · Active Automations · CrewAI Versions in Use | At-a-glance health for the last hour. Error Rate is conditionally formatted (green ≤ 5%, yellow ≤ 10%, red > 10%). | +| **Throughput** | Executions per Hour by Automation (top 10, stacked bars) | Spotting traffic shifts, surfacing busy automations, validating that a rollout didn't change baseline volume. | +| **Errors** | Errors by Exception Type (top 5, stacked bars) · Top Exception Types by Count (toplist) | Triaging failures — which exception types are spiking, which automations they're hitting. | +| **Cost** | Total Tokens per Hour by Model (input + output, stacked area) | Tracking LLM token spend by model. Useful for catching cost regressions when an automation switches model or starts looping. | +| **Drill-Down** | _(empty placeholder)_ | See [Customization](#customize) for adding a recent-errors log stream here. | + +Three template variables at the top of the dashboard re-scope every widget at once: + +- **`$automation`** — filter to a single automation by name. +- **`$version`** — filter to a single `crewai` SDK version (useful for comparing pre- and post-upgrade behavior). +- **`$service`** — filter to a specific Datadog `service` tag (useful when multiple CrewAI deployments share one Datadog account). + +## Verify ingestion + +Open [Logs Explorer](https://app.datadoghq.com/logs) and run a query that matches your ingestion path: + + + + Search `service:crewai* @schema:v1`. You should see structured logs with the JSON fields parsed into Datadog facets. Pick a recent event and verify it has `@automation_id`, `@kickoff_id`, `@execution_id`, `@crewai_version`, and (when running inside a span) `@trace_id` / `@span_id` populated. + + If nothing appears, confirm `CREWAI_LOG_FORMAT=json` is set on the running container, the deployment was restarted after the change, and the Datadog Agent is tailing container stdout. + + + Search `source:otlp service:crewai*`. OTLP attributes land with their OpenTelemetry names (`automation_id`, `crewai.kickoff.id`, etc.) rather than the stdout JSON keys, but they map to the same dashboard facets after [facet promotion](#prerequisite-promote-facets). + + If nothing appears, verify the collector endpoint is correct (`/v1/logs` for logs, `/v1/traces` for traces) and **Test Connection** succeeded when the collector was saved. + + + +## Customize + +The dashboard ships with deliberate gaps so you can extend it without uninstalling and re-importing. + +### Add a Recent Errors log stream + +The **Drill-Down** section is intentionally empty. Add a Log Stream widget to it for an inline view of recent failures: + +1. Edit the dashboard and click **+ Add Widgets** inside the Drill-Down group. +2. Drag in a **Log Stream** widget. +3. Set the filter query to `status:error $automation $version $service`. +4. Choose columns: `@timestamp`, `@automation_name`, `@exception.type`, `@exception.message`, `@execution_id`. +5. Sort by most recent, limit to 25 entries. + +Clicking any row jumps to Logs Explorer with the same filter pre-applied. + +### Add p95 latency + +Logs don't include execution duration by default. Two ways to add a latency widget: + +- **From APM traces** — if you also export OTLP traces to Datadog, add a Timeseries widget with data source **Traces**, query `service:crewai*`, aggregation `p95 of @duration`. Datadog APM auto-tracks span duration. +- **From metric extraction** — extract a `flow.duration_ms` metric from logs via [Datadog's log-to-metric pipeline](https://docs.datadoghq.com/logs/log_configuration/logs_to_metrics/), then chart it like any other metric. Useful if you don't run APM. + +### Re-scope to multiple deployments + +The `$service` template variable defaults to `*` and will catch every CrewAI deployment in your Datadog account. Change the default to a specific service name in **Configure → Template Variables** if you want the dashboard to focus on one deployment by default. + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---------|--------------|-----| +| All widgets show "No data" | Facets aren't promoted | Re-do the [Promote facets](#prerequisite-promote-facets) step. Datadog won't query against an un-promoted field. | +| Error Rate widget shows `NaN` | No executions in the time window | Either no traffic, or `@execution_id` isn't faceted. Expand the time range and re-check facets. | +| Throughput chart is flat at the same value | Logs aren't reaching Datadog | Search `service:crewai*` in Logs Explorer. If nothing shows, verify the Datadog Agent is running (Agent path) or the OTel collector endpoint is correct (OTLP path). | +| `crewai_version` shows fewer values than expected | Some containers predate the structured-logs work | The `crewai_version` field was added alongside JSON output. Older deployments running text mode (or older AMP builds) won't emit it. Upgrade those deployments to pick up the field. See the [log schema reference](#log-schema-reference) for the full field contract. | +| Template variables don't filter widgets | The widget's filter line doesn't reference the template variable | Edit the widget and confirm the search includes `$automation $version $service`. | + +## Next steps + + + + Vendor-neutral observability for non-Datadog stacks (Grafana, Honeycomb, your own collector) — or as a Datadog complement when you want to fan out telemetry to multiple backends. + + + Reference for customizing widget queries against the structured facets above. + + diff --git a/docs/edge/en/enterprise/guides/datadog_dashboard.mdx b/docs/edge/en/enterprise/guides/datadog_dashboard.mdx deleted file mode 100644 index ec7fa7597..000000000 --- a/docs/edge/en/enterprise/guides/datadog_dashboard.mdx +++ /dev/null @@ -1,136 +0,0 @@ ---- -title: "Datadog Dashboard for crewAI" -description: "Import a ready-made Datadog dashboard for monitoring self-hosted CrewAI AMP deployments — executions, errors, token cost, and version distribution. Works with both the Datadog Agent and Datadog's OTLP intake." -icon: "dog" -mode: "wide" ---- - -CrewAI ships a ready-made Datadog dashboard for self-hosted AMP deployments. Once your logs are flowing into Datadog, you can import the dashboard JSON and have an operations view live in your account in under five minutes. - -The dashboard works with either of Datadog's two log-ingestion paths — pick whichever fits your infrastructure: - - - - The Datadog Agent runs alongside your CrewAI containers (typically as a DaemonSet on Kubernetes) and tails their stdout. This path requires enabling [Structured JSON Logs](/en/enterprise/guides/structured_logs) so each log event is a single billable line instead of a multi-line traceback. - - **Setup:** - 1. Set `CREWAI_LOG_FORMAT=json` on every CrewAI container — see [Structured JSON Logs](/en/enterprise/guides/structured_logs) for full details. - 2. Install the Datadog Agent in your cluster following [Datadog's Kubernetes setup guide](https://docs.datadoghq.com/containers/kubernetes/installation/). Enable log collection (`logs_enabled: true`) and container log collection (`logs_config.container_collect_all: true`). - 3. Confirm logs are landing in Datadog by searching `service:crewai*` in the [Logs Explorer](https://app.datadoghq.com/logs). - - **When to pick this path:** you already run the Datadog Agent for infrastructure metrics, or you want logs without configuring an OTel collector in AMP. - - - Datadog accepts OTLP traffic directly at its intake endpoint, no agent required. Configure CrewAI AMP's built-in OTel collector to point at Datadog's OTLP host. - - **Setup:** - 1. In CrewAI AMP: **Settings → OpenTelemetry Collectors → Add Collector → Datadog**. See [OpenTelemetry Export](/en/enterprise/guides/capture_telemetry_logs) for the full collector setup. - 2. The default Datadog template ships **traces** to `/v1/traces`. For log export, switch the endpoint path to `/v1/logs` on the OpenTelemetry Logs collector (use the same Datadog OTLP host). - 3. Confirm logs are landing by searching `source:otlp service:crewai*` in the [Logs Explorer](https://app.datadoghq.com/logs). - - **When to pick this path:** you can't or don't want to run the Datadog Agent, or you're already using OTLP for traces and want a single export pipeline. - - - -Either path lands the same structured facets in Datadog (`@automation_id`, `@kickoff_id`, `@execution_id`, `@automation_name`, `@crewai_version`, `@exception.type`, `@gen_ai.*`), so the dashboard works identically with either choice. - -## Prerequisite: promote facets - -Datadog auto-discovers fields the first time it sees them but doesn't make them queryable in widgets until they're promoted to **facets**. This is a one-time setup in your Datadog account. - - - - Open [Logs Explorer](https://app.datadoghq.com/logs) and search `service:crewai*`. You should see at least one log event. - - - Click any log entry to open the right-hand details panel. For each field below, hover the field name → click the gear icon → **Create facet**. - - - `automation_id`, `automation_name`, `execution_id`, `kickoff_id`, `task_id` - - `crewai_version`, `model_id` - - `exception.type`, `exception.message` - - Skip any field that already shows a star icon next to its name — that means it's already a facet. The `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, and `gen_ai.request.model` facets are typically promoted automatically by Datadog's LLM Observability auto-discovery, but verify they exist before importing the dashboard. - - - -## Import the dashboard - - - - Save [`datadog_dashboard.json`](https://raw.githubusercontent.com/crewAIInc/crewAI/main/docs/edge/en/enterprise/guides/datadog_dashboard.json) to your machine. - - - Navigate to **Dashboards → New Dashboard**. Click the **gear icon** in the top right of the empty dashboard and select **Import Dashboard JSON**. - - - Paste the contents of `datadog_dashboard.json` into the import dialog (or drag the file in). Click **Import**. - - Datadog creates the dashboard immediately and lands you on it. The first load may show empty widgets for a few seconds while queries execute against the time range. - - - - - Datadog's [Dashboard API](https://docs.datadoghq.com/api/latest/dashboards/#create-a-new-dashboard) accepts the same JSON via `POST /api/v1/dashboard`. Use it if you manage dashboards through Terraform, Pulumi, or CI. - - -## What you get - -The dashboard is organized into four sections plus a placeholder for a custom drill-down widget: - -| Section | Widgets | Useful for | -|---------|---------|------------| -| **Header** | Total Executions · Error Rate (%) · Active Automations · CrewAI Versions in Use | At-a-glance health for the last hour. Error Rate is conditionally formatted (green ≤ 5%, yellow ≤ 10%, red > 10%). | -| **Throughput** | Executions per Hour by Automation (top 10, stacked bars) | Spotting traffic shifts, surfacing busy automations, validating that a rollout didn't change baseline volume. | -| **Errors** | Errors by Exception Type (top 5, stacked bars) · Top Exception Types by Count (toplist) | Triaging failures — which exception types are spiking, which automations they're hitting. | -| **Cost** | Total Tokens per Hour by Model (input + output, stacked area) | Tracking LLM token spend by model. Useful for catching cost regressions when an automation switches model or starts looping. | -| **Drill-Down** | _(empty placeholder)_ | See [Customization](#customization) for adding a recent-errors log stream here. | - -Three template variables at the top of the dashboard re-scope every widget at once: - -- **`$automation`** — filter to a single automation by name. -- **`$version`** — filter to a single `crewai` SDK version (useful for comparing pre- and post-upgrade behavior). -- **`$service`** — filter to a specific Datadog `service` tag (useful when multiple CrewAI deployments share one Datadog account). - -## Customization - -The dashboard ships with deliberate gaps so you can extend it without uninstalling and re-importing. - -### Add a Recent Errors log stream - -The **Drill-Down** section is intentionally empty. Add a Log Stream widget to it for an inline view of recent failures: - -1. Edit the dashboard and click **+ Add Widgets** inside the Drill-Down group. -2. Drag in a **Log Stream** widget. -3. Set the filter query to `status:error $automation $version $service`. -4. Choose columns: `@timestamp`, `@automation_name`, `@exception.type`, `@exception.message`, `@execution_id`. -5. Sort by most recent, limit to 25 entries. - -Clicking any row jumps to Logs Explorer with the same filter pre-applied. - -### Add p95 latency - -Logs don't include execution duration by default. Two ways to add a latency widget: - -- **From APM traces** — if you also export OTLP traces to Datadog, add a Timeseries widget with data source **Traces**, query `service:crewai*`, aggregation `p95 of @duration`. Datadog APM auto-tracks span duration. -- **From metric extraction** — extract a `flow.duration_ms` metric from logs via [Datadog's log-to-metric pipeline](https://docs.datadoghq.com/logs/log_configuration/logs_to_metrics/), then chart it like any other metric. Useful if you don't run APM. - -### Re-scope to multiple deployments - -The `$service` template variable defaults to `*` and will catch every CrewAI deployment in your Datadog account. Change the default to a specific service name in **Configure → Template Variables** if you want the dashboard to focus on one deployment by default. - -## Troubleshooting - -| Symptom | Likely cause | Fix | -|---------|--------------|-----| -| All widgets show "No data" | Facets aren't promoted | Re-do the [Promote facets](#prerequisite-promote-facets) step. Datadog won't query against an un-promoted field. | -| Error Rate widget shows `NaN` | No executions in the time window | Either no traffic, or `@execution_id` isn't faceted. Expand the time range and re-check facets. | -| Throughput chart is flat at the same value | Logs aren't reaching Datadog | Search `service:crewai*` in Logs Explorer. If nothing shows, verify the Datadog Agent is running (Agent path) or the OTel collector endpoint is correct (OTLP path). | -| `crewai_version` shows fewer values than expected | Some containers predate the structured-logs work | The `crewai_version` field was added alongside JSON output. Older deployments running text mode (or older AMP builds) won't emit it. Upgrade those deployments to pick up the field. | -| Template variables don't filter widgets | The widget's filter line doesn't reference the template variable | Edit the widget and confirm the search includes `$automation $version $service`. | - -## References - -- [Structured JSON Logs](/en/enterprise/guides/structured_logs) — the underlying log format the dashboard queries against. -- [OpenTelemetry Export](/en/enterprise/guides/capture_telemetry_logs) — set up the OTLP path if you're not using the Datadog Agent. -- [Datadog Log Search Syntax](https://docs.datadoghq.com/logs/explorer/search_syntax/) — reference for customizing widget queries. -- [Datadog Dashboard JSON Schema](https://docs.datadoghq.com/dashboards/graphing_json/) — full reference for the dashboard file format if you want to script changes. diff --git a/docs/edge/en/enterprise/guides/structured_logs.mdx b/docs/edge/en/enterprise/guides/structured_logs.mdx deleted file mode 100644 index b6871bee4..000000000 --- a/docs/edge/en/enterprise/guides/structured_logs.mdx +++ /dev/null @@ -1,142 +0,0 @@ ---- -title: "Structured JSON Logs" -description: "Emit single-line JSON log events from CrewAI AMP deployments for cheaper, structured ingestion in Datadog, Splunk, Loki, and other log backends." -icon: "brackets-curly" -mode: "wide" ---- - -CrewAI AMP can emit one JSON object per log event on stdout instead of the default multi-line text format. Each event ships with typed context fields (automation, kickoff, execution, trace IDs, exception details), making logs cheaper to index, easier to search, and trivially correlatable with traces. - -This page describes the JSON schema, how to enable it, and how to verify it's working. For a ready-made Datadog dashboard built on top of these fields, see [Datadog Dashboard for crewAI](/en/enterprise/guides/datadog_dashboard). - -## Why use JSON output - - - - Most managed log backends bill per event. A Python traceback in text format is counted as one event per line — 30+ events for a single error. JSON output collapses each traceback into a single event with the stack trace as an escaped string field. - - - Search by `@automation_id`, `@exception.type`, `@kickoff_id` instead of grepping free-text. Build dashboards on typed facets without parser configuration. - - - Every event carries `trace_id` and `span_id` when fired inside a recording span, so backends auto-link logs to traces. - - - The format is plain JSON — Datadog, Splunk, Loki, Elasticsearch, and CloudWatch all parse it natively without custom log pipelines. - - - -## Enabling JSON output - -Set the `CREWAI_LOG_FORMAT` environment variable to `json` on every container that runs your deployment (API + workers). - -```shell -CREWAI_LOG_FORMAT=json -``` - -Restart the deployment to pick up the change. Every log line on stdout from that point on is a single JSON object. - - - The default value is `text`, which preserves the legacy human-readable line format byte-for-byte. Setting any value other than `json` falls back to text mode. There is no migration step — the variable is read at process start and the format switches immediately. - - -## What a log event looks like - -A single info-level log inside an active automation kickoff: - -```json -{ - "schema": "v1", - "ts": "2026-06-17T16:14:23.482914Z", - "level": "INFO", - "logger": "crewai_enterprise.utilities.pii_redaction", - "crewai_version": "1.14.7", - "msg": "PII tracking state reset (engines preserved)", - "automation_id": "12", - "task_id": "0843a930-b306-464b-89c8-bfafa78cc711", - "kickoff_id": "0843a930-b306-464b-89c8-bfafa78cc711", - "execution_id": "0843a930-b306-464b-89c8-bfafa78cc711", - "automation_name": "research_flow" -} -``` - -An error with a Python exception is collapsed into a single event with the traceback as a string: - -```json -{ - "schema": "v1", - "ts": "2026-06-17T16:14:31.218450Z", - "level": "ERROR", - "logger": "api.tasks.flow_run_task", - "crewai_version": "1.14.7", - "msg": "Flow execution failed", - "automation_id": "12", - "kickoff_id": "0843a930-b306-464b-89c8-bfafa78cc711", - "execution_id": "0843a930-b306-464b-89c8-bfafa78cc711", - "automation_name": "research_flow", - "exception": { - "type": "ValueError", - "message": "Topic cannot be empty", - "stacktrace": "Traceback (most recent call last):\n File \"/app/flow.py\", line 42, in summarize\n ...\nValueError: Topic cannot be empty\n" - } -} -``` - -The same error in legacy text mode would have produced ~25 separate log events (one per traceback line) — all of which the backend would bill and index individually. - -## Schema v1 field reference - -Within the `v1` schema, fields are only added, never renamed or removed. New fields will appear as soon as a deployment is upgraded. - -| Field | Type | Always present | Source | -|-------|------|----------------|--------| -| `schema` | string | Yes | Constant `"v1"`. Increment indicates a breaking schema change. | -| `ts` | string (ISO-8601 UTC, microseconds) | Yes | Record creation time, e.g. `2026-06-17T16:14:23.482914Z`. | -| `level` | string | Yes | Python log level name: `DEBUG` / `INFO` / `WARNING` / `ERROR` / `CRITICAL`. | -| `logger` | string | Yes | Dotted logger name, e.g. `api.tasks.flow_run_task`. | -| `crewai_version` | string | Yes (when `crewai` package metadata is resolvable) | Installed `crewai` package version, e.g. `"1.14.7"`. | -| `msg` | string | Yes | Rendered log message (after `%`-formatting / `{}`-formatting). | -| `automation_id` | string | When `CREWAI_PLUS_ID` env var is set | Numeric deployment ID (AMP provisions this on every container). | -| `task_id` | string | On Celery worker logs | Celery task UUID, or `"no-task"` for non-task contexts. | -| `kickoff_id` | string | Inside an automation kickoff | UUID of the current kickoff. | -| `execution_id` | string | Inside an automation kickoff | UUID of the current sub-execution. Equal to `kickoff_id` at the top level; differs for nested flow methods that spawn sub-executions. | -| `automation_name` | string | Inside an automation kickoff | Human-readable automation/flow name, e.g. `"research_flow"`. | -| `trace_id` | string (32-hex) | Inside a recording OpenTelemetry span | Hex trace ID. Omitted when no span is active. | -| `span_id` | string (16-hex) | Inside a recording OpenTelemetry span | Hex span ID. Omitted when no span is active. | -| `exception` | object | When the log record has `exc_info` | `{type, message, stacktrace}` — full traceback as a single escaped string. | - - - Any additional `extra={...}` kwargs passed to a logger call appear as top-level JSON fields verbatim. Reserved field names above always win to keep the schema stable. - - -## Verifying it's working - -After enabling the env var and restarting, fetch the latest container logs and confirm each line is a single JSON object: - -```shell -# Example: docker logs --tail 10 -docker logs $(docker ps -qf name=crewai-api) --tail 10 | jq -r '.msg' -``` - -If the output is JSON, each line will parse successfully and `jq` will print only the `msg` field. If you see "parse error", the env var didn't take effect — confirm it's set in the running container and that the deployment was restarted after the change. - -## Compatibility and versioning - -The `schema` field declares the contract. Within `v1`, CrewAI commits to: - -- **Never removing a field** that customers may have built queries or dashboards against. -- **Never renaming a field** in place — renames happen via a schema bump (e.g. `v2`), with the old name kept as a deprecated alias for at least one release cycle. -- **Adding new fields** at any time. Consumers should ignore unknown top-level keys. - -When a `v2` is introduced, both the `schema` field and the migration guide will be published in advance, and `v1` will continue to be emitted for one release cycle so dashboards and queries have time to migrate. - -## What's next - - - - Import a ready-made operations dashboard built on these facets — executions, errors, token cost, version distribution. Works with both the Datadog Agent and Datadog's OTLP intake. - - - Ship logs and traces to your own OTel collector or directly to a backend's OTLP intake. The same context fields land as OTLP attributes, so the dashboard works regardless of which path you use. - -