Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/en/continuous_integration/pipelines/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ Select your CI provider to set up CI Visibility in Datadog:
| {{< ci-details title="Infrastructure correlation" >}}Correlation of host-level information for the Datadog Agent, CI pipelines, or job runners to CI pipeline execution data.{{< /ci-details >}} | | | {{< X >}} | | | {{< X >}} | {{< X >}} | {{< X >}} | | |
| {{< ci-details title="Running pipelines" >}}Identification of pipelines executions that are running with associated tracing.{{< /ci-details >}} | {{< X >}} | | | | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} |
| {{< ci-details title="Partial retries" >}}Identification of partial retries (for example, when only a subset of jobs were retried).{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} |
| {{< ci-details title="Automatic job retries" >}}Preview. Datadog retries failed jobs classified as transient by its AI error model.{{< /ci-details >}} | | | | | | {{< X >}} | {{< X >}} | | | |
| {{< ci-details title="Step granularity" >}}Step level spans are available for more granular visibility.{{< /ci-details >}} | | | | | {{< X >}} | {{< X >}} | | {{< X >}} <br /> (_Presented as job spans_) | | {{< X >}} |
| {{< ci-details title="Manual steps" >}}Identification of when there is a job with a manual approval phase in the overall pipeline.{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} |

Expand Down
39 changes: 39 additions & 0 deletions content/en/continuous_integration/pipelines/github.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Set up CI Visibility for GitHub Actions to track the execution of your workflows
| [Running pipelines][2] | Running pipelines | View pipeline executions that are running. Queued or waiting pipelines show with status "Running" on Datadog. |
| [CI jobs failure analysis][23] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. |
| [Partial retries][3] | Partial pipelines | View partially retried pipeline executions. |
| [Automatic job retries](#automatic-job-retries) | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. |
| Logs correlation | Logs correlation | Correlate pipeline and job spans to logs and enable [job log collection](#collect-job-logs). |
| Infrastructure metric correlation | Infrastructure metric correlation | Correlate jobs to [infrastructure host metrics][4] for GitHub jobs. |
| [Custom tags][5] [and measures at runtime][6] | Custom tags and measures at runtime | Configure [custom tags and measures][7] at runtime. |
Expand Down Expand Up @@ -122,6 +123,43 @@ You can also add job failure analysis to a PR comment. See the guide on [using P

For a full explanation, see the guide on [using CI jobs failure analysis][23].

## Automatic job retries

<div class="alert alert-info">Automatic job retries are in Preview. To request access, contact your Datadog account team.</div>

Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the GitHub Actions API without manual intervention.

### How it works

1. A job fails in your workflow.
2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient.
3. If the failure is classified as retriable, Datadog requests a retry through the GitHub Actions API.
4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops.
5. Datadog records the retry outcome on the original pipeline in CI Visibility.

### Requirements

- CI Visibility enabled for your GitHub Actions integration (see [Configure the Datadog integration](#configure-the-datadog-integration)).
- [Datadog Source Code Integration][27] configured for the repositories where you want automatic retries.
- Automatic job retries enabled for your organization (see the banner above for how to request access).

### GitHub-specific behavior

GitHub Actions imposes two provider-level limitations that shape how retries work:

- **Retries happen after the workflow finishes.** The GitHub API does not allow retrying an individual job while the rest of the workflow is still running. Datadog waits for the workflow to reach a final state before issuing retries.
- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog reruns every failed job in the workflow through a single GitHub API call. This may increase your GitHub Actions compute usage.

### Protected branches

The Datadog GitHub App's default permissions do not allow retries on protected branches. To enable automatic retries on a protected branch (for example, your default branch), grant the app Maintainer-level access. Review your organization's policies before expanding permissions.

### Limitations

- Each logical job is retried at most one time.
- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried.
- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry.

## Visualize pipeline data in Datadog

The [**CI Pipeline List**][17] and [**Executions**][18] pages populate with data after the pipelines finish.
Expand Down Expand Up @@ -158,3 +196,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep
[24]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/
[25]: /glossary/#pipeline-execution-time
[26]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments
[27]: /integrations/guide/source-code-integration/
31 changes: 31 additions & 0 deletions content/en/continuous_integration/pipelines/gitlab.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Set up CI Visibility for GitLab to collect data on your pipeline executions, ana
| [CI jobs failure analysis][28] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. |
| [Filter CI Jobs on the critical path][29] | Filter CI Jobs on the critical path | Filter by jobs on the critical path. |
| [Partial retries][19] | Partial pipelines | View partially retried pipeline executions. |
| [Automatic job retries](#automatic-job-retries) | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. |
| [Manual steps][20] | Manual steps | View manually triggered pipelines. |
| [Queue time][21] | Queue time | View the amount of time pipeline jobs sit in the queue before processing. |
| Logs correlation | Logs correlation | Correlate pipeline spans to logs and enable [job log collection][12]. |
Expand Down Expand Up @@ -426,6 +427,35 @@ You can also apply these filters using the facet panel on the left hand side of

{{< img src="ci/partial_retries_facet_panel.png" alt="The facet panel with Partial Pipeline facet expanded and the value Retry selected, the Partial Retry facet expanded and the value true selected" style="width:20%;">}}

## Automatic job retries

<div class="alert alert-info">Automatic job retries are in Preview. To request access, contact your Datadog account team.</div>

Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the GitLab API without manual intervention.

On GitLab, Datadog performs **smart retries**: only the specific job classified as retriable is re-run. Other failed jobs (that aren't classified as retriable) and passing jobs aren't affected.

### How it works

1. A job fails in your pipeline.
2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient.
3. If the failure is classified as retriable, Datadog requests a retry through the GitLab API as soon as the job fails. Retries are dispatched per job.
4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops.
5. Datadog records the retry outcome on the original pipeline in CI Visibility.

### Requirements

- CI Visibility enabled for your GitLab integration (see [Configure the Datadog integration](#configure-the-datadog-integration)).
- [Datadog Source Code Integration][31] configured for the repositories where you want automatic retries.
- Smart retries work with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Source Code Integration.
- Automatic job retries enabled for your organization (see the banner above for how to request access).

### Limitations

- Each logical job is retried at most one time.
- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried.
- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry.

## Visualize pipeline data in Datadog

Once the integration is successfully configured, the [**CI Pipeline List**][4] and [**Executions**][5] pages populate with data after the pipelines finish.
Expand Down Expand Up @@ -466,3 +496,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep
[28]: /continuous_integration/guides/use_ci_jobs_failure_analysis/
[29]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/
[30]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments
[31]: /integrations/guide/source-code-integration/
Loading