diff --git a/content/en/continuous_integration/pipelines/_index.md b/content/en/continuous_integration/pipelines/_index.md index 3a760a89f1a..3cdb3f058a0 100644 --- a/content/en/continuous_integration/pipelines/_index.md +++ b/content/en/continuous_integration/pipelines/_index.md @@ -47,6 +47,7 @@ Select your CI provider to set up CI Visibility in Datadog: | {{< ci-details title="Infrastructure correlation" >}}Correlation of host-level information for the Datadog Agent, CI pipelines, or job runners to CI pipeline execution data.{{< /ci-details >}} | | | {{< X >}} | | | {{< X >}} | {{< X >}} | {{< X >}} | | | | {{< ci-details title="Running pipelines" >}}Identification of pipelines executions that are running with associated tracing.{{< /ci-details >}} | {{< X >}} | | | | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | | {{< ci-details title="Partial retries" >}}Identification of partial retries (for example, when only a subset of jobs were retried).{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | +| {{< ci-details title="Automatic job retries" >}}Preview. Datadog retries failed jobs classified as transient by its AI error model.{{< /ci-details >}} | | | | | | {{< X >}} | {{< X >}} | | | | | {{< ci-details title="Step granularity" >}}Step level spans are available for more granular visibility.{{< /ci-details >}} | | | | | {{< X >}} | {{< X >}} | | {{< X >}}
(_Presented as job spans_) | | {{< X >}} | | {{< ci-details title="Manual steps" >}}Identification of when there is a job with a manual approval phase in the overall pipeline.{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | diff --git a/content/en/continuous_integration/pipelines/github.md b/content/en/continuous_integration/pipelines/github.md index b6320e0ca6b..fec3e2e30b8 100644 --- a/content/en/continuous_integration/pipelines/github.md +++ b/content/en/continuous_integration/pipelines/github.md @@ -30,6 +30,7 @@ Set up CI Visibility for GitHub Actions to track the execution of your workflows | [Running pipelines][2] | Running pipelines | View pipeline executions that are running. Queued or waiting pipelines show with status "Running" on Datadog. | | [CI jobs failure analysis][23] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. | | [Partial retries][3] | Partial pipelines | View partially retried pipeline executions. | +| [Automatic job retries](#automatic-job-retries) | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. | | Logs correlation | Logs correlation | Correlate pipeline and job spans to logs and enable [job log collection](#collect-job-logs). | | Infrastructure metric correlation | Infrastructure metric correlation | Correlate jobs to [infrastructure host metrics][4] for GitHub jobs. | | [Custom tags][5] [and measures at runtime][6] | Custom tags and measures at runtime | Configure [custom tags and measures][7] at runtime. | @@ -122,6 +123,43 @@ You can also add job failure analysis to a PR comment. See the guide on [using P For a full explanation, see the guide on [using CI jobs failure analysis][23]. +## Automatic job retries + +
Automatic job retries are in Preview. To request access, contact your Datadog account team.
+ +Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the GitHub Actions API without manual intervention. + +### How it works + +1. A job fails in your workflow. +2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient. +3. If the failure is classified as retriable, Datadog requests a retry through the GitHub Actions API. +4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops. +5. Datadog records the retry outcome on the original pipeline in CI Visibility. + +### Requirements + +- CI Visibility enabled for your GitHub Actions integration (see [Configure the Datadog integration](#configure-the-datadog-integration)). +- [Datadog Source Code Integration][27] configured for the repositories where you want automatic retries. +- Automatic job retries enabled for your organization (see the banner above for how to request access). + +### GitHub-specific behavior + +GitHub Actions imposes two provider-level limitations that shape how retries work: + +- **Retries happen after the workflow finishes.** The GitHub API does not allow retrying an individual job while the rest of the workflow is still running. Datadog waits for the workflow to reach a final state before issuing retries. +- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog reruns every failed job in the workflow through a single GitHub API call. This may increase your GitHub Actions compute usage. + +### Protected branches + +The Datadog GitHub App's default permissions do not allow retries on protected branches. To enable automatic retries on a protected branch (for example, your default branch), grant the app Maintainer-level access. Review your organization's policies before expanding permissions. + +### Limitations + +- Each logical job is retried at most one time. +- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried. +- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry. + ## Visualize pipeline data in Datadog The [**CI Pipeline List**][17] and [**Executions**][18] pages populate with data after the pipelines finish. @@ -158,3 +196,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep [24]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/ [25]: /glossary/#pipeline-execution-time [26]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments +[27]: /integrations/guide/source-code-integration/ diff --git a/content/en/continuous_integration/pipelines/gitlab.md b/content/en/continuous_integration/pipelines/gitlab.md index df397f6232b..03cc47c5fb7 100644 --- a/content/en/continuous_integration/pipelines/gitlab.md +++ b/content/en/continuous_integration/pipelines/gitlab.md @@ -28,6 +28,7 @@ Set up CI Visibility for GitLab to collect data on your pipeline executions, ana | [CI jobs failure analysis][28] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. | | [Filter CI Jobs on the critical path][29] | Filter CI Jobs on the critical path | Filter by jobs on the critical path. | | [Partial retries][19] | Partial pipelines | View partially retried pipeline executions. | +| [Automatic job retries](#automatic-job-retries) | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. | | [Manual steps][20] | Manual steps | View manually triggered pipelines. | | [Queue time][21] | Queue time | View the amount of time pipeline jobs sit in the queue before processing. | | Logs correlation | Logs correlation | Correlate pipeline spans to logs and enable [job log collection][12]. | @@ -426,6 +427,35 @@ You can also apply these filters using the facet panel on the left hand side of {{< img src="ci/partial_retries_facet_panel.png" alt="The facet panel with Partial Pipeline facet expanded and the value Retry selected, the Partial Retry facet expanded and the value true selected" style="width:20%;">}} +## Automatic job retries + +
Automatic job retries are in Preview. To request access, contact your Datadog account team.
+ +Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the GitLab API without manual intervention. + +On GitLab, Datadog performs **smart retries**: only the specific job classified as retriable is re-run. Other failed jobs (that aren't classified as retriable) and passing jobs aren't affected. + +### How it works + +1. A job fails in your pipeline. +2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient. +3. If the failure is classified as retriable, Datadog requests a retry through the GitLab API as soon as the job fails. Retries are dispatched per job. +4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops. +5. Datadog records the retry outcome on the original pipeline in CI Visibility. + +### Requirements + +- CI Visibility enabled for your GitLab integration (see [Configure the Datadog integration](#configure-the-datadog-integration)). +- [Datadog Source Code Integration][31] configured for the repositories where you want automatic retries. +- Smart retries work with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Source Code Integration. +- Automatic job retries enabled for your organization (see the banner above for how to request access). + +### Limitations + +- Each logical job is retried at most one time. +- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried. +- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry. + ## Visualize pipeline data in Datadog Once the integration is successfully configured, the [**CI Pipeline List**][4] and [**Executions**][5] pages populate with data after the pipelines finish. @@ -466,3 +496,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep [28]: /continuous_integration/guides/use_ci_jobs_failure_analysis/ [29]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/ [30]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments +[31]: /integrations/guide/source-code-integration/