diff --git a/content/en/continuous_integration/pipelines/_index.md b/content/en/continuous_integration/pipelines/_index.md index 3a760a89f1a..4155c8b33dc 100644 --- a/content/en/continuous_integration/pipelines/_index.md +++ b/content/en/continuous_integration/pipelines/_index.md @@ -47,6 +47,7 @@ Select your CI provider to set up CI Visibility in Datadog: | {{< ci-details title="Infrastructure correlation" >}}Correlation of host-level information for the Datadog Agent, CI pipelines, or job runners to CI pipeline execution data.{{< /ci-details >}} | | | {{< X >}} | | | {{< X >}} | {{< X >}} | {{< X >}} | | | | {{< ci-details title="Running pipelines" >}}Identification of pipelines executions that are running with associated tracing.{{< /ci-details >}} | {{< X >}} | | | | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | | {{< ci-details title="Partial retries" >}}Identification of partial retries (for example, when only a subset of jobs were retried).{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | +| {{< ci-details title="Automatic job retries" >}}Preview. Datadog retries failed jobs classified as transient by its AI error model. More info.{{< /ci-details >}} | | | | | | {{< X >}} | {{< X >}} | | | | | {{< ci-details title="Step granularity" >}}Step level spans are available for more granular visibility.{{< /ci-details >}} | | | | | {{< X >}} | {{< X >}} | | {{< X >}}
(_Presented as job spans_) | | {{< X >}} | | {{< ci-details title="Manual steps" >}}Identification of when there is a job with a manual approval phase in the overall pipeline.{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | diff --git a/content/en/continuous_integration/pipelines/automatic_retries.md b/content/en/continuous_integration/pipelines/automatic_retries.md new file mode 100644 index 00000000000..09e94b80045 --- /dev/null +++ b/content/en/continuous_integration/pipelines/automatic_retries.md @@ -0,0 +1,78 @@ +--- +title: Automatic Job Retries +further_reading: + - link: "/continuous_integration/pipelines" + tag: "Documentation" + text: "Explore Pipeline Execution Results and Performance" + - link: "/continuous_integration/pipelines/github/" + tag: "Documentation" + text: "Set up CI Visibility for GitHub Actions" + - link: "/continuous_integration/pipelines/gitlab/" + tag: "Documentation" + text: "Set up CI Visibility for GitLab" + - link: "/continuous_integration/troubleshooting/" + tag: "Documentation" + text: "Troubleshooting CI Visibility" +--- + +
Automatic job retries are in Preview. To request access, contact your Datadog account team.
+ +## Overview + +Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the CI provider's API without manual intervention. + +Automatic retries reduce the number of pipelines that developers re-run by hand, shorten feedback loops, and keep pipeline success metrics focused on non-transient failures. + +## How it works + +1. A CI job fails in your pipeline. +2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient. +3. If the failure is classified as retriable, Datadog requests a retry through the provider's API. +4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops. +5. Datadog records the retry outcome on the original pipeline in CI Visibility. + +## Requirements + +- CI Visibility enabled for your [GitHub Actions][1] or [GitLab][2] integration. +- [Datadog Source Code Integration][3] configured for the repositories where you want automatic retries. +- Automatic job retries enabled for your organization (see the banner above for how to request access). + +## Provider-specific behavior + +{{< tabs >}} +{{% tab "GitLab" %}} + +Datadog performs **smart retries** on GitLab: only the specific job classified as retriable is re-run. Other failed jobs (that aren't classified as retriable) and passing jobs aren't affected. + +- Retries are triggered per job, as soon as the job fails. +- Smart retries work with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Datadog Source Code Integration. +- There is no additional CI cost beyond the retried job. + +{{% /tab %}} +{{% tab "GitHub Actions" %}} + +GitHub Actions imposes two provider-level limitations that shape how retries work: + +- **Retries happen after the workflow finishes.** The GitHub API does not allow retrying an individual job while the rest of the workflow is still running. Datadog waits for the workflow to reach a final state before issuing retries. +- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog reruns every failed job in the workflow through a single GitHub API call. This may increase your GitHub Actions compute usage. + +### Protected branches + +The Datadog GitHub App's default permissions do not allow retries on protected branches. To enable automatic retries on a protected branch (for example, your default branch), grant the app Maintainer-level access. Review your organization's policies before expanding permissions. + +{{% /tab %}} +{{< /tabs >}} + +## Limitations + +- Each logical job is retried at most one time. +- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried. +- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry. + +## Further reading + +{{< partial name="whats-next/whats-next.html" >}} + +[1]: /continuous_integration/pipelines/github/ +[2]: /continuous_integration/pipelines/gitlab/ +[3]: /integrations/guide/source-code-integration/