diff --git a/config/_default/menus/main.en.yaml b/config/_default/menus/main.en.yaml index 7f6d2515649..3c1afb6fe39 100644 --- a/config/_default/menus/main.en.yaml +++ b/config/_default/menus/main.en.yaml @@ -5416,6 +5416,27 @@ menu: parent: llm_obs identifier: llm_obs_guide weight: 9 + - name: GPU Monitoring + url: gpu_monitoring/ + pre: gpu-monitoring-wui + identifier: gpu_monitoring + parent: ai_observability_heading + weight: 20000 + - name: Setup + url: gpu_monitoring/setup/ + parent: gpu_monitoring + identifier: gpu_monitoring_setup + weight: 1 + - name: Summary Page + url: gpu_monitoring/summary/ + parent: gpu_monitoring + identifier: gpu_monitoring_summary + weight: 2 + - name: Fleet Page + url: gpu_monitoring/fleet/ + parent: gpu_monitoring + identifier: gpu_monitoring_fleet + weight: 3 - name: CI Visibility url: continuous_integration/ pre: ci diff --git a/content/en/gpu_monitoring/_index.md b/content/en/gpu_monitoring/_index.md index f0c4504a043..8f6a557df38 100644 --- a/content/en/gpu_monitoring/_index.md +++ b/content/en/gpu_monitoring/_index.md @@ -1,6 +1,5 @@ --- title: GPU Monitoring -private: true further_reading: - link: "/gpu_monitoring/setup" tag: "Documentation" @@ -16,10 +15,6 @@ further_reading: {{% /site-region %}} -{{< callout url="https://www.datadoghq.com/product-preview/gpu-monitoring/" >}} -GPU Monitoring is in Preview. To join the preview, click Request Access and complete the form. -{{< /callout >}} - ## Overview Datadog's [GPU Monitoring][1] provides a centralized view into your GPU fleet's health, cost, and performance. It enables teams to make better provisioning decisions, optimize and troubleshoot AI workload performance, and eliminate idle GPU costs without having to manually set up individual vendor tools (like NVIDIA's DCGM). GPU Monitoring supports fleets deployed across the major cloud providers (AWS, GCP, Azure, Oracle Cloud), hosted on-premises, or provisioned through GPU-as-a-Service platforms like Coreweave and Lambda Labs. diff --git a/content/en/gpu_monitoring/fleet.md b/content/en/gpu_monitoring/fleet.md index a00de9d253a..97b70abd42d 100644 --- a/content/en/gpu_monitoring/fleet.md +++ b/content/en/gpu_monitoring/fleet.md @@ -1,7 +1,6 @@ --- title: GPU Monitoring Fleet Page description: "An inventory of all your GPU-accelerated hosts that helps you diagnose performance issues." -private: true further_reading: - link: "https://www.datadoghq.com/blog/datadog-gpu-monitoring/" tag: "Blog" diff --git a/content/en/gpu_monitoring/setup.md b/content/en/gpu_monitoring/setup.md index 5d0847d1928..fe2077aa226 100644 --- a/content/en/gpu_monitoring/setup.md +++ b/content/en/gpu_monitoring/setup.md @@ -1,6 +1,15 @@ --- title: Set up GPU Monitoring -private: true +further_reading: +- link: "/gpu_monitoring/summary" + tag: "Documentation" + text: "GPU Monitoring Summary Page" +- link: "/gpu_monitoring/fleet" + tag: "Documentation" + text: "GPU Monitoring Fleet Page" +- link: "https://www.datadoghq.com/blog/datadog-gpu-monitoring/" + tag: "Blog" + text: "Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring" --- This page provides instructions on setting up Datadog's GPU Monitoring on your infrastructure. Follow the configuration instructions that match your operating environment below. @@ -28,7 +37,14 @@ If using Kubernetes, the following additional requirements must be met: - [**Datadog Operator**][5]: version 1.18, _or_ [**Datadog Helm chart**][6]: version 3.137.3 - **Kubernetes**: 1.22 with PodResources API active -## Set up GPU Monitoring on a non-Kubernetes environment or uniform Kubernetes cluster +## Setting up GPU Monitoring +Configuring GPU Monitoring does not require DCGM. You need to opt-in to the collection of GPU Monitoring metrics at the Agent. Setup depends on your environment: non-Kubernetes/uniform, Kubernetes cluster, or mixed cluster. + +After you've enabled the collection of GPU Monitoring metrics, you can opt-in to enable several integrations for more advanced insights: +- For cloud costs and cloud instance-type information: enable the [AWS][9], [Google Cloud][10], [Azure][11], or [Oracle][12] cloud integrations in your Datadog UI. +- For process-level insights, set up Datadog's [Live Processes][13]. + +### Set up GPU Monitoring on a non-Kubernetes environment or uniform Kubernetes cluster The following instructions are the basic steps to set up GPU Monitoring in the following environments: - In a Kubernetes cluster where **all** nodes have GPU devices @@ -36,7 +52,7 @@ The following instructions are the basic steps to set up GPU Monitoring in the f {{< tabs >}} {{% tab "Datadog Operator" %}} -1. Ensure that the [latest version of the Datadog Agent][2] is [installed and deployed][1] on every GPU host you wish to monitor. +1. Make sure that the [latest version of the Datadog Agent][2] is [installed and deployed][1] on every GPU host you wish to monitor. 2. Modify your `DatadogAgent` resource with the following parameters: `gpu.enabled: true` @@ -46,7 +62,7 @@ The following instructions are the basic steps to set up GPU Monitoring in the f : _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.process.core.usage`). `gpu.patchCgroupPermissions: true` - : _Only for GKE_. Enables a code path in `system-probe` that ensures the Agent can access GPU devices. + : _Only for GKE_. Enables a code path in `system-probe` that helps the Agent access GPU devices. `gpu.requiredRuntimeClassName:` : _Optional_. Specifies the container runtime for pods that need access to GPU devices, for example: `nvidia`, `nvidia-cdi`, `nvidia-legacy`. The default value is `nvidia`, as that is the default runtime defined by the NVIDIA GPU Operator. In EKS and Oracle Cloud, this value should be set to the empty string as the default runtime class already allows GPU device access. @@ -75,7 +91,7 @@ The following instructions are the basic steps to set up GPU Monitoring in the f {{% /tab %}} {{% tab "Helm" %}} -1. Ensure that the [latest version of the Datadog Agent][2] is [installed and deployed][1] on every GPU host you wish to monitor. +1. Make sure that the [latest version of the Datadog Agent][2] is [installed and deployed][1] on every GPU host you wish to monitor. 2. Modify your `datadog-values.yaml` configuration file with the following parameters: @@ -86,7 +102,7 @@ The following instructions are the basic steps to set up GPU Monitoring in the f : _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.process.core.usage`). `gpuMonitoring.configureCgroupPerms: true` - : _Only for GKE_. Enables a code path in `system-probe` that ensures the Agent can access GPU devices. + : _Only for GKE_. Enables a code path in `system-probe` that helps the Agent access GPU devices. `gpuMonitoring.runtimeClassName:` : _Optional_. Specifies the container runtime for pods that need access to GPU devices, for example: `nvidia`, `nvidia-cdi`, `nvidia-legacy`. The default value is `nvidia`, as that is the default runtime defined by the NVIDIA GPU Operator. In EKS and Oracle Cloud, this value should be set to the empty string as the default runtime class already allows GPU device access. @@ -262,14 +278,14 @@ Additionally, to enable advanced eBPF-based metrics such as GPU core utilization ## Set up GPU Monitoring on a mixed Kubernetes cluster -In a mixed Kubernetes cluster, some nodes have GPU devices while other nodes do not. Two separate DaemonSets are required (one for the runtime class in GPU nodes, and another for non-GPU nodes) due to runtime class requirements for the NVIDIA device plugin for Kubernetes. +In a mixed Kubernetes cluster, some nodes have GPU devices while others do not. Two separate DaemonSets are required: one for the runtime class in GPU nodes, and another for non-GPU nodes. This split is due to runtime class requirements for the NVIDIA device plugin for Kubernetes. -The recommended method to set up the Agent in this case is using the Datadog Operator, version 1.20 or greater, which provides features to make this setup easier. However, for compatibility reasons instructions are also provided for Helm installations or for older versions of the Datadog Operator. +The recommended method is the Datadog Operator, version 1.20 or greater. This version provides features that make setup easier. For compatibility, instructions are also provided for Helm installations and older Datadog Operator versions. {{< tabs >}} {{% tab "Datadog Operator (1.20 or greater)" %}} -To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the Operator's [Agent Profiles][2] feature to selectively enable GPU Monitoring only on nodes with GPUs. +To set up GPU Monitoring on a mixed cluster, use the Operator's [Agent Profiles][2] feature. This selectively enables GPU Monitoring only on nodes with GPUs. 1. Configure the Datadog Operator to enable the Datadog Agent Profile feature in the DatadogAgentInternal mode. @@ -299,7 +315,7 @@ To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the O 2. Modify your `DatadogAgent` resource with the following changes: 1. Add the `agent.datadoghq.com/update-metadata` annotation to the `DatadogAgent` resource. - 2. If advanced eBPF metrics are wanted, ensure at least one system-probe feature is enabled. Examples of system-probe features are `npm`, `cws`, `usm`. If none is enabled, the `oomKill` feature can be enabled. + 2. If advanced eBPF metrics are wanted, verify that at least one system-probe feature is enabled. Examples of system-probe features are `npm`, `cws`, `usm`. If none is enabled, the `oomKill` feature can be enabled. The additions to the `datadog-agent.yaml` file should look like this: @@ -353,9 +369,9 @@ To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the O {{% /tab %}} {{% tab "Datadog Operator (1.18 or 1.19)" %}} -To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the Operator's [Agent Profiles][2] feature to selectively enable GPU Monitoring only on nodes with GPUs. +To set up GPU Monitoring on a mixed cluster, use the Operator's [Agent Profiles][2] feature. This selectively enables GPU Monitoring only on nodes with GPUs. -1. Ensure that the [latest version of the Datadog Agent][4] is [installed and deployed][1] on every GPU host you wish to monitor. +1. Make sure that the [latest version of the Datadog Agent][4] is [installed and deployed][1] on every GPU host you wish to monitor. 2. Modify your `DatadogAgent` resource with the following changes: @@ -449,9 +465,9 @@ To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the O {{% tab "Helm" %}} -To set up GPU Monitoring on a mixed cluster with Helm, create two different Helm deployments: one deployment for GPU nodes, and one deployment for non-GPU nodes. +To set up GPU Monitoring on a mixed cluster with Helm, create two Helm deployments. One deployment targets GPU nodes, and the other targets non-GPU nodes. -1. Ensure that the [latest version of the Datadog Agent][3] is [installed and deployed][1] on every GPU host you wish to monitor. +1. Make sure that the [latest version of the Datadog Agent][3] is [installed and deployed][1] on every GPU host you wish to monitor. 2. Modify your `datadog-values.yaml` configuration file to target non-GPU nodes. @@ -470,7 +486,7 @@ To set up GPU Monitoring on a mixed cluster with Helm, create two different Helm - "true" ``` -3. Create a new values file, `datadog-gpu-values.yaml`. Configure this file to: +3. Create a values file, `datadog-gpu-values.yaml`. Configure this file to: - Target only GPU nodes - [Join existing Datadog Cluster Agents][2] - Enable GPU Monitoring with the following parameters: @@ -482,7 +498,7 @@ To set up GPU Monitoring on a mixed cluster with Helm, create two different Helm : _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.process.core.usage`). `gpuMonitoring.configureCgroupPerms: true` - : _Only for GKE_. Enables a code path in `system-probe` that ensures the Agent can access GPU devices. + : _Only for GKE_. Enables a code path in `system-probe` that helps the Agent access GPU devices. `gpuMonitoring.runtimeClassName:` : _Optional_. Specifies the container runtime for pods that need access to GPU devices, for example: `nvidia`, `nvidia-cdi`, `nvidia-legacy`. The default value is `nvidia`, as that is the default runtime defined by the NVIDIA GPU Operator. In EKS and Oracle Cloud, this value should be set to the empty string as the default runtime class already allows GPU device access. @@ -550,3 +566,8 @@ To set up GPU Monitoring on a mixed cluster with Helm, create two different Helm [6]: https://github.com/DataDog/helm-charts/blob/main/charts/datadog/README.md [7]: /containers/docker/ [8]: /agent/supported_platforms/linux/ +[9]: /getting_started/integrations/aws/ +[10]: /getting_started/integrations/google_cloud/?tab=orglevel +[11]: /getting_started/integrations/azure/?tab=createanappregistration +[12]: /getting_started/integrations/oci/ +[13]: /infrastructure/process?tab=linuxwindows#installation diff --git a/content/en/gpu_monitoring/summary.md b/content/en/gpu_monitoring/summary.md index 837735b8ce4..cdcc3066616 100644 --- a/content/en/gpu_monitoring/summary.md +++ b/content/en/gpu_monitoring/summary.md @@ -1,6 +1,5 @@ --- title: GPU Monitoring Summary Page -private: true description: "Real-time insights across your entire GPU fleet for better provisioning and cost optimization" further_reading: - link: "https://www.datadoghq.com/blog/datadog-gpu-monitoring/"