Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions content/en/gpu_monitoring/_index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
---
title: GPU Monitoring
private: true
further_reading:
- link: "/gpu_monitoring/setup"
tag: "Documentation"
Expand All @@ -16,10 +15,6 @@ further_reading:
</div>
{{% /site-region %}}

{{< callout url="https://www.datadoghq.com/product-preview/gpu-monitoring/" >}}
GPU Monitoring is in Preview. To join the preview, click <strong>Request Access</strong> and complete the form.
{{< /callout >}}

## Overview
Datadog's [GPU Monitoring][1] provides a centralized view into your GPU fleet's health, cost, and performance. It enables teams to make better provisioning decisions, optimize and troubleshoot AI workload performance, and eliminate idle GPU costs without having to manually set up individual vendor tools (like NVIDIA's DCGM). GPU Monitoring supports fleets deployed across the major cloud providers (AWS, GCP, Azure, Oracle Cloud), hosted on-premises, or provisioned through GPU-as-a-Service platforms like Coreweave and Lambda Labs.

Expand Down
1 change: 0 additions & 1 deletion content/en/gpu_monitoring/fleet.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
title: GPU Monitoring Fleet Page
description: "An inventory of all your GPU-accelerated hosts that helps you diagnose performance issues."
private: true
further_reading:
- link: "https://www.datadoghq.com/blog/datadog-gpu-monitoring/"
tag: "Blog"
Expand Down
15 changes: 13 additions & 2 deletions content/en/gpu_monitoring/setup.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
---
title: Set up GPU Monitoring
private: true
---
This page provides instructions on setting up Datadog's GPU Monitoring on your infrastructure. Follow the configuration instructions that match your operating environment below.

Expand Down Expand Up @@ -28,7 +27,14 @@
- [**Datadog Operator**][5]: version 1.18, _or_ [**Datadog Helm chart**][6]: version 3.137.3
- **Kubernetes**: 1.22 with PodResources API active

## Set up GPU Monitoring on a non-Kubernetes environment or uniform Kubernetes cluster
## Setting up GPU Monitoring
Setting up GPU Monitoring does not require DCGM at all. You will need to opt-in to the collection of GPU Monitoring metrics at the Agent depending on your environment: non-Kubernetes/uniform Kubernetes cluster or mixed cluster.

Check warning on line 31 in content/en/gpu_monitoring/setup.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

Check notice on line 31 in content/en/gpu_monitoring/setup.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Setting up GPU Monitoring does not require DCGM at all. You will need to opt-in to the collection of GPU Monitoring metrics at the Agent depending on your environment: non-Kubernetes/uniform Kubernetes cluster or mixed cluster.
Configuring GPU Monitoring does not require DCGM. You need to opt-in to the collection of GPU Monitoring metrics at the Agent depending on your environment: non-Kubernetes/uniform, Kubernetes cluster, or mixed cluster.


Once you've enabled the collection of GPU Monitoring metrics, you can opt-in to enabling several integrations for more advanced insights:

Check warning on line 33 in content/en/gpu_monitoring/setup.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_sensitive

Use 'After' instead of 'Once'.

Check warning on line 33 in content/en/gpu_monitoring/setup.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'after you' instead of 'Once you'.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once you've enabled the collection of GPU Monitoring metrics, you can opt-in to enabling several integrations for more advanced insights:
After you've enabled the collection of GPU Monitoring metrics, you can opt-in to enable several integrations for more advanced insights:

- For cloud costs and cloud instance-type information: enable the AWS[9], Google Cloud[10], Azure[11], or Oracle [12] cloud integrations in your Datadog UI
- For process-level insights, set up Datadog's Live Processes[13] .

### Set up GPU Monitoring on a non-Kubernetes environment or uniform Kubernetes cluster

The following instructions are the basic steps to set up GPU Monitoring in the following environments:
- In a Kubernetes cluster where **all** nodes have GPU devices
Expand Down Expand Up @@ -550,3 +556,8 @@
[6]: https://github.com/DataDog/helm-charts/blob/main/charts/datadog/README.md
[7]: /containers/docker/
[8]: /agent/supported_platforms/linux/
[9]: https://docs.datadoghq.com/getting_started/integrations/aws/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[9]: https://docs.datadoghq.com/getting_started/integrations/aws/
[9]: /getting_started/integrations/aws/

[10]:https://docs.datadoghq.com/getting_started/integrations/google_cloud/?tab=orglevel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[10]:https://docs.datadoghq.com/getting_started/integrations/google_cloud/?tab=orglevel
[10]: /getting_started/integrations/google_cloud/?tab=orglevel

[11]:https://docs.datadoghq.com/getting_started/integrations/azure/?tab=createanappregistration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[11]:https://docs.datadoghq.com/getting_started/integrations/azure/?tab=createanappregistration
[11]: /getting_started/integrations/azure/?tab=createanappregistration

[12]:https://docs.datadoghq.com/getting_started/integrations/oci/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[12]:https://docs.datadoghq.com/getting_started/integrations/oci/
[12]: /getting_started/integrations/oci/

[13]:https://docs.datadoghq.com/infrastructure/process?tab=linuxwindows#installation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[13]:https://docs.datadoghq.com/infrastructure/process?tab=linuxwindows#installation
[13]: /infrastructure/process?tab=linuxwindows#installation

1 change: 0 additions & 1 deletion content/en/gpu_monitoring/summary.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
---
title: GPU Monitoring Summary Page
private: true
description: "Real-time insights across your entire GPU fleet for better provisioning and cost optimization"
further_reading:
- link: "https://www.datadoghq.com/blog/datadog-gpu-monitoring/"
Expand Down
Loading