Skip to content

Conversation

@mresvanis
Copy link
Contributor

@mresvanis mresvanis commented Jan 15, 2026

Description

This PR adds Fabric Manager configuration support for vm-passthrough workloads with Shared NVSwitch virtualization mode.

It enables users to configure Fabric Manager modes through the ClusterPolicy CRD, providing better support for NVIDIA multi-GPU systems in virtualized environments.

Changes

  • API Extensions: Added FabricManagerSpec to the ClusterPolicy CRD with support for two modes:
    • full-passthrough (FABRIC_MODE=0) - default mode.
    • shared-nvswitch (FABRIC_MODE=1) - shared NVSwitch virtualization mode.
  • Controller Logic: Implemented validation and state management for Fabric Manager configurations:
    • Added validation to ensure driver is enabled when using vm-passthrough with shared NVSwitch mode.
    • Integrated Fabric Manager configuration checks into the state manager workflow.
  • Driver State Management: Enhanced driver state handling to support Fabric Manager configuration:
    • Added logic to detect and handle Fabric Manager shared NVSwitch mode.
    • Updated driver startup behavior for vm-passthrough scenarios.
  • Configuration Updates: Adjusted the driver startup probe to accommodate Fabric Manager requirements in vm-passthrough with shared NVSwitch mode.
  • CRD Updates: Updated all CRD manifests across bundle, config, and deployment directories to include the new Fabric Manager configuration fields.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

TBD

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@LandonTClipp
Copy link

How coincidental that I resolved to implement something like this and 2 hours ago you submitted this draft!

I want to ask what the plan is for the CDI-side. The ideal scenario is that the fabricmanager can be spawned as a Kata container, which means we need to inject the NVSwitch VFIO cdevs just like how we do for passthrough GPUs. When I tried to use GPU operator a few months ago, this was simply not possible at the time so I used libvirt instead. Does the GPU Operator CDI already expose the NVswitches to k8s now? I apologize if my knowledge is a little out of date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants