Enhanced CSI driver readiness validation with comprehensive LNet health checks

jeffbearer · jeffbearer · commit c68659dc72b3 · 2025-09-08T17:06:49.000-04:00
- Add CSI-compliant external liveness probe sidecars to both controller and node deployments
- Implement comprehensive LNet validation including NIDs, self-ping, and interface checks
- Separate health endpoints: /healthz (readiness) and /livez (liveness) on dedicated ports
- Controller uses port 29762, Node uses port 29763 for consistent internal communication
- Enhanced validation functions: hasValidLNetNIDs(), lnetSelfPingWorks(), lnetInterfacesOperational()
- Early health server startup for immediate status availability
- Maintain CSI community standards while providing Lustre-specific health validation

Hybrid approach provides both:
- Standard CSI external liveness probe monitoring gRPC endpoints
- Enhanced HTTP health endpoints with comprehensive Lustre readiness validation
diff --git a/deploy/csi-azurelustre-node.yaml b/deploy/csi-azurelustre-node.yaml
@@ -110,6 +110,22 @@ spec:
             initialDelaySeconds: 60
             timeoutSeconds: 10
             periodSeconds: 30
+          readinessProbe:
+            failureThreshold: 5
+            exec:
+              command:
+                - /app/readinessProbe.sh
+            initialDelaySeconds: 10
+            timeoutSeconds: 10
+            periodSeconds: 30
+          startupProbe:
+            failureThreshold: 120
+            exec:
+              command:
+                - /app/readinessProbe.sh
+            initialDelaySeconds: 10
+            timeoutSeconds: 5
+            periodSeconds: 5
           env:
             - name: CSI_ENDPOINT
               value: unix:///csi/csi.sock
diff --git a/docs/csi-debug.md b/docs/csi-debug.md
@@ -2,6 +2,113 @@
 
 ---
 
+## Driver Readiness and Health Issues
+
+### Enhanced LNet Validation Troubleshooting
+
+**Symptoms:**
+
+- CSI driver node pods show `0/3` or `1/3` ready status
+- Readiness probe failing repeatedly
+- Pods remain in `ContainerCreating` or startup issues
+- Mount operations fail with "driver not ready" errors
+
+**Test readiness probe directly:**
+
+```sh
+# Test the exec-based readiness probe script
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- /app/readinessProbe.sh
+```
+
+Expected responses:
+- Exit code 0: Enhanced LNet validation passed
+- Exit code 1: One or more validation checks failed (with descriptive error message)
+
+**Test HTTP health endpoints (optional manual testing):**
+
+```sh
+# Test enhanced readiness/liveness via HTTP endpoint
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- curl -s localhost:29763/healthz
+```
+
+HTTP responses:
+- `/healthz`: `ok` (HTTP 200) or `not ready` (HTTP 503)
+
+**Check enhanced validation logs:**
+
+```sh
+# Look for detailed LNet validation messages
+kubectl logs -n kube-system <csi-azurelustre-node-pod> -c azurelustre | grep -E "(LNet validation|NIDs|self-ping|interfaces)"
+```
+
+Look for validation success messages:
+- `"LNet validation passed: all checks successful"`
+- `"Found NIDs: <network-identifiers>"`
+- `"LNet self-ping to <nid> successful"`
+- `"All LNet interfaces operational"`
+
+**Common readiness failure patterns:**
+
+1. **No valid NIDs found:**
+   ```text
+   LNet validation failed: no valid NIDs
+   No valid non-loopback LNet NIDs found
+   ```
+   **Solution:** Check LNet configuration and network setup
+
+2. **Self-ping test failed:**
+   ```text
+   LNet validation failed: self-ping test failed
+   LNet self-ping to <nid> failed
+   ```
+   **Solution:** Verify network connectivity and LNet networking
+
+3. **Interfaces not operational:**
+   ```text
+   LNet validation failed: interfaces not operational
+   Found non-operational interface: status: down
+   ```
+   **Solution:** Check network interface status and configuration
+
+4. **Module loading issues:**
+   ```text
+   Lustre module not loaded
+   LNet kernel module is not loaded
+   ```
+   **Solution:** Check kernel module installation and loading
+
+**Debug LNet configuration manually:**
+
+```sh
+# Check kernel modules
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep -E "(lnet|lustre)"
+
+# Check LNet NIDs
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl list_nids
+
+# Test LNet self-ping
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl ping <nid>
+
+# Check interface status
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lnetctl net show --net tcp
+```
+
+**Check probe configuration:**
+
+```sh
+# Verify probe settings in deployment
+kubectl describe -n kube-system pod <csi-azurelustre-node-pod> | grep -A 10 -E "(Liveness|Readiness|Startup)"
+```
+
+**Monitor readiness probe attempts:**
+
+```sh
+# Watch probe events in real-time
+kubectl get events --field-selector involvedObject.name=<csi-azurelustre-node-pod> -n kube-system -w | grep -E "(Readiness|Liveness)"
+```
+
+---
+
 ## Volume Provisioning Issues
 
 ### Dynamic Provisioning (AMLFS Cluster Creation) - Public Preview
diff --git a/docs/install-csi-driver.md b/docs/install-csi-driver.md
@@ -39,6 +39,76 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
     csi-azurelustre-node-g6sfx   3/3     Running   0          30s
     ```
 
+### Verifying CSI Driver Readiness for Lustre Operations
+
+Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
+
+- Load required kernel modules (lnet, lustre)
+- Configure LNet networking with valid Network Identifiers (NIDs)
+- Verify LNet self-ping functionality
+- Validate all network interfaces are operational
+- Complete all initialization steps
+
+#### Enhanced Readiness Validation
+
+The CSI driver now provides **exec-based readiness probes** for accurate readiness detection:
+
+- **Readiness & Startup Probes**: `/app/readinessProbe.sh` - Direct validation with comprehensive LNet checking
+- **HTTP Endpoint**: `/healthz` (Port 29763) - Available for manual testing and liveness monitoring
+
+#### Verification Commands
+
+1. **Check pod readiness status:**
+   ```shell
+   kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
+   ```
+   All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
+
+2. **Test enhanced readiness endpoint directly:**
+   ```shell
+   kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/healthz
+   ```
+   Should return `ok` (HTTP 200) when LNet validation passes, or `not ready` (HTTP 503) if any validation fails.
+
+3. **Test liveness endpoint:**
+   ```shell
+   kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/livez
+   ```
+   Should return `alive` (HTTP 200) indicating basic container health.
+
+4. **Check detailed probe status:**
+   ```shell
+   kubectl describe -n kube-system pod -l app=csi-azurelustre-node
+   ```
+   Look for successful readiness and liveness probe checks in the Events section.
+
+5. **Review enhanced validation logs:**
+   ```shell
+   kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
+   ```
+   Look for enhanced LNet validation messages:
+   - `"LNet validation passed: all checks successful"`
+   - `"Found NIDs: <network-identifiers>"`
+   - `"LNet self-ping to <nid> successful"`
+   - `"All LNet interfaces operational"`
+
+#### Troubleshooting Failed Readiness
+
+If the readiness probe fails (exit code 1), check the logs for specific validation failure reasons:
+
+```shell
+# Check for detailed validation failure reasons
+kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
+```
+
+Common issues and solutions:
+- **"No valid NIDs"**: LNet networking not properly configured
+- **"Self-ping test failed"**: Network connectivity issues
+- **"Interfaces not operational"**: Network interfaces not in UP state
+- **"Lustre module not loaded"**: Kernel module loading issues
+
+**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
+
 ## Default instructions for production release
 
 ### Install with kubectl (current production release)
@@ -73,3 +143,75 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
     csi-azurelustre-node-drlq2   3/3     Running   0          30s
     csi-azurelustre-node-g6sfx   3/3     Running   0          30s
     ```
+
+
+### Verifying CSI Driver Readiness for Lustre Operations
+
+Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
+
+- Load required kernel modules (lnet, lustre)
+- Configure LNet networking with valid Network Identifiers (NIDs)
+- Verify LNet self-ping functionality
+- Validate all network interfaces are operational
+- Complete all initialization steps
+
+#### Enhanced Readiness Validation
+
+The CSI driver now provides **HTTP health endpoints** for accurate readiness detection:
+
+- **`/healthz`** (Port 29763): Enhanced readiness check with comprehensive LNet validation
+- **`/livez`** (Port 29763): Basic liveness check to prevent unnecessary restarts
+
+#### Verification Commands
+
+1. **Check pod readiness status:**
+   ```shell
+   kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
+   ```
+   All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
+
+2. **Test enhanced readiness endpoint directly:**
+   ```shell
+   kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/healthz
+   ```
+   Should return `ok` (HTTP 200) when LNet validation passes, or `not ready` (HTTP 503) if any validation fails.
+
+3. **Test liveness endpoint:**
+   ```shell
+   kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/livez
+   ```
+   Should return `alive` (HTTP 200) indicating basic container health.
+
+4. **Check detailed probe status:**
+   ```shell
+   kubectl describe -n kube-system pod -l app=csi-azurelustre-node
+   ```
+   Look for successful readiness and liveness probe checks in the Events section.
+
+5. **Review enhanced validation logs:**
+   ```shell
+   kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
+   ```
+   Look for enhanced LNet validation messages:
+   - `"LNet validation passed: all checks successful"`
+   - `"Found NIDs: <network-identifiers>"`
+   - `"LNet self-ping to <nid> successful"`
+   - `"All LNet interfaces operational"`
+
+#### Troubleshooting Failed Readiness
+
+If the readiness endpoint returns `not ready`, check the logs for specific validation failure reasons:
+
+```shell
+# Check for detailed validation failure reasons
+kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
+```
+
+Common issues and solutions:
+- **"No valid NIDs"**: LNet networking not properly configured
+- **"Self-ping test failed"**: Network connectivity issues
+- **"Interfaces not operational"**: Network interfaces not in UP state
+- **"Lustre module not loaded"**: Kernel module loading issues
+
+**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
+
diff --git a/pkg/azurelustreplugin/Dockerfile b/pkg/azurelustreplugin/Dockerfile
@@ -16,8 +16,9 @@ FROM ubuntu:22.04
 
 COPY "./_output/azurelustreplugin" "/app/azurelustreplugin"
 COPY "./pkg/azurelustreplugin/entrypoint.sh" "/app/entrypoint.sh"
+COPY "./pkg/azurelustreplugin/readinessProbe.sh" "/app/readinessProbe.sh"
 
-RUN chmod +x "/app/entrypoint.sh"
+RUN chmod +x "/app/entrypoint.sh" && chmod +x "/app/readinessProbe.sh"
 
 RUN apt-get update && \
   apt-get upgrade -y && \
diff --git a/pkg/azurelustreplugin/readinessProbe.sh b/pkg/azurelustreplugin/readinessProbe.sh
@@ -0,0 +1,71 @@
+#!/bin/bash
+
+# readinessProbe.sh - Health check script for Azure Lustre CSI driver
+# This script performs direct LNet readiness validation
+
+set -euo pipefail
+
+# Check if this is a controller pod (no Lustre client installation required)
+INSTALL_LUSTRE_CLIENT=${AZURELUSTRE_CSI_INSTALL_LUSTRE_CLIENT:-"yes"}
+
+if [[ "$INSTALL_LUSTRE_CLIENT" == "no" ]]; then
+    echo "Controller pod detected - reporting ready (skipping Lustre checks)"
+    exit 0
+fi
+
+echo "Node pod detected - performing Lustre-specific readiness checks"
+
+# Check if CSI socket exists and is accessible
+CSI_SOCKET=${CSI_ENDPOINT:-"unix:///csi/csi.sock"}
+SOCKET_PATH=$(echo "$CSI_SOCKET" | sed 's|unix://||')
+
+if [[ ! -S "$SOCKET_PATH" ]]; then
+    echo "CSI socket not found: $SOCKET_PATH"
+    exit 1
+fi
+
+# Check if LNet is properly configured and operational
+# This replicates the logic from CheckLustreReadiness()
+
+# 1. Check if LNet NIDs are valid and available
+if ! lnetctl net show >/dev/null 2>&1; then
+    echo "LNet not available or not configured"
+    exit 1
+fi
+
+# 2. Check if we have any NIDs configured
+NID_COUNT=$(lnetctl net show 2>/dev/null | grep -c "nid:")
+if [[ "$NID_COUNT" -eq 0 ]]; then
+    echo "No LNet NIDs configured"
+    exit 1
+fi
+
+# 3. Check LNet self-ping functionality
+if ! lnetctl ping --help >/dev/null 2>&1; then
+    echo "LNet ping functionality not available"
+    exit 1
+fi
+
+# Get the first available NID for self-ping test (exclude loopback)
+FIRST_NID=$(lnetctl net show 2>/dev/null | grep "nid:" | grep -v "@lo" | head -1 | sed 's/.*nid: \([^ ]*\).*/\1/' || echo "")
+if [[ -z "$FIRST_NID" ]]; then
+    echo "Unable to determine LNet NID for self-ping test"
+    exit 1
+fi
+
+# Perform self-ping test with timeout
+if ! timeout 10 lnetctl ping "$FIRST_NID" >/dev/null 2>&1; then
+    echo "LNet self-ping test failed for NID: $FIRST_NID"
+    exit 1
+fi
+
+# 4. Check if LNet interfaces are operational
+# Verify we have at least one interface in 'up' state
+UP_INTERFACES=$(lnetctl net show 2>/dev/null | grep -c "status: up")
+if [[ "$UP_INTERFACES" -eq 0 ]]; then
+    echo "No LNet interfaces in 'up' state"
+    exit 1
+fi
+
+echo "All Lustre readiness checks passed"
+exit 0