You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhanced CSI driver readiness validation with comprehensive LNet health checks
- Add CSI-compliant external liveness probe sidecars to both controller and node deployments
- Implement comprehensive LNet validation including NIDs, self-ping, and interface checks
- Separate health endpoints: /healthz (readiness) and /livez (liveness) on dedicated ports
- Controller uses port 29762, Node uses port 29763 for consistent internal communication
- Enhanced validation functions: hasValidLNetNIDs(), lnetSelfPingWorks(), lnetInterfacesOperational()
- Early health server startup for immediate status availability
- Maintain CSI community standards while providing Lustre-specific health validation
Hybrid approach provides both:
- Standard CSI external liveness probe monitoring gRPC endpoints
- Enhanced HTTP health endpoints with comprehensive Lustre readiness validation
Copy file name to clipboardExpand all lines: docs/install-csi-driver.md
+142Lines changed: 142 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,76 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
39
39
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
40
40
```
41
41
42
+
### Verifying CSI Driver Readiness for Lustre Operations
43
+
44
+
Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
45
+
46
+
- Load required kernel modules (lnet, lustre)
47
+
- Configure LNet networking with valid Network Identifiers (NIDs)
48
+
- Verify LNet self-ping functionality
49
+
- Validate all network interfaces are operational
50
+
- Complete all initialization steps
51
+
52
+
#### Enhanced Readiness Validation
53
+
54
+
The CSI driver now provides **exec-based readiness probes** for accurate readiness detection:
55
+
56
+
- **Readiness & Startup Probes**: `/app/readinessProbe.sh` - Direct validation with comprehensive LNet checking
57
+
- **HTTP Endpoint**: `/healthz` (Port 29763) - Available for manual testing and liveness monitoring
58
+
59
+
#### Verification Commands
60
+
61
+
1. **Check pod readiness status:**
62
+
```shell
63
+
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
64
+
```
65
+
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
- **"No valid NIDs"**: LNet networking not properly configured
106
+
- **"Self-ping test failed"**: Network connectivity issues
107
+
- **"Interfaces not operational"**: Network interfaces not in UP state
108
+
- **"Lustre module not loaded"**: Kernel module loading issues
109
+
110
+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
111
+
42
112
## Default instructions for production release
43
113
44
114
### Install with kubectl (current production release)
@@ -73,3 +143,75 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
73
143
csi-azurelustre-node-drlq2 3/3 Running 0 30s
74
144
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
75
145
```
146
+
147
+
148
+
### Verifying CSI Driver Readiness for Lustre Operations
149
+
150
+
Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
151
+
152
+
- Load required kernel modules (lnet, lustre)
153
+
- Configure LNet networking with valid Network Identifiers (NIDs)
154
+
- Verify LNet self-ping functionality
155
+
- Validate all network interfaces are operational
156
+
- Complete all initialization steps
157
+
158
+
#### Enhanced Readiness Validation
159
+
160
+
The CSI driver now provides **HTTP health endpoints** for accurate readiness detection:
- **"No valid NIDs"**: LNet networking not properly configured
212
+
- **"Self-ping test failed"**: Network connectivity issues
213
+
- **"Interfaces not operational"**: Network interfaces not in UP state
214
+
- **"Lustre module not loaded"**: Kernel module loading issues
215
+
216
+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
0 commit comments