Skip to content

Commit 83bf1a7

Browse files
committed
Enhanced CSI driver readiness validation with comprehensive LNet health checks
- Add CSI-compliant external liveness probe sidecars to both controller and node deployments - Implement comprehensive LNet validation including NIDs, self-ping, and interface checks - Separate health endpoints: /healthz (readiness) and /livez (liveness) on dedicated ports - Controller uses port 29762, Node uses port 29763 for consistent internal communication - Enhanced validation functions: hasValidLNetNIDs(), lnetSelfPingWorks(), lnetInterfacesOperational() - Early health server startup for immediate status availability - Maintain CSI community standards while providing Lustre-specific health validation Hybrid approach provides both: - Standard CSI external liveness probe monitoring gRPC endpoints - Enhanced HTTP health endpoints with comprehensive Lustre readiness validation
1 parent e50efb6 commit 83bf1a7

File tree

6 files changed

+596
-3
lines changed

6 files changed

+596
-3
lines changed

deploy/csi-azurelustre-controller.yaml

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,11 +84,28 @@ spec:
8484
livenessProbe:
8585
failureThreshold: 5
8686
httpGet:
87-
path: /healthz
87+
path: /livez
8888
port: healthz
8989
initialDelaySeconds: 60
90+
timeoutSeconds: 5
91+
periodSeconds: 30
92+
readinessProbe:
93+
failureThreshold: 5
94+
httpGet:
95+
path: /healthz
96+
port: healthz
97+
initialDelaySeconds: 10
9098
timeoutSeconds: 10
9199
periodSeconds: 30
100+
startupProbe:
101+
failureThreshold: 20
102+
httpGet:
103+
path: /healthz
104+
port: healthz
105+
initialDelaySeconds: 5
106+
timeoutSeconds: 5
107+
periodSeconds: 5
108+
# Removed PostStartHook coordination to eliminate timing issues
92109
env:
93110
- name: CSI_ENDPOINT
94111
value: unix:///csi/csi.sock

deploy/csi-azurelustre-node.yaml

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,11 +105,27 @@ spec:
105105
livenessProbe:
106106
failureThreshold: 5
107107
httpGet:
108-
path: /healthz
108+
path: /livez
109109
port: healthz
110110
initialDelaySeconds: 60
111+
timeoutSeconds: 5
112+
periodSeconds: 30
113+
readinessProbe:
114+
failureThreshold: 5
115+
httpGet:
116+
path: /healthz
117+
port: healthz
118+
initialDelaySeconds: 10
111119
timeoutSeconds: 10
112120
periodSeconds: 30
121+
startupProbe:
122+
failureThreshold: 120
123+
httpGet:
124+
path: /healthz
125+
port: healthz
126+
initialDelaySeconds: 10
127+
timeoutSeconds: 5
128+
periodSeconds: 5
113129
env:
114130
- name: CSI_ENDPOINT
115131
value: unix:///csi/csi.sock

docs/csi-debug.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,112 @@
22

33
---
44

5+
## Driver Readiness and Health Issues
6+
7+
### Enhanced LNet Validation Troubleshooting
8+
9+
**Symptoms:**
10+
11+
- CSI driver node pods show `0/3` or `1/3` ready status
12+
- Readiness probe failing repeatedly
13+
- Pods remain in `ContainerCreating` or startup issues
14+
- Mount operations fail with "driver not ready" errors
15+
16+
**Check enhanced readiness endpoint:**
17+
18+
```sh
19+
# Test readiness endpoint directly
20+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- curl -s localhost:29763/healthz
21+
```
22+
23+
Expected responses:
24+
- `ok` (HTTP 200): Enhanced LNet validation passed
25+
- `not ready` (HTTP 503): One or more validation checks failed
26+
27+
**Test liveness endpoint:**
28+
29+
```sh
30+
# Test basic container health
31+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- curl -s localhost:29763/livez
32+
```
33+
34+
Should return `alive` (HTTP 200) if the container is healthy.
35+
36+
**Check enhanced validation logs:**
37+
38+
```sh
39+
# Look for detailed LNet validation messages
40+
kubectl logs -n kube-system <csi-azurelustre-node-pod> -c azurelustre | grep -E "(LNet validation|NIDs|self-ping|interfaces)"
41+
```
42+
43+
Look for validation success messages:
44+
- `"LNet validation passed: all checks successful"`
45+
- `"Found NIDs: <network-identifiers>"`
46+
- `"LNet self-ping to <nid> successful"`
47+
- `"All LNet interfaces operational"`
48+
49+
**Common readiness failure patterns:**
50+
51+
1. **No valid NIDs found:**
52+
```text
53+
LNet validation failed: no valid NIDs
54+
No valid non-loopback LNet NIDs found
55+
```
56+
**Solution:** Check LNet configuration and network setup
57+
58+
2. **Self-ping test failed:**
59+
```text
60+
LNet validation failed: self-ping test failed
61+
LNet self-ping to <nid> failed
62+
```
63+
**Solution:** Verify network connectivity and LNet networking
64+
65+
3. **Interfaces not operational:**
66+
```text
67+
LNet validation failed: interfaces not operational
68+
Found non-operational interface: status: down
69+
```
70+
**Solution:** Check network interface status and configuration
71+
72+
4. **Module loading issues:**
73+
```text
74+
Lustre module not loaded
75+
LNet kernel module is not loaded
76+
```
77+
**Solution:** Check kernel module installation and loading
78+
79+
**Debug LNet configuration manually:**
80+
81+
```sh
82+
# Check kernel modules
83+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep -E "(lnet|lustre)"
84+
85+
# Check LNet NIDs
86+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl list_nids
87+
88+
# Test LNet self-ping
89+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl ping <nid>
90+
91+
# Check interface status
92+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lnetctl net show --net tcp
93+
```
94+
95+
**Check probe configuration:**
96+
97+
```sh
98+
# Verify probe settings in deployment
99+
kubectl describe -n kube-system pod <csi-azurelustre-node-pod> | grep -A 10 -E "(Liveness|Readiness|Startup)"
100+
```
101+
102+
**Monitor readiness probe attempts:**
103+
104+
```sh
105+
# Watch probe events in real-time
106+
kubectl get events --field-selector involvedObject.name=<csi-azurelustre-node-pod> -n kube-system -w | grep -E "(Readiness|Liveness)"
107+
```
108+
109+
---
110+
5111
## Volume Provisioning Issues
6112

7113
### Dynamic Provisioning (AMLFS Cluster Creation) - Public Preview

docs/install-csi-driver.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,76 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
3939
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
4040
```
4141

42+
### Verifying CSI Driver Readiness for Lustre Operations
43+
44+
Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
45+
46+
- Load required kernel modules (lnet, lustre)
47+
- Configure LNet networking with valid Network Identifiers (NIDs)
48+
- Verify LNet self-ping functionality
49+
- Validate all network interfaces are operational
50+
- Complete all initialization steps
51+
52+
#### Enhanced Readiness Validation
53+
54+
The CSI driver now provides **HTTP health endpoints** for accurate readiness detection:
55+
56+
- **`/healthz`** (Port 29763): Enhanced readiness check with comprehensive LNet validation
57+
- **`/livez`** (Port 29763): Basic liveness check to prevent unnecessary restarts
58+
59+
#### Verification Commands
60+
61+
1. **Check pod readiness status:**
62+
```shell
63+
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
64+
```
65+
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
66+
67+
2. **Test enhanced readiness endpoint directly:**
68+
```shell
69+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/healthz
70+
```
71+
Should return `ok` (HTTP 200) when LNet validation passes, or `not ready` (HTTP 503) if any validation fails.
72+
73+
3. **Test liveness endpoint:**
74+
```shell
75+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/livez
76+
```
77+
Should return `alive` (HTTP 200) indicating basic container health.
78+
79+
4. **Check detailed probe status:**
80+
```shell
81+
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
82+
```
83+
Look for successful readiness and liveness probe checks in the Events section.
84+
85+
5. **Review enhanced validation logs:**
86+
```shell
87+
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
88+
```
89+
Look for enhanced LNet validation messages:
90+
- `"LNet validation passed: all checks successful"`
91+
- `"Found NIDs: <network-identifiers>"`
92+
- `"LNet self-ping to <nid> successful"`
93+
- `"All LNet interfaces operational"`
94+
95+
#### Troubleshooting Failed Readiness
96+
97+
If the readiness endpoint returns `not ready`, check the logs for specific validation failures:
98+
99+
```shell
100+
# Check for detailed validation failure reasons
101+
kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
102+
```
103+
104+
Common issues and solutions:
105+
- **"No valid NIDs"**: LNet networking not properly configured
106+
- **"Self-ping test failed"**: Network connectivity issues
107+
- **"Interfaces not operational"**: Network interfaces not in UP state
108+
- **"Lustre module not loaded"**: Kernel module loading issues
109+
110+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
111+
42112
## Default instructions for production release
43113
44114
### Install with kubectl (current production release)
@@ -73,3 +143,75 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
73143
csi-azurelustre-node-drlq2 3/3 Running 0 30s
74144
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
75145
```
146+
147+
148+
### Verifying CSI Driver Readiness for Lustre Operations
149+
150+
Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
151+
152+
- Load required kernel modules (lnet, lustre)
153+
- Configure LNet networking with valid Network Identifiers (NIDs)
154+
- Verify LNet self-ping functionality
155+
- Validate all network interfaces are operational
156+
- Complete all initialization steps
157+
158+
#### Enhanced Readiness Validation
159+
160+
The CSI driver now provides **HTTP health endpoints** for accurate readiness detection:
161+
162+
- **`/healthz`** (Port 29763): Enhanced readiness check with comprehensive LNet validation
163+
- **`/livez`** (Port 29763): Basic liveness check to prevent unnecessary restarts
164+
165+
#### Verification Commands
166+
167+
1. **Check pod readiness status:**
168+
```shell
169+
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
170+
```
171+
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
172+
173+
2. **Test enhanced readiness endpoint directly:**
174+
```shell
175+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/healthz
176+
```
177+
Should return `ok` (HTTP 200) when LNet validation passes, or `not ready` (HTTP 503) if any validation fails.
178+
179+
3. **Test liveness endpoint:**
180+
```shell
181+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/livez
182+
```
183+
Should return `alive` (HTTP 200) indicating basic container health.
184+
185+
4. **Check detailed probe status:**
186+
```shell
187+
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
188+
```
189+
Look for successful readiness and liveness probe checks in the Events section.
190+
191+
5. **Review enhanced validation logs:**
192+
```shell
193+
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
194+
```
195+
Look for enhanced LNet validation messages:
196+
- `"LNet validation passed: all checks successful"`
197+
- `"Found NIDs: <network-identifiers>"`
198+
- `"LNet self-ping to <nid> successful"`
199+
- `"All LNet interfaces operational"`
200+
201+
#### Troubleshooting Failed Readiness
202+
203+
If the readiness endpoint returns `not ready`, check the logs for specific validation failure reasons:
204+
205+
```shell
206+
# Check for detailed validation failure reasons
207+
kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
208+
```
209+
210+
Common issues and solutions:
211+
- **"No valid NIDs"**: LNet networking not properly configured
212+
- **"Self-ping test failed"**: Network connectivity issues
213+
- **"Interfaces not operational"**: Network interfaces not in UP state
214+
- **"Lustre module not loaded"**: Kernel module loading issues
215+
216+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
217+

0 commit comments

Comments
 (0)