Skip to content

Commit c68659d

Browse files
committed
Enhanced CSI driver readiness validation with comprehensive LNet health checks
- Add CSI-compliant external liveness probe sidecars to both controller and node deployments - Implement comprehensive LNet validation including NIDs, self-ping, and interface checks - Separate health endpoints: /healthz (readiness) and /livez (liveness) on dedicated ports - Controller uses port 29762, Node uses port 29763 for consistent internal communication - Enhanced validation functions: hasValidLNetNIDs(), lnetSelfPingWorks(), lnetInterfacesOperational() - Early health server startup for immediate status availability - Maintain CSI community standards while providing Lustre-specific health validation Hybrid approach provides both: - Standard CSI external liveness probe monitoring gRPC endpoints - Enhanced HTTP health endpoints with comprehensive Lustre readiness validation
1 parent e50efb6 commit c68659d

File tree

5 files changed

+338
-1
lines changed

5 files changed

+338
-1
lines changed

deploy/csi-azurelustre-node.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,22 @@ spec:
110110
initialDelaySeconds: 60
111111
timeoutSeconds: 10
112112
periodSeconds: 30
113+
readinessProbe:
114+
failureThreshold: 5
115+
exec:
116+
command:
117+
- /app/readinessProbe.sh
118+
initialDelaySeconds: 10
119+
timeoutSeconds: 10
120+
periodSeconds: 30
121+
startupProbe:
122+
failureThreshold: 120
123+
exec:
124+
command:
125+
- /app/readinessProbe.sh
126+
initialDelaySeconds: 10
127+
timeoutSeconds: 5
128+
periodSeconds: 5
113129
env:
114130
- name: CSI_ENDPOINT
115131
value: unix:///csi/csi.sock

docs/csi-debug.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,113 @@
22

33
---
44

5+
## Driver Readiness and Health Issues
6+
7+
### Enhanced LNet Validation Troubleshooting
8+
9+
**Symptoms:**
10+
11+
- CSI driver node pods show `0/3` or `1/3` ready status
12+
- Readiness probe failing repeatedly
13+
- Pods remain in `ContainerCreating` or startup issues
14+
- Mount operations fail with "driver not ready" errors
15+
16+
**Test readiness probe directly:**
17+
18+
```sh
19+
# Test the exec-based readiness probe script
20+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- /app/readinessProbe.sh
21+
```
22+
23+
Expected responses:
24+
- Exit code 0: Enhanced LNet validation passed
25+
- Exit code 1: One or more validation checks failed (with descriptive error message)
26+
27+
**Test HTTP health endpoints (optional manual testing):**
28+
29+
```sh
30+
# Test enhanced readiness/liveness via HTTP endpoint
31+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- curl -s localhost:29763/healthz
32+
```
33+
34+
HTTP responses:
35+
- `/healthz`: `ok` (HTTP 200) or `not ready` (HTTP 503)
36+
37+
**Check enhanced validation logs:**
38+
39+
```sh
40+
# Look for detailed LNet validation messages
41+
kubectl logs -n kube-system <csi-azurelustre-node-pod> -c azurelustre | grep -E "(LNet validation|NIDs|self-ping|interfaces)"
42+
```
43+
44+
Look for validation success messages:
45+
- `"LNet validation passed: all checks successful"`
46+
- `"Found NIDs: <network-identifiers>"`
47+
- `"LNet self-ping to <nid> successful"`
48+
- `"All LNet interfaces operational"`
49+
50+
**Common readiness failure patterns:**
51+
52+
1. **No valid NIDs found:**
53+
```text
54+
LNet validation failed: no valid NIDs
55+
No valid non-loopback LNet NIDs found
56+
```
57+
**Solution:** Check LNet configuration and network setup
58+
59+
2. **Self-ping test failed:**
60+
```text
61+
LNet validation failed: self-ping test failed
62+
LNet self-ping to <nid> failed
63+
```
64+
**Solution:** Verify network connectivity and LNet networking
65+
66+
3. **Interfaces not operational:**
67+
```text
68+
LNet validation failed: interfaces not operational
69+
Found non-operational interface: status: down
70+
```
71+
**Solution:** Check network interface status and configuration
72+
73+
4. **Module loading issues:**
74+
```text
75+
Lustre module not loaded
76+
LNet kernel module is not loaded
77+
```
78+
**Solution:** Check kernel module installation and loading
79+
80+
**Debug LNet configuration manually:**
81+
82+
```sh
83+
# Check kernel modules
84+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep -E "(lnet|lustre)"
85+
86+
# Check LNet NIDs
87+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl list_nids
88+
89+
# Test LNet self-ping
90+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl ping <nid>
91+
92+
# Check interface status
93+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lnetctl net show --net tcp
94+
```
95+
96+
**Check probe configuration:**
97+
98+
```sh
99+
# Verify probe settings in deployment
100+
kubectl describe -n kube-system pod <csi-azurelustre-node-pod> | grep -A 10 -E "(Liveness|Readiness|Startup)"
101+
```
102+
103+
**Monitor readiness probe attempts:**
104+
105+
```sh
106+
# Watch probe events in real-time
107+
kubectl get events --field-selector involvedObject.name=<csi-azurelustre-node-pod> -n kube-system -w | grep -E "(Readiness|Liveness)"
108+
```
109+
110+
---
111+
5112
## Volume Provisioning Issues
6113

7114
### Dynamic Provisioning (AMLFS Cluster Creation) - Public Preview

docs/install-csi-driver.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,76 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
3939
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
4040
```
4141

42+
### Verifying CSI Driver Readiness for Lustre Operations
43+
44+
Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
45+
46+
- Load required kernel modules (lnet, lustre)
47+
- Configure LNet networking with valid Network Identifiers (NIDs)
48+
- Verify LNet self-ping functionality
49+
- Validate all network interfaces are operational
50+
- Complete all initialization steps
51+
52+
#### Enhanced Readiness Validation
53+
54+
The CSI driver now provides **exec-based readiness probes** for accurate readiness detection:
55+
56+
- **Readiness & Startup Probes**: `/app/readinessProbe.sh` - Direct validation with comprehensive LNet checking
57+
- **HTTP Endpoint**: `/healthz` (Port 29763) - Available for manual testing and liveness monitoring
58+
59+
#### Verification Commands
60+
61+
1. **Check pod readiness status:**
62+
```shell
63+
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
64+
```
65+
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
66+
67+
2. **Test enhanced readiness endpoint directly:**
68+
```shell
69+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/healthz
70+
```
71+
Should return `ok` (HTTP 200) when LNet validation passes, or `not ready` (HTTP 503) if any validation fails.
72+
73+
3. **Test liveness endpoint:**
74+
```shell
75+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/livez
76+
```
77+
Should return `alive` (HTTP 200) indicating basic container health.
78+
79+
4. **Check detailed probe status:**
80+
```shell
81+
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
82+
```
83+
Look for successful readiness and liveness probe checks in the Events section.
84+
85+
5. **Review enhanced validation logs:**
86+
```shell
87+
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
88+
```
89+
Look for enhanced LNet validation messages:
90+
- `"LNet validation passed: all checks successful"`
91+
- `"Found NIDs: <network-identifiers>"`
92+
- `"LNet self-ping to <nid> successful"`
93+
- `"All LNet interfaces operational"`
94+
95+
#### Troubleshooting Failed Readiness
96+
97+
If the readiness probe fails (exit code 1), check the logs for specific validation failure reasons:
98+
99+
```shell
100+
# Check for detailed validation failure reasons
101+
kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
102+
```
103+
104+
Common issues and solutions:
105+
- **"No valid NIDs"**: LNet networking not properly configured
106+
- **"Self-ping test failed"**: Network connectivity issues
107+
- **"Interfaces not operational"**: Network interfaces not in UP state
108+
- **"Lustre module not loaded"**: Kernel module loading issues
109+
110+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
111+
42112
## Default instructions for production release
43113
44114
### Install with kubectl (current production release)
@@ -73,3 +143,75 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
73143
csi-azurelustre-node-drlq2 3/3 Running 0 30s
74144
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
75145
```
146+
147+
148+
### Verifying CSI Driver Readiness for Lustre Operations
149+
150+
Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
151+
152+
- Load required kernel modules (lnet, lustre)
153+
- Configure LNet networking with valid Network Identifiers (NIDs)
154+
- Verify LNet self-ping functionality
155+
- Validate all network interfaces are operational
156+
- Complete all initialization steps
157+
158+
#### Enhanced Readiness Validation
159+
160+
The CSI driver now provides **HTTP health endpoints** for accurate readiness detection:
161+
162+
- **`/healthz`** (Port 29763): Enhanced readiness check with comprehensive LNet validation
163+
- **`/livez`** (Port 29763): Basic liveness check to prevent unnecessary restarts
164+
165+
#### Verification Commands
166+
167+
1. **Check pod readiness status:**
168+
```shell
169+
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
170+
```
171+
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
172+
173+
2. **Test enhanced readiness endpoint directly:**
174+
```shell
175+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/healthz
176+
```
177+
Should return `ok` (HTTP 200) when LNet validation passes, or `not ready` (HTTP 503) if any validation fails.
178+
179+
3. **Test liveness endpoint:**
180+
```shell
181+
kubectl exec -n kube-system <pod-name> -c azurelustre -- curl -s localhost:29763/livez
182+
```
183+
Should return `alive` (HTTP 200) indicating basic container health.
184+
185+
4. **Check detailed probe status:**
186+
```shell
187+
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
188+
```
189+
Look for successful readiness and liveness probe checks in the Events section.
190+
191+
5. **Review enhanced validation logs:**
192+
```shell
193+
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
194+
```
195+
Look for enhanced LNet validation messages:
196+
- `"LNet validation passed: all checks successful"`
197+
- `"Found NIDs: <network-identifiers>"`
198+
- `"LNet self-ping to <nid> successful"`
199+
- `"All LNet interfaces operational"`
200+
201+
#### Troubleshooting Failed Readiness
202+
203+
If the readiness endpoint returns `not ready`, check the logs for specific validation failure reasons:
204+
205+
```shell
206+
# Check for detailed validation failure reasons
207+
kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
208+
```
209+
210+
Common issues and solutions:
211+
- **"No valid NIDs"**: LNet networking not properly configured
212+
- **"Self-ping test failed"**: Network connectivity issues
213+
- **"Interfaces not operational"**: Network interfaces not in UP state
214+
- **"Lustre module not loaded"**: Kernel module loading issues
215+
216+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
217+

pkg/azurelustreplugin/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ FROM ubuntu:22.04
1616

1717
COPY "./_output/azurelustreplugin" "/app/azurelustreplugin"
1818
COPY "./pkg/azurelustreplugin/entrypoint.sh" "/app/entrypoint.sh"
19+
COPY "./pkg/azurelustreplugin/readinessProbe.sh" "/app/readinessProbe.sh"
1920

20-
RUN chmod +x "/app/entrypoint.sh"
21+
RUN chmod +x "/app/entrypoint.sh" && chmod +x "/app/readinessProbe.sh"
2122

2223
RUN apt-get update && \
2324
apt-get upgrade -y && \
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
#!/bin/bash
2+
3+
# readinessProbe.sh - Health check script for Azure Lustre CSI driver
4+
# This script performs direct LNet readiness validation
5+
6+
set -euo pipefail
7+
8+
# Check if this is a controller pod (no Lustre client installation required)
9+
INSTALL_LUSTRE_CLIENT=${AZURELUSTRE_CSI_INSTALL_LUSTRE_CLIENT:-"yes"}
10+
11+
if [[ "$INSTALL_LUSTRE_CLIENT" == "no" ]]; then
12+
echo "Controller pod detected - reporting ready (skipping Lustre checks)"
13+
exit 0
14+
fi
15+
16+
echo "Node pod detected - performing Lustre-specific readiness checks"
17+
18+
# Check if CSI socket exists and is accessible
19+
CSI_SOCKET=${CSI_ENDPOINT:-"unix:///csi/csi.sock"}
20+
SOCKET_PATH=$(echo "$CSI_SOCKET" | sed 's|unix://||')
21+
22+
if [[ ! -S "$SOCKET_PATH" ]]; then
23+
echo "CSI socket not found: $SOCKET_PATH"
24+
exit 1
25+
fi
26+
27+
# Check if LNet is properly configured and operational
28+
# This replicates the logic from CheckLustreReadiness()
29+
30+
# 1. Check if LNet NIDs are valid and available
31+
if ! lnetctl net show >/dev/null 2>&1; then
32+
echo "LNet not available or not configured"
33+
exit 1
34+
fi
35+
36+
# 2. Check if we have any NIDs configured
37+
NID_COUNT=$(lnetctl net show 2>/dev/null | grep -c "nid:")
38+
if [[ "$NID_COUNT" -eq 0 ]]; then
39+
echo "No LNet NIDs configured"
40+
exit 1
41+
fi
42+
43+
# 3. Check LNet self-ping functionality
44+
if ! lnetctl ping --help >/dev/null 2>&1; then
45+
echo "LNet ping functionality not available"
46+
exit 1
47+
fi
48+
49+
# Get the first available NID for self-ping test (exclude loopback)
50+
FIRST_NID=$(lnetctl net show 2>/dev/null | grep "nid:" | grep -v "@lo" | head -1 | sed 's/.*nid: \([^ ]*\).*/\1/' || echo "")
51+
if [[ -z "$FIRST_NID" ]]; then
52+
echo "Unable to determine LNet NID for self-ping test"
53+
exit 1
54+
fi
55+
56+
# Perform self-ping test with timeout
57+
if ! timeout 10 lnetctl ping "$FIRST_NID" >/dev/null 2>&1; then
58+
echo "LNet self-ping test failed for NID: $FIRST_NID"
59+
exit 1
60+
fi
61+
62+
# 4. Check if LNet interfaces are operational
63+
# Verify we have at least one interface in 'up' state
64+
UP_INTERFACES=$(lnetctl net show 2>/dev/null | grep -c "status: up")
65+
if [[ "$UP_INTERFACES" -eq 0 ]]; then
66+
echo "No LNet interfaces in 'up' state"
67+
exit 1
68+
fi
69+
70+
echo "All Lustre readiness checks passed"
71+
exit 0

0 commit comments

Comments
 (0)