Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![Coverage Status](https://coveralls.io/repos/github/kubernetes-sigs/azurelustre-csi-driver/badge.svg?branch=main)](https://coveralls.io/github/kubernetes-sigs/azurelustre-csi-driver?branch=main)
[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fkubernetes-sigs%2Fazurelustre-csi-driver.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fkubernetes-sigs%2Fazurelustre-csi-driver?ref=badge_shield)

### About
## About

This driver allows Kubernetes to access Azure Lustre file system.

Expand All @@ -12,7 +12,7 @@ This driver allows Kubernetes to access Azure Lustre file system.

 

### Container Images & Kubernetes Compatibility:
### Container Images & Kubernetes Compatibility

| Driver version | Image | Supported k8s version | Lustre client version |
|-----------------|-----------------------------------------------------------------|-----------------------|-----------------------|
Expand Down
3 changes: 3 additions & 0 deletions deploy/rbac-csi-azurelustre-node.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "patch"]

---
kind: ClusterRoleBinding
Expand Down
69 changes: 69 additions & 0 deletions docs/csi-debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,75 @@ Check for solutions in [Resolving Common Errors](errors.md)

---

## Pod Scheduling and Node Readiness Issues

### Pods Stuck in Pending Status with Taint-Related Errors

**Symptoms:**

- Pods requiring Azure Lustre storage remain in `Pending` status
- Pod events show taint-related scheduling failures
- Error messages mentioning `azurelustre.csi.azure.com/agent-not-ready` taint

**Check pod scheduling status:**

```sh
kubectl describe pod <pod-name>
```

Look for events such as:

- `Warning FailedScheduling ... node(s) had taint {azurelustre.csi.azure.com/agent-not-ready: }, that the pod didn't tolerate`
- `0/X nodes are available: X node(s) had taint {azurelustre.csi.azure.com/agent-not-ready}`

**Check node taints:**

```sh
kubectl describe nodes | grep -A5 -B5 "azurelustre.csi.azure.com/agent-not-ready"
```

**Check CSI driver readiness on nodes:**

```sh
# Check if CSI driver pods are running on all nodes
kubectl get pods -n kube-system -l app=csi-azurelustre-node -o wide

# Check CSI driver logs for startup issues
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=100 | grep -i "taint\|ready\|error"
```

**Common causes and solutions:**

1. **CSI Driver Still Starting**: Wait for CSI driver pods to reach `Running` status

```sh
kubectl wait --for=condition=ready pod -l app=csi-azurelustre-node -n kube-system --timeout=300s
```

2. **Lustre Module Loading Issues**: Check if Lustre kernel modules are properly loaded

```sh
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep lustre
```

3. **Manual Taint Removal** (Emergency only - not recommended for production):

```sh
kubectl taint nodes <node-name> azurelustre.csi.azure.com/agent-not-ready:NoSchedule-
```

**Verify taint removal functionality:**

Check that startup taint removal is enabled in the CSI driver:

```sh
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre | grep -i "remove.*taint"
```

Expected log output should show taint removal activity when the driver becomes ready.

---

## Get Azure Lustre Driver Version

```sh
Expand Down
23 changes: 23 additions & 0 deletions docs/driver-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,29 @@ These are the parameters to be passed into the custom StorageClass that users mu

For more information, see the [Azure Managed Lustre Filesystem (AMLFS) service documentation](https://learn.microsoft.com/en-us/azure/azure-managed-lustre/) and the [AMLFS CSI documentation](https://learn.microsoft.com/en-us/azure/azure-managed-lustre/use-csi-driver-kubernetes).

## CSI Driver Configuration Parameters

These parameters control the behavior of the Azure Lustre CSI driver itself and are typically configured during driver installation rather than in StorageClass definitions.

### Node Startup Taint Management

Name | Meaning | Available Value | Default Value | Configuration Method
--- | --- | --- | --- | ---
remove-not-ready-taint | Controls whether the CSI driver automatically removes startup taints from nodes when the driver becomes ready. This ensures pods are only scheduled to nodes where the CSI driver is fully operational and Lustre filesystem capacity is available. Nodes should have a taint of the form: `azurelustre.csi.azure.com/agent-not-ready:NoSchedule` | `true`, `false` | `true` | Command-line flag `--remove-not-ready-taint` in driver deployment

#### Startup Taint Details

When enabled (default), the Azure Lustre CSI driver will:

1. **Monitor Node Readiness**: Check if the CSI driver is fully initialized on the node
2. **Remove Blocking Taint**: Automatically remove the `azurelustre.csi.azure.com/agent-not-ready:NoSchedule` taint when ready

This mechanism prevents pods requiring Azure Lustre storage from being scheduled to nodes where:

- Lustre kernel modules are not yet loaded
- CSI driver components are not fully initialized
- Network connectivity to Lustre filesystems is not established

## Dynamic Provisioning (Create an AMLFS Cluster through AKS) - Public Preview

> **Public Preview Notice**: Dynamic provisioning functionality is currently in public preview. Some features may not be supported or may have constrained capabilities.
Expand Down
86 changes: 85 additions & 1 deletion docs/errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ This document describes common errors that can occur during volume creation and
- [Error: Resource not found](#error-resource-not-found)
- [Error: Cannot create AMLFS cluster, not enough IP addresses available](#error-cannot-create-amlfs-cluster-not-enough-ip-addresses-available)
- [Error: Reached Azure Subscription Quota Limit for AMLFS Clusters](#error-reached-azure-subscription-quota-limit-for-amlfs-clusters)
- [Pod Scheduling Errors](#pod-scheduling-errors)
- [Node Readiness and Taint Errors](#node-readiness-and-taint-errors)
- [Error: Node had taint azurelustre.csi.azure.com/agent-not-ready](#error-node-had-taint-azurelustrecsiazurecomagent-not-ready)
- [Volume Mounting Errors](#volume-mounting-errors)
- [Node Mount Errors](#node-mount-errors)
- [Error: Could not mount target](#error-could-not-mount-target)
Expand All @@ -31,7 +34,7 @@ This document describes common errors that can occur during volume creation and
- [Controller Logs](#controller-logs)
- [Node Logs](#node-logs)
- [Comprehensive Log Collection](#comprehensive-log-collection)

---

## Volume Creation Errors
Expand Down Expand Up @@ -211,6 +214,87 @@ There is not enough room in the /subscriptions/<sub-id>/resourceGroups/<rg>/prov

---

## Pod Scheduling Errors

### Node Readiness and Taint Errors

#### Error: Node had taint azurelustre.csi.azure.com/agent-not-ready

**Symptoms:**

- Pods requiring Azure Lustre storage remain stuck in `Pending` status
- Pod events show taint-related scheduling failures:
- `Warning FailedScheduling ... node(s) had taint {azurelustre.csi.azure.com/agent-not-ready: }, that the pod didn't tolerate`
- `0/X nodes are available: X node(s) had taint {azurelustre.csi.azure.com/agent-not-ready}`
- Kubectl describe pod shows scheduling failures due to taints

**Possible Causes:**

- CSI driver is still initializing on nodes
- Lustre kernel modules are not yet loaded
- CSI driver failed to start properly on affected nodes
- Node is not ready to handle Azure Lustre volume allocations
- CSI driver startup taint removal is disabled

**Debugging Steps:**

```bash
# Check pod scheduling status
kubectl describe pod <pod-name> | grep -A10 Events

# Check which nodes have the taint
kubectl describe nodes | grep -A5 -B5 "azurelustre.csi.azure.com/agent-not-ready"

# Verify CSI driver pod status on nodes
kubectl get pods -n kube-system -l app=csi-azurelustre-node -o wide

# Check CSI driver startup logs
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=100 | grep -i "taint\|ready\|error"

# Verify taint removal is enabled (should be true by default)
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre | grep -i "remove.*taint"
```

**Resolution:**

1. **Wait for CSI Driver Readiness** (most common case):

```bash
# Wait for CSI driver pods to reach Running status
kubectl wait --for=condition=ready pod -l app=csi-azurelustre-node -n kube-system --timeout=300s
```

The taint should be automatically removed once the CSI driver is fully operational.

2. **Check Lustre Module Loading**:

```bash
# Verify Lustre modules are loaded on nodes
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep lustre
```

3. **Verify CSI Driver Configuration**:

```bash
# Check if taint removal is enabled (default: true)
kubectl get deployment csi-azurelustre-node -n kube-system -o yaml | grep "remove-not-ready-taint"
```

4. **Emergency Manual Taint Removal** (not recommended for production):

```bash
# Only use if CSI driver is confirmed working but taint persists
kubectl taint nodes <node-name> azurelustre.csi.azure.com/agent-not-ready:NoSchedule-
```

**Prevention:**

- Ensure CSI driver has sufficient time to initialize during cluster updates
- Monitor CSI driver health during node scaling operations
- Use pod disruption budgets to prevent scheduling issues during maintenance

---

## Volume Mounting Errors

### Node Mount Errors
Expand Down
79 changes: 28 additions & 51 deletions docs/install-csi-driver.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,54 +39,6 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
```

### Verifying CSI Driver Readiness for Lustre Operations

Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:

- Load required kernel modules (lnet, lustre)
- Configure LNet networking with valid Network Identifiers (NIDs)
- Verify LNet self-ping functionality
- Validate all network interfaces are operational
- Complete all initialization steps

#### Readiness Validation

The CSI driver deployment includes automated probes for container health monitoring:

- **Liveness Probe**: `/healthz` (Port 29763) - HTTP endpoint for basic container health
- **Container Status**: Kubernetes readiness based on container startup and basic health checks

#### Verification Steps

1. **Check pod readiness status:**
```shell
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
```
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.

2. **Verify probe configuration:**
```shell
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
```
Look for exec-based readiness and startup probe configuration in the pod description:
- `Readiness: exec [/app/readinessProbe.sh]`
- `Startup: exec [/app/readinessProbe.sh]`

In the Events section, you may see initial startup probe failures during LNet initialization:
- `Warning Unhealthy ... Startup probe failed: Node pod detected - performing Lustre-specific readiness checks`

This is normal during the initialization phase. Once LNet is fully operational, the probes will succeed and no more failure events will appear.

3. **Monitor validation logs:**
```shell
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
```
Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.

> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.

**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.

## Default instructions for production release

### Install with kubectl (current production release)
Expand Down Expand Up @@ -122,8 +74,7 @@ The CSI driver deployment includes automated probes for container health monitor
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
```


### Verifying CSI Driver Readiness for Lustre Operations
## Verifying CSI Driver Readiness for Lustre Operations

Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes enhanced LNet validation that performs comprehensive readiness checks:

Expand All @@ -133,7 +84,7 @@ Before mounting Azure Lustre filesystems, it is important to verify that the CSI
- Validate all network interfaces are operational
- Complete all initialization steps

#### Enhanced Readiness Validation
### Enhanced Readiness Validation

The CSI driver deployment includes automated **exec-based readiness probes** for accurate readiness detection:

Expand All @@ -143,24 +94,50 @@ The CSI driver deployment includes automated **exec-based readiness probes** for
#### Verification Steps

1. **Check pod readiness status:**

```shell
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
```

All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.

2. **Verify probe configuration:**

```shell
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
```

Look for exec-based readiness and startup probe configuration and check that no recent probe failures appear in the Events section.

3. **Monitor validation logs:**

```shell
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
```

Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.

> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.

**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.

## Startup Taints

When the CSI driver starts on each node, it automatically removes the following taint if present:

- **Taint Key**: `azurelustre.csi.azure.com/agent-not-ready`
- **Taint Effect**: `NoSchedule`

This ensures that:

1. **Node Readiness**: Pods requiring Azure Lustre storage are only scheduled to nodes where the CSI driver is fully initialized
2. **Lustre Client Ready**: The node has successfully loaded Lustre kernel modules and networking components

### Configuring Startup Taint Behavior

The startup taint functionality is enabled by default but can be configured during installation:

- **Default Behavior**: Startup taint removal is **enabled** by default
- **Disable Taint Removal**: To disable, set `--remove-not-ready-taint=false` in the driver deployment

For most AKS users, the default behavior provides optimal pod scheduling and should not be changed
4 changes: 2 additions & 2 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ require (
golang.org/x/net v0.44.0
google.golang.org/grpc v1.75.1
google.golang.org/protobuf v1.36.9
k8s.io/api v0.31.13
k8s.io/apimachinery v0.31.13
k8s.io/client-go v1.5.2
k8s.io/klog/v2 v2.130.1
k8s.io/kubernetes v1.31.13
k8s.io/mount-utils v0.31.6
Expand Down Expand Up @@ -122,10 +124,8 @@ require (
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/api v0.31.13 // indirect
k8s.io/apiextensions-apiserver v0.31.1 // indirect
k8s.io/apiserver v0.31.13 // indirect
k8s.io/client-go v1.5.2 // indirect
k8s.io/cloud-provider v0.31.1 // indirect
k8s.io/component-base v0.31.13 // indirect
k8s.io/component-helpers v0.31.13 // indirect
Expand Down
Loading