Agentic AI + MCP is changing how we troubleshoot Kubernetes. Built a quick Go prototype that uses the MCP protocol to run tools like kubectl and k8sgpt on live clusters.
This project showcases the development of an AI-powered Kubernetes troubleshooting assistant using the Model-Controller-Presenter (MCP) protocol. Implemented in Go using mark3labs/mcp-go SDK, the prototype integrates large language models (LLMs) with standard Kubernetes tools like kubectl and k8sgpt to deliver a natural language, conversational interface for managing and troubleshooting clusters. Key features include real-time cluster monitoring, automated issue detection and resolution, intelligent command execution, and verification of applied fixes.
The Model-Controller-Presenter (MCP) protocol is a standardized framework designed to facilitate robust and flexible communication between clients and servers. It enables the discovery and exposure of tools as callable functions, supports both synchronous and asynchronous operations, ensures type-safe communication, allows for bidirectional data exchange, and permits language-agnostic tool implementation, making it adaptable across diverse systems and environments.
In this project, MCP is used to:
- Expose
kubectlandk8sgptas callable tools - Handle tool discovery and execution
- Manage communication between the AI client/agent and Kubernetes tools using SSE. SSE enables lightweight, real-time streaming from server to client—ideal for delivering tool outputs and logs in MCP-based AI workflows.
1. User Input → MCP Client (natural language query)
2. MCP Client → OpenAI LLM (query interpretation)
3. OpenAI LLM → MCP Client (tool selection and arguments)
4. MCP Client → MCP Server (tool execution request)
5. MCP Server → MCP Tools (kubectl/k8sgpt execution)
6. MCP Tools → Kubernetes Cluster (command execution)
7. Results flow back up through the same path
8. MCP Client → User (formatted response)
1. User: "check to see if there are problems with any pods"
2. LLM interprets and selects k8sgpt analyze
3. MCP Server executes k8sgpt analyze
4. Results flow back to user
5. If issues found, LLM selects kubectl commands
6. MCP Server executes kubectl commands
7. Results flow back to user
The project consists of two main components:
// Create MCP server with basic capabilities
mcpServer := server.NewMCPServer(
"kubernetes-troubleshooter",
"1.0.0",
)
// Create and add the k8sgpt tool
k8sGptTool := mcp.NewTool(
"k8sgpt",
mcp.WithDescription(
"Execute 'k8sgpt' command to interact with a Kubernetes cluster...",
),
)
// Create and add the kubectl tool
kubectlTool := mcp.NewTool(
"kubectl",
mcp.WithDescription(
"Use 'kubectl' command to check if there are any issues...",
),
)// Create a new MCP client using HTTP transport
mcpClient, err := client.NewSSEMCPClient(MCPEndpoint)
if err != nil {
log.Fatalf("failed to create MCP client: %v", err)
}
// Initialize OpenAI client
client := openai.NewClient()When a user asks to check for pod issues:
> check to see if there are problems with any pods running in my cluster
System detects:
0: Pod default/nginx-5545cbc86d-wszvv(Deployment/nginx)
- Error: Back-off pulling image "nginx007": ErrImagePull: failed to pull and unpack image "docker.io/library/nginx007:latest"
The system automatically:
- Deletes the problematic pod
- Updates the deployment with the correct image
- Monitors the rollout
- Verifies the fix
- Go 1.16 or later installed
- A running Kubernetes cluster
kubectlconfigured with access to your clusterk8sgptCLI tool installed from here. K8sGPT is an AI-powered tool designed to scan your Kubernetes clusters and identify issues, translating them into easy-to-understand explanations. It can use AI mode to summarize its findings but we are not using that mode here.- OpenAI API key
git clone <repository-url>
cd using-ai-mcp-tools-for-troubleshooting-kubernetes- Set your OpenAI API key:
export OPENAI_API_KEY="your-openai-api-key"- Verify your Kubernetes cluster access:
kubectl cluster-info- Verify that
k8sgptis working. Note thatk8sgptcan use AI mode to summarize its findings but we are not using that mode here.
k8sgpt analyze
AI Provider: AI not used; --explain not set
No problems detectedIn a new terminal window:
cd server
go run server.goYou should see:
2025/06/02 07:58:51 Starting SSE server on localhost:8090
In another terminal window:
cd client
go run client.goYou should see:
Initializing server...Initialized with server: kubernetes-troubleshooter 1.0.0
Available tools:
- k8sgpt: Execute 'k8sgpt' command to interact with a Kubernetes cluster...
- kubectl: Use 'kubectl' command to check if there are any issues...
Let's walk through a real example of troubleshooting an image pull error:
- First, let's check for any issues in the cluster:
Enter your question about the Kubernetes cluster (type 'quit' to exit):
check to see if there are problems with any pods running in my cluster
The system will respond with any detected issues:
0: Pod default/nginx-5545cbc86d-wszvv(Deployment/nginx)
- Error: Back-off pulling image "nginx007": ErrImagePull: failed to pull and unpack image "docker.io/library/nginx007:latest"
- To fix the issues, ask the system to take action:
Enter your question about the Kubernetes cluster (type 'quit' to exit):
check to see if there are problems with any pods running in my cluster and fix them
The system will:
- Delete the problematic pod
- Update the deployment with the correct image
- Monitor the rollout
- Verify the fix
- To verify the fix and wait for confirmation:
Enter your question about the Kubernetes cluster (type 'quit' to exit):
check to see if there are problems with any pods running in my cluster and fix them, wait for 10 seconds and confirm that the problem is resolved before completing the task
- Check kube-apiserver logs:
Enter your question about the Kubernetes cluster (type 'quit' to exit):
get me the last 5 log lines of the pod which is running kube api server in kube-system namespace
- Check kube-proxy logs:
Enter your question about the Kubernetes cluster (type 'quit' to exit):
get me the last 5 log line of the pod running kube proxy in kube-system namespace
Here are some useful queries you can try:
- "Get the status of all pods in the default namespace"
- "Check if there are any authentication issues in the cluster"
-
If the server fails to start:
- Check if port 8090 is available
- Verify your Go installation
- Check server logs for errors
-
If the client fails to connect:
- Verify the server is running
- Check your OpenAI API key
- Ensure your kubeconfig is properly configured
-
If commands fail:
- Check your cluster permissions
- Verify the tools (kubectl, k8sgpt) are installed
- Look for error messages in the client output
To exit the client:
Enter your question about the Kubernetes cluster (type 'quit' to exit):
quit
