Skip to content

Add gh devlake diagnose — AI-powered troubleshooting #64

@ewega

Description

@ewega

Problem

When something goes wrong with a DevLake deployment, users must manually:

  1. Run gh devlake status and read the output
  2. Test each connection individually
  3. Check pipeline logs in the Config UI
  4. Correlate error messages across services

There's no single command that inspects the entire stack, identifies problems, and explains what's wrong with actionable remediation steps.

Proposed Solution

Add gh devlake diagnose — an AI-powered diagnostic command that runs all health checks, connection tests, and pipeline inspections, then synthesizes a diagnosis with remediation commands.

Command surface

# Full diagnostic
gh devlake diagnose

# Focus on a specific area
gh devlake diagnose --scope connections
gh devlake diagnose --scope pipelines

How it works

  1. Gather data — run all checks programmatically (no user interaction needed):

    • Ping all endpoints (backend, Config UI, Grafana)
    • Test all saved connections across all plugins
    • Fetch recent pipeline runs and their error messages
    • Check DB connectivity
    • Read state file for deployment context
  2. Send to Copilot SDK — package all results into a structured context and send to the LLM with a diagnostic prompt

  3. Stream diagnosis — the LLM synthesizes findings into plain-language explanation with actionable gh devlake commands

Example output

$ gh devlake diagnose

🔍 Running diagnostics...
   ✅ Backend API: http://localhost:8080 (healthy)
   ✅ Config UI: http://localhost:4000 (healthy)
   ✅ Grafana: http://localhost:3002 (healthy)
   ❌ Connection "GitHub - my-org" (github, id=1): 401 Unauthorized
   ✅ Connection "Copilot - my-ent" (gh-copilot, id=2): healthy
   ⚠️  Pipeline #12: FAILED (2 hours ago)

📋 Diagnosis:

  Your GitHub connection "GitHub - my-org" is returning 401 Unauthorized.
  This typically means the PAT has expired or been revoked.

  To fix:
    1. Generate a new PAT with scopes: repo, read:org, read:user
    2. Update the connection:
       gh devlake configure connection update --plugin github --id 1 --token ghp_NEW_TOKEN

  Pipeline #12 failed because it depends on this connection.
  After updating the token, re-trigger collection:
    gh devlake configure project add --project-name my-team

Architecture

Reuses the internal/copilot/ package from #63. Adds diagnostic-specific tools:

// Tool: test_all_connections
// Batch-tests every connection across all plugins and returns results
var testAllConnectionsTool = copilot.DefineTool("test_all_connections",
    "Test all saved DevLake connections and return pass/fail status for each",
    func(params struct{}, inv copilot.ToolInvocation) (any, error) {
        client := devlake.NewClient(apiURL)
        var results []ConnectionTestResult
        for _, def := range connectionRegistry {
            conns, _ := client.ListConnections(def.Plugin)
            for _, conn := range conns {
                test, _ := client.TestSavedConnection(def.Plugin, conn.ID)
                results = append(results, ConnectionTestResult{
                    Plugin: def.Plugin, ID: conn.ID, Name: conn.Name,
                    Healthy: test.Success, Message: test.Message,
                })
            }
        }
        return results, nil
    })

// Tool: get_recent_pipeline_errors
// Fetches recent failed pipelines with error details
var getRecentPipelineErrorsTool = copilot.DefineTool("get_recent_pipeline_errors",
    "Get recent failed DevLake pipeline runs with error messages and timestamps",
    func(params struct{ Limit int `json:"limit,omitempty"` }, inv copilot.ToolInvocation) (any, error) {
        // ... fetch pipelines, filter for failures, include error details ...
    })

// Tool: check_all_endpoints
// Pings backend, Config UI, Grafana and returns status for each
var checkEndpointsTool = copilot.DefineTool("check_all_endpoints",
    "Check health of all DevLake endpoints (backend API, Config UI, Grafana)",
    func(params struct{}, inv copilot.ToolInvocation) (any, error) {
        // ... ping each endpoint from state file or discovery ...
    })

Output mode

Unlike insights (which streams), diagnose uses batch mode: collect the full response, then render with the CLI's standard emoji/box-drawing formatting. This ensures the diagnostic output has consistent visual structure.

// Wait for full response instead of streaming
response, err := session.SendAndWait(ctx, copilot.MessageOptions{
    Prompt: diagnosticPrompt,
})
// Format and print with standard CLI output conventions

System prompt for diagnosis

The system message includes:

  • DevLake architecture context (three-layer model, plugin structure)
  • Available gh devlake commands for remediation
  • Common failure patterns and their fixes
  • The user's deployment type (local vs Azure) from the state file

Files to create/modify

File Change
cmd/diagnose.go NEW — gh devlake diagnose command
internal/copilot/tools.go ADD — diagnostic-specific tools (test_all_connections, get_recent_pipeline_errors, check_all_endpoints)
internal/copilot/system.go ADD — diagnostic system prompt variant

Acceptance Criteria

  • gh devlake diagnose gathers all health/connection/pipeline data and produces a synthesis
  • --scope connections limits diagnosis to connection health only
  • --scope pipelines limits diagnosis to pipeline failures only
  • Diagnosis includes actionable gh devlake commands for remediation
  • Graceful error if Copilot CLI is not installed (same as insights)
  • Diagnostic data gathering works even if some endpoints are down (partial results)
  • Output uses batch mode with standard CLI formatting (not streaming)
  • go build ./... and go test ./... pass
  • README updated

Target Version

v0.4.x — AI-powered operations.

Dependencies

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions