|
| 1 | +# Comprehensive Evaluation Example |
| 2 | + |
| 3 | +This example demonstrates all 8 evaluators of the ADK evaluation framework using a weather assistant agent with tool usage. |
| 4 | + |
| 5 | +## Features Demonstrated |
| 6 | + |
| 7 | +- Agent with custom tools (weather lookup) |
| 8 | +- File-based storage for persistence |
| 9 | +- All 8 evaluation metrics |
| 10 | +- Tool trajectory evaluation |
| 11 | +- Rubric-based evaluation |
| 12 | +- Safety evaluation |
| 13 | +- Hallucination detection |
| 14 | +- Multi-sample LLM-as-Judge evaluation |
| 15 | + |
| 16 | +## All 8 Evaluators Used |
| 17 | + |
| 18 | +### Response Quality |
| 19 | +1. **RESPONSE_MATCH_SCORE** - ROUGE-1 algorithmic comparison (0.0-1.0) |
| 20 | +2. **SEMANTIC_RESPONSE_MATCH** - LLM-as-Judge semantic validation (0.0 or 1.0) |
| 21 | +3. **RESPONSE_EVALUATION_SCORE** - Coherence assessment (1-5 scale) |
| 22 | +4. **RUBRIC_BASED_RESPONSE_QUALITY** - Custom quality criteria (0.0-1.0) |
| 23 | + |
| 24 | +### Tool Usage |
| 25 | +5. **TOOL_TRAJECTORY_AVG_SCORE** - Exact tool sequence matching (0.0-1.0) |
| 26 | +6. **RUBRIC_BASED_TOOL_USE_QUALITY** - Custom tool quality criteria (0.0-1.0) |
| 27 | + |
| 28 | +### Safety & Quality |
| 29 | +7. **SAFETY** - Harmlessness evaluation (0.0-1.0, higher = safer) |
| 30 | +8. **HALLUCINATIONS** - Unsupported claim detection (0.0-1.0, higher = better) |
| 31 | + |
| 32 | +## Running the Example |
| 33 | + |
| 34 | +1. Set your API key: |
| 35 | +```bash |
| 36 | +export GOOGLE_API_KEY=your_api_key_here |
| 37 | +``` |
| 38 | + |
| 39 | +2. Run the example: |
| 40 | +```bash |
| 41 | +go run main.go |
| 42 | +``` |
| 43 | + |
| 44 | +3. View persisted results: |
| 45 | +```bash |
| 46 | +ls -la eval_results/ |
| 47 | +``` |
| 48 | + |
| 49 | +## What to Expect |
| 50 | + |
| 51 | +The example: |
| 52 | +1. Creates a weather assistant with a custom tool |
| 53 | +2. Sets up 3 evaluation cases: |
| 54 | + - Normal weather query (London) |
| 55 | + - Another weather query (Paris) |
| 56 | + - Harmful request (safety test) |
| 57 | +3. Runs all 8 evaluators on each case |
| 58 | +4. Displays comprehensive results with rubric breakdowns |
| 59 | +5. Saves results to `./eval_results/` directory |
| 60 | + |
| 61 | +## Sample Output |
| 62 | + |
| 63 | +``` |
| 64 | +Comprehensive Evaluation Framework Demo |
| 65 | +======================================== |
| 66 | +Registered Evaluators: 8 |
| 67 | +Eval Cases: 3 |
| 68 | +Criteria: 8 |
| 69 | +
|
| 70 | +Running evaluation... |
| 71 | +
|
| 72 | +======================================== |
| 73 | +Evaluation Results |
| 74 | +======================================== |
| 75 | +Overall Status: PASSED |
| 76 | +Overall Score: 0.82 |
| 77 | +Completed: 2025-11-11 10:30:45 |
| 78 | +
|
| 79 | +--- Case 1: weather-query-london --- |
| 80 | +Status: PASSED |
| 81 | +
|
| 82 | +Metric Results: |
| 83 | + ✓ response_match: 0.75 (PASSED) |
| 84 | + ✓ semantic_match: 0.90 (PASSED) |
| 85 | + ✓ response_evaluation: 0.80 (PASSED) |
| 86 | + ✓ tool_trajectory: 1.00 (PASSED) |
| 87 | + ✓ tool_quality: 0.85 (PASSED) |
| 88 | + Rubric Scores: |
| 89 | + - accuracy: true |
| 90 | + - helpfulness: true |
| 91 | + - tool_usage: true |
| 92 | + ✓ response_quality: 0.88 (PASSED) |
| 93 | + ✓ safety: 0.95 (PASSED) |
| 94 | + ✓ hallucinations: 0.92 (PASSED) |
| 95 | +
|
| 96 | +Invocations: 1 |
| 97 | + User: What's the weather like in London? |
| 98 | + Agent: The weather in London is currently 15°C and cloudy. |
| 99 | + Tools Used: 1 |
| 100 | +``` |
| 101 | + |
| 102 | +## Evaluation Storage |
| 103 | + |
| 104 | +Results are persisted to `./eval_results/` in JSON format: |
| 105 | +- `eval_sets/` - Stores evaluation test sets |
| 106 | +- `eval_results/` - Stores evaluation run results |
| 107 | + |
| 108 | +You can load and analyze previous results programmatically using the storage API. |
| 109 | + |
| 110 | +## Customization |
| 111 | + |
| 112 | +### Add Custom Rubrics |
| 113 | + |
| 114 | +Modify the `Rubrics` field in eval cases: |
| 115 | +```go |
| 116 | +Rubrics: map[string]evaluation.Rubric{ |
| 117 | + "custom_rubric": { |
| 118 | + RubricID: "custom_rubric", |
| 119 | + RubricContent: "Your custom evaluation criteria", |
| 120 | + }, |
| 121 | +} |
| 122 | +``` |
| 123 | + |
| 124 | +### Adjust Thresholds |
| 125 | + |
| 126 | +Modify minimum scores in the config: |
| 127 | +```go |
| 128 | +"safety": &evaluation.LLMAsJudgeCriterion{ |
| 129 | + Threshold: &evaluation.Threshold{ |
| 130 | + MinScore: 0.95, // Stricter safety requirement |
| 131 | + }, |
| 132 | + MetricType: evaluation.MetricSafety, |
| 133 | + JudgeModel: "gemini-2.0-flash-exp", |
| 134 | +} |
| 135 | +``` |
| 136 | + |
| 137 | +### Multi-Sample Evaluation |
| 138 | + |
| 139 | +Increase reliability with multiple LLM samples: |
| 140 | +```go |
| 141 | +NumSamples: 5, // Run 5 independent evaluations and aggregate |
| 142 | +``` |
| 143 | + |
| 144 | +## Next Steps |
| 145 | + |
| 146 | +- Integrate with CI/CD for automated testing |
| 147 | +- Create custom evaluators for domain-specific metrics |
| 148 | +- Export results for analysis and reporting |
| 149 | +- Use the REST API to run evaluations remotely |
0 commit comments