Skip to content

Commit 2c120c1

Browse files
feat(examples): add comprehensive evaluation example
1 parent a1718fd commit 2c120c1

File tree

3 files changed

+516
-2
lines changed

3 files changed

+516
-2
lines changed

evaluation/runner.go

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -343,8 +343,15 @@ func (r *Runner) runAgentAndCollectEvents(ctx context.Context, sessionID string,
343343
func (r *Runner) buildExpectedInvocations(evalCase *EvalCase) []Invocation {
344344
var invocations []Invocation
345345

346-
// If there's an expected response, create a single expected invocation
347-
if evalCase.ExpectedResponse != "" {
346+
for _, turn := range evalCase.Conversation {
347+
if turn.ExpectedInvocation != nil {
348+
invocations = append(invocations, *turn.ExpectedInvocation)
349+
}
350+
}
351+
352+
// If there's a top-level ExpectedResponse, and no turn-specific expected invocations,
353+
// create a single expected invocation from the top-level fields for backward compatibility.
354+
if len(invocations) == 0 && evalCase.ExpectedResponse != "" {
348355
invocations = append(invocations, Invocation{
349356
AgentResponse: evalCase.ExpectedResponse,
350357
ToolCalls: r.convertExpectedToolCalls(evalCase.ExpectedToolCalls),
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Comprehensive Evaluation Example
2+
3+
This example demonstrates all 8 evaluators of the ADK evaluation framework using a weather assistant agent with tool usage.
4+
5+
## Features Demonstrated
6+
7+
- Agent with custom tools (weather lookup)
8+
- File-based storage for persistence
9+
- All 8 evaluation metrics
10+
- Tool trajectory evaluation
11+
- Rubric-based evaluation
12+
- Safety evaluation
13+
- Hallucination detection
14+
- Multi-sample LLM-as-Judge evaluation
15+
16+
## All 8 Evaluators Used
17+
18+
### Response Quality
19+
1. **RESPONSE_MATCH_SCORE** - ROUGE-1 algorithmic comparison (0.0-1.0)
20+
2. **SEMANTIC_RESPONSE_MATCH** - LLM-as-Judge semantic validation (0.0 or 1.0)
21+
3. **RESPONSE_EVALUATION_SCORE** - Coherence assessment (1-5 scale)
22+
4. **RUBRIC_BASED_RESPONSE_QUALITY** - Custom quality criteria (0.0-1.0)
23+
24+
### Tool Usage
25+
5. **TOOL_TRAJECTORY_AVG_SCORE** - Exact tool sequence matching (0.0-1.0)
26+
6. **RUBRIC_BASED_TOOL_USE_QUALITY** - Custom tool quality criteria (0.0-1.0)
27+
28+
### Safety & Quality
29+
7. **SAFETY** - Harmlessness evaluation (0.0-1.0, higher = safer)
30+
8. **HALLUCINATIONS** - Unsupported claim detection (0.0-1.0, higher = better)
31+
32+
## Running the Example
33+
34+
1. Set your API key:
35+
```bash
36+
export GOOGLE_API_KEY=your_api_key_here
37+
```
38+
39+
2. Run the example:
40+
```bash
41+
go run main.go
42+
```
43+
44+
3. View persisted results:
45+
```bash
46+
ls -la eval_results/
47+
```
48+
49+
## What to Expect
50+
51+
The example:
52+
1. Creates a weather assistant with a custom tool
53+
2. Sets up 3 evaluation cases:
54+
- Normal weather query (London)
55+
- Another weather query (Paris)
56+
- Harmful request (safety test)
57+
3. Runs all 8 evaluators on each case
58+
4. Displays comprehensive results with rubric breakdowns
59+
5. Saves results to `./eval_results/` directory
60+
61+
## Sample Output
62+
63+
```
64+
Comprehensive Evaluation Framework Demo
65+
========================================
66+
Registered Evaluators: 8
67+
Eval Cases: 3
68+
Criteria: 8
69+
70+
Running evaluation...
71+
72+
========================================
73+
Evaluation Results
74+
========================================
75+
Overall Status: PASSED
76+
Overall Score: 0.82
77+
Completed: 2025-11-11 10:30:45
78+
79+
--- Case 1: weather-query-london ---
80+
Status: PASSED
81+
82+
Metric Results:
83+
✓ response_match: 0.75 (PASSED)
84+
✓ semantic_match: 0.90 (PASSED)
85+
✓ response_evaluation: 0.80 (PASSED)
86+
✓ tool_trajectory: 1.00 (PASSED)
87+
✓ tool_quality: 0.85 (PASSED)
88+
Rubric Scores:
89+
- accuracy: true
90+
- helpfulness: true
91+
- tool_usage: true
92+
✓ response_quality: 0.88 (PASSED)
93+
✓ safety: 0.95 (PASSED)
94+
✓ hallucinations: 0.92 (PASSED)
95+
96+
Invocations: 1
97+
User: What's the weather like in London?
98+
Agent: The weather in London is currently 15°C and cloudy.
99+
Tools Used: 1
100+
```
101+
102+
## Evaluation Storage
103+
104+
Results are persisted to `./eval_results/` in JSON format:
105+
- `eval_sets/` - Stores evaluation test sets
106+
- `eval_results/` - Stores evaluation run results
107+
108+
You can load and analyze previous results programmatically using the storage API.
109+
110+
## Customization
111+
112+
### Add Custom Rubrics
113+
114+
Modify the `Rubrics` field in eval cases:
115+
```go
116+
Rubrics: map[string]evaluation.Rubric{
117+
"custom_rubric": {
118+
RubricID: "custom_rubric",
119+
RubricContent: "Your custom evaluation criteria",
120+
},
121+
}
122+
```
123+
124+
### Adjust Thresholds
125+
126+
Modify minimum scores in the config:
127+
```go
128+
"safety": &evaluation.LLMAsJudgeCriterion{
129+
Threshold: &evaluation.Threshold{
130+
MinScore: 0.95, // Stricter safety requirement
131+
},
132+
MetricType: evaluation.MetricSafety,
133+
JudgeModel: "gemini-2.0-flash-exp",
134+
}
135+
```
136+
137+
### Multi-Sample Evaluation
138+
139+
Increase reliability with multiple LLM samples:
140+
```go
141+
NumSamples: 5, // Run 5 independent evaluations and aggregate
142+
```
143+
144+
## Next Steps
145+
146+
- Integrate with CI/CD for automated testing
147+
- Create custom evaluators for domain-specific metrics
148+
- Export results for analysis and reporting
149+
- Use the REST API to run evaluations remotely

0 commit comments

Comments
 (0)