MLflow integration with GEPA optimize_anything by auschoi96 · Pull Request #228 · databricks-solutions/ai-dev-kit

auschoi96 · 2026-03-08T00:54:27Z

Summary

The initial PR for using GEPA did not use claude code as a Agentic test. It was simply an LLM call with the loaded skill and then, the scorers would review if the output of that LLM was good or not. This was also difficult to debug without an integrated tracing capability. There were also issues with how the tools were being optimized since the code looked across all the skills and tested a single tool against every instance of skill. That plus multiple iterations and passes is very expensive.

What's in the PR

This PR address the following:

addition of --agent-eval flag to ensure claude code kicks off when evaluating the performance of the skill. Now, the claude code agent sdk is used with the skill to determine how well it performs with the optimized skill
Adds an MLflow integration with updates to the CLI commands so that you can pass in your own MLflow experiment. Traces are also tagged to the skill name for easier referencing
Added a way to provide an experiment that may have existing MLflow traces with Judge Feedback. The feedback will be used to help GEPA optimize more efficiently by pulling relevant traces via tags set in the point above
Tool call optimization fixes
.test README changes
manifest.yaml file inclusions for every single skill to configure how the scorers/judges evaluate the skill. This can be configured to your liking

Test Plan.

You can run the following commands to test the new flags and optimizations. You will need to set the correct env variables according to the .test/README.md:

This one will optimize the sql tools
uv run python .test/scripts/optimize.py --tools-only --tool-module sql --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment" --max-per-skill 2

This one will optimize the databricks-metric-views

uv run python .test/scripts/optimize.py databricks-metric-views --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment" --mlflow-assessments "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment"

Test plan

Ensured calls to LLMs are traced with MLflow
Ensure claude code is running by checking MLflow Traces in the experiment
Ensure the CLI command above works

…ent-eval where an instance of claude code is ran to properly assess tool selection as well

…ing for per skill and tool call. Adjust this per eval. MLflow integration improvements

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

calreynolds

beautiful 👍

auschoi96 added 3 commits March 5, 2026 23:42

adding tighter mlflow integration for tracing capabilities. adding ag…

d9fc986

…ent-eval where an instance of claude code is ran to properly assess tool selection as well

adding claude code evaluator files

017f3b3

added manifest files that describes what the evaluator should be look…

7bca338

…ing for per skill and tool call. Adjust this per eval. MLflow integration improvements

auschoi96 requested review from calreynolds and dustinvannoy-db March 8, 2026 00:54

auschoi96 and others added 4 commits March 7, 2026 16:57

linting fixes

3465f47

feedback fixes for GEPA and package dependency updates

e86256d

feedback fixes for GEPA and package dependency updates

fed286f

ruff format fixes for CI lint check

997e361

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

calreynolds approved these changes Mar 9, 2026

View reviewed changes

calreynolds merged commit c1ec9ea into databricks-solutions:main Mar 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLflow integration with GEPA optimize_anything#228

MLflow integration with GEPA optimize_anything#228
calreynolds merged 7 commits intodatabricks-solutions:mainfrom
auschoi96:main

auschoi96 commented Mar 8, 2026 •

edited

Loading

Uh oh!

calreynolds left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

auschoi96 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Test Plan.

Test plan

Uh oh!

calreynolds left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

auschoi96 commented Mar 8, 2026 •

edited

Loading