Skip to content

MLflow integration with GEPA optimize_anything#228

Merged
calreynolds merged 7 commits intodatabricks-solutions:mainfrom
auschoi96:main
Mar 9, 2026
Merged

MLflow integration with GEPA optimize_anything#228
calreynolds merged 7 commits intodatabricks-solutions:mainfrom
auschoi96:main

Conversation

@auschoi96
Copy link
Collaborator

@auschoi96 auschoi96 commented Mar 8, 2026

Summary

The initial PR for using GEPA did not use claude code as a Agentic test. It was simply an LLM call with the loaded skill and then, the scorers would review if the output of that LLM was good or not. This was also difficult to debug without an integrated tracing capability. There were also issues with how the tools were being optimized since the code looked across all the skills and tested a single tool against every instance of skill. That plus multiple iterations and passes is very expensive.

What's in the PR

This PR address the following:

  1. addition of --agent-eval flag to ensure claude code kicks off when evaluating the performance of the skill. Now, the claude code agent sdk is used with the skill to determine how well it performs with the optimized skill
  2. Adds an MLflow integration with updates to the CLI commands so that you can pass in your own MLflow experiment. Traces are also tagged to the skill name for easier referencing
  3. Added a way to provide an experiment that may have existing MLflow traces with Judge Feedback. The feedback will be used to help GEPA optimize more efficiently by pulling relevant traces via tags set in the point above
  4. Tool call optimization fixes
  5. .test README changes
  6. manifest.yaml file inclusions for every single skill to configure how the scorers/judges evaluate the skill. This can be configured to your liking

Test Plan.

You can run the following commands to test the new flags and optimizations. You will need to set the correct env variables according to the .test/README.md:

This one will optimize the sql tools
uv run python .test/scripts/optimize.py --tools-only --tool-module sql --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment" --max-per-skill 2

This one will optimize the databricks-metric-views

uv run python .test/scripts/optimize.py databricks-metric-views --reflection-lm databricks/gepa-fallbacks --judge-model databricks/gepa-fallbacks --preset quick --agent-eval --mlflow-experiment "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment" --mlflow-assessments "/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates dc-assistant-agent_experiment"

Test plan

  • Ensured calls to LLMs are traced with MLflow
  • Ensure claude code is running by checking MLflow Traces in the experiment
  • Ensure the CLI command above works

…ent-eval where an instance of claude code is ran to properly assess tool selection as well
…ing for per skill and tool call. Adjust this per eval.

MLflow integration improvements
Copy link
Collaborator

@calreynolds calreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful 👍

@calreynolds calreynolds merged commit c1ec9ea into databricks-solutions:main Mar 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants