-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[issue-3764] [FE] [BE] [Docs] Introduce experiment scores #3989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📋 PR Linter Failed❌ Invalid Title Format. Your PR title must include a ticket/issue number and may optionally include component tags (
Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces experiment scores functionality, allowing users to log aggregate metrics (like f1-score, recall, or custom statistics) at the experiment level. These scores are computed from test results and stored separately from per-trace feedback scores, enabling better experiment-level analytics.
Key changes include:
- Python SDK support for computing and logging experiment scores via the
evaluate()function - Backend storage and retrieval of experiment scores in ClickHouse
- Frontend display of experiment scores alongside feedback scores in experiment lists, comparison views, and charts
- TypeScript SDK type definitions for experiment scores
Reviewed Changes
Copilot reviewed 68 out of 68 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
sdks/python/src/opik/evaluation/evaluator.py |
Added experiment_scores parameter to evaluation functions and logic to compute/log scores |
sdks/python/src/opik/evaluation/types.py |
Defined ExperimentScoreFunction type for score computation functions |
sdks/python/src/opik/rest_api/types/experiment_score*.py |
Auto-generated Pydantic models for experiment scores |
apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentScore.java |
Java model for experiment scores with validation |
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java |
Database operations for storing/retrieving experiment scores |
apps/opik-backend/src/main/resources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql |
Database migration adding experiment_scores column |
apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx |
Frontend logic to merge and display feedback scores and experiment scores |
sdks/typescript/src/opik/rest_api/api/types/ExperimentScore*.ts |
Auto-generated TypeScript types for experiment scores |
apps/opik-frontend/src/lib/sorting.ts |
Sorting support for experiment_scores columns |
sdks/python/tests/unit/evaluation/test_evaluate.py |
Unit tests for experiment scores functionality |
apps/opik-backend/src/test/java/com/comet/opik/api/resources/v1/priv/ExperimentsResourceTest.java |
Backend integration tests for experiment scores CRUD operations |
| test_results: List[test_result.TestResult], | ||
| ) -> List[score_result.ScoreResult]: | ||
| """Compute experiment-level scores from test results.""" | ||
| if not experiment_scores or not test_results: |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The early return when test_results is empty prevents experiment score functions from executing. However, some experiment score functions may want to return a default score even with no test results (e.g., a baseline metric). Consider removing the not test_results check and let individual score functions decide how to handle empty results.
| if not experiment_scores or not test_results: | |
| if not experiment_scores: |
| const feedbackScores = ( | ||
| get(row, "feedback_scores", []) as AggregatedFeedbackScore[] | ||
| ).map((score) => ({ | ||
| ...score, | ||
| name: `${score.name} (avg)`, | ||
| value: formatNumericData(score.value), | ||
| })); | ||
| const experimentScores = ( | ||
| get(row, "experiment_scores", []) as AggregatedFeedbackScore[] | ||
| ).map((score) => ({ | ||
| ...score, | ||
| value: formatNumericData(score.value), | ||
| })), | ||
| })); | ||
| return [...feedbackScores, ...experimentScores]; |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for transforming feedback scores and experiment scores is duplicated in multiple files (ExperimentsPage.tsx and ExperimentsTab.tsx). Consider extracting this into a shared utility function to avoid code duplication and ensure consistency.
| private String getDbField(SortingField sortingField) { | ||
| // Handle experiment_scores.* fields - extract from JSON array | ||
| if (sortingField.field().startsWith(EXPERIMENT_METRICS_PREFIX) && sortingField.isDynamic()) { | ||
| String bindKey = sortingField.bindKey(); | ||
| // Extract value from experiment_scores JSON array where name matches the key | ||
| // experiment_scores is stored as a JSON string array: [{"name": "metric1", "value": 0.5}, ...] | ||
| // We filter the array to find the object with matching name, then extract its value | ||
| return String.format( | ||
| "JSONExtractFloat(arrayFirst(x -> JSONExtractString(x, 'name') == :%s, JSONExtractArrayRaw(e.experiment_scores)), 'value')", | ||
| bindKey); | ||
| } | ||
| return sortingField.dbField(); | ||
| } |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrayFirst can return NULL if no matching element is found in the array, which would cause JSONExtractFloat to fail or return unexpected results when sorting. Consider wrapping the expression with coalesce() or ifNull() to handle cases where an experiment doesn't have a particular score.
| return Mono.zip(feedbackScoresNames, experimentScoresNames) | ||
| .map(tuple -> { | ||
| Set<ScoreNameWithType> allScores = new java.util.HashSet<>(tuple.getT1()); | ||
| allScores.addAll(tuple.getT2()); | ||
| return allScores.stream() | ||
| .sorted((a, b) -> a.name().compareTo(b.name())) | ||
| .collect(Collectors.toList()); | ||
| }); |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using java.util.HashSet with the fully qualified name is inconsistent with other imports in the file. Consider adding import java.util.HashSet; at the top and using HashSet directly for consistency.
| private List<FeedbackScoreAverage> calculateExperimentScoreAverages(List<Experiment> experiments) { | ||
| Map<String, List<BigDecimal>> scoresByName = new HashMap<>(); | ||
|
|
||
| for (Experiment experiment : experiments) { | ||
| if (experiment.experimentScores() != null) { | ||
| for (ExperimentScore score : experiment.experimentScores()) { | ||
| if (score.name() != null && score.value() != null) { | ||
| scoresByName.computeIfAbsent(score.name(), k -> new ArrayList<>()) | ||
| .add(score.value()); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return scoresByName.entrySet().stream() | ||
| .map(entry -> { | ||
| BigDecimal average = entry.getValue().stream() | ||
| .reduce(BigDecimal.ZERO, BigDecimal::add) | ||
| .divide(BigDecimal.valueOf(entry.getValue().size()), | ||
| ValidationUtils.SCALE, RoundingMode.HALF_UP); | ||
| return FeedbackScoreAverage.builder() | ||
| .name(entry.getKey()) | ||
| .value(average) | ||
| .build(); | ||
| }) | ||
| .sorted((a, b) -> a.name().compareTo(b.name())) | ||
| .toList(); | ||
| } |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating average scores is nearly identical between calculateFeedbackScoreAverages (line 312-340) and calculateExperimentScoreAverages (line 346-373). Consider extracting a common helper method that takes a function to extract scores from items, reducing code duplication.
This comment was marked as outdated.
This comment was marked as outdated.
SDK E2E Tests Results0 tests 0 ✅ 0s ⏱️ Results for commit 54510fa. ♻️ This comment has been updated with latest results. |
Backend Tests Results 283 files 283 suites 50m 57s ⏱️ Results for commit 54510fa. ♻️ This comment has been updated with latest results. |
|
🌿 Preview your docs: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik The following broken links where found: Page: Page: Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel Page: Page: Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel/ Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel/ |
| dataset_item_ids: Optional[List[str]] = None, | ||
| dataset_sampler: Optional[samplers.BaseDatasetSampler] = None, | ||
| trial_count: int = 1, | ||
| experiment_scores: Optional[List[ExperimentScoreFunction]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experiment_scoring_functionsname is more consistent with the existing names.- as soon as
evaluatefunction starts working, let's do the following:
experiment_scoring_functions = [] if experiment_scoring_functions is None else experiment_scoring_functionsAfter this small transformation we can simplify the following code by omitting "Optional" being used in multiple places and also get rid of default values.
(relevant for the other evaluate_* functions too)
| ) | ||
|
|
||
|
|
||
| def log_experiment_scores( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest moving it to opik.api_objects.Experiment class and call it directly from there without putting a new function to rest_operations helper module (it might also be useful for other users).
alexkuzmik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment - we need at least one e2e test for this feature in the SDK.
Details
Introduces the concept of experiment scores which allow you to log experiment level scores based on experiment results. This allows you to log metrics like
f1-score,recallorlastfor example.In the FE, the following places have been updated:
The documentation was also updated to include this feature.
Change checklist
Issues
Testing
SDK and BE tests were added. Manual testing was also completed.
Documentation
Documentation was updated