[issue-3764] [FE] [BE] [Docs] Introduce experiment scores #3989

jverre · 2025-11-07T18:20:39Z

Details

Introduces the concept of experiment scores which allow you to log experiment level scores based on experiment results. This allows you to log metrics like f1-score, recall or last for example.

from typing import List
from opik.evaluation import evaluate, test_result
from opik.evaluation.metrics import Hallucination, score_result

# Define an experiment score function
def compute_hallucination_max(
    test_results: List[test_result.TestResult],
) -> List[score_result.ScoreResult]:
    """Compute the maximum hallucination score across all test results."""
    hallucination_scores = [
        result.score_results[0].value 
        for result in test_results 
        if result.score_results and len(result.score_results) > 0
    ]
    
    if not hallucination_scores:
        return []
    
    return [
        score_result.ScoreResult(
            name="hallucination_metric (max)",
            value=max(hallucination_scores),
            reason=f"Maximum hallucination score across {len(hallucination_scores)} test cases"
        )
    ]

# Run evaluation with experiment scores
evaluation = evaluate(
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[Hallucination()],
    experiment_scores=[compute_hallucination_max],
    experiment_name="My experiment"
)

# Access experiment scores from the result
print(f"Experiment scores: {evaluation.experiment_scores}")

In the FE, the following places have been updated:

Evaluation table in home page
Experiment list page: Chart and table was updated with special care taken to support groups and sorting
Single experiment page: Tags top of page and feedback scores table where updated

The documentation was also updated to include this feature.

Change checklist

User facing
Documentation update

Issues

Resolves [FR]: Support for population based metrics #3764
OPIK-2884

Testing

SDK and BE tests were added. Manual testing was also completed.

Documentation

Documentation was updated

github-actions · 2025-11-07T18:20:54Z

📋 PR Linter Failed

❌ Invalid Title Format. Your PR title must include a ticket/issue number and may optionally include component tags ([FE], [BE], etc.).

Internal contributors: Open a JIRA ticket and link to it: [OPIK-xxxx] or [CUST-xxxx] or [DEV-xxxx] [COMPONENT] Your change
External contributors: Open a Github Issue and link to it via its number: [issue-xxxx] [COMPONENT] Your change
No ticket: Use [NA] [COMPONENT] Your change (Issues section not required)

Example: [issue-3108] [BE] [FE] Fix authentication bug or [OPIK-1234] Fix bug or [NA] Update README

Copilot

Pull Request Overview

This PR introduces experiment scores functionality, allowing users to log aggregate metrics (like f1-score, recall, or custom statistics) at the experiment level. These scores are computed from test results and stored separately from per-trace feedback scores, enabling better experiment-level analytics.

Key changes include:

Python SDK support for computing and logging experiment scores via the evaluate() function
Backend storage and retrieval of experiment scores in ClickHouse
Frontend display of experiment scores alongside feedback scores in experiment lists, comparison views, and charts
TypeScript SDK type definitions for experiment scores

Reviewed Changes

Copilot reviewed 68 out of 68 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`sdks/python/src/opik/evaluation/evaluator.py`	Added `experiment_scores` parameter to evaluation functions and logic to compute/log scores
`sdks/python/src/opik/evaluation/types.py`	Defined `ExperimentScoreFunction` type for score computation functions
`sdks/python/src/opik/rest_api/types/experiment_score*.py`	Auto-generated Pydantic models for experiment scores
`apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentScore.java`	Java model for experiment scores with validation
`apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java`	Database operations for storing/retrieving experiment scores
`apps/opik-backend/src/main/resources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql`	Database migration adding experiment_scores column
`apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx`	Frontend logic to merge and display feedback scores and experiment scores
`sdks/typescript/src/opik/rest_api/api/types/ExperimentScore*.ts`	Auto-generated TypeScript types for experiment scores
`apps/opik-frontend/src/lib/sorting.ts`	Sorting support for experiment_scores columns
`sdks/python/tests/unit/evaluation/test_evaluate.py`	Unit tests for experiment scores functionality
`apps/opik-backend/src/test/java/com/comet/opik/api/resources/v1/priv/ExperimentsResourceTest.java`	Backend integration tests for experiment scores CRUD operations

Copilot · 2025-11-07T18:22:23Z

sdks/python/src/opik/evaluation/evaluator.py

+    test_results: List[test_result.TestResult],
+) -> List[score_result.ScoreResult]:
+    """Compute experiment-level scores from test results."""
+    if not experiment_scores or not test_results:


The early return when test_results is empty prevents experiment score functions from executing. However, some experiment score functions may want to return a default score even with no test results (e.g., a baseline metric). Consider removing the not test_results check and let individual score functions decide how to handle empty results.

Suggested change

if not experiment_scores or not test_results:

if not experiment_scores:

sdks/python/examples/evaluation_example.py

Copilot · 2025-11-07T18:22:24Z

apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx

+      const feedbackScores = (
+        get(row, "feedback_scores", []) as AggregatedFeedbackScore[]
+      ).map((score) => ({
+        ...score,
+        name: `${score.name} (avg)`,
+        value: formatNumericData(score.value),
+      }));
+      const experimentScores = (
+        get(row, "experiment_scores", []) as AggregatedFeedbackScore[]
+      ).map((score) => ({
        ...score,
        value: formatNumericData(score.value),
-      })),
+      }));
+      return [...feedbackScores, ...experimentScores];


The logic for transforming feedback scores and experiment scores is duplicated in multiple files (ExperimentsPage.tsx and ExperimentsTab.tsx). Consider extracting this into a shared utility function to avoid code duplication and ensure consistency.

Copilot · 2025-11-07T18:22:25Z

apps/opik-backend/src/main/java/com/comet/opik/domain/sorting/SortingQueryBuilder.java

+    private String getDbField(SortingField sortingField) {
+        // Handle experiment_scores.* fields - extract from JSON array
+        if (sortingField.field().startsWith(EXPERIMENT_METRICS_PREFIX) && sortingField.isDynamic()) {
+            String bindKey = sortingField.bindKey();
+            // Extract value from experiment_scores JSON array where name matches the key
+            // experiment_scores is stored as a JSON string array: [{"name": "metric1", "value": 0.5}, ...]
+            // We filter the array to find the object with matching name, then extract its value
+            return String.format(
+                    "JSONExtractFloat(arrayFirst(x -> JSONExtractString(x, 'name') == :%s, JSONExtractArrayRaw(e.experiment_scores)), 'value')",
+                    bindKey);
+        }
+        return sortingField.dbField();
+    }


arrayFirst can return NULL if no matching element is found in the array, which would cause JSONExtractFloat to fail or return unexpected results when sorting. Consider wrapping the expression with coalesce() or ifNull() to handle cases where an experiment doesn't have a particular score.

Copilot · 2025-11-07T18:22:25Z

apps/opik-backend/src/main/java/com/comet/opik/domain/FeedbackScoreDAO.java

+            return Mono.zip(feedbackScoresNames, experimentScoresNames)
+                    .map(tuple -> {
+                        Set<ScoreNameWithType> allScores = new java.util.HashSet<>(tuple.getT1());
+                        allScores.addAll(tuple.getT2());
+                        return allScores.stream()
+                                .sorted((a, b) -> a.name().compareTo(b.name()))
+                                .collect(Collectors.toList());
+                    });


Using java.util.HashSet with the fully qualified name is inconsistent with other imports in the file. Consider adding import java.util.HashSet; at the top and using HashSet directly for consistency.

Copilot · 2025-11-07T18:22:25Z

apps/opik-backend/src/test/java/com/comet/opik/api/resources/utils/ExperimentsTestUtils.java

+    private List<FeedbackScoreAverage> calculateExperimentScoreAverages(List<Experiment> experiments) {
+        Map<String, List<BigDecimal>> scoresByName = new HashMap<>();
+
+        for (Experiment experiment : experiments) {
+            if (experiment.experimentScores() != null) {
+                for (ExperimentScore score : experiment.experimentScores()) {
+                    if (score.name() != null && score.value() != null) {
+                        scoresByName.computeIfAbsent(score.name(), k -> new ArrayList<>())
+                                .add(score.value());
+                    }
+                }
+            }
+        }
+
+        return scoresByName.entrySet().stream()
+                .map(entry -> {
+                    BigDecimal average = entry.getValue().stream()
+                            .reduce(BigDecimal.ZERO, BigDecimal::add)
+                            .divide(BigDecimal.valueOf(entry.getValue().size()),
+                                    ValidationUtils.SCALE, RoundingMode.HALF_UP);
+                    return FeedbackScoreAverage.builder()
+                            .name(entry.getKey())
+                            .value(average)
+                            .build();
+                })
+                .sorted((a, b) -> a.name().compareTo(b.name()))
                .toList();
    }


The logic for calculating average scores is nearly identical between calculateFeedbackScoreAverages (line 312-340) and calculateExperimentScoreAverages (line 346-373). Consider extracting a common helper method that takes a function to extract scores from items, reducing code duplication.

github-actions · 2025-11-07T18:27:10Z

SDK E2E Tests Results

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 54510fa.

♻️ This comment has been updated with latest results.

github-actions · 2025-11-07T18:50:41Z

Backend Tests Results

283 files 283 suites 50m 57s ⏱️
5 434 tests 5 426 ✅ 8 💤 0 ❌
5 426 runs 5 418 ✅ 8 💤 0 ❌

Results for commit 54510fa.

♻️ This comment has been updated with latest results.

github-actions · 2025-11-07T19:11:11Z

🌿 Preview your docs: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik

The following broken links where found:

Page:
❌ Broken link: {} ()

Page:
❌ Broken link: ] ()

Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel
❌ Broken link: https://docs.gretel.ai/create-synthetic-data/gretel-data-designer/ (404)

Page:
❌ Broken link: {} ()

Page:
❌ Broken link: ] ()

Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel/
❌ Broken link: https://docs.gretel.ai/create-synthetic-data/gretel-data-designer/ (404)

Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel
❌ Broken link: https://docs.gretel.ai/ (404)

Page: https://opik-preview-daeca528-ddb3-48b8-80ae-a709540fb7c8.docs.buildwithfern.com/docs/opik/integrations/gretel/
❌ Broken link: https://docs.gretel.ai/ (404)

alexkuzmik · 2025-11-10T12:04:27Z

sdks/python/src/opik/evaluation/evaluator.py

    dataset_item_ids: Optional[List[str]] = None,
    dataset_sampler: Optional[samplers.BaseDatasetSampler] = None,
    trial_count: int = 1,
+    experiment_scores: Optional[List[ExperimentScoreFunction]] = None,


experiment_scoring_functions name is more consistent with the existing names.

as soon as evaluate function starts working, let's do the following:

experiment_scoring_functions = [] if experiment_scoring_functions is None else experiment_scoring_functions

After this small transformation we can simplify the following code by omitting "Optional" being used in multiple places and also get rid of default values.

(relevant for the other evaluate_* functions too)

alexkuzmik · 2025-11-10T12:09:33Z

sdks/python/src/opik/evaluation/rest_operations.py

    )
+
+
+def log_experiment_scores(


I'd suggest moving it to opik.api_objects.Experiment class and call it directly from there without putting a new function to rest_operations helper module (it might also be useful for other users).

alexkuzmik

One more comment - we need at least one e2e test for this feature in the SDK.

jverre added 5 commits November 7, 2025 16:26

Hide experiment_scores columns in the single experiment table

1d00408

Add SDK support for experiment_scores

81fe6c8

Add SDK support for experiment_scores

21c4aef

Add BE functionality

2f1231e

Typescript autogenerated code

4f1b508

Copilot AI review requested due to automatic review settings November 7, 2025 18:20

jverre requested review from a team as code owners November 7, 2025 18:20

github-actions bot assigned jverre Nov 7, 2025

jverre changed the title ~~[#3764] [FE] [BE] [Docs] Introduce experiment scores~~ [issue-3764] [FE] [BE] [Docs] Introduce experiment scores Nov 7, 2025

Copilot AI reviewed Nov 7, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

Documentation and FE update

54510fa

alexkuzmik requested changes Nov 10, 2025

View reviewed changes

jverre added 2 commits November 12, 2025 13:10

Address PR comments

7fede23

Address PR comments

e8adcdf

	if not experiment_scores or not test_results:
	if not experiment_scores:

[issue-3764] [FE] [BE] [Docs] Introduce experiment scores #3989

Are you sure you want to change the base?

[issue-3764] [FE] [BE] [Docs] Introduce experiment scores #3989

Conversation

jverre commented Nov 7, 2025

Details

Change checklist

Issues

Testing

Documentation

Uh oh!

github-actions bot commented Nov 7, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 PR Linter Failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SDK E2E Tests Results

Uh oh!

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests Results

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

alexkuzmik Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alexkuzmik Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

alexkuzmik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Nov 7, 2025 •

edited by atlassian bot

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading