Skip to content

Conversation

@jverre
Copy link
Collaborator

@jverre jverre commented Nov 7, 2025

Details

Introduces the concept of experiment scores which allow you to log experiment level scores based on experiment results. This allows you to log metrics like f1-score, recall or last for example.

from typing import List
from opik.evaluation import evaluate, test_result
from opik.evaluation.metrics import Hallucination, score_result

# Define an experiment score function
def compute_hallucination_max(
    test_results: List[test_result.TestResult],
) -> List[score_result.ScoreResult]:
    """Compute the maximum hallucination score across all test results."""
    hallucination_scores = [
        result.score_results[0].value 
        for result in test_results 
        if result.score_results and len(result.score_results) > 0
    ]
    
    if not hallucination_scores:
        return []
    
    return [
        score_result.ScoreResult(
            name="hallucination_metric (max)",
            value=max(hallucination_scores),
            reason=f"Maximum hallucination score across {len(hallucination_scores)} test cases"
        )
    ]

# Run evaluation with experiment scores
evaluation = evaluate(
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[Hallucination()],
    experiment_scores=[compute_hallucination_max],
    experiment_name="My experiment"
)

# Access experiment scores from the result
print(f"Experiment scores: {evaluation.experiment_scores}")
Screenshot 2025-11-07 at 18 15 06

In the FE, the following places have been updated:

  1. Evaluation table in home page
  2. Experiment list page: Chart and table was updated with special care taken to support groups and sorting
  3. Single experiment page: Tags top of page and feedback scores table where updated

The documentation was also updated to include this feature.

Change checklist

  • User facing
  • Documentation update

Issues

Testing

SDK and BE tests were added. Manual testing was also completed.

Documentation

Documentation was updated

Copilot AI review requested due to automatic review settings November 7, 2025 18:20
@jverre jverre requested review from a team as code owners November 7, 2025 18:20
@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

📋 PR Linter Failed

Invalid Title Format. Your PR title must include a ticket/issue number and may optionally include component tags ([FE], [BE], etc.).

  • Internal contributors: Open a JIRA ticket and link to it: [OPIK-xxxx] or [CUST-xxxx] or [DEV-xxxx] [COMPONENT] Your change
  • External contributors: Open a Github Issue and link to it via its number: [issue-xxxx] [COMPONENT] Your change
  • No ticket: Use [NA] [COMPONENT] Your change (Issues section not required)

Example: [issue-3108] [BE] [FE] Fix authentication bug or [OPIK-1234] Fix bug or [NA] Update README

@jverre jverre changed the title [#3764] [FE] [BE] [Docs] Introduce experiment scores [issue-3764] [FE] [BE] [Docs] Introduce experiment scores Nov 7, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces experiment scores functionality, allowing users to log aggregate metrics (like f1-score, recall, or custom statistics) at the experiment level. These scores are computed from test results and stored separately from per-trace feedback scores, enabling better experiment-level analytics.

Key changes include:

  • Python SDK support for computing and logging experiment scores via the evaluate() function
  • Backend storage and retrieval of experiment scores in ClickHouse
  • Frontend display of experiment scores alongside feedback scores in experiment lists, comparison views, and charts
  • TypeScript SDK type definitions for experiment scores

Reviewed Changes

Copilot reviewed 68 out of 68 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdks/python/src/opik/evaluation/evaluator.py Added experiment_scores parameter to evaluation functions and logic to compute/log scores
sdks/python/src/opik/evaluation/types.py Defined ExperimentScoreFunction type for score computation functions
sdks/python/src/opik/rest_api/types/experiment_score*.py Auto-generated Pydantic models for experiment scores
apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentScore.java Java model for experiment scores with validation
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java Database operations for storing/retrieving experiment scores
apps/opik-backend/src/main/resources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql Database migration adding experiment_scores column
apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx Frontend logic to merge and display feedback scores and experiment scores
sdks/typescript/src/opik/rest_api/api/types/ExperimentScore*.ts Auto-generated TypeScript types for experiment scores
apps/opik-frontend/src/lib/sorting.ts Sorting support for experiment_scores columns
sdks/python/tests/unit/evaluation/test_evaluate.py Unit tests for experiment scores functionality
apps/opik-backend/src/test/java/com/comet/opik/api/resources/v1/priv/ExperimentsResourceTest.java Backend integration tests for experiment scores CRUD operations

test_results: List[test_result.TestResult],
) -> List[score_result.ScoreResult]:
"""Compute experiment-level scores from test results."""
if not experiment_scores or not test_results:
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return when test_results is empty prevents experiment score functions from executing. However, some experiment score functions may want to return a default score even with no test results (e.g., a baseline metric). Consider removing the not test_results check and let individual score functions decide how to handle empty results.

Suggested change
if not experiment_scores or not test_results:
if not experiment_scores:

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +211
const feedbackScores = (
get(row, "feedback_scores", []) as AggregatedFeedbackScore[]
).map((score) => ({
...score,
name: `${score.name} (avg)`,
value: formatNumericData(score.value),
}));
const experimentScores = (
get(row, "experiment_scores", []) as AggregatedFeedbackScore[]
).map((score) => ({
...score,
value: formatNumericData(score.value),
})),
}));
return [...feedbackScores, ...experimentScores];
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for transforming feedback scores and experiment scores is duplicated in multiple files (ExperimentsPage.tsx and ExperimentsTab.tsx). Consider extracting this into a shared utility function to avoid code duplication and ensure consistency.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +64
private String getDbField(SortingField sortingField) {
// Handle experiment_scores.* fields - extract from JSON array
if (sortingField.field().startsWith(EXPERIMENT_METRICS_PREFIX) && sortingField.isDynamic()) {
String bindKey = sortingField.bindKey();
// Extract value from experiment_scores JSON array where name matches the key
// experiment_scores is stored as a JSON string array: [{"name": "metric1", "value": 0.5}, ...]
// We filter the array to find the object with matching name, then extract its value
return String.format(
"JSONExtractFloat(arrayFirst(x -> JSONExtractString(x, 'name') == :%s, JSONExtractArrayRaw(e.experiment_scores)), 'value')",
bindKey);
}
return sortingField.dbField();
}
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrayFirst can return NULL if no matching element is found in the array, which would cause JSONExtractFloat to fail or return unexpected results when sorting. Consider wrapping the expression with coalesce() or ifNull() to handle cases where an experiment doesn't have a particular score.

Copilot uses AI. Check for mistakes.
Comment on lines +647 to +654
return Mono.zip(feedbackScoresNames, experimentScoresNames)
.map(tuple -> {
Set<ScoreNameWithType> allScores = new java.util.HashSet<>(tuple.getT1());
allScores.addAll(tuple.getT2());
return allScores.stream()
.sorted((a, b) -> a.name().compareTo(b.name()))
.collect(Collectors.toList());
});
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using java.util.HashSet with the fully qualified name is inconsistent with other imports in the file. Consider adding import java.util.HashSet; at the top and using HashSet directly for consistency.

Copilot uses AI. Check for mistakes.
Comment on lines +346 to 373
private List<FeedbackScoreAverage> calculateExperimentScoreAverages(List<Experiment> experiments) {
Map<String, List<BigDecimal>> scoresByName = new HashMap<>();

for (Experiment experiment : experiments) {
if (experiment.experimentScores() != null) {
for (ExperimentScore score : experiment.experimentScores()) {
if (score.name() != null && score.value() != null) {
scoresByName.computeIfAbsent(score.name(), k -> new ArrayList<>())
.add(score.value());
}
}
}
}

return scoresByName.entrySet().stream()
.map(entry -> {
BigDecimal average = entry.getValue().stream()
.reduce(BigDecimal.ZERO, BigDecimal::add)
.divide(BigDecimal.valueOf(entry.getValue().size()),
ValidationUtils.SCALE, RoundingMode.HALF_UP);
return FeedbackScoreAverage.builder()
.name(entry.getKey())
.value(average)
.build();
})
.sorted((a, b) -> a.name().compareTo(b.name()))
.toList();
}
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for calculating average scores is nearly identical between calculateFeedbackScoreAverages (line 312-340) and calculateExperimentScoreAverages (line 346-373). Consider extracting a common helper method that takes a function to extract scores from items, reducing code duplication.

Copilot uses AI. Check for mistakes.
@github-actions

This comment was marked as outdated.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

SDK E2E Tests Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 54510fa.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

Backend Tests Results

  283 files    283 suites   50m 57s ⏱️
5 434 tests 5 426 ✅ 8 💤 0 ❌
5 426 runs  5 418 ✅ 8 💤 0 ❌

Results for commit 54510fa.

♻️ This comment has been updated with latest results.

dataset_item_ids: Optional[List[str]] = None,
dataset_sampler: Optional[samplers.BaseDatasetSampler] = None,
trial_count: int = 1,
experiment_scores: Optional[List[ExperimentScoreFunction]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. experiment_scoring_functions name is more consistent with the existing names.
  2. as soon as evaluate function starts working, let's do the following:
experiment_scoring_functions = [] if experiment_scoring_functions is None else experiment_scoring_functions

After this small transformation we can simplify the following code by omitting "Optional" being used in multiple places and also get rid of default values.

(relevant for the other evaluate_* functions too)

)


def log_experiment_scores(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving it to opik.api_objects.Experiment class and call it directly from there without putting a new function to rest_operations helper module (it might also be useful for other users).

Copy link
Collaborator

@alexkuzmik alexkuzmik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment - we need at least one e2e test for this feature in the SDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR]: Support for population based metrics

3 participants