Skip to content

Conversation

@jverre
Copy link
Collaborator

@jverre jverre commented Nov 3, 2025

Details

This PR adds comprehensive multimodal content support across the Opik Optimizer SDK, enabling optimization of prompts that include both text and images. The changes ensure that multimodal message structures are preserved throughout the optimization process.

Key Changes:

  • OptimizationResult Model: Updated to support multimodal content by changing prompt and initial_prompt fields from list[dict[str, str]] to list[MessageDict], which properly supports content as either a string or a list of text/image parts.

  • Hierarchical Reflective Optimizer:

    • Fixed issue where multimodal content was being converted to string representation after first optimization run
    • Updated PromptMessage model to use MessageDict type for consistency with existing codebase
    • Implemented structured outputs using Pydantic models for more robust LLM responses
    • Changed template formatting to use JSON serialization to preserve multimodal structure
  • Meta Prompt Optimizer:

    • Replaced manual JSON parsing with structured outputs using Pydantic models
    • Created dedicated types.py file with PromptCandidate, CandidatePromptsResponse, ToolDescriptionCandidate, and ToolDescriptionsResponse models
    • Removed brittle regex-based JSON extraction logic
  • Reporting Utilities:

    • Updated display_optimized_prompt_diff to handle multimodal content in prompt diffs
    • Added _content_to_string helper for converting multimodal content to string representation for diffing
    • Enhanced display methods to properly format multimodal content using existing _format_message_content utility
  • New Dataset: Added driving_hazard_50 dataset for multimodal evaluation scenarios

  • Example Script: Added multimodal_example.py demonstrating multimodal prompt optimization

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-000

Testing

  • Updated test_optimization_result.py to reflect new type definitions
  • Manual testing with multimodal example script confirms:
    • Multimodal content structure is preserved through optimization rounds
    • Display methods correctly format multimodal content
    • Structured outputs work correctly with Pydantic validation

Documentation

  • Added multimodal example script demonstrating usage
  • Updated type hints and docstrings to reflect multimodal support
  • No breaking changes to public API - existing string-based prompts continue to work

@vincentkoc vincentkoc changed the base branch from main to feat/imagebased-optimizer November 5, 2025 15:46
@vincentkoc vincentkoc changed the base branch from feat/imagebased-optimizer to main November 8, 2025 02:55
@vincentkoc
Copy link
Member

Cherry picked commits into #3926, closing this branch and #3488 - biggest issue is the Pydantic type changes are extensive and would need a large scale refactor, will adjust and combine my EO changes into a new PR. Will raise additional issues/tickets for missing parts

@vincentkoc vincentkoc closed this Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants