Skip to content

Conversation

@mohi-devhub
Copy link
Contributor

@mohi-devhub mohi-devhub commented Dec 8, 2025

Summary

This PR replaces the rule-based HeuristicEvaluationEngine with an LLM-driven evaluation pipeline and updates the UIElement model to match real OmniParser output. The system now evaluates H1–H5 using GPT-4 with structured, measurable criteria.

Key Changes

1. OmniParser Alignment (Commit: 7255ae1)

Updated UIElement to align with the actual OmniParser format:

  • Includes type, bbox: [x1, y1, x2, y2], interactivity, content.
  • Removed previously hallucinated fields (hover_state, confirmation, etc.).
  • Added computed fields: width, height.
  • Added text alias for backward compatibility.
  • Implemented from_dict() supporting both old and new formats.
  • Added infer_heading_level() for text hierarchy calculation.

2. Measurable Criteria for H4 & H5 (Commit: 8957369)

Added concrete, structured criteria for LLM evaluation of higher-level heuristics:

  • H4 – Consistency & Standards (e.g., button dimensions, typography, terminology).
  • H5 – Error Prevention (e.g., input constraints, validation, confirmation prompts).

3. LLM-Based Evaluation Engine (Commit: ec31f70)

The core evaluation logic is now LLM-driven:

  • New methods: _serialize_elements_for_llm(), _evaluate_with_llm(), and _llm_explain_heuristic().
  • evaluate_heuristic() is now fully LLM-based, supporting H1-H5.
  • Removed legacy rule-based evaluation for H1–H3 (H4 legacy kept as deprecated for now).
  • Updated scoring model fields to track LLM usage:
    • llm_explanation
    • evaluation_version="2.0.0-llm"
    • evaluation_method="llm-based"

4. Other Fixes & Integrations

  • Fixed an unterminated docstring bug.
  • Merged upstream changes from main.
  • Added configurable OPENAI_BASE_URL and improved LLM client initialization.

Impact Summary

Metric Before After
Evaluation Method Rule-based LLM-based
OmniParser Model Hallucinated attrs Real format
Heuristics Supported H1–H3 H1–H5
LOC Change +355 net

Validation

  • All Python modules compile cleanly.
  • UIElement serialization confirmed to match real OmniParser output.
  • Backward compatibility for element access maintained through property aliases.

@mohi-devhub mohi-devhub changed the title feat: add H4/H5 criteria and H4 rule-based evaluation Add H4/H5 criteria and H4 rule-based evaluation Dec 8, 2025
@tqmsh tqmsh self-requested a review December 8, 2025 13:45
@latishab latishab self-requested a review December 9, 2025 08:19
@latishab
Copy link
Collaborator

latishab commented Dec 9, 2025

@mohi-devhub

actually, i realized the mock data (omniparser_client.py) was incorrect. it included an attributes={'level': 'h1'} field that the real omniparser does not actually output.

since the real model only returns bounding boxes and text, we cannot rely on font_size or level attributes. please update the logic to infer hierarchy from the element's height (bounds['height']) instead.

here is the example output of what omniparser does:

icon 0: {'type': 'text', 'bbox': [0.31953126192092896, 0.10987482964992523, 0.41015625, 0.13212795555591583], 'interactivity': False, 'content': 'Type here to search'}
icon 1: {'type': 'text', 'bbox': [0.3101562559604645, 0.19332405924797058, 0.3460937440395355, 0.21279555559158325], 'interactivity': False, 'content': 'Pinned'}
icon 2: {'type': 'text', 'bbox': [0.6585937738418579, 0.2990264296531677, 0.694531261920929, 0.3184979259967804], 'interactivity': False, 'content': 'Settings'}
icon 6: {'type': 'icon', 'bbox': [0.4970785975456238, 0.23429809510707855, 0.5733658671379089, 0.34724709391593933], 'interactivity': True, 'content': 'Microsoft '}
icon 7: {'type': 'icon', 'bbox': [0.5727160573005676, 0.4605804979801178, 0.6388594508171082, 0.5722821354866028], 'interactivity': True, 'content': 'Photoshop Express '}

@mohi-devhub
Copy link
Contributor Author

@latishab
Thanks for the clarification ,makes sense. I’ll remove the mock level/font_size fields and update the logic to infer hierarchy purely from the element’s bbox height. I’ll also fix the mock data to match real Omniparser output. Let me know if anything else looks off.

@mohi-devhub mohi-devhub changed the title Add H4/H5 criteria and H4 rule-based evaluation Refactor Heuristic Evaluation Engine to LLM-Based Analysis Dec 9, 2025
@latishab latishab requested a review from tqmsh December 9, 2025 17:21
@latishab
Copy link
Collaborator

latishab commented Dec 9, 2025

@mohi-devhub

I tested your code and I had two issues:

  1. currently the model name is hardcoded in app/services/heuristic_engine.py, you should be using the model from configurable (hence settings.OPENAI_MODEL)
  2. the method _llm_explain_heuristic() accesses properties that don't exist in the new UIElement format. e.attributes.keys()should not exist, and e.interactive should be e.interactivity based on OmniParser output.

Please make these changes, and possibly provide your API outputs (you can put example raw outputs of OmniParser for testing so you don't have to rely on mock data anymore).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants