I get best_score 0.5 when the evaluator only returns negative values #327

lesshaste · 2025-11-24T16:19:06Z

lesshaste
Nov 24, 2025

After running a few minutes I see:

2025-11-24 16:09:43,290 - INFO - Sampled model: openai/gpt-oss-120b
2025-11-24 16:09:43,294 - WARNING - Iteration 82 error: Generated code exceeds maximum length (21825 > 20000)
2025-11-24 16:09:43,637 - WARNING - Evaluation timed out after 120s
2025-11-24 16:09:43,646 - INFO - Sampled model: openai/gpt-oss-120b
2025-11-24 16:09:43,651 - INFO - New MAP-Elites cell occupied in island 1: {'complexity': 7, 'diversity': 6}
2025-11-24 16:09:43,651 - INFO - Population size (71) exceeds limit (70), removing 1 programs
2025-11-24 16:09:43,652 - INFO - Population size after cleanup: 70
2025-11-24 16:09:43,652 - INFO - New best program da964b6f-ef89-4656-98d5-5a538784ee41 replaces c40678cd-cdb8-4990-a522-af194ee8b106
2025-11-24 16:09:43,652 - INFO - Iteration 73: Program da964b6f-ef89-4656-98d5-5a538784ee41 (parent: 21bc16c1-d859-4db0-a716-d4dd346d22bf) completed in 134.98s
2025-11-24 16:09:43,652 - INFO - Metrics: error=0.0000, timeout=1.0000
2025-11-24 16:09:43,652 - WARNING - ⚠️ No 'combined_score' metric found in evaluation results. Using average of all numeric metrics (0.5000) for evolution guidance. For better evolution results, please modify your evaluator to return a 'combined_score' metric that properly weights different aspects of program performance.
2025-11-24 16:09:43,652 - INFO - 🌟 New best solution found at iteration 73: da964b6f-ef89-4656-98d5-5a538784ee41
2025-11-24 16:09:43,653 - INFO - Checkpoint interval reached at iteration 73
2025-11-24 16:09:43,663 - INFO - Island Status:
2025-11-24 16:09:43,663 - INFO - Island 0: 21 programs, best=-58.0000, avg=-3398.4286, diversity=1570.22, gen=16 (best: 7ec4faea-9b2b-4d5c-ba10-bd41ff5ee922)
2025-11-24 16:09:43,664 - INFO - Island 1: 13 programs, best=0.5000, avg=-301.4231, diversity=746.05, gen=16 (best: da964b6f-ef89-4656-98d5-5a538784ee41)

My evaluator.py looks like:


import sys
import subprocess
import tempfile
import os
import json
from openevolve.evaluation_result import EvaluationResult

def evaluate(program_path):
    """
    Evaluates the solution in a subprocess with memory limits.
    """
    memory_limit_mb = 4096
    timeout_sec = 120

    # Verification helper code (embedded in wrapper)
    # We use a raw string to avoid f-string escaping issues for the helper logic
    helper_func_code = r"""
def apply_moves_and_verify(initial, moves):
   CODE REMOVED
"""

    # The wrapper script content
    wrapper_content = f"""
import sys
import json
import resource
import importlib.util
import traceback

# 1. Set Memory Limit
try:
    limit = {memory_limit_mb} * 1024 * 1024
    resource.setrlimit(resource.RLIMIT_AS, (limit, limit))
except Exception:
    pass

# 2. Define Verification Logic
{helper_func_code}

# 3. Main Execution
def main():
    if len(sys.argv) < 2:
        print(json.dumps({{"score": -100000.0, "passed": 0, "artifacts": {{"error": "No program path provided"}} }}))
        return

    program_path = sys.argv[1]

    result_data = {{
        "score": -100000.0,
        "passed": 0,
        "artifacts": {{}}
    }}

    try:
        # Load user module
        spec = importlib.util.spec_from_file_location("user_code", program_path)
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)

        if not hasattr(module, "solve"):
            result_data["artifacts"]["error"] = "Missing function 'solve"
            print(json.dumps(result_data))
            return

        CODE REMOVED
        
        # Verify
        is_valid, _, err_msg = apply_moves_and_verify(input, moves)
        
        if is_valid:
            score = -1.0 * len(moves)
            result_data["score"] = score
            result_data["passed"] = 1
            result_data["artifacts"]["solution_len"] = len(moves)
            if len(moves) > 0:
                # Store string representation of first few moves
                result_data["artifacts"]["moves_snippet"] = str(moves[:5])
        else:
            result_data["score"] = -10000.0
            result_data["passed"] = 0
            result_data["artifacts"]["validation_fail"] = str(err_msg)

    except Exception as e:
        result_data["score"] = -100000.0
        result_data["passed"] = 0
        result_data["artifacts"]["runtime_error"] = str(e)
        result_data["artifacts"]["traceback"] = traceback.format_exc()

    print(json.dumps(result_data))

if __name__ == "__main__":
    main()
"""

    # Write wrapper to temp file
    fd, wrapper_path = tempfile.mkstemp(suffix='.py')
    try:
        with os.fdopen(fd, 'w') as f:
            f.write(wrapper_content)
        
        # Run subprocess
        proc = subprocess.run(
            [sys.executable, wrapper_path, program_path],
            capture_output=True,
            text=True,
            timeout=timeout_sec
        )
        
        # Handle Parsing
        if proc.returncode != 0:
            # Subprocess crashed or failed
            return EvaluationResult(
                metrics={
                    "combined_score": -100000.0,
                    "reliability_score": 0,
                    "value_score": -100000.0
                },
                artifacts={
                    "error": "Subprocess execution failed",
                    "stderr": proc.stderr,
                    "stdout": proc.stdout
                }
            )

        # Parse JSON output
        try:
            data = json.loads(proc.stdout)
            return EvaluationResult(
                metrics={
                    "combined_score": float(data.get("score", -100000.0)),
                    "reliability_score": int(data.get("passed", 0)),
                    "value_score": float(data.get("score", -100000.0))
                },
                artifacts=data.get("artifacts", {})
            )
        except json.JSONDecodeError:
             return EvaluationResult(
                metrics={
                    "combined_score": -100000.0,
                    "reliability_score": 0,
                    "value_score": -100000.0
                },
                artifacts={
                    "error": "Failed to parse JSON output from subprocess",
                    "raw_stdout": proc.stdout
                }
            )

    except subprocess.TimeoutExpired:
        return EvaluationResult(
            metrics={
                "combined_score": -100000.0,
                "reliability_score": 0,
                "value_score": -100000.0
            },
            artifacts={"error": "Timeout expired"}
        )
    except Exception as e:
        return EvaluationResult(
            metrics={
                "combined_score": -100000.0,
                "reliability_score": 0,
                "value_score": -100000.0
            },
            artifacts={"error": f"Evaluator exception: {str(e)}"}
        )
    finally:
        if os.path.exists(wrapper_path):
            try:
                os.unlink(wrapper_path)
            except:
                pass

How does it ever return a value of 0.5? It seems this might be the average of 0 and 1 but I am not sure how to stop it happening.

codelion · 2025-11-24T16:23:12Z

codelion
Nov 24, 2025
Maintainer

You need make sure to return a “combined_score” in your evaluator which will be used. Otherwise it uses an average of all the metrics returned as the fitness score. There may be condition in the evaluator where it is not returning a combined score field.

0 replies

lesshaste · 2025-11-24T16:30:33Z

lesshaste
Nov 24, 2025
Author

Don't all paths in my code return a “combined_score” ? I must be missing something.

0 replies

lesshaste · 2025-11-24T16:49:28Z

lesshaste
Nov 24, 2025
Author

I think the problem is the timeout from openevolve itself. This sets combined_score to 0.5 and because I am maximizing a negative score the value of 0.5 openevolve gives is larger than any value the evolved code can achieve. I can get round it by setting the timeout in evaluate to be less than the timeout in config.yaml.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I get best_score 0.5 when the evaluator only returns negative values #327

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

I get best_score 0.5 when the evaluator only returns negative values #327

Uh oh!

Uh oh!

lesshaste Nov 24, 2025

Replies: 3 comments

Uh oh!

Uh oh!

codelion Nov 24, 2025 Maintainer

Uh oh!

lesshaste Nov 24, 2025 Author

Uh oh!

Uh oh!

lesshaste Nov 24, 2025 Author

lesshaste
Nov 24, 2025

codelion
Nov 24, 2025
Maintainer

lesshaste
Nov 24, 2025
Author

lesshaste
Nov 24, 2025
Author