Skip to content

fix: 5 critical bugs — TEE randomization, HTTP 402 hint, int dtype, inference None crash, gas estimation#201

Open
amathxbt wants to merge 8 commits intoOpenGradient:mainfrom
amathxbt:fix/critical-5-bugs-tee-402-dtype-inference-gas
Open

fix: 5 critical bugs — TEE randomization, HTTP 402 hint, int dtype, inference None crash, gas estimation#201
amathxbt wants to merge 8 commits intoOpenGradient:mainfrom
amathxbt:fix/critical-5-bugs-tee-402-dtype-inference-gas

Conversation

@amathxbt
Copy link
Contributor

Summary

This PR fixes 5 critical bugs identified from open issues and code audit, spanning TEE selection, LLM error handling, type conversions, inference safety, and gas estimation.


Bug 1 — TEE always picks the same node (closes #200)

File: src/opengradient/client/tee_registry.py

Root cause: get_llm_tee() always returned tees[0], routing 100% of traffic to a single TEE with zero load distribution or failover.

Fix: Replace tees[0] with random.choice(tees) so each LLM() construction independently selects from the full pool of active, registry-verified TEEs.

# Before
return tees[0]

# After
selected = random.choice(tees)
logger.debug("Selected TEE %s from %d active LLM proxy TEE(s)", selected.tee_id, len(tees))
return selected

Bug 2 — HTTP 402 swallowed as cryptic RuntimeError (closes #188)

File: src/opengradient/client/llm.py

Root cause: All HTTP errors (including 402 Payment Required) were caught by except Exception and re-raised as a generic RuntimeError("TEE LLM chat failed: ..."). Users had no idea they needed to call ensure_opg_approval() first.

Fix: Add import httpx and intercept httpx.HTTPStatusError before the generic handler. When status_code == 402, raise a RuntimeError with an explicit, actionable message pointing to ensure_opg_approval(). Applied to _chat_request(), completion(), and _parse_sse_response() (streaming path).

# New constant
_402_HINT = (
    "Payment required (HTTP 402): your wallet may have insufficient OPG token allowance. "
    "Call llm.ensure_opg_approval(opg_amount=<amount>) to approve Permit2 spending "
    "before making requests. Minimum amount is 0.05 OPG."
)

# In _chat_request / completion / _parse_sse_response:
except httpx.HTTPStatusError as e:
    if e.response.status_code == 402:
        raise RuntimeError(_402_HINT) from e
    raise RuntimeError(f"TEE LLM chat failed: {e}") from e

Bug 3 — Fixed-point int returns np.float32 instead of int (partially closes #103)

File: src/opengradient/client/_conversions.py

Root cause: convert_to_float32(value, decimals) always returned np.float32, even when decimals == 0 (i.e., the original value was an integer). Users expecting integer outputs received np.float32 and had to cast manually.

Fix: Add convert_fixed_point_to_python(value, decimals) that returns int when decimals == 0 and np.float32 otherwise. np.array() then infers the correct dtype (int64 vs float32) automatically. The old convert_to_float32 is kept as a deprecated alias for backward compatibility.

def convert_fixed_point_to_python(value: int, decimals: int) -> Union[int, np.float32]:
    if decimals == 0:
        return int(value)      # ← integer tensor stays integer
    return np.float32(Decimal(value) / (10 ** Decimal(decimals)))

Bug 4 — infer() crashes with AttributeError when inference node returns None

File: src/opengradient/client/alpha.py

Root cause: _get_inference_result_from_node() returns None when the node has no result yet. This None was passed directly to convert_to_model_output(), which calls event_data.get("output", {})AttributeError: NoneType object has no attribute get. Additionally, if ModelInferenceEvent logs were empty, parsed_logs[0] caused an IndexError.

Fix:

# Guard 1: empty precompile logs
if not precompile_logs:
    raise RuntimeError("ModelInferenceEvent not found in transaction logs.")

# Guard 2: None from inference node
if inference_result is None:
    raise RuntimeError(
        f"Inference node returned no result for inference ID {inference_id!r}. "
        "The result may not be available yet — retry after a short delay."
    )

Bug 5 — run_workflow() uses hardcoded 30M gas

File: src/opengradient/client/alpha.py

Root cause: run_workflow() hardcoded gas=30000000, unlike infer() which correctly uses estimate_gas(). This is both wasteful (users overpay) and fragile on networks with lower block gas limits.

Fix: Call estimate_gas() first and multiply by 3 for headroom. Fall back to 30M only if estimation itself fails.

try:
    estimated_gas = run_function.estimate_gas({"from": self._wallet_account.address})
    gas_limit = int(estimated_gas * 3)
except Exception:
    gas_limit = 30000000  # fallback

Also replaces the hardcoded timeout=60 in new_workflow() with the INFERENCE_TX_TIMEOUT constant.


Files Changed

File Bugs Fixed
src/opengradient/client/tee_registry.py Bug 1 (TEE randomization)
src/opengradient/client/llm.py Bug 2 (HTTP 402 hint)
src/opengradient/client/_conversions.py Bug 3 (int dtype)
src/opengradient/client/alpha.py Bugs 4 & 5 (None guard + gas)

Related Issues

Closes #200
Closes #188
Partially closes #103

Previously get_llm_tee() always returned tees[0], the first TEE in the
registry list. This caused all clients to hit the same TEE, providing no
load distribution and no resilience when that TEE starts failing.

Fix: use random.choice(tees) so each LLM() construction independently
picks from all currently active TEEs. Successive retries or re-initializations
will therefore naturally spread across the healthy pool.

Closes OpenGradient#200
Previously any HTTP error from the TEE (including 402 Payment Required)
was silently wrapped into a generic RuntimeError("TEE LLM chat failed: ...").
This caused the confusing traceback seen in issue OpenGradient#188 — the real cause
(insufficient OPG allowance) was buried inside the exception message.

Fix:
- Add `import httpx` and intercept httpx.HTTPStatusError before the
  generic `except Exception` handler in both _chat_request() and
  completion().
- When status == 402, raise a RuntimeError with a clear, actionable hint
  telling the user to call llm.ensure_opg_approval(opg_amount=<amount>).
- Also handle 402 in _parse_sse_response() for the streaming path.
- All other HTTP errors continue to surface as before.

Closes OpenGradient#188
Previously convert_to_float32() always returned np.float32 regardless of
the decimals field, forcing callers to manually cast integer results
(issue OpenGradient#103 — add proper type conversions from Solidity contract to Python).

Fix:
- Add convert_fixed_point_to_python(value, decimals) that returns int when
  decimals == 0 and np.float32 otherwise.  np.array() automatically picks
  the correct dtype (int64 vs float32) based on the element types.
- Update both convert_to_model_output() and convert_array_to_model_output()
  to call the new function.
- Keep convert_to_float32() as a deprecated backward-compatible alias so
  any external code that imports it directly continues to work.

Partially closes OpenGradient#103
… in run_workflow

Bug 4 — None crash in infer():
  _get_inference_result_from_node() can legitimately return None when the
  inference node has no result yet.  Previously this None was passed
  directly to convert_to_model_output(), which calls event_data.get(),
  causing an AttributeError: 'NoneType' object has no attribute 'get'.

  Fix: after calling _get_inference_result_from_node(), check for None
  and raise a clear RuntimeError with a human-readable message and the
  inference_id so callers know what to retry.

  Also guard the precompile log fetch: if parsed_logs is empty, raise
  RuntimeError instead of crashing with an IndexError on parsed_logs[0].

Bug 5 — hardcoded 30M gas in run_workflow():
  run_workflow() built the transaction with gas=30000000 (30 million)
  unconditionally.  This is wasteful (users overpay for gas) and can
  fail on networks with a lower block gas limit.

  Fix: call estimate_gas() first (consistent with infer()), then multiply
  by 3 for headroom.  Fall back to 30M only if estimation itself fails.

Also fixes new_workflow() deploy to use INFERENCE_TX_TIMEOUT constant
instead of the previous hardcoded 60 seconds.
@amathxbt
Copy link
Contributor Author

Hey @adambalogh 👋 — this PR addresses 5 critical bugs found during a code audit, 3 of which directly close or partially close open issues you and the team have filed:

# Bug Issue
1 get_llm_tee() always returned tees[0] — no randomization or load distribution Closes #200
2 HTTP 402 from TEE was swallowed as a cryptic RuntimeError with no guidance Closes #188
3 Fixed-point values with decimals=0 returned np.float32 instead of int Partially closes #103
4 infer() crashed with AttributeError when the inference node returned None Code audit
5 run_workflow() hardcoded gas=30000000 instead of using estimate_gas() Code audit

All changes are backward-compatible. Happy to adjust anything before review! 🙏

@adambalogh
Copy link
Collaborator

Hey @amathxbt thanks for your contributions, I added it to my list to review, will get back to you tomorrow!

@amathxbt
Copy link
Contributor Author

Hey @amathxbt thanks for your contributions, I added it to my list to review, will get back to you tomorrow!

Ok @adambalogh thank you see you tomorrow

"""LLM chat and completion via TEE-verified execution with x402 payments."""

import json
import json as _json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import json as _json alias is intentional — it avoids colliding with any user-level json variable or with the json key that appears heavily throughout the request/response dicts we construct in this file. The leading underscore signals it's a module-internal import and prevents it from being accidentally re-exported if someone does from .llm import *. No behaviour change; purely a namespace-safety convention.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses several correctness and UX issues in the OpenGradient Python SDK across TEE selection, LLM error reporting, fixed-point conversions, inference robustness, and workflow gas handling.

Changes:

  • Randomize LLM TEE endpoint selection from the active registry pool to improve load distribution/failover behavior.
  • Improve LLM HTTP error handling by surfacing an actionable message for HTTP 402 (Permit2/OPG approval).
  • Add guards in Alpha.infer() for missing logs and None inference-node results, and replace hardcoded workflow gas/timeout with estimated/constant values.
  • Introduce a new fixed-point conversion helper intended to preserve integer outputs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/opengradient/client/tee_registry.py Randomizes get_llm_tee() selection and adds debug logging.
src/opengradient/client/llm.py Adds explicit HTTP 402 hint handling for completion/chat/streaming.
src/opengradient/client/_conversions.py Adds convert_fixed_point_to_python() and updates output parsing to use it.
src/opengradient/client/alpha.py Adds inference result/log guards, uses timeout constant, and estimates gas for workflows with fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +131 to +133
selected = random.choice(tees)
logger.debug(
"Selected TEE %s (endpoint=%s) from %d active LLM proxy TEE(s)",
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes get_llm_tee() non-deterministic, which will break/flakify the existing unit test TestGetLlmTee.test_returns_first_active_tee (it asserts the first endpoint). Update the tests to patch random.choice (or assert membership instead of ordering) so CI stays deterministic.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to fix this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ✅ — committed in tests/tee_registry_test.py. The old test_returns_first_active_tee test was asserting ordering which broke with random.choice. Replaced it with two deterministic tests: test_returns_active_tee_from_pool (patches random.choice via unittest.mock.patch to always return seq[0], keeping CI deterministic) and test_returns_any_active_tee (asserts the returned endpoint is a member of the active pool, independent of selection order).

Comment on lines +39 to +56
def convert_fixed_point_to_python(value: int, decimals: int) -> Union[int, np.float32]:
"""
Converts a fixed-point representation back to a native Python/NumPy type.

Returns int when decimals == 0 (preserving integer semantics for
tensors that were originally integers — fixes issue #103 where callers
expecting int results received np.float32 and had to cast manually).
Returns np.float32 for all other cases.

Args:
value: The integer significand stored on-chain.
decimals: The scale factor exponent (value / 10**decimals).

Returns:
int if decimals == 0, np.float32 otherwise.
"""
if decimals == 0:
return int(value)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning int solely when decimals == 0 can misclassify float tensors that happen to contain whole numbers (e.g., 1.0 normalizes to decimals==0). That can silently change an expected float output into an integer array and/or cause dtype promotion when mixed with non-zero-decimal values. Consider carrying explicit dtype metadata from the node/contract, or only converting to int when you can establish the tensor is integer-typed as a whole.

Suggested change
def convert_fixed_point_to_python(value: int, decimals: int) -> Union[int, np.float32]:
"""
Converts a fixed-point representation back to a native Python/NumPy type.
Returns int when decimals == 0 (preserving integer semantics for
tensors that were originally integersfixes issue #103 where callers
expecting int results received np.float32 and had to cast manually).
Returns np.float32 for all other cases.
Args:
value: The integer significand stored on-chain.
decimals: The scale factor exponent (value / 10**decimals).
Returns:
int if decimals == 0, np.float32 otherwise.
"""
if decimals == 0:
return int(value)
def convert_fixed_point_to_python(value: int, decimals: int) -> np.float32:
"""
Converts a fixed-point representation back to a NumPy float32.
This function is intentionally type-stable and always returns np.float32,
regardless of the value of `decimals`. Callers that require integer
semantics should perform an explicit cast (e.g., int(...)) based on
their own dtype metadata or application logic.
Args:
value: The integer significand stored on-chain.
decimals: The scale factor exponent (value / 10**decimals).
Returns:
np.float32 corresponding to `value / 10**decimals`.
"""

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there are reason we are mixing up float and int? might make the handling of results more difficult

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ✅ — committed in src/opengradient/client/_conversions.py. convert_fixed_point_to_python is now type-stable: it always returns np.float32 regardless of decimals, matching Copilot's suggestion. The int-when-decimals==0 branch has been removed. The return type annotation is updated from Union[int, np.float32]np.float32. Callers that genuinely need integer semantics can do an explicit int(...) cast based on their own dtype metadata.

Comment on lines +87 to +91
logging.debug(f" Converting {tensor_data} to np array")
tensor_data = np.array(tensor_data)

if isinstance(tensor_data, (str, int, float)):
logging.debug(f"\tConverting single entry {tensor_data} to a list")
logging.debug(f" Converting single entry {tensor_data} to a list")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug string includes a literal tab character inside the f-string. Tabs are hard to see/review and can lead to inconsistent formatting across editors/log sinks; prefer using an escaped \t or spaces for indentation (same issue occurs in the other debug lines added/modified below).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ✅ — committed in src/opengradient/client/_conversions.py. Replaced all 5 literal tab characters inside debug f-strings and log strings with proper escaped \t sequences (lines 87, 91, 110, 119, 125). No more invisible tabs in the source.

@adambalogh
Copy link
Collaborator

@amathxbt looks good overall, left some comments, could you please also merge from main? i think there are changes in the llm.py that need to be resolved

@amathxbt
Copy link
Contributor Author

Done @adambalogh ✅ — merged main into llm.py (commit 8896eac).

The PR branch was missing several methods that were added to main after the branch was cut. I preserved all of them and kept the PR's bug fixes layered on top:

Restored from main:

  • _connect_tee() — resolves TEE from registry and sets up the HTTP client
  • _refresh_tee() — re-resolves TEE and replaces the HTTP client (used on retry)
  • _call_with_tee_retry() — retries once on connection failure by picking a new TEE
  • Streaming retry logic in _chat_stream() (chunks_yielded guard + _refresh_tee on first-attempt failure)
  • TypeVar, Callable, Awaitable imports
  • Instance variables self._rpc_url, self._tee_registry_address, self._llm_server_url

PR fixes kept on top:

  • import json as _json + all json.loads / json.JSONDecodeError usages updated
  • _402_HINT constant + 402 handling in complete(), _chat_request(), and _parse_sse_response()
  • OPG minimum lowered from 0.10.05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants