fix: 5 critical bugs — TEE randomization, HTTP 402 hint, int dtype, inference None crash, gas estimation by amathxbt · Pull Request #201 · OpenGradient/OpenGradient-SDK

amathxbt · 2026-03-23T23:49:53Z

Summary

This PR fixes 5 critical bugs identified from open issues and code audit, spanning TEE selection, LLM error handling, type conversions, inference safety, and gas estimation.

Bug 1 — TEE always picks the same node (closes #200)

File: src/opengradient/client/tee_registry.py

Root cause: get_llm_tee() always returned tees[0], routing 100% of traffic to a single TEE with zero load distribution or failover.

Fix: Replace tees[0] with random.choice(tees) so each LLM() construction independently selects from the full pool of active, registry-verified TEEs.

# Before
return tees[0]

# After
selected = random.choice(tees)
logger.debug("Selected TEE %s from %d active LLM proxy TEE(s)", selected.tee_id, len(tees))
return selected

Bug 2 — HTTP 402 swallowed as cryptic RuntimeError (closes #188)

File: src/opengradient/client/llm.py

Root cause: All HTTP errors (including 402 Payment Required) were caught by except Exception and re-raised as a generic RuntimeError("TEE LLM chat failed: ..."). Users had no idea they needed to call ensure_opg_approval() first.

Fix: Add import httpx and intercept httpx.HTTPStatusError before the generic handler. When status_code == 402, raise a RuntimeError with an explicit, actionable message pointing to ensure_opg_approval(). Applied to _chat_request(), completion(), and _parse_sse_response() (streaming path).

# New constant
_402_HINT = (
    "Payment required (HTTP 402): your wallet may have insufficient OPG token allowance. "
    "Call llm.ensure_opg_approval(opg_amount=<amount>) to approve Permit2 spending "
    "before making requests. Minimum amount is 0.05 OPG."
)

# In _chat_request / completion / _parse_sse_response:
except httpx.HTTPStatusError as e:
    if e.response.status_code == 402:
        raise RuntimeError(_402_HINT) from e
    raise RuntimeError(f"TEE LLM chat failed: {e}") from e

Bug 3 — Fixed-point int returns np.float32 instead of int (partially closes #103)

File: src/opengradient/client/_conversions.py

Root cause: convert_to_float32(value, decimals) always returned np.float32, even when decimals == 0 (i.e., the original value was an integer). Users expecting integer outputs received np.float32 and had to cast manually.

Fix: Add convert_fixed_point_to_python(value, decimals) that returns int when decimals == 0 and np.float32 otherwise. np.array() then infers the correct dtype (int64 vs float32) automatically. The old convert_to_float32 is kept as a deprecated alias for backward compatibility.

def convert_fixed_point_to_python(value: int, decimals: int) -> Union[int, np.float32]:
    if decimals == 0:
        return int(value)      # ← integer tensor stays integer
    return np.float32(Decimal(value) / (10 ** Decimal(decimals)))

Bug 4 — `infer()` crashes with AttributeError when inference node returns None

File: src/opengradient/client/alpha.py

Root cause: _get_inference_result_from_node() returns None when the node has no result yet. This None was passed directly to convert_to_model_output(), which calls event_data.get("output", {}) → AttributeError: NoneType object has no attribute get. Additionally, if ModelInferenceEvent logs were empty, parsed_logs[0] caused an IndexError.

Fix:

# Guard 1: empty precompile logs
if not precompile_logs:
    raise RuntimeError("ModelInferenceEvent not found in transaction logs.")

# Guard 2: None from inference node
if inference_result is None:
    raise RuntimeError(
        f"Inference node returned no result for inference ID {inference_id!r}. "
        "The result may not be available yet — retry after a short delay."
    )

Bug 5 — `run_workflow()` uses hardcoded 30M gas

File: src/opengradient/client/alpha.py

Root cause: run_workflow() hardcoded gas=30000000, unlike infer() which correctly uses estimate_gas(). This is both wasteful (users overpay) and fragile on networks with lower block gas limits.

Fix: Call estimate_gas() first and multiply by 3 for headroom. Fall back to 30M only if estimation itself fails.

try:
    estimated_gas = run_function.estimate_gas({"from": self._wallet_account.address})
    gas_limit = int(estimated_gas * 3)
except Exception:
    gas_limit = 30000000  # fallback

Also replaces the hardcoded timeout=60 in new_workflow() with the INFERENCE_TX_TIMEOUT constant.

Files Changed

File	Bugs Fixed
`src/opengradient/client/tee_registry.py`	Bug 1 (TEE randomization)
`src/opengradient/client/llm.py`	Bug 2 (HTTP 402 hint)
`src/opengradient/client/_conversions.py`	Bug 3 (int dtype)
`src/opengradient/client/alpha.py`	Bugs 4 & 5 (None guard + gas)

Related Issues

Closes #200
Closes #188
Partially closes #103

Previously get_llm_tee() always returned tees[0], the first TEE in the registry list. This caused all clients to hit the same TEE, providing no load distribution and no resilience when that TEE starts failing. Fix: use random.choice(tees) so each LLM() construction independently picks from all currently active TEEs. Successive retries or re-initializations will therefore naturally spread across the healthy pool. Closes OpenGradient#200

Previously any HTTP error from the TEE (including 402 Payment Required) was silently wrapped into a generic RuntimeError("TEE LLM chat failed: ..."). This caused the confusing traceback seen in issue OpenGradient#188 — the real cause (insufficient OPG allowance) was buried inside the exception message. Fix: - Add `import httpx` and intercept httpx.HTTPStatusError before the generic `except Exception` handler in both _chat_request() and completion(). - When status == 402, raise a RuntimeError with a clear, actionable hint telling the user to call llm.ensure_opg_approval(opg_amount=<amount>). - Also handle 402 in _parse_sse_response() for the streaming path. - All other HTTP errors continue to surface as before. Closes OpenGradient#188

Previously convert_to_float32() always returned np.float32 regardless of the decimals field, forcing callers to manually cast integer results (issue OpenGradient#103 — add proper type conversions from Solidity contract to Python). Fix: - Add convert_fixed_point_to_python(value, decimals) that returns int when decimals == 0 and np.float32 otherwise. np.array() automatically picks the correct dtype (int64 vs float32) based on the element types. - Update both convert_to_model_output() and convert_array_to_model_output() to call the new function. - Keep convert_to_float32() as a deprecated backward-compatible alias so any external code that imports it directly continues to work. Partially closes OpenGradient#103

… in run_workflow Bug 4 — None crash in infer(): _get_inference_result_from_node() can legitimately return None when the inference node has no result yet. Previously this None was passed directly to convert_to_model_output(), which calls event_data.get(), causing an AttributeError: 'NoneType' object has no attribute 'get'. Fix: after calling _get_inference_result_from_node(), check for None and raise a clear RuntimeError with a human-readable message and the inference_id so callers know what to retry. Also guard the precompile log fetch: if parsed_logs is empty, raise RuntimeError instead of crashing with an IndexError on parsed_logs[0]. Bug 5 — hardcoded 30M gas in run_workflow(): run_workflow() built the transaction with gas=30000000 (30 million) unconditionally. This is wasteful (users overpay for gas) and can fail on networks with a lower block gas limit. Fix: call estimate_gas() first (consistent with infer()), then multiply by 3 for headroom. Fall back to 30M only if estimation itself fails. Also fixes new_workflow() deploy to use INFERENCE_TX_TIMEOUT constant instead of the previous hardcoded 60 seconds.

amathxbt · 2026-03-23T23:50:06Z

Hey @adambalogh 👋 — this PR addresses 5 critical bugs found during a code audit, 3 of which directly close or partially close open issues you and the team have filed:

#	Bug	Issue
1	`get_llm_tee()` always returned `tees[0]` — no randomization or load distribution	Closes #200
2	HTTP 402 from TEE was swallowed as a cryptic `RuntimeError` with no guidance	Closes #188
3	Fixed-point values with `decimals=0` returned `np.float32` instead of `int`	Partially closes #103
4	`infer()` crashed with `AttributeError` when the inference node returned `None`	Code audit
5	`run_workflow()` hardcoded `gas=30000000` instead of using `estimate_gas()`	Code audit

All changes are backward-compatible. Happy to adjust anything before review! 🙏

adambalogh · 2026-03-26T02:02:47Z

Hey @amathxbt thanks for your contributions, I added it to my list to review, will get back to you tomorrow!

amathxbt · 2026-03-26T23:28:38Z

Hey @amathxbt thanks for your contributions, I added it to my list to review, will get back to you tomorrow!

Ok @adambalogh thank you see you tomorrow

adambalogh · 2026-03-26T23:32:12Z

src/opengradient/client/llm.py

 """LLM chat and completion via TEE-verified execution with x402 payments."""

-import json
+import json as _json


what is the reason for this?

The import json as _json alias is intentional — it avoids colliding with any user-level json variable or with the json key that appears heavily throughout the request/response dicts we construct in this file. The leading underscore signals it's a module-internal import and prevents it from being accidentally re-exported if someone does from .llm import *. No behaviour change; purely a namespace-safety convention.

Copilot

Pull request overview

This PR addresses several correctness and UX issues in the OpenGradient Python SDK across TEE selection, LLM error reporting, fixed-point conversions, inference robustness, and workflow gas handling.

Changes:

Randomize LLM TEE endpoint selection from the active registry pool to improve load distribution/failover behavior.
Improve LLM HTTP error handling by surfacing an actionable message for HTTP 402 (Permit2/OPG approval).
Add guards in Alpha.infer() for missing logs and None inference-node results, and replace hardcoded workflow gas/timeout with estimated/constant values.
Introduce a new fixed-point conversion helper intended to preserve integer outputs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`src/opengradient/client/tee_registry.py`	Randomizes `get_llm_tee()` selection and adds debug logging.
`src/opengradient/client/llm.py`	Adds explicit HTTP 402 hint handling for completion/chat/streaming.
`src/opengradient/client/_conversions.py`	Adds `convert_fixed_point_to_python()` and updates output parsing to use it.
`src/opengradient/client/alpha.py`	Adds inference result/log guards, uses timeout constant, and estimates gas for workflows with fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T23:34:41Z

src/opengradient/client/tee_registry.py

+        selected = random.choice(tees)
+        logger.debug(
+            "Selected TEE %s (endpoint=%s) from %d active LLM proxy TEE(s)",


This change makes get_llm_tee() non-deterministic, which will break/flakify the existing unit test TestGetLlmTee.test_returns_first_active_tee (it asserts the first endpoint). Update the tests to patch random.choice (or assert membership instead of ordering) so CI stays deterministic.

Would be good to fix this

Fixed ✅ — committed in tests/tee_registry_test.py. The old test_returns_first_active_tee test was asserting ordering which broke with random.choice. Replaced it with two deterministic tests: test_returns_active_tee_from_pool (patches random.choice via unittest.mock.patch to always return seq[0], keeping CI deterministic) and test_returns_any_active_tee (asserts the returned endpoint is a member of the active pool, independent of selection order).

Copilot · 2026-03-26T23:34:41Z

src/opengradient/client/_conversions.py

+def convert_fixed_point_to_python(value: int, decimals: int) -> Union[int, np.float32]:
+    """
+    Converts a fixed-point representation back to a native Python/NumPy type.
+
+    Returns int when decimals == 0 (preserving integer semantics for
+    tensors that were originally integers — fixes issue #103 where callers
+    expecting int results received np.float32 and had to cast manually).
+    Returns np.float32 for all other cases.
+
+    Args:
+        value:    The integer significand stored on-chain.
+        decimals: The scale factor exponent (value / 10**decimals).
+
+    Returns:
+        int if decimals == 0, np.float32 otherwise.
+    """
+    if decimals == 0:
+        return int(value)


Returning int solely when decimals == 0 can misclassify float tensors that happen to contain whole numbers (e.g., 1.0 normalizes to decimals==0). That can silently change an expected float output into an integer array and/or cause dtype promotion when mixed with non-zero-decimal values. Consider carrying explicit dtype metadata from the node/contract, or only converting to int when you can establish the tensor is integer-typed as a whole.

Suggested change

def convert_fixed_point_to_python(value: int, decimals: int) -> Union[int, np.float32]:

"""

Converts a fixed-point representation back to a native Python/NumPy type.

Returns int when decimals == 0 (preserving integer semantics for

tensors that were originally integers — fixes issue #103 where callers

expecting int results received np.float32 and had to cast manually).

Returns np.float32 for all other cases.

Args:

value: The integer significand stored on-chain.

decimals: The scale factor exponent (value / 10**decimals).

Returns:

int if decimals == 0, np.float32 otherwise.

"""

if decimals == 0:

return int(value)

def convert_fixed_point_to_python(value: int, decimals: int) -> np.float32:

"""

Converts a fixed-point representation back to a NumPy float32.

This function is intentionally type-stable and always returns np.float32,

regardless of the value of `decimals`. Callers that require integer

semantics should perform an explicit cast (e.g., int(...)) based on

their own dtype metadata or application logic.

Args:

value: The integer significand stored on-chain.

decimals: The scale factor exponent (value / 10**decimals).

Returns:

np.float32 corresponding to `value / 10**decimals`.

"""

Is there are reason we are mixing up float and int? might make the handling of results more difficult

Fixed ✅ — committed in src/opengradient/client/_conversions.py. convert_fixed_point_to_python is now type-stable: it always returns np.float32 regardless of decimals, matching Copilot's suggestion. The int-when-decimals==0 branch has been removed. The return type annotation is updated from Union[int, np.float32] → np.float32. Callers that genuinely need integer semantics can do an explicit int(...) cast based on their own dtype metadata.

Copilot · 2026-03-26T23:34:42Z

src/opengradient/client/_conversions.py

+            logging.debug(f"	Converting {tensor_data} to np array")
            tensor_data = np.array(tensor_data)

        if isinstance(tensor_data, (str, int, float)):
-            logging.debug(f"\tConverting single entry {tensor_data} to a list")
+            logging.debug(f"	Converting single entry {tensor_data} to a list")


This debug string includes a literal tab character inside the f-string. Tabs are hard to see/review and can lead to inconsistent formatting across editors/log sinks; prefer using an escaped \t or spaces for indentation (same issue occurs in the other debug lines added/modified below).

Fixed ✅ — committed in src/opengradient/client/_conversions.py. Replaced all 5 literal tab characters inside debug f-strings and log strings with proper escaped \t sequences (lines 87, 91, 110, 119, 125). No more invisible tabs in the source.

src/opengradient/client/alpha.py

src/opengradient/client/llm.py

adambalogh · 2026-03-26T23:37:30Z

@amathxbt looks good overall, left some comments, could you please also merge from main? i think there are changes in the llm.py that need to be resolved

…2), replace literal tabs with \t

…tion fallback

…ll_with_tee_retry; keep 402 hint, json alias, OPG 0.05

amathxbt · 2026-03-26T23:51:00Z

Done @adambalogh ✅ — merged main into llm.py (commit 8896eac).

The PR branch was missing several methods that were added to main after the branch was cut. I preserved all of them and kept the PR's bug fixes layered on top:

Restored from main:

_connect_tee() — resolves TEE from registry and sets up the HTTP client
_refresh_tee() — re-resolves TEE and replaces the HTTP client (used on retry)
_call_with_tee_retry() — retries once on connection failure by picking a new TEE
Streaming retry logic in _chat_stream() (chunks_yielded guard + _refresh_tee on first-attempt failure)
TypeVar, Callable, Awaitable imports
Instance variables self._rpc_url, self._tee_registry_address, self._llm_server_url

PR fixes kept on top:

import json as _json + all json.loads / json.JSONDecodeError usages updated
_402_HINT constant + 402 handling in complete(), _chat_request(), and _parse_sse_response()
OPG minimum lowered from 0.1 → 0.05

amathxbt added 4 commits March 24, 2026 00:44

adambalogh requested a review from Copilot March 26, 2026 23:29

Copilot started reviewing on behalf of adambalogh March 26, 2026 23:29 View session

adambalogh reviewed Mar 26, 2026

View reviewed changes

Copilot AI reviewed Mar 26, 2026

View reviewed changes

amathxbt added 4 commits March 27, 2026 00:41

fix: make convert_fixed_point_to_python type-stable (always np.float3…

7c786b1

…2), replace literal tabs with \t

fix: separate ContractLogicError from generic Exception in gas estima…

deff27c

…tion fallback

test: fix TestGetLlmTee to patch random.choice for deterministic CI

5dcecd1

fix: merge main into llm.py — restore _connect_tee, _refresh_tee, _ca…

6cfbb68

…ll_with_tee_retry; keep 402 hint, json alias, OPG 0.05

Conversation

amathxbt commented Mar 23, 2026

Summary

Bug 1 — TEE always picks the same node (closes #200)

Bug 2 — HTTP 402 swallowed as cryptic RuntimeError (closes #188)

Bug 3 — Fixed-point int returns np.float32 instead of int (partially closes #103)

Bug 4 — infer() crashes with AttributeError when inference node returns None

Bug 5 — run_workflow() uses hardcoded 30M gas

Files Changed

Related Issues

Uh oh!

amathxbt commented Mar 23, 2026

Uh oh!

adambalogh commented Mar 26, 2026

Uh oh!

amathxbt commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adambalogh commented Mar 26, 2026

Uh oh!

amathxbt commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 4 — `infer()` crashes with AttributeError when inference node returns None

Bug 5 — `run_workflow()` uses hardcoded 30M gas