SupChain-Bench has been accepted by ACL 2026 Findings. If you use it in your research, please cite our paper:
@misc{guan2026supchainbenchbenchmarkinglargelanguage,
title={SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management},
author={Shengyue Guan and Yihao Liu and Lang Cao},
year={2026},
eprint={2602.07342},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.07342},
}A comprehensive benchmark for evaluating LLM tool-use and multi-step reasoning capabilities in supply chain order management scenarios.
SupChain-Bench simulates a realistic three-tier supply chain system (Trade → Fulfillment → Warehouse) where models must navigate complex hierarchical relationships to answer natural language queries. Models are evaluated on their ability to make strategic function calls, chain multiple tool invocations with conditional logic, and produce accurate structured answers.
Key Features:
- Realistic Supply Chain Simulation: Three-tier order management with authentic business logic including cancellations, errors, and status tracking
- Tool-Grounded Evaluation: Models must make strategic function calls to retrieve information from a simulated database
- Multi-Step Reasoning: Questions demand chaining multiple tool calls with conditional logic based on intermediate results
.
├── README.md
├── config/
│ ├── prompt_tool_sessions.yaml # Standard system/user prompts
│ ├── prompt_react_reasoning.yaml # ReAct-style prompts
│ ├── prompt_tool_sessions_with_sop.yaml # SOP-guided prompts
│ └── prompts/
│ ├── generate_question_prompt.md # Question generation prompt template
│ ├── generate_answer_prompt.md # Answer generation prompt template
│ └── prompt_tool_sessions_with_sop.md
├── data/
│ ├── TradeOrders.csv # Trade order data
│ ├── FulfillmentOrders.csv # Fulfillment order data
│ ├── WarehouseOrders.csv # Warehouse order data
│ ├── ErrorLogs.csv # Error logs
│ ├── CancellationContext.csv # Cancellation metadata
│ ├── tool_use_question.jsonl # Tool-use benchmark questions
│ ├── tool_use_answers.jsonl # Tool-use benchmark ground-truth answers
│ ├── multiple_choices_clean.jsonl # Multiple-choice questions
│ ├── single_choices_clean.jsonl # Single-choice questions
│ └── true_false_clean.jsonl # True/false questions
├── src/
│ └── tool.py # Tool definitions, schemas, and registry
├── scripts/
│ └── evaluation.py # Evaluation script
├── generate_data.py # Synthetic dataset generation
└── get_results.py # Deterministic result orchestration (ground truth)
The benchmark uses five CSV tables representing a hierarchical order management system:
| Table | Key Fields | Description |
|---|---|---|
| TradeOrders | trade_order_id, buyer_id |
Top-level customer orders |
| FulfillmentOrders | fulfillment_order_id, trade_order_id, biz_status |
Fulfillment units within trade orders |
| WarehouseOrders | warehouse_order_id, fulfillment_order_id, status, error_code |
Warehouse-level execution units |
| ErrorLogs | entity_type, warehouse_order_id, fulfillment_order_id, code, text |
Detailed error information |
| CancellationContext | entity_type, entity_id, cancel_type, reason_code, reason_text |
Cancellation metadata |
Each trade order contains 1–5 fulfillment orders, and each fulfillment contains 1–3 warehouse orders.
The benchmark provides 8 function tools (defined in src/tool.py) that models can call:
| # | Tool | Purpose | Key Output |
|---|---|---|---|
| 1 | query_buyer_and_related |
Entry point: get buyer info + related order IDs | buyer_id, related_item[] |
| 2 | get_fulfillment_status |
Get aggregated fulfillment status | status |
| 3 | get_cancel_scenes |
Get cancellation initiator | cancelType (BUYER/SELLER) |
| 4 | get_cancel_error_code |
Get cancellation reason | cancelErrorCode, cancelErrorMsg |
| 5 | get_error_reason |
Get fulfillment-level error details | code, text |
| 6 | check_fake_shipping |
Check for fake shipping flags | exceptionFlag |
| 7 | get_warehouse_status |
Get warehouse order status | status, error |
| 8 | get_warehouse_error_details |
Get detailed warehouse error info | code, text |
Conditional Tool-Calling Flow:
query_buyer_and_related(order_id)
└─ for each fulfillment_id:
get_fulfillment_status(fulfillment_id)
├─ if "cancelled" → get_cancel_scenes → get_cancel_error_code
│ (optional: check_fake_shipping)
├─ if "error" → get_error_reason
└─ for each warehouse_order_id:
get_warehouse_status → get_warehouse_error_details
The OpenAI-compatible function tool schemas are exported as the tools list and TOOL_REGISTRY dict in src/tool.py, ready to be plugged into any LLM API client.
Prerequisites: Python 3.8+
Install dependencies:
pip install pandas numpypython generate_data.pyConfigurable parameters: num_trades (default: 100), cancel_rate (0.15), error_rate (0.10), seed (42).
get_results.py deterministically orchestrates tool calls for a given question and produces the expected structured answer:
python get_results.py "What is the status of order T1030?"Import the tool definitions and registry into your own inference code:
from src.tool import tools, TOOL_REGISTRY
# `tools` — OpenAI-compatible function schemas, pass to your API call
# `TOOL_REGISTRY` — name → callable mapping, use to execute tool calls locallyPrompt templates are available in config/ for different strategies (standard, ReAct, SOP-guided).
python scripts/evaluation.py \
-g data/tool_use_answers.jsonl \
-p your_model_predictions.jsonl \
-o results/evaluation/eval.jsonOptional flags:
-q/--questions: Path to questions JSONL to include question text in mismatch reports--details-limit: Limit number of mismatches in stdout preview (default: 50)
Entity-Level Precision/Recall:
- Trade Order Level:
trade_order_id,buyer_id - Fulfillment Order Level:
fulfillment_id,status,cancel_type,reason_code,reason_text,errorCode,errorText - Warehouse Order Level:
warehouse_order_id,status,errorCode,errorText
Conditional Logic Evaluation covers three flows:
- Normal Flow: Status tracking without errors/cancellations
- Cancellation Flow: Requires
cancel_type,reason_code,reason_text - Error Flow: Requires
errorCodeanderrorText
Each line in the predictions JSONL file must contain a tool_trace array. The evaluator parses this trace to reconstruct the structured answer and compare it against ground truth.
{
"tool_trace": [
{
"step": 1,
"name": "query_buyer_and_related",
"arguments": {"order_id": "T1030"},
"output": {
"buyer_id": {"id": 90029},
"related_item": [
{"fulfillment_id": "FO2080", "warehouse_order_id": "WO3170"}
]
}
},
{
"step": 2,
"name": "get_fulfillment_status",
"arguments": {"fulfillment_id": "FO2080"},
"output": {"status": "packing_in_progress"}
},
{
"step": 3,
"name": "get_warehouse_status",
"arguments": {"fulfillment_id": "FO2080", "warehouse_order_id": "WO3170"},
"output": {"status": "packing_in_progress", "error": null}
}
]
}Each tool_trace entry must have:
name: One of the 8 tool names (e.g.query_buyer_and_related)arguments: The arguments passed to the tool (dict)output: The tool's return value (dict)
The evaluator automatically reconstructs trade/fulfillment/warehouse structures from the trace and compares field-by-field against ground truth.
Copyright (c) 2025 AIDC-SupplyChain-AI
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the LICENSE file for the specific language governing permissions and limitations under the License.
Third-party software notices and additional attributions can be found in THIRD-PARTY-NOTICES.