mlcommons · v-shobhit · Dec 20, 2024 · Sep 18, 2025 · Sep 18, 2025 · Sep 27, 2025
@@ -0,0 +1,3 @@
+*venv*
+*.pkl
+*.csv
@@ -0,0 +1,154 @@
+# MLPerf Inference reference implementation for GPT-OSS-120B
+This is the reference implementation for GPT-OSS-120B. This is a proposal and is a WIP. 
+
+## Model and Dataset download
+
+#### TODO: Replace this with mlc download link when available
+
+* Model: `openai/gpt-oss-120b`, commit id: [`b5c939d`](https://huggingface.co/openai/gpt-oss-120b/tree/b5c939de8f754692c1647ca79fbf85e8c1e70f8a)
+* Dataset: Please request access at [this link](https://drive.google.com/drive/folders/1DCfEXHqe69okrqKbSyV-8VUw413JqpPY?usp=drive_link) - **this is a tentative dataset**
+
+Datasets are now provided in **Parquet format** (recommended) for better performance and smaller file size (50% smaller than pickle). Pickle format is still supported for backward compatibility.
+
+## Environment setup
+Work on reference implementation is done using the sglang containers at [https://hub.docker.com/r/lmsysorg/sglang/tags](https://hub.docker.com/r/lmsysorg/sglang/tags). For enroot setup, a script is provided under [`setup_enroot.sh`](./setup_enroot.sh). For all sections below, we shall assume this environment is instantiated.
+
+Once in the environment, install additional requirements using [`setup.sh`](./setup.sh): 
+```bash
+./setup.sh
+```
+
+## Running the reference implementation: SGLang
+Use [`./sglang/run_server.sh`](./sglang/run_server.sh) to launch an SGLang server hosting `gpt-oss-120b`.
+
+### Run the server
+```bash
+./run_server.sh \
+  --model_path path/to/gpt-oss-120b/model \
+  --dp N  \
+  --stream_interval 100 \
+  --eagle_path optional/path/to/eagle/head
+```
+The script uses `python3 -m sglang.launch_server` tp instantiate the model, with `tp=pp=ep=1`, and `dp` as specified. 
+
+You may also use docker:
+```bash
+docker run --runtime nvidia --gpus all --net host  \
+    -v ${HF_HOME}:/root/.cache/huggingface \
+    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"     \
+    --ipc=host lmsysorg/sglang:latest \
+     python3 -m sglang.launch_server --model-path ${MODEL_NAME} \
+    --host 0.0.0.0  --port 3000 --data-parallel-size=1 --max-running-requests 512 \
+    --mem-fraction-static 0.85 --chunked-prefill-size 16384 --ep-size=1 \
+    --enable-metrics --stream-interval 500
+```
+
+Then, run a benchmark script that uses the client to send/recv requests.
+### Run the inference
+
+**Note:** All scripts now support both Parquet (`.parquet`) and Pickle (`.pkl`) formats for dataset files. Parquet is recommended as it offers:
+- 50% smaller file size
+- Faster loading times
+- Cross-language compatibility
+- Type-safe schema preservation
+
+Example usage:
+```bash
+# first, install loadgen
+pip install $(git rev-parse --show-toplevel)/loadgen
+
+# Using Parquet format (recommended)
+python3 run_mlperf.py \
+  --scenario offline \
+  --input-file /path/to/dataset.parquet \
+  --accuracy
+
+# Using Pickle format (backward compatible)
+python3 run_mlperf.py \
+  --scenario offline \
+  --input-file /path/to/dataset.pkl \
+  --accuracy
+```
+
+Full command-line options:
+```bash
+python3 run_mlperf.py --help
+usage: run_mlperf.py [-h] [--scenario {offline,server}] --input-file INPUT_FILE [--max-samples MAX_SAMPLES] [--mlperf-conf MLPERF_CONF]
+                     [--user-conf USER_CONF] [--accuracy] [--output-dir OUTPUT_DIR] [--backend {sglang}] [--server-url SERVER_URL]
+                     [--generation-config GENERATION_CONFIG] [--max-new-tokens MAX_NEW_TOKENS] [--num-workers NUM_WORKERS]
+                     [--max-concurrency MAX_CONCURRENCY]
+
+Run MLPerf inference benchmarks for gpt-oss
+
+options:
+  -h, --help            show this help message and exit
+  --scenario {offline,server}
+                        MLPerf scenario mode
+  --input-file INPUT_FILE
+                        Path to tokenized dataset (parquet or pickle file)
+  --max-samples MAX_SAMPLES
+                        Maximum number of samples to use (None for all)
+  --mlperf-conf MLPERF_CONF
+                        Path to MLPerf configuration file
+  --user-conf USER_CONF
+                        Path to user configuration file
+  --accuracy            Run accuracy mode instead of performance
+  --output-dir OUTPUT_DIR
+                        Directory for MLPerf output logs
+  --backend {sglang}    Backend to use for inference
+  --server-url SERVER_URL
+                        Server URL for backend (SGLang)
+  --generation-config GENERATION_CONFIG
+                        Path to generation configuration JSON file
+  --max-new-tokens MAX_NEW_TOKENS
+                        Override max_new_tokens from generation config (default: use value from config)
+  --num-workers NUM_WORKERS
+                        Number of worker threads (for server scenario)
+  --max-concurrency MAX_CONCURRENCY
+                        Maximum concurrent requests to backend (SGLang handles batching internally)
+
+```
+
+### Evaluate the accuracy
+Run `run_mlperf.py` with `--accuracy`, and then use the generated `mlperf_log_accuracy.json` to evaluate the accuracy of the run.
+
+Example usage:
+```bash
+# Using Parquet format (recommended)
+python3 eval_mlperf_accuracy.py \
+  --mlperf-log mlperf_results/offline/accuracy/mlperf_log_accuracy.json \
+  --reference-data /path/to/acc_eval_inputs.parquet \
+  --tokenizer openai/gpt-oss-120b
+
+# Using Pickle format (backward compatible)
+python3 eval_mlperf_accuracy.py \
+  --mlperf-log mlperf_results/offline/accuracy/mlperf_log_accuracy.json \
+  --reference-data /path/to/acc_eval_inputs.pkl \
+  --tokenizer openai/gpt-oss-120b
+```
+
+Full command-line options:
+```bash
+python3 eval_mlperf_accuracy.py --help
+usage: eval_mlperf_accuracy.py [-h] --mlperf-log MLPERF_LOG --reference-data REFERENCE_DATA [--tokenizer TOKENIZER] [--output-file OUTPUT_FILE]
+                               [--save-outputs SAVE_OUTPUTS] [--num-lcb-workers NUM_LCB_WORKERS] [--verbose]
+
+Evaluate MLPerf accuracy logs for gpt-oss-120b
+
+options:
+  -h, --help            show this help message and exit
+  --mlperf-log MLPERF_LOG
+                        Path to mlperf_log_accuracy.json
+  --reference-data REFERENCE_DATA
+                        Path to reference parquet or pickle file (DataFrame with dataset, ground_truth, etc.)
+  --tokenizer TOKENIZER
+                        HuggingFace tokenizer name or path
+  --output-file OUTPUT_FILE
+                        Output JSON file for results (optional)
+  --save-outputs SAVE_OUTPUTS
+                        Save detokenized outputs to pickle file (ordered by qsl_idx) for debugging
+  --num-lcb-workers NUM_LCB_WORKERS
+                        Number of parallel workers for LiveCodeBench evaluation (default: 64)
+  --verbose             Verbose logging
+
+```
@@ -0,0 +1,10 @@
+#!/usr/bin/env python3
+"""Backend implementations for gpt-oss inference."""
+
+from .base_backend import BaseBackend
+from .sglang_backend import SGLangBackend
+
+__all__ = [
+    "BaseBackend",
+    "SGLangBackend",
+]
@@ -0,0 +1,77 @@
+#!/usr/bin/env python3
+"""Base backend class for gpt-oss inference."""
+
+import abc
+import logging
+from typing import List, Dict, Any, Optional
+
+logger = logging.getLogger(__name__)
+
+
+class BaseBackend(abc.ABC):
+    """Abstract base class for inference backends.
+
+    All backends must implement this interface to work with the MLPerf SUT.
+    """
+
+    def __init__(self, config: Optional[Dict[str, Any]] = None):
+        """Initialize the backend.
+
+        Args:
+            config: Optional configuration dictionary
+        """
+        self.config = config or {}
+        self.initialized = False
+        logger.info(f"Initializing {self.__class__.__name__}")
+
+    @abc.abstractmethod
+    def initialize(self) -> None:
+        """Initialize the backend (load model, connect to server, etc.)."""
+        raise NotImplementedError("Subclasses must implement initialize()")
+
+    @abc.abstractmethod
+    def generate(
+        self,
+        prompts: List[List[int]],
+        max_tokens: int = 100,
+        temperature: float = 0.001,
+        top_k: int = 1,
+        top_p: float = 1.0,
+        **kwargs
+    ) -> List[Dict[str, Any]]:
+        """Generate responses for a batch of prompts.
+
+        Args:
+            prompts: List of token ID sequences
+            max_tokens: Maximum tokens to generate per prompt
+            temperature: Sampling temperature
+            top_k: Top-k sampling parameter
+            top_p: Top-p (nucleus) sampling parameter
+            **kwargs: Additional backend-specific parameters
+
+        Returns:
+            List of response dictionaries with keys:
+                - output_ids: List of generated token IDs
+                - output_text: Generated text (optional)
+                - metadata: Additional metadata (latencies, etc.)
+        """
+        raise NotImplementedError("Subclasses must implement generate()")
+
+    @abc.abstractmethod
+    def cleanup(self) -> None:
+        """Clean up backend resources."""
+        raise NotImplementedError("Subclasses must implement cleanup()")
+
+    def __enter__(self):
+        """Context manager entry."""
+        self.initialize()
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Context manager exit."""
+        self.cleanup()
+
+    @property
+    def is_initialized(self) -> bool:
+        """Check if backend is initialized."""
+        return self.initialized
-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    *venv*
+    *.pkl
+    *.csv