Skip to content

Conversation

@LinjianMa
Copy link

This PR introduces the reference implementation for DLDRMv3.
Instructions for running the benchmark are provided in the README. Please note: The dataset setup is still pending. The README will be updated with download instructions once the dataset becomes available.

@LinjianMa LinjianMa requested a review from a team as a code owner December 12, 2025 00:55
@github-actions
Copy link
Contributor

github-actions bot commented Dec 12, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@nvzhihanj
Copy link
Contributor

@tanvi-mlcommons @pgmpablo157321 to take a look (can ignore the whole HSTU submodules) and help resolve the CLA issues for Linjian.
@zihaok to help review as well

@LinjianMa
Copy link
Author

Hi @mrmhodak , could you help review this PR? All comments of @nvzhihanj have been addressed

```
python accuracy.py --path path/to/mlperf_log_accuracy.json
```
We use normalized entropy (NE), accuracy, and AUC as the metrics to evaluate the model quality. The accuracy for the reference implementation evaluated on 34,996 requests across 10 inference timestamps are listed below:
Copy link
Contributor

@nvzhihanj nvzhihanj Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using all 3 scores for 99%, or just the AUC? (If it's just AUC please help point it out)
Also it seems like the loadgen/mlperf.conf is missing (for latency threshold and sample count)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need all 3 scores to be 99% close, I will clarify in the README. I will also add latency threshold to mlperf.conf.

Regarding sample count, could you explain when will we use it? I'm a bit confused as I think performance_sample_count_override

# Set performance_sample_count for each model.
# User can optionally set this to higher values in user.conf.
resnet50.*.performance_sample_count_override = 1024
ssd-mobilenet.*.performance_sample_count_override = 256
retinanet.*.performance_sample_count_override = 64
bert.*.performance_sample_count_override = 10833
dlrm.*.performance_sample_count_override = 204800
dlrm-v2.*.performance_sample_count_override = 204800
rnnt.*.performance_sample_count_override = 2513
gptj.*.performance_sample_count_override = 13368
mixtral-8x7b.*.performance_sample_count_override = 15000
llama2-70b.*.performance_sample_count_override = 24576
llama2-70b-interactive.*.performance_sample_count_override = 24576
llama3_1-405b.*.performance_sample_count_override = 8313
llama3_1-405b-interactive.*.performance_sample_count_override = 8313
llama3_1-8b.*.performance_sample_count_override = 13368
llama3_1-8b-edge.*.performance_sample_count_override = 5000
llama3_1-8b-interactive.*.performance_sample_count_override = 13368
stable-diffusion-xl.*.performance_sample_count_override = 5000
rgat.*.performance_sample_count_override = 788379
pointpainting.*.performance_sample_count_override = 1024
deepseek-r1.*.performance_sample_count_override = 4388
whisper.*.performance_sample_count_override = 1633
# set to 0 to let entire sample set to be performance sample
3d-unet.*.performance_sample_count_override = 0
are never used. In particular, performance_sample_count at https://github.com/mlcommons/inference/blob/8999c4d686f6e4a180da14597c97063fce7c9f33/loadgen/test_settings_internal.cc#L123C3-L123C27 are only used when performance_issue_unique is True, but I don't see a case in existing benchmarks (and also DLRMv3) that we need this flag to be True.
My understanding is that submitters are allowed to change the sample count, and as long as the minimum duration requirement for the benchmark is met, this should be acceptable.

DLRMv3SyntheticStreamingDataset,
{
"ratings_file_prefix": os.path.join(
new_path_prefix, "data/streaming-100b/sampled_data/"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest changing it to "sampled_data/", because the
"--dataset-path-prefix", is already pointing to the location where it contains sampled_data, so this data/streaming-100b/ is no longer a valid path without user self modifying

"--scenario-name", default="Server", choices={"Server", "Offline"}, help="inference benchmark scenario"
)
parser.add_argument(
"--batchsize", default=10, help="batch size used in the benchmark"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test run of this ref implementation, it throws an error because there is no type enforcement in arg parser, python will parse all of them to string instead of actual datatype. Can you add them

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides this, accuracy successfully reproduced. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants