Add nvfp4 dual_gemm example #76
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tested results on B200 chip with Python3.13.8 and CuTe DSL 4.3.0.dev0
(env13_8) nvfp4_dual_gemm$ python3 eval.py test task.yml
compile: start
compile: pass
test-count: 10
test.0.spec: m: 128; n: 256; k: 256; l: 1; seed: 1111
test.0.status: pass
test.1.spec: m: 128; n: 1536; k: 7168; l: 1; seed: 1111
test.1.status: pass
test.2.spec: m: 128; n: 3072; k: 1536; l: 1; seed: 1111
test.2.status: pass
test.3.spec: m: 256; n: 7168; k: 256; l: 1; seed: 1111
test.3.status: pass
test.4.spec: m: 256; n: 7168; k: 2048; l: 1; seed: 1111
test.4.status: pass
test.5.spec: m: 2304; n: 4608; k: 7168; l: 1; seed: 1111
test.5.status: pass
test.6.spec: m: 384; n: 7168; k: 2304; l: 1; seed: 1111
test.6.status: pass
test.7.spec: m: 512; n: 512; k: 7168; l: 1; seed: 1111
test.7.status: pass
test.8.spec: m: 512; n: 4096; k: 512; l: 1; seed: 1111
test.8.status: pass
test.9.spec: m: 512; n: 1536; k: 7168; l: 1; seed: 1111
test.9.status: pass
check: pass
(env13_8) nvfp4_dual_gemm$ python3 eval.py benchmark task.yml
compile: start
compile: pass
benchmark-count: 3
benchmark.0.spec: m: 7168; n: 128; k: 16384; l: 1; seed: 1111
benchmark.0.runs: 200
benchmark.0.mean: 160051.9973784685
benchmark.0.std: 23031.866455664996
benchmark.0.err: 1628.5988954183692
benchmark.0.best: 152575.9994983673
benchmark.0.worst: 472128.0038356781
benchmark.1.spec: m: 4096; n: 128; k: 7168; l: 1; seed: 1111
benchmark.1.runs: 200
benchmark.1.mean: 99979.84111309052
benchmark.1.std: 20095.203008511555
benchmark.1.err: 1420.945431663883
benchmark.1.best: 93184.00174379349
benchmark.1.worst: 378879.99415397644
benchmark.2.spec: m: 7168; n: 128; k: 2048; l: 1; seed: 1111
benchmark.2.runs: 200
benchmark.2.mean: 74724.80170428753
benchmark.2.std: 21870.720279291818
benchmark.2.err: 1546.4934618921386
benchmark.2.best: 69632.00122117996
benchmark.2.worst: 374783.992767334
check: pass
(env13_8) nvfp4_dual_gemm$ python3 eval.py leaderboard task.yml
compile: start
compile: pass
benchmark-count: 3
benchmark.0.spec: m: 7168; n: 128; k: 16384; l: 1; seed: 1111
benchmark.0.runs: 200
benchmark.0.mean: 253803.03986370564
benchmark.0.std: 9263.37772232849
benchmark.0.err: 655.019720415087
benchmark.0.best: 230399.9960422516
benchmark.0.worst: 283648.0140686035
benchmark.1.spec: m: 4096; n: 128; k: 7168; l: 1; seed: 1111
benchmark.1.runs: 200
benchmark.1.mean: 143465.91904759407
benchmark.1.std: 26430.637287532198
benchmark.1.err: 1868.9282857096032
benchmark.1.best: 136191.99395179749
benchmark.1.worst: 509952.00872421265
benchmark.2.spec: m: 7168; n: 128; k: 2048; l: 1; seed: 1111
benchmark.2.runs: 200
benchmark.2.mean: 114432.32048302889
benchmark.2.std: 28682.384716506524
benchmark.2.err: 2028.1508733643152
benchmark.2.best: 107519.99914646149
benchmark.2.worst: 506911.9930267334
check: pass