[AArch64] Update zero latency instructions in Neoverse scheduling tables #165690

simonwallis2 · 2025-10-30T10:47:14Z

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all the above Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn
MOV Vd, Vn (vector)
MOV Zd.D, Zn.D
PTRUE
PFALSE

…eoverse cores NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE. Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions: MOV Xd|Wd, #0|XZR|WZR For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine: MOV Wd,Wn MOV Xd,Xn For Neoverse-N3 only, these instructions also have zero latency FMOV Dd, Dn FMOV Sd, Sn Change-Id: I1a5f86e049798582d33d96ba99389e4b2ffb210e

llvmbot · 2025-10-30T10:47:48Z

@llvm/pr-subscribers-backend-aarch64

Author: Simon Wallis (simonwallis2)

Changes

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all the above Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Patch is 43.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165690.diff

16 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td (+28-2)
(modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td (+23-3)
(modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td (+20-2)
(modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td (+4-1)
(modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td (+4-1)
(modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td (+4-1)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s (+11-11)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s (+15-15)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-basic-instructions.s (+11-11)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-zero-dependency.s (+21-21)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-basic-instructions.s (+5-5)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-zero-lat-movs.s (+12-12)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3-basic-instructions.s (+5-5)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3-zero-lat-movs.s (+12-12)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3AE-basic-instructions.s (+5-5)
(modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3AE-zero-lat-movs.s (+12-12)

diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td
index 50f10114989d0..d1ce5a13d0510 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td
@@ -72,6 +72,13 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 // Define customized scheduler read/write types specific to the Neoverse N2.
 
 //===----------------------------------------------------------------------===//
+
+// Define generic 0 micro-op types
+def N2Write_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
+
 // Define generic 1 micro-op types
 
 def N2Write_1c_1B   : SchedWriteRes<[N2UnitB]>   { let Latency = 1; }
@@ -645,6 +652,21 @@ def N2Write_11c_9L01_9S_9V : SchedWriteRes<[N2UnitL01, N2UnitL01, N2UnitL01,
   let NumMicroOps = 27;
 }
 
+//===----------------------------------------------------------------------===//
+// Define predicate-controlled types
+
+def N2Write_0or1c_1I : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N2Write_0c]>,
+                      SchedVar<NoSchedPred,      [N2Write_1c_1I]>]>;
+
+def N2Write_0or2c_1V : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N2Write_0c]>,
+                      SchedVar<NoSchedPred,      [N2Write_2c_1V]>]>;
+
+def N2Write_0or3c_1M0 : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N2Write_0c]>,
+                      SchedVar<NoSchedPred,      [N2Write_3c_1M0]>]>;
+
 //===----------------------------------------------------------------------===//
 // Define types for arithmetic and logical ops with short shifts
 def N2Write_Arith : SchedWriteVariant<[
@@ -680,6 +702,7 @@ def : InstRW<[N2Write_1c_1B_1S], (instrs BL, BLR)>;
 // ALU, basic
 // ALU, basic, flagset
 def : SchedAlias<WriteI,     N2Write_1c_1I>;
+def : InstRW<[N2Write_0or1c_1I], (instregex "^MOVZ[WX]i$")>;
 
 // ALU, extend and shift
 def : SchedAlias<WriteIEReg, N2Write_2c_1M>;
@@ -691,7 +714,8 @@ def : SchedAlias<WriteISReg, N2Write_Arith>;
 
 // Logical, shift, no flagset
 def : InstRW<[N2Write_1c_1I],
-             (instregex "^(AND|BIC|EON|EOR|ORN|ORR)[WX]rs$")>;
+             (instregex "^(AND|BIC|EON|EOR|ORN)[WX]rs$")>;
+def : InstRW<[N2Write_0or1c_1I], (instregex "^ORR[WX]rs$")>;
 
 // Logical, shift, flagset
 def : InstRW<[N2Write_Logical], (instregex "^(AND|BIC)S[WX]rs$")>;
@@ -882,7 +906,7 @@ def : SchedAlias<WriteFImm, N2Write_2c_1V>;
 def : InstRW<[N2Write_2c_1V], (instrs FMOVHr, FMOVSr, FMOVDr)>;
 
 // FP transfer, from gen to low half of vec reg
-def : InstRW<[N2Write_3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr,
+def : InstRW<[N2Write_0or3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr,
                                         FMOVHWr, FMOVHXr, FMOVSWr, FMOVDXr)>;
 
 // FP transfer, from gen to high half of vec reg
@@ -1225,6 +1249,8 @@ def : InstRW<[N2Write_3c_1V0], (instrs BFCVT)>;
 // ASIMD unzip/zip
 // Handled by SchedAlias<WriteV[dq], ...>
 
+def : InstRW<[N2Write_0or2c_1V], (instrs MOVID, MOVIv2d_ns)>;
+
 // ASIMD duplicate, gen reg
 def : InstRW<[N2Write_3c_1M0], (instregex "^DUPv.+gpr")>;
 
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
index 411b372a3f533..32d48ca66ee2d 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
@@ -553,6 +553,22 @@ def N3Write_16c_16V0 : SchedWriteRes<[N3UnitV0, N3UnitV0, N3UnitV0, N3UnitV0,
     let NumMicroOps = 16;
 }
 
+
+//===----------------------------------------------------------------------===//
+// Define predicate-controlled types
+
+def N3Write_0or1c_1I : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N3Write_0c]>,
+                      SchedVar<NoSchedPred,      [N3Write_1c_1I]>]>;
+
+def N3Write_0or2c_1V : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N3Write_0c]>,
+                      SchedVar<NoSchedPred,      [N3Write_2c_1V]>]>;
+
+def N3Write_0or3c_1M0 : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N3Write_0c]>,
+                      SchedVar<NoSchedPred,      [N3Write_3c_1M0]>]>;
+
 // Miscellaneous
 // -----------------------------------------------------------------------------
 
@@ -581,6 +597,7 @@ def : InstRW<[N3Write_1c_1B_1S], (instrs BL, BLR)>;
 // Conditional compare
 // Conditional select
 def : SchedAlias<WriteI, N3Write_1c_1I>;
+def : InstRW<[N3Write_0or1c_1I], (instregex "^MOVZ[WX]i$")>;
 
 // ALU, extend and shift
 def : SchedAlias<WriteIEReg, N3Write_2c_1M>;
@@ -610,7 +627,8 @@ def : InstRW<[N3Write_1c_1I], (instrs GMI, SUBP, SUBPS)>;
 
 // Logical, shift, no flagset
 def : InstRW<[N3Write_1c_1I],
-             (instregex "^(AND|BIC|EON|EOR|ORN|ORR)[WX]rs$")>;
+             (instregex "^(AND|BIC|EON|EOR|ORN)[WX]rs$")>;
+def : InstRW<[N3Write_0or1c_1I], (instregex "^ORR[WX]rs$")>;
 
 // Logical, shift, flagset
 def : InstRW<[N3Write_2c_1M], (instregex "^(AND|BIC)S[WX]rs$")>;
@@ -855,10 +873,11 @@ def : SchedAlias<WriteFCvt, N3Write_3c_1V0>;
 def : SchedAlias<WriteFImm, N3Write_2c_1V>;
 
 // FP move, register
-def : InstRW<[N3Write_2c_1V], (instrs FMOVHr, FMOVSr, FMOVDr)>;
+def : InstRW<[N3Write_2c_1V], (instrs FMOVHr)>;
+def : InstRW<[N3Write_0c], (instrs FMOVSr, FMOVDr)>;
 
 // FP transfer, from gen to low half of vec reg
-def : InstRW<[N3Write_3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
+def : InstRW<[N3Write_0or3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
 
 // FP transfer, from gen to high half of vec reg
 def : InstRW<[N3Write_5c_1M0_1V], (instrs FMOVXDHighr)>;
@@ -1186,6 +1205,7 @@ def : InstRW<[N3Write_3c_1V0], (instrs BFCVT)>;
 // ASIMD transpose
 // ASIMD unzip/zip
 // Covered by WriteV[dq]
+def : InstRW<[N3Write_0or2c_1V], (instrs MOVID, MOVIv2d_ns)>;
 
 // ASIMD duplicate, gen reg
 def : InstRW<[N3Write_3c_1M0], (instregex "^DUPv.+gpr")>;
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
index 3cbfc59423c9a..8d33ca22616c2 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
@@ -472,6 +472,21 @@ def V1Write_11c_9L01_9S_9V : SchedWriteRes<[V1UnitL01, V1UnitL01, V1UnitL01,
                                             V1UnitV, V1UnitV, V1UnitV,
                                             V1UnitV, V1UnitV, V1UnitV]>;
 
+//===----------------------------------------------------------------------===//
+// Define predicate-controlled types
+
+def V1Write_0or1c_1I : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [V1Write_0c_0Z]>,
+                      SchedVar<NoSchedPred,      [V1Write_1c_1I]>]>;
+
+def V1Write_0or2c_1V : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [V1Write_0c_0Z]>,
+                      SchedVar<NoSchedPred,      [V1Write_2c_1V]>]>;
+
+def V1Write_0or3c_1M0 : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [V1Write_0c_0Z]>,
+                      SchedVar<NoSchedPred,      [V1Write_3c_1M0]>]>;
+
 //===----------------------------------------------------------------------===//
 // Define forwarded types
 
@@ -603,6 +618,7 @@ def : InstRW<[V1Write_1c_1I_1Flg],
                         "^(ADC|SBC)S[WX]r$",
                         "^ANDS[WX]ri$",
                         "^(AND|BIC)S[WX]rr$")>;
+def : InstRW<[V1Write_0or1c_1I], (instregex "^MOVZ[WX]i$")>;
 
 // ALU, extend and shift
 def : SchedAlias<WriteIEReg, V1Write_2c_1M>;
@@ -623,7 +639,8 @@ def               : InstRW<[V1WriteISRegS],
                            (instregex "^(ADD|SUB)S(([WX]r[sx])|Xrx64)$")>;
 
 // Logical, shift, no flagset
-def : InstRW<[V1Write_1c_1I], (instregex "^(AND|BIC|EON|EOR|ORN|ORR)[WX]rs$")>;
+def : InstRW<[V1Write_1c_1I], (instregex "^(AND|BIC|EON|EOR|ORN)[WX]rs$")>;
+def : InstRW<[V1Write_0or1c_1I], (instregex "^ORR[WX]rs$")>;
 
 // Logical, shift, flagset
 def : InstRW<[V1Write_2c_1M_1Flg], (instregex "^(AND|BIC)S[WX]rs$")>;
@@ -805,7 +822,7 @@ def : SchedAlias<WriteFImm, V1Write_2c_1V>;
 def : InstRW<[V1Write_2c_1V], (instrs FMOVHr, FMOVSr, FMOVDr)>;
 
 // FP transfer, from gen to low half of vec reg
-def : InstRW<[V1Write_3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
+def : InstRW<[V1Write_0or3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
 
 // FP transfer, from gen to high half of vec reg
 def : InstRW<[V1Write_5c_1M0_1V], (instrs FMOVXDHighr)>;
@@ -1122,6 +1139,7 @@ def : InstRW<[V1Write_3c_1V02], (instrs BFCVT)>;
 // ASIMD transpose
 // ASIMD unzip/zip
 // Covered by "SchedAlias (WriteV[dq]...)" above
+def : InstRW<[V1Write_0or2c_1V], (instrs MOVID, MOVIv2d_ns)>;
 
 // ASIMD duplicate, gen reg
 def : InstRW<[V1Write_3c_1M0],
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
index 2387f176f3051..1ef087f07022d 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
@@ -94,7 +94,10 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 //===----------------------------------------------------------------------===//
 
 // Define generic 0 micro-op types
-def V2Write_0c : SchedWriteRes<[]> { let Latency = 0; }
+def V2Write_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
 
 // Define generic 1 micro-op types
 
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td
index e23576a20d277..3dd2988088f0b 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td
@@ -94,7 +94,10 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 //===----------------------------------------------------------------------===//
 
 // Define generic 0 micro-op types
-def V3Write_0c : SchedWriteRes<[]> { let Latency = 0; }
+def V3Write_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
 
 // Define generic 1 micro-op types
 
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td
index 0f1ec669a4e5e..19b56260387e1 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td
@@ -89,7 +89,10 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 //===----------------------------------------------------------------------===//
 
 // Define generic 0 micro-op types
-def V3AEWrite_0c : SchedWriteRes<[]> { let Latency = 0; }
+def V3AEWrite_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
 
 // Define generic 1 micro-op types
 
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s
index cf1cf0e98c801..d3343ab055887 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s
@@ -2508,14 +2508,14 @@ drps
 # CHECK-NEXT:  1      2     0.50                        bics	x3, xzr, x3, lsl #1
 # CHECK-NEXT:  1      2     0.50                        tst	w3, w7, lsl #31
 # CHECK-NEXT:  1      2     0.50                        tst	x2, x20, asr #2
-# CHECK-NEXT:  1      1     0.25                        mov	x3, x6
-# CHECK-NEXT:  1      1     0.25                        mov	x3, xzr
-# CHECK-NEXT:  1      1     0.25                        mov	wzr, w2
-# CHECK-NEXT:  1      1     0.25                        mov	w3, w5
+# CHECK-NEXT:  0      0     0.00                        mov	x3, x6
+# CHECK-NEXT:  0      0     0.00                        mov	x3, xzr
+# CHECK-NEXT:  0      0     0.00                        mov	wzr, w2
+# CHECK-NEXT:  0      0     0.00                        mov	w3, w5
 # CHECK-NEXT:  1      1     0.25                        movz	w2, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        mov	w2, #-1235
 # CHECK-NEXT:  1      1     0.25                        mov	x2, #5299989643264
-# CHECK-NEXT:  1      1     0.25                        mov	x2, #0
+# CHECK-NEXT:  0      0     0.00                        mov	x2, #0
 # CHECK-NEXT:  1      1     0.25                        movk	w3, #0
 # CHECK-NEXT:  1      1     0.25                        movz	x4, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        movk	w5, #0, lsl #16
@@ -2557,7 +2557,7 @@ drps
 
 # CHECK:      Resource pressure per iteration:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]
-# CHECK-NEXT: 11.00  11.00  33.00  33.00  87.33  151.33 151.33 517.00 251.00 162.50 162.50 215.50 85.50
+# CHECK-NEXT: 11.00  11.00  33.00  33.00  87.33  151.33 151.33 515.75 249.75 161.25 161.25 215.50 85.50
 
 # CHECK:      Resource pressure by instruction:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]    Instructions:
@@ -3692,14 +3692,14 @@ drps
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -      -     bics	x3, xzr, x3, lsl #1
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -      -     tst	w3, w7, lsl #31
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -      -     tst	x2, x20, asr #2
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x3, x6
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x3, xzr
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	wzr, w2
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	w3, w5
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	x3, x6
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	x3, xzr
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	wzr, w2
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	w3, w5
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movz	w2, #0, lsl #16
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	w2, #-1235
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x2, #5299989643264
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x2, #0
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	x2, #0
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movk	w3, #0
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movz	x4, #0, lsl #16
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movk	w5, #0, lsl #16
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
index b9758280e2491..f7311b5e41b2e 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
@@ -1888,7 +1888,7 @@ drps
 # CHECK-NEXT:  1      2     0.50                        fccmpe	d31, d5, #7, ne
 # CHECK-NEXT:  1      2     0.50                        fcsel	s3, s20, s9, pl
 # CHECK-NEXT:  1      2     0.50                        fcsel	d9, d10, d11, mi
-# CHECK-NEXT:  1      2     0.50                        fmov	s0, s1
+# CHECK-NEXT:  0      0     0.00                        fmov	s0, s1
 # CHECK-NEXT:  1      2     0.50                        fabs	s2, s3
 # CHECK-NEXT:  1      2     0.50                        fneg	s4, s5
 # CHECK-NEXT:  1      7     1.00                        fsqrt	s6, s7
@@ -1901,7 +1901,7 @@ drps
 # CHECK-NEXT:  1      3     1.00                        frinta	s20, s21
 # CHECK-NEXT:  1      3     1.00                        frintx	s22, s23
 # CHECK-NEXT:  1      3     1.00                        frinti	s24, s25
-# CHECK-NEXT:  1      2     0.50                        fmov	d0, d1
+# CHECK-NEXT:  0      0     0.00                        fmov	d0, d1
 # CHECK-NEXT:  1      2     0.50                        fabs	d2, d3
 # CHECK-NEXT:  1      2     0.50                        fneg	d4, d5
 # CHECK-NEXT:  1      12    1.00                        fsqrt	d6, d7
@@ -2508,14 +2508,14 @@ drps
 # CHECK-NEXT:  1      2     0.50                        bics	x3, xzr, x3, lsl #1
 # CHECK-NEXT:  1      2     0.50                        tst	w3, w7, lsl #31
 # CHECK-NEXT:  1      2     0.50                        tst	x2, x20, asr #2
-# CHECK-NEXT:  1      1     0.25                        mov	x3, x6
-# CHECK-NEXT:  1      1     0.25                        mov	x3, xzr
-# CHECK-NEXT:  1      1     0.25                        mov	wzr, w2
-# CHECK-NEXT:  1      1     0.25                        mov	w3, w5
+# CHECK-NEXT:  0      0     0.00                        mov	x3, x6
+# CHECK-NEXT:  0      0     0.00                        mov	x3, xzr
+# CHECK-NEXT:  0      0     0.00                        mov	wzr, w2
+# CHECK-NEXT:  0      0     0.00                        mov	w3, w5
 # CHECK-NEXT:  1      1     0.25                        movz	w2, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        mov	w2, #-1235
 # CHECK-NEXT:  1      1     0.25                        mov	x2, #5299989643264
-# CHECK-NEXT:  1      1     0.25                        mov	x2, #0
+# CHECK-NEXT:  0      0     0.00                        mov	x2, #0
 # CHECK-NEXT:  1      1     0.25                        movk	w3, #0
 # CHECK-NEXT:  1      1     0.25                        movz	x4, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        movk	w5, #0, lsl #16
@@ -2557,7 +2557,7 @@ drps
 
 # CHECK:      Resource pressure per iteration:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]
-# CHECK-NEXT: 11.00  11.00  33.00  33.00  99.33  163.33 163.33 357.75 212.75 156.25 156.25 184.50 64.50
+# CHECK-NEXT: 11.00  11.00  33.00  33.00  99.33  163.33 163.33 356.50 211.50 155.00 155.00 183.50 63.50
 
 # CHECK:      Resource pressure by instruction:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]    Instructions:
@@ -3072,7 +3072,7 @@ drps
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fccmpe	d31, d5, #7, ne
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fcsel	s3, s20, s9, pl
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fcsel	d9, d10, d11, mi
-# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fmov	s0, s1
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     fmov	s0, s1
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fabs	s2, s3
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fneg	s4, s5
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     fsqrt	s6, s7
@@ -3085,7 +3085,7 @@ drps
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     frinta	s20, s21
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     frintx	s22, s23
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     frinti	s24, s25
-# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fmov	d0, d1
+# CHECK-NEXT:  -     ...
[truncated]

…eoverse cores NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE. Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions: MOV Xd|Wd, #0|XZR|WZR For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine: MOV Wd,Wn MOV Xd,Xn For Neoverse-N3 only, these instructions also have zero latency FMOV Dd, Dn FMOV Sd, Sn Change-Id: I955cfe3efc689bea305a708eb6d7259dced6fe04

rj-jesus

Hi, the changes generally look good, but I'm not sure we should be modelling zero-latency moves with zero micro-ops? AFAIU these instructions still count as a micro-op (towards decode bandwidth, for example) despite not using execution resources. If you can share any references to the contrary, that would be much appreciated.

llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE. Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions: MOV Xd|Wd, #0|XZR|WZR For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine: MOV Wd,Wn MOV Xd,Xn For Neoverse-N3 only, these instructions also have zero latency FMOV Dd, Dn FMOV Sd, Sn Change-Id: Ie6b1f5c3f4d74f26bdd4c67c5e6c5acf6a8e00cc

simonwallis2 · 2025-11-07T09:44:16Z

The categories of instructions with zero micro ops were previously added to the .td files for V1 (in 2023) and for N3 (in 2024). This patch extends this to N2, V2, V3, V3AE.

I based this patch solely on my own reading of the Neoverse SWOGs, in particular section 4.15 (variously 4.12, 4.11) about zero latency instructions not using the scheduling and execution resources, and section 2.1 about macro ops proceeding through register renaming and dispatch.

It looks like these zero-latency instructions still take up decode resources, which we don’t currently describe explicitly.
But the work done is largely register renaming, and there are no micro-ops dispatched to the individual pipelines.
So I think zero micro-ops is the correct way to represent this in the .td file.

rj-jesus · 2025-11-07T11:41:26Z

It looks like these zero-latency instructions still take up decode resources, which we don’t currently describe explicitly.
But the work done is largely register renaming, and there are no micro-ops dispatched to the individual pipelines.

I believe we implicitly model decode constraints in the "IssueWidth". My main concern with modelling these instructions with zero micro-ops is that it might trick the machine scheduler into assuming they can be scheduled freely, which AFAIU isn't true. Also, OP_RETIRED suggests these instructions do count as a micro-op.

Do you have any performance data that suggests this is preferable for performance or at least neutral?

simonwallis2 · 2025-11-07T12:59:48Z

Do you have any performance data that suggests this is preferable for performance or at least neutral?

On Neoverse-V2 the benchmarks that I ran reported zero change in performance.

rj-jesus · 2025-11-07T17:09:38Z

Consider this example, which issues 20 mov x9, 0 per iteration for 10^9 iterations. What I see:

$ perf stat -e cycles,instructions,op_retired ./a.out 

 Performance counter stats for './a.out':

     3,503,570,565      cycles                                                                
    22,003,024,649      instructions                     #    6.28  insn per cycle            
    22,003,034,412      op_retired                                                            

       1.133565071 seconds time elapsed

       1.050723000 seconds user
       0.003995000 seconds sys

If you normalise the IPC perf computes by 21/22 (because the CMP+B is fused), you get exactly 6.
LLVM-MCA currently works out something similar: https://godbolt.org/z/7az3YEEY9.

If we make zero-latency moves zero micro-ops, then we'll have:

Iterations:        100000
Instructions:      2000000
Total Cycles:      6251
Total uOps:        0

Dispatch Width:    6
uOps Per Cycle:    0.00
IPC:               319.95
Block RThroughput: 0.0

(The IPC becomes bottlenecked on the MicroOpBufferSize.)

It does seem to me that these instructions should be modelled as one micro-op... Unless I'm missing something or unless there's a compelling reason for us to make this change, I believe it would be better if we left it as is. What do you think?

simonwallis2 · 2025-11-13T08:52:01Z

It does seem to me that these instructions should be modelled as one micro-op... Unless I'm missing something or unless there's a compelling reason for us to make this change, I believe it would be better if we left it as is. What do you think?

I see that if making this zero micro-op change in isolation leads to llvm-mca reporting unrealistic cycle counts then that would be unhelpful. We still do not model all of the dispatch constraints described in the SWOG Special Considerations section 4.1. Modelling the MOPS per cycle limit would be key to reporting sensible numbers for an instruction with zero uOPs.

Thanks for supplying these examples, very useful. I tried modifying them to clarify to myself how the IPC is affected by MOPS limits and uOPs limits but was not able to provide a definite answer.

I propose to update the patch and remove the zero-micro-op change.
So the zero latency instructions stay as one micro-op, but we still proliferate zero latency instructions to those targets where they are not currently modelled.

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE. Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions: MOV Xd|Wd, #0|XZR|WZR For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine: MOV Wd,Wn MOV Xd,Xn For Neoverse-N3 only, these instructions also have zero latency FMOV Dd, Dn FMOV Sd, Sn Change-Id: I7a6a971cf75c60d8f75b210f0529c4ad813775a3

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE. Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions: MOV Xd|Wd, #0|XZR|WZR For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine: MOV Wd,Wn MOV Xd,Xn For Neoverse-N3 only, these instructions also have zero latency FMOV Dd, Dn FMOV Sd, Sn Change-Id: I95c53d373f35bb0bea5174a16c7ab3ac25acf684

simonwallis2 · 2025-11-13T10:58:02Z

I reverted my changes for V2, V3, V3A and their corresponding tests,
restoring the old behaviour of zero latency not implying zero micro-ops,

I updated the changes and tests for N2, N3, V1 so they match V2,V3,V3A.
I made it explicit in N3.td and V1.td that NumMicroOps is 1.
Note that for N3, this removes the zero-micro-ops count from the instruction setffr (only).

rj-jesus · 2025-11-13T12:56:08Z

I see that if making this zero micro-op change in isolation leads to llvm-mca reporting unrealistic cycle counts then that would be unhelpful. We still do not model all of the dispatch constraints described in the SWOG Special Considerations section 4.1. Modelling the MOPS per cycle limit would be key to reporting sensible numbers for an instruction with zero uOPs.

It would certainly be good to model those dispatch constraints (FYI, in case you haven't noticed them, this patch mentions slightly different dispatch constraints for the Neoverse V2 than those in the latest SWOG). There are a few other attributes that are currently not modelled in the Neoverse scheduling tables that could be worth looking at, for example BufferSize's, which should help LLVM-MCA be slightly more accurate.

llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td

llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE. Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions: MOV Xd|Wd, #0|XZR|WZR For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine: MOV Wd,Wn MOV Xd,Xn For Neoverse-N3 only, these instructions also have zero latency FMOV Dd, Dn FMOV Sd, Sn Change-Id: Ibbc0ba1da02dd4bf5ca28b33164d8fa4e93958d6

rj-jesus

A few more suggestions but otherwise this is almost good to go.
Thanks for the patience. :)

llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td

rj-jesus · 2025-11-14T14:23:14Z

llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td


 // Predicate set/initialize, set flags
-def : InstRW<[N3Write_2c_1M], (instregex "^PTRUES_[BHSD]")>;
+def : InstRW<[N3Write_0or2c_1M], (instregex "^PTRUES_[BHSD]")>;


PTRUES isn't listed in Section 4.11, but it is mentioned in Table 2-23, so I'll assume the latter to be correct (matching what you've implemented).

llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td

llvm/lib/Target/AArch64/AArch64SchedPredNeoverse.td

llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE. Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions: MOV Xd|Wd, #0|XZR|WZR For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine: MOV Wd,Wn MOV Xd,Xn For Neoverse-N3 only, these instructions also have zero latency FMOV Dd, Dn FMOV Sd, Sn Change-Id: I2d51b0ee6736d14f8212583f234431c555cc2574

rj-jesus

LGTM, cheers. Just please fix the conflict in llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td and let the CI tests run. :)

github-actions · 2025-11-19T12:04:47Z

🐧 Linux x64 Test Results

186368 tests passed
4859 tests skipped

llvmbot added the backend:AArch64 label Oct 30, 2025

simonwallis2 requested review from Asher8118, c-rhodes, davemgreen and rj-jesus October 30, 2025 12:50

rj-jesus reviewed Nov 3, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td Outdated Show resolved Hide resolved

simonwallis2 added 2 commits November 13, 2025 10:45

rj-jesus reviewed Nov 13, 2025

View reviewed changes

rj-jesus reviewed Nov 14, 2025

View reviewed changes

rj-jesus approved these changes Nov 18, 2025

View reviewed changes

Merge branch 'main' into sched-zero-latency

f0c2a71

simonwallis2 merged commit 6fc48de into llvm:main Nov 19, 2025
10 checks passed

[AArch64] Update zero latency instructions in Neoverse scheduling tables #165690

[AArch64] Update zero latency instructions in Neoverse scheduling tables #165690

Uh oh!

Conversation

simonwallis2 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 30, 2025

Uh oh!

rj-jesus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simonwallis2 commented Nov 7, 2025

Uh oh!

rj-jesus commented Nov 7, 2025

Uh oh!

simonwallis2 commented Nov 7, 2025

Uh oh!

rj-jesus commented Nov 7, 2025

Uh oh!

simonwallis2 commented Nov 13, 2025

Uh oh!

simonwallis2 commented Nov 13, 2025

Uh oh!

rj-jesus commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rj-jesus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rj-jesus Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rj-jesus left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 19, 2025

🐧 Linux x64 Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonwallis2 commented Oct 30, 2025 •

edited

Loading