Skip to content

Conversation

@simonwallis2
Copy link
Contributor

@simonwallis2 simonwallis2 commented Oct 30, 2025

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all the above Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn
MOV Vd, Vn (vector)
MOV Zd.D, Zn.D
PTRUE
PFALSE

…eoverse cores

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Change-Id: I1a5f86e049798582d33d96ba99389e4b2ffb210e
@llvmbot
Copy link
Member

llvmbot commented Oct 30, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Simon Wallis (simonwallis2)

Changes

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all the above Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn


Patch is 43.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165690.diff

16 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td (+28-2)
  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td (+23-3)
  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td (+20-2)
  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td (+4-1)
  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td (+4-1)
  • (modified) llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td (+4-1)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s (+11-11)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s (+15-15)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-basic-instructions.s (+11-11)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V1-zero-dependency.s (+21-21)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-basic-instructions.s (+5-5)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-zero-lat-movs.s (+12-12)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3-basic-instructions.s (+5-5)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3-zero-lat-movs.s (+12-12)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3AE-basic-instructions.s (+5-5)
  • (modified) llvm/test/tools/llvm-mca/AArch64/Neoverse/V3AE-zero-lat-movs.s (+12-12)
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td
index 50f10114989d0..d1ce5a13d0510 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN2.td
@@ -72,6 +72,13 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 // Define customized scheduler read/write types specific to the Neoverse N2.
 
 //===----------------------------------------------------------------------===//
+
+// Define generic 0 micro-op types
+def N2Write_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
+
 // Define generic 1 micro-op types
 
 def N2Write_1c_1B   : SchedWriteRes<[N2UnitB]>   { let Latency = 1; }
@@ -645,6 +652,21 @@ def N2Write_11c_9L01_9S_9V : SchedWriteRes<[N2UnitL01, N2UnitL01, N2UnitL01,
   let NumMicroOps = 27;
 }
 
+//===----------------------------------------------------------------------===//
+// Define predicate-controlled types
+
+def N2Write_0or1c_1I : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N2Write_0c]>,
+                      SchedVar<NoSchedPred,      [N2Write_1c_1I]>]>;
+
+def N2Write_0or2c_1V : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N2Write_0c]>,
+                      SchedVar<NoSchedPred,      [N2Write_2c_1V]>]>;
+
+def N2Write_0or3c_1M0 : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N2Write_0c]>,
+                      SchedVar<NoSchedPred,      [N2Write_3c_1M0]>]>;
+
 //===----------------------------------------------------------------------===//
 // Define types for arithmetic and logical ops with short shifts
 def N2Write_Arith : SchedWriteVariant<[
@@ -680,6 +702,7 @@ def : InstRW<[N2Write_1c_1B_1S], (instrs BL, BLR)>;
 // ALU, basic
 // ALU, basic, flagset
 def : SchedAlias<WriteI,     N2Write_1c_1I>;
+def : InstRW<[N2Write_0or1c_1I], (instregex "^MOVZ[WX]i$")>;
 
 // ALU, extend and shift
 def : SchedAlias<WriteIEReg, N2Write_2c_1M>;
@@ -691,7 +714,8 @@ def : SchedAlias<WriteISReg, N2Write_Arith>;
 
 // Logical, shift, no flagset
 def : InstRW<[N2Write_1c_1I],
-             (instregex "^(AND|BIC|EON|EOR|ORN|ORR)[WX]rs$")>;
+             (instregex "^(AND|BIC|EON|EOR|ORN)[WX]rs$")>;
+def : InstRW<[N2Write_0or1c_1I], (instregex "^ORR[WX]rs$")>;
 
 // Logical, shift, flagset
 def : InstRW<[N2Write_Logical], (instregex "^(AND|BIC)S[WX]rs$")>;
@@ -882,7 +906,7 @@ def : SchedAlias<WriteFImm, N2Write_2c_1V>;
 def : InstRW<[N2Write_2c_1V], (instrs FMOVHr, FMOVSr, FMOVDr)>;
 
 // FP transfer, from gen to low half of vec reg
-def : InstRW<[N2Write_3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr,
+def : InstRW<[N2Write_0or3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr,
                                         FMOVHWr, FMOVHXr, FMOVSWr, FMOVDXr)>;
 
 // FP transfer, from gen to high half of vec reg
@@ -1225,6 +1249,8 @@ def : InstRW<[N2Write_3c_1V0], (instrs BFCVT)>;
 // ASIMD unzip/zip
 // Handled by SchedAlias<WriteV[dq], ...>
 
+def : InstRW<[N2Write_0or2c_1V], (instrs MOVID, MOVIv2d_ns)>;
+
 // ASIMD duplicate, gen reg
 def : InstRW<[N2Write_3c_1M0], (instregex "^DUPv.+gpr")>;
 
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
index 411b372a3f533..32d48ca66ee2d 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
@@ -553,6 +553,22 @@ def N3Write_16c_16V0 : SchedWriteRes<[N3UnitV0, N3UnitV0, N3UnitV0, N3UnitV0,
     let NumMicroOps = 16;
 }
 
+
+//===----------------------------------------------------------------------===//
+// Define predicate-controlled types
+
+def N3Write_0or1c_1I : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N3Write_0c]>,
+                      SchedVar<NoSchedPred,      [N3Write_1c_1I]>]>;
+
+def N3Write_0or2c_1V : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N3Write_0c]>,
+                      SchedVar<NoSchedPred,      [N3Write_2c_1V]>]>;
+
+def N3Write_0or3c_1M0 : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [N3Write_0c]>,
+                      SchedVar<NoSchedPred,      [N3Write_3c_1M0]>]>;
+
 // Miscellaneous
 // -----------------------------------------------------------------------------
 
@@ -581,6 +597,7 @@ def : InstRW<[N3Write_1c_1B_1S], (instrs BL, BLR)>;
 // Conditional compare
 // Conditional select
 def : SchedAlias<WriteI, N3Write_1c_1I>;
+def : InstRW<[N3Write_0or1c_1I], (instregex "^MOVZ[WX]i$")>;
 
 // ALU, extend and shift
 def : SchedAlias<WriteIEReg, N3Write_2c_1M>;
@@ -610,7 +627,8 @@ def : InstRW<[N3Write_1c_1I], (instrs GMI, SUBP, SUBPS)>;
 
 // Logical, shift, no flagset
 def : InstRW<[N3Write_1c_1I],
-             (instregex "^(AND|BIC|EON|EOR|ORN|ORR)[WX]rs$")>;
+             (instregex "^(AND|BIC|EON|EOR|ORN)[WX]rs$")>;
+def : InstRW<[N3Write_0or1c_1I], (instregex "^ORR[WX]rs$")>;
 
 // Logical, shift, flagset
 def : InstRW<[N3Write_2c_1M], (instregex "^(AND|BIC)S[WX]rs$")>;
@@ -855,10 +873,11 @@ def : SchedAlias<WriteFCvt, N3Write_3c_1V0>;
 def : SchedAlias<WriteFImm, N3Write_2c_1V>;
 
 // FP move, register
-def : InstRW<[N3Write_2c_1V], (instrs FMOVHr, FMOVSr, FMOVDr)>;
+def : InstRW<[N3Write_2c_1V], (instrs FMOVHr)>;
+def : InstRW<[N3Write_0c], (instrs FMOVSr, FMOVDr)>;
 
 // FP transfer, from gen to low half of vec reg
-def : InstRW<[N3Write_3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
+def : InstRW<[N3Write_0or3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
 
 // FP transfer, from gen to high half of vec reg
 def : InstRW<[N3Write_5c_1M0_1V], (instrs FMOVXDHighr)>;
@@ -1186,6 +1205,7 @@ def : InstRW<[N3Write_3c_1V0], (instrs BFCVT)>;
 // ASIMD transpose
 // ASIMD unzip/zip
 // Covered by WriteV[dq]
+def : InstRW<[N3Write_0or2c_1V], (instrs MOVID, MOVIv2d_ns)>;
 
 // ASIMD duplicate, gen reg
 def : InstRW<[N3Write_3c_1M0], (instregex "^DUPv.+gpr")>;
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
index 3cbfc59423c9a..8d33ca22616c2 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV1.td
@@ -472,6 +472,21 @@ def V1Write_11c_9L01_9S_9V : SchedWriteRes<[V1UnitL01, V1UnitL01, V1UnitL01,
                                             V1UnitV, V1UnitV, V1UnitV,
                                             V1UnitV, V1UnitV, V1UnitV]>;
 
+//===----------------------------------------------------------------------===//
+// Define predicate-controlled types
+
+def V1Write_0or1c_1I : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [V1Write_0c_0Z]>,
+                      SchedVar<NoSchedPred,      [V1Write_1c_1I]>]>;
+
+def V1Write_0or2c_1V : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [V1Write_0c_0Z]>,
+                      SchedVar<NoSchedPred,      [V1Write_2c_1V]>]>;
+
+def V1Write_0or3c_1M0 : SchedWriteVariant<[
+                      SchedVar<NeoverseZeroMove, [V1Write_0c_0Z]>,
+                      SchedVar<NoSchedPred,      [V1Write_3c_1M0]>]>;
+
 //===----------------------------------------------------------------------===//
 // Define forwarded types
 
@@ -603,6 +618,7 @@ def : InstRW<[V1Write_1c_1I_1Flg],
                         "^(ADC|SBC)S[WX]r$",
                         "^ANDS[WX]ri$",
                         "^(AND|BIC)S[WX]rr$")>;
+def : InstRW<[V1Write_0or1c_1I], (instregex "^MOVZ[WX]i$")>;
 
 // ALU, extend and shift
 def : SchedAlias<WriteIEReg, V1Write_2c_1M>;
@@ -623,7 +639,8 @@ def               : InstRW<[V1WriteISRegS],
                            (instregex "^(ADD|SUB)S(([WX]r[sx])|Xrx64)$")>;
 
 // Logical, shift, no flagset
-def : InstRW<[V1Write_1c_1I], (instregex "^(AND|BIC|EON|EOR|ORN|ORR)[WX]rs$")>;
+def : InstRW<[V1Write_1c_1I], (instregex "^(AND|BIC|EON|EOR|ORN)[WX]rs$")>;
+def : InstRW<[V1Write_0or1c_1I], (instregex "^ORR[WX]rs$")>;
 
 // Logical, shift, flagset
 def : InstRW<[V1Write_2c_1M_1Flg], (instregex "^(AND|BIC)S[WX]rs$")>;
@@ -805,7 +822,7 @@ def : SchedAlias<WriteFImm, V1Write_2c_1V>;
 def : InstRW<[V1Write_2c_1V], (instrs FMOVHr, FMOVSr, FMOVDr)>;
 
 // FP transfer, from gen to low half of vec reg
-def : InstRW<[V1Write_3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
+def : InstRW<[V1Write_0or3c_1M0], (instrs FMOVWHr, FMOVXHr, FMOVWSr, FMOVXDr)>;
 
 // FP transfer, from gen to high half of vec reg
 def : InstRW<[V1Write_5c_1M0_1V], (instrs FMOVXDHighr)>;
@@ -1122,6 +1139,7 @@ def : InstRW<[V1Write_3c_1V02], (instrs BFCVT)>;
 // ASIMD transpose
 // ASIMD unzip/zip
 // Covered by "SchedAlias (WriteV[dq]...)" above
+def : InstRW<[V1Write_0or2c_1V], (instrs MOVID, MOVIv2d_ns)>;
 
 // ASIMD duplicate, gen reg
 def : InstRW<[V1Write_3c_1M0],
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
index 2387f176f3051..1ef087f07022d 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td
@@ -94,7 +94,10 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 //===----------------------------------------------------------------------===//
 
 // Define generic 0 micro-op types
-def V2Write_0c : SchedWriteRes<[]> { let Latency = 0; }
+def V2Write_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
 
 // Define generic 1 micro-op types
 
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td
index e23576a20d277..3dd2988088f0b 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3.td
@@ -94,7 +94,10 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 //===----------------------------------------------------------------------===//
 
 // Define generic 0 micro-op types
-def V3Write_0c : SchedWriteRes<[]> { let Latency = 0; }
+def V3Write_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
 
 // Define generic 1 micro-op types
 
diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td
index 0f1ec669a4e5e..19b56260387e1 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseV3AE.td
@@ -89,7 +89,10 @@ def : WriteRes<WriteLDHi,    []> { let Latency = 4; }
 //===----------------------------------------------------------------------===//
 
 // Define generic 0 micro-op types
-def V3AEWrite_0c : SchedWriteRes<[]> { let Latency = 0; }
+def V3AEWrite_0c : SchedWriteRes<[]> {
+    let Latency = 0;
+    let NumMicroOps = 0;
+}
 
 // Define generic 1 micro-op types
 
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s
index cf1cf0e98c801..d3343ab055887 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N2-basic-instructions.s
@@ -2508,14 +2508,14 @@ drps
 # CHECK-NEXT:  1      2     0.50                        bics	x3, xzr, x3, lsl #1
 # CHECK-NEXT:  1      2     0.50                        tst	w3, w7, lsl #31
 # CHECK-NEXT:  1      2     0.50                        tst	x2, x20, asr #2
-# CHECK-NEXT:  1      1     0.25                        mov	x3, x6
-# CHECK-NEXT:  1      1     0.25                        mov	x3, xzr
-# CHECK-NEXT:  1      1     0.25                        mov	wzr, w2
-# CHECK-NEXT:  1      1     0.25                        mov	w3, w5
+# CHECK-NEXT:  0      0     0.00                        mov	x3, x6
+# CHECK-NEXT:  0      0     0.00                        mov	x3, xzr
+# CHECK-NEXT:  0      0     0.00                        mov	wzr, w2
+# CHECK-NEXT:  0      0     0.00                        mov	w3, w5
 # CHECK-NEXT:  1      1     0.25                        movz	w2, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        mov	w2, #-1235
 # CHECK-NEXT:  1      1     0.25                        mov	x2, #5299989643264
-# CHECK-NEXT:  1      1     0.25                        mov	x2, #0
+# CHECK-NEXT:  0      0     0.00                        mov	x2, #0
 # CHECK-NEXT:  1      1     0.25                        movk	w3, #0
 # CHECK-NEXT:  1      1     0.25                        movz	x4, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        movk	w5, #0, lsl #16
@@ -2557,7 +2557,7 @@ drps
 
 # CHECK:      Resource pressure per iteration:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]
-# CHECK-NEXT: 11.00  11.00  33.00  33.00  87.33  151.33 151.33 517.00 251.00 162.50 162.50 215.50 85.50
+# CHECK-NEXT: 11.00  11.00  33.00  33.00  87.33  151.33 151.33 515.75 249.75 161.25 161.25 215.50 85.50
 
 # CHECK:      Resource pressure by instruction:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]    Instructions:
@@ -3692,14 +3692,14 @@ drps
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -      -     bics	x3, xzr, x3, lsl #1
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -      -     tst	w3, w7, lsl #31
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -      -     tst	x2, x20, asr #2
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x3, x6
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x3, xzr
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	wzr, w2
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	w3, w5
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	x3, x6
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	x3, xzr
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	wzr, w2
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	w3, w5
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movz	w2, #0, lsl #16
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	w2, #-1235
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x2, #5299989643264
-# CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     mov	x2, #0
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     mov	x2, #0
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movk	w3, #0
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movz	x4, #0, lsl #16
 # CHECK-NEXT:  -      -      -      -      -      -      -     0.25   0.25   0.25   0.25    -      -     movk	w5, #0, lsl #16
diff --git a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
index b9758280e2491..f7311b5e41b2e 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Neoverse/N3-basic-instructions.s
@@ -1888,7 +1888,7 @@ drps
 # CHECK-NEXT:  1      2     0.50                        fccmpe	d31, d5, #7, ne
 # CHECK-NEXT:  1      2     0.50                        fcsel	s3, s20, s9, pl
 # CHECK-NEXT:  1      2     0.50                        fcsel	d9, d10, d11, mi
-# CHECK-NEXT:  1      2     0.50                        fmov	s0, s1
+# CHECK-NEXT:  0      0     0.00                        fmov	s0, s1
 # CHECK-NEXT:  1      2     0.50                        fabs	s2, s3
 # CHECK-NEXT:  1      2     0.50                        fneg	s4, s5
 # CHECK-NEXT:  1      7     1.00                        fsqrt	s6, s7
@@ -1901,7 +1901,7 @@ drps
 # CHECK-NEXT:  1      3     1.00                        frinta	s20, s21
 # CHECK-NEXT:  1      3     1.00                        frintx	s22, s23
 # CHECK-NEXT:  1      3     1.00                        frinti	s24, s25
-# CHECK-NEXT:  1      2     0.50                        fmov	d0, d1
+# CHECK-NEXT:  0      0     0.00                        fmov	d0, d1
 # CHECK-NEXT:  1      2     0.50                        fabs	d2, d3
 # CHECK-NEXT:  1      2     0.50                        fneg	d4, d5
 # CHECK-NEXT:  1      12    1.00                        fsqrt	d6, d7
@@ -2508,14 +2508,14 @@ drps
 # CHECK-NEXT:  1      2     0.50                        bics	x3, xzr, x3, lsl #1
 # CHECK-NEXT:  1      2     0.50                        tst	w3, w7, lsl #31
 # CHECK-NEXT:  1      2     0.50                        tst	x2, x20, asr #2
-# CHECK-NEXT:  1      1     0.25                        mov	x3, x6
-# CHECK-NEXT:  1      1     0.25                        mov	x3, xzr
-# CHECK-NEXT:  1      1     0.25                        mov	wzr, w2
-# CHECK-NEXT:  1      1     0.25                        mov	w3, w5
+# CHECK-NEXT:  0      0     0.00                        mov	x3, x6
+# CHECK-NEXT:  0      0     0.00                        mov	x3, xzr
+# CHECK-NEXT:  0      0     0.00                        mov	wzr, w2
+# CHECK-NEXT:  0      0     0.00                        mov	w3, w5
 # CHECK-NEXT:  1      1     0.25                        movz	w2, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        mov	w2, #-1235
 # CHECK-NEXT:  1      1     0.25                        mov	x2, #5299989643264
-# CHECK-NEXT:  1      1     0.25                        mov	x2, #0
+# CHECK-NEXT:  0      0     0.00                        mov	x2, #0
 # CHECK-NEXT:  1      1     0.25                        movk	w3, #0
 # CHECK-NEXT:  1      1     0.25                        movz	x4, #0, lsl #16
 # CHECK-NEXT:  1      1     0.25                        movk	w5, #0, lsl #16
@@ -2557,7 +2557,7 @@ drps
 
 # CHECK:      Resource pressure per iteration:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]
-# CHECK-NEXT: 11.00  11.00  33.00  33.00  99.33  163.33 163.33 357.75 212.75 156.25 156.25 184.50 64.50
+# CHECK-NEXT: 11.00  11.00  33.00  33.00  99.33  163.33 163.33 356.50 211.50 155.00 155.00 183.50 63.50
 
 # CHECK:      Resource pressure by instruction:
 # CHECK-NEXT: [0.0]  [0.1]  [1.0]  [1.1]  [2]    [3.0]  [3.1]  [4]    [5]    [6.0]  [6.1]  [7]    [8]    Instructions:
@@ -3072,7 +3072,7 @@ drps
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fccmpe	d31, d5, #7, ne
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fcsel	s3, s20, s9, pl
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fcsel	d9, d10, d11, mi
-# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fmov	s0, s1
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -     fmov	s0, s1
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fabs	s2, s3
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fneg	s4, s5
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     fsqrt	s6, s7
@@ -3085,7 +3085,7 @@ drps
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     frinta	s20, s21
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     frintx	s22, s23
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     1.00    -     frinti	s24, s25
-# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -     0.50   0.50   fmov	d0, d1
+# CHECK-NEXT:  -     ...
[truncated]

…eoverse cores

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Change-Id: I955cfe3efc689bea305a708eb6d7259dced6fe04
Copy link
Contributor

@rj-jesus rj-jesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the changes generally look good, but I'm not sure we should be modelling zero-latency moves with zero micro-ops? AFAIU these instructions still count as a micro-op (towards decode bandwidth, for example) despite not using execution resources. If you can share any references to the contrary, that would be much appreciated.

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Change-Id: Ie6b1f5c3f4d74f26bdd4c67c5e6c5acf6a8e00cc
@simonwallis2
Copy link
Contributor Author

The categories of instructions with zero micro ops were previously added to the .td files for V1 (in 2023) and for N3 (in 2024). This patch extends this to N2, V2, V3, V3AE.

I based this patch solely on my own reading of the Neoverse SWOGs, in particular section 4.15 (variously 4.12, 4.11) about zero latency instructions not using the scheduling and execution resources, and section 2.1 about macro ops proceeding through register renaming and dispatch.

It looks like these zero-latency instructions still take up decode resources, which we don’t currently describe explicitly.
But the work done is largely register renaming, and there are no micro-ops dispatched to the individual pipelines.
So I think zero micro-ops is the correct way to represent this in the .td file.

@rj-jesus
Copy link
Contributor

rj-jesus commented Nov 7, 2025

It looks like these zero-latency instructions still take up decode resources, which we don’t currently describe explicitly.
But the work done is largely register renaming, and there are no micro-ops dispatched to the individual pipelines.

I believe we implicitly model decode constraints in the "IssueWidth". My main concern with modelling these instructions with zero micro-ops is that it might trick the machine scheduler into assuming they can be scheduled freely, which AFAIU isn't true. Also, OP_RETIRED suggests these instructions do count as a micro-op.

Do you have any performance data that suggests this is preferable for performance or at least neutral?

@simonwallis2
Copy link
Contributor Author

Do you have any performance data that suggests this is preferable for performance or at least neutral?

On Neoverse-V2 the benchmarks that I ran reported zero change in performance.

@rj-jesus
Copy link
Contributor

rj-jesus commented Nov 7, 2025

Consider this example, which issues 20 mov x9, 0 per iteration for 10^9 iterations. What I see:

$ perf stat -e cycles,instructions,op_retired ./a.out 

 Performance counter stats for './a.out':

     3,503,570,565      cycles                                                                
    22,003,024,649      instructions                     #    6.28  insn per cycle            
    22,003,034,412      op_retired                                                            

       1.133565071 seconds time elapsed

       1.050723000 seconds user
       0.003995000 seconds sys

If you normalise the IPC perf computes by 21/22 (because the CMP+B is fused), you get exactly 6.
LLVM-MCA currently works out something similar: https://godbolt.org/z/7az3YEEY9.

If we make zero-latency moves zero micro-ops, then we'll have:

Iterations:        100000
Instructions:      2000000
Total Cycles:      6251
Total uOps:        0

Dispatch Width:    6
uOps Per Cycle:    0.00
IPC:               319.95
Block RThroughput: 0.0

(The IPC becomes bottlenecked on the MicroOpBufferSize.)

It does seem to me that these instructions should be modelled as one micro-op... Unless I'm missing something or unless there's a compelling reason for us to make this change, I believe it would be better if we left it as is. What do you think?

@simonwallis2
Copy link
Contributor Author

It does seem to me that these instructions should be modelled as one micro-op... Unless I'm missing something or unless there's a compelling reason for us to make this change, I believe it would be better if we left it as is. What do you think?

I see that if making this zero micro-op change in isolation leads to llvm-mca reporting unrealistic cycle counts then that would be unhelpful. We still do not model all of the dispatch constraints described in the SWOG Special Considerations section 4.1. Modelling the MOPS per cycle limit would be key to reporting sensible numbers for an instruction with zero uOPs.

Thanks for supplying these examples, very useful. I tried modifying them to clarify to myself how the IPC is affected by MOPS limits and uOPs limits but was not able to provide a definite answer.

I propose to update the patch and remove the zero-micro-op change.
So the zero latency instructions stay as one micro-op, but we still proliferate zero latency instructions to those targets where they are not currently modelled.

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Change-Id: I7a6a971cf75c60d8f75b210f0529c4ad813775a3
NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Change-Id: I95c53d373f35bb0bea5174a16c7ab3ac25acf684
@simonwallis2
Copy link
Contributor Author

I reverted my changes for V2, V3, V3A and their corresponding tests,
restoring the old behaviour of zero latency not implying zero micro-ops,

I updated the changes and tests for N2, N3, V1 so they match V2,V3,V3A.
I made it explicit in N3.td and V1.td that NumMicroOps is 1.
Note that for N3, this removes the zero-micro-ops count from the instruction setffr (only).

@rj-jesus
Copy link
Contributor

I see that if making this zero micro-op change in isolation leads to llvm-mca reporting unrealistic cycle counts then that would be unhelpful. We still do not model all of the dispatch constraints described in the SWOG Special Considerations section 4.1. Modelling the MOPS per cycle limit would be key to reporting sensible numbers for an instruction with zero uOPs.

It would certainly be good to model those dispatch constraints (FYI, in case you haven't noticed them, this patch mentions slightly different dispatch constraints for the Neoverse V2 than those in the latest SWOG). There are a few other attributes that are currently not modelled in the Neoverse scheduling tables that could be worth looking at, for example BufferSize's, which should help LLVM-MCA be slightly more accurate.

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Change-Id: Ibbc0ba1da02dd4bf5ca28b33164d8fa4e93958d6
Copy link
Contributor

@rj-jesus rj-jesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more suggestions but otherwise this is almost good to go.
Thanks for the patience. :)


// Predicate set/initialize, set flags
def : InstRW<[N3Write_2c_1M], (instregex "^PTRUES_[BHSD]")>;
def : InstRW<[N3Write_0or2c_1M], (instregex "^PTRUES_[BHSD]")>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTRUES isn't listed in Section 4.11, but it is mentioned in Table 2-23, so I'll assume the latter to be correct (matching what you've implemented).

NeoverseZeroMove was introduced for Neoverse-V2 and was added to V3 and V3AE.
Use NeoverseZeroMove for Neoverse-V1, N2, N3 in the same way, including these instructions:
MOV Xd|Wd, #0|XZR|WZR

For all Neoverse targets, the following instructions are also decoded as not utilizing the scheduling and execution resources of the machine:
MOV Wd,Wn
MOV Xd,Xn

For Neoverse-N3 only, these instructions also have zero latency
FMOV Dd, Dn
FMOV Sd, Sn

Change-Id: I2d51b0ee6736d14f8212583f234431c555cc2574
Copy link
Contributor

@rj-jesus rj-jesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cheers. Just please fix the conflict in llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td and let the CI tests run. :)

@github-actions
Copy link

🐧 Linux x64 Test Results

  • 186368 tests passed
  • 4859 tests skipped

@simonwallis2 simonwallis2 merged commit 6fc48de into llvm:main Nov 19, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants