[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #509

michaelselehov · 2025-11-06T15:25:21Z

AMDGPU: reflect lack of complex addressing in LSR; fix regressions after SROA copy-load promotion

Background

Recent SROA changes promote copy-load style IR more aggressively. As a side effect, LoopStrengthReduce (LSR) now triggers on loops where it previously stayed quiet. On AMDGPU (no rich base+index addressing modes), generic LSR heuristics could pick plans that looked cheaper but expanded into extra address formation work in the loop body, causing performance regressions.

Problem

A scaled index with a base register is not free on AMDGPU; it becomes an extra per-iteration add.
LSR’s Insns metric did not count this address work, so plans could “win” by 1 instruction in theory but actually add a v_add in the loop.
On GFX9+ we care more about per-iteration work than about small, one-time preheader setup.

Solution

Implement getScalingFactorCost in GCNTTIImpl to model scaled addressing:
If HasBaseReg && Scale != 0, return cost 1 (one extra per-iteration instruction).
Use Effective Insns as the top metric in isLSRCostLess (GFX9+):
Effective Insns = Insns + ScaleCost
This makes scaled address formation count as real per-iteration work.

Priority order on GFX9+

Effective Insns (Insns + ScaleCost)
NumIVMuls (penalize mul/mul_hi/addc IV chains)
AddRecCost (prefer fewer IV updates)
NumBaseAdds
SetupCost
ImmCost
NumRegs

Pre-GFX9 behavior is unchanged (fall back to base implementation).

Rationale

AMDGPU lacks hardware complex addressing. Base + (iv * scale) requires an extra instruction; modeling that as ScaleCost = 1 and folding into Effective Insns fixes false “wins”.
Keeping NumIVMuls high in the order preserves the important penalty for wide IV mul chains (problematic on gfx942).
Preheader work remains a lower priority; we accept modest setup if per-iteration work is not worse (often better).

Validation

Micro cases that previously regressed now choose pointer IV or a single IV with honest accounting of address work; per-iteration VALU pressure is reduced or unchanged.
A GlobalISel case that relies on fewer IV updates still selects the intended plan (no scale involved, so Effective Insns ties and AddRecCost wins).
AMDGPU lit tests with autogenerated checks were reviewed and selectively regenerated where code shape improved or shifted benignly. Intent-based tests (no crashes, correct sinking, correct divergence handling) remain valid.
Bench on gfx942 remains stable; no increase in actual loop work.

Notes

No LSR algorithms were changed; this is TTI cost modeling and plan ordering for AMDGPU.
Scope is limited to GFX9+.
Further tuning can be added later if we find unrolled-shape outliers, but current data shows this approach addresses the original regressions without harming other cases.

P.S. The commented-out debug messages will stay there till the final form (waiting for CQE extensive perf testing), then I'll remove them.

Mach-o has 32 bit file offsets in the MachO::section_64 structs. dSYM files can contain sections whose start offset exceeds UINT32_MAX, which means the MachO::section_64.offset will get truncated. We can calculate when this happens and properly adjust the section offset to be 64 bit safe. This means tools can get the correct section contents for large dSYM files and allows tools that parse DWARF, like llvm-gsymutil, to be able to load and convert these files correctly.

This reverts commit df1d786.

Upstream the CXXDefaultArgExpr support for AggregateExpr

Add faccessat entrypoints for aarch64 and riscv linux. Entrypoints are removed if faccessat2 syscall is not available.

ConstantRange uses `[-1, -1)` as the canonical form of a full set. Therefore, the `for (APInt I = Lower; I != Upper; ++I)` idiom doesn't work for full ranges. This patch fixes the value enumeration in `ConstantComparesGatherer` to prevent missing values for full sets. Closes llvm#166369.

…no-gpu-bundle-output flag (llvm#163834) Currently, the command `clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip` will lead to a bundled output: ``` ❯ ../bin/clang -c -emit-llvm --no-gpu-bundle-output --offload-arch=gfx900,gfx1030 -O3 -x hip square.hip ❯ ls square.hip square.bc ``` This doesn't match my expectation of the behavior of `--no-gpu-bundle-output`, so this adds a check into OffloadingActionBuilder for the flag when replacing the host compile action for a bundling action.

PR llvm#159163's probability computation for epilogue loops does not handle the possibility of an original loop probability of one. Runtime loop unrolling does not make sense for such an infinite loop, and a division by zero results. This patch works around that case. Issue llvm#165998.

Fix Python virtual environment paths for Windows in the Lifetime Safety Analysis benchmark ### What changed? - Added conditional path setting for the Python executable in the virtual environment based on the platform - For Windows, use `Scripts/python` path - For other platforms, use `bin/python` path - Updated the commands that use the Python virtual environment to use the platform-specific path ### How to test? `ninja benchmark_lifetime_safety_analysis` Fixes llvm#166143

Omitting `-o /dev/null` may try to write output to the current dir, which may not have write permissions on some build systems. This fixes the test added by llvm#165737

…llvm#166312) This is consistent with some other nodes that require a constant. Particularly intrinsics with ImmArg.

…66290) Eventually this should be generated by tablegen for all functions. For now add a manually implementation for sincos_stret, which I have an immediate use for. This will allow pulling repeated code across targets into shared call sequence code. Also add sqrt just to make sure we can handle adding return attributes on the declaration.

Upstream the Builtin ExpOp

Avoid weird lambda returning function pointer and sink the libcall logic to where the operation is handled. This allows chaining the libcall logic to try sincos_stret and fallback to sincos. The resulting cost seems too low.

This avoids AArch64 legality rules depending on libcall availability. ARM, AArch64, and X86 all had custom lowering of fsincos which all were just to emit calls to sincos_stret / sincosf_stret. This messes with the cost heuristics around legality, because really it's an expand/libcall cost and not a favorable custom. This is a bit ugly, because we're emitting code trying to match the C ABI lowered IR type for the aggregate return type. This now also gives an easy way to lift the unhandled x86_32 darwin case, since ARM already handled the return as sret case.

Previously, `clang -march=znver5 -O3` would emit the following for `shl`, `lshr` and `ashr <64 x i8>`: ```asm .LCPI0_2: .byte 8 .byte 4 .byte 2 .byte 1 .byte 0 .byte 0 .byte 0 .byte 0 .LCPI0_3: .byte 32 .byte 16 .byte 8 .byte 4 .byte 2 .byte 1 .byte 0 .byte 0 shl: vpsllw zmm1, zmm1, 5 vpmovb2m k1, zmm1 vpaddb zmm1, zmm1, zmm1 vgf2p8affineqb zmm0 {k1}, zmm0, qword ptr [rip + .LCPI0_2]{1to8}, 0 vpmovb2m k1, zmm1 vpaddb zmm1, zmm1, zmm1 vgf2p8affineqb zmm0 {k1}, zmm0, qword ptr [rip + .LCPI0_3]{1to8}, 0 vpmovb2m k1, zmm1 vpaddb zmm0 {k1}, zmm0, zmm0 ret .LCPI1_3: .byte 0 .byte 0 .byte 0 .byte 0 .byte 128 .byte 64 .byte 32 .byte 16 .LCPI1_4: .byte 0 .byte 0 .byte 128 .byte 64 .byte 32 .byte 16 .byte 8 .byte 4 .LCPI1_5: .byte 0 .byte 128 .byte 64 .byte 32 .byte 16 .byte 8 .byte 4 .byte 2 lshr: vpsllw zmm1, zmm1, 5 vpmovb2m k1, zmm1 vpaddb zmm1, zmm1, zmm1 vgf2p8affineqb zmm0 {k1}, zmm0, qword ptr [rip + .LCPI1_3]{1to8}, 0 vpmovb2m k1, zmm1 vpaddb zmm1, zmm1, zmm1 vgf2p8affineqb zmm0 {k1}, zmm0, qword ptr [rip + .LCPI1_4]{1to8}, 0 vpmovb2m k1, zmm1 vgf2p8affineqb zmm0 {k1}, zmm0, qword ptr [rip + .LCPI1_5]{1to8}, 0 ret ashr: vpsllw zmm1, zmm1, 5 vpunpckhbw zmm2, zmm0, zmm0 vpunpckhbw zmm4, zmm1, zmm1 vpsraw zmm3, zmm2, 4 vpunpcklbw zmm0, zmm0, zmm0 vpmovb2m k1, zmm4 vpaddw zmm4, zmm4, zmm4 vpunpcklbw zmm1, zmm1, zmm1 vmovdqu8 zmm2 {k1}, zmm3 vpmovb2m k1, zmm4 vpsraw zmm3, zmm2, 2 vpaddw zmm4, zmm4, zmm4 vmovdqu8 zmm2 {k1}, zmm3 vpsraw zmm3, zmm2, 1 vpmovb2m k1, zmm4 vmovdqu8 zmm2 {k1}, zmm3 vpmovb2m k1, zmm1 vpsraw zmm3, zmm0, 4 vpaddw zmm1, zmm1, zmm1 vpsrlw zmm2, zmm2, 8 vmovdqu8 zmm0 {k1}, zmm3 vpmovb2m k1, zmm1 vpsraw zmm3, zmm0, 2 vpaddw zmm1, zmm1, zmm1 vmovdqu8 zmm0 {k1}, zmm3 vpsraw zmm3, zmm0, 1 vpmovb2m k1, zmm1 vmovdqu8 zmm0 {k1}, zmm3 vpsrlw zmm0, zmm0, 8 vpackuswb zmm0, zmm0, zmm2 ret ``` With this commit, the generated assembly becomes this: ```asm .LCPI0_2: .byte 0 .byte 255 .byte 0 .byte 255 .LCPI0_3: .byte 255 .byte 0 .byte 255 .byte 0 shl: vpsrlw zmm2, zmm1, 8 vpandd zmm3, zmm0, dword ptr [rip + .LCPI0_2]{1to16} vpandd zmm1, zmm1, dword ptr [rip + .LCPI0_3]{1to16} movabs rax, -6148914691236517206 kmovq k1, rax vpsllvw zmm2, zmm3, zmm2 vpsllvw zmm0, zmm0, zmm1 vmovdqu8 zmm0 {k1}, zmm2 ret .LCPI1_0: .byte 255 .byte 0 lshr: vpbroadcastw zmm2, word ptr [rip + .LCPI1_0] movabs rax, -6148914691236517206 kmovq k1, rax vpandq zmm3, zmm1, zmm2 vpandq zmm2, zmm0, zmm2 vpsrlw zmm1, zmm1, 8 vpsrlvw zmm2, zmm2, zmm3 vpsrlvw zmm0, zmm0, zmm1 vmovdqu8 zmm2 {k1}, zmm0 vmovdqa64 zmm0, zmm2 ret .LCPI2_1: .byte 255 .byte 0 .byte 255 .byte 0 ashr: vpsrlw zmm2, zmm1, 8 vpandd zmm1, zmm1, dword ptr [rip + .LCPI2_1]{1to16} movabs rax, -6148914691236517206 vpsravw zmm2, zmm0, zmm2 vpsllw zmm0, zmm0, 8 kmovq k1, rax vpsraw zmm0, zmm0, 8 vpsravw zmm0, zmm0, zmm1 vmovdqu8 zmm0 {k1}, zmm2 ret ``` While I don't have AVX512 hardware, llvm-mca suggests significant speedups, and I've done some simple correctness tests on random inputs using the Intel Software Development Emulator.

Update the verifier to first check if the number of incoming values matches the number of predecessors, before using incoming_values_and_blocks. We unfortunately need also check here, as this may be called before verifyPhiRecipes runs. Also update the verifier unit tests, to actually fail for the expected recipes.

…irst` (llvm#166384) This paves the way to make `--stack-first` the default. See: llvm#151015

When running UTC on SLU/guards.ll (without LLVM changes), there are a number of changes in the UTC-generated checks. Submitting those first to simplify the diff of PR llvm#164271, as most of the changes in the latter were actually these.

…66260) MachineFunctionSplitter was missing a skipFunction() check, causing it to incorrectly split functions that should be skipped (e.g., functions with optnone attribute). This patch adds an early skipFunction() check in runOnMachineFunction() to ensure these functions are never split, regardless of profile data availability or other splitting conditions.

…ction (llvm#165774) Updated the conditions for generating the cvt.ftz.f32.bf16 instruction to include sm90 and isa7.8, so that ftz is only generated when it is supported. --------- Co-authored-by: Justin Fargnoli <[email protected]>

Broken by 831e79a, though presubmit was somehow green.

Introduced in OpenMP 6.0, the device UID shall be a unique identifier of a device on a given system. (Not necessarily a UUID.) Since it is not guaranteed that the (U)UIDs defined by the device vendor libraries, such as HSA, do not overlap with those of other vendors, the device UIDs in offload are always combined with the offload plugin name. In case the vendor library does not specify any device UID for a given device, we fall back to the offload-internal device ID. The device UID can be retrieved using the `llvm-offload-device-info` tool.

From llvm#148554 (comment) - this adds an option for API tests to be run with the both PDB readers on Windows. As there are a lot of failures with PDB, this is an opt-in per test. To get PDB, `-g -gcodeview` has to be used on Clang. `-gcodeview` alone isn't enough, because it won't cause clang to pass `-debug` to the linker. llvm#149498 tracks the (currently) failing tests.

Ensure that the nested native build uses the same python interpreter as the main build, in case the python that CMake detects first is not the python that the user has specified explicitly. For example, if the person building LLVM wants to use a different python interpreter to build (eg, testing the build with `python3.14` when `python3` is a link to `python3.8`, or the default python doesn't have development headers available) then they could add `-DPYTHON_EXECUTABLE=python3.14` when invoking CMake. This should be forwarded to the native CMake build to ensure that the same python is used. Original fix by Anuj Mittal <[email protected]>.

Reviewers: paperchalice, phoebewang, arsenm Reviewed By: arsenm Pull Request: llvm#165113

There are no tests that specifically stop/start at x86-partial-reduction, so no test cases have been updated for this patch. Reviewers: phoebewang, paperchalice, RKSimon, arsenm Reviewed By: arsenm, RKSimon Pull Request: llvm#166048

Currently, the jump table entry contains labels only. For example, the following is one example: BPF.JT.0.0: .quad LBB0_1 .quad LBB0_2 .size BPF.JT.0.0, 16 Since the jump table entry contains a label, the relocation is necessary so linker can resolve the label value. The relocation looks like below: Relocation section '.rel.jumptables' at offset 0x160 contains 2 entries: Offset Info Type Symbol's Value Symbol's Name 0000000000000000 0000000200000002 R_BPF_64_ABS64 0000000000000000 .text 0000000000000008 0000000200000002 R_BPF_64_ABS64 0000000000000000 .text You can see that the symbol value is 0 which makes .rel.jumptables not very useful. Instead of having the label itself in the jump table entry, use the difference of label and the section begin symbol. This can avoid the relocation and the eventual jumptable entries in object file remains the same as before. Hex dump of section '.jumptables': 0x00000000 68000000 00000000 78000000 00000000 h.......x.......

The register values between `2 << 30` (inclusive) and `2 << 31` (exclusive) correspond to frame indices. To obtain the frame index from the given register value we interpret first 30 bits as an unsigned integer. Thus, currently only non-negative frame indices can be represented. However, we should also be able to represent negative frame indices as register values as well. This is used by reaching definitions analysis for example. In order to do that, we interpret the first 30 bits of the register value as a signed integer. --------- Co-authored-by: Mikhail Gudim <[email protected]> Co-authored-by: Petr Penzin <[email protected]>

This patch introduces `SBFrameList`, a new SBAPI class that allows iterating over stack frames lazily without calling `SBThread::GetFrameAtIndex` in a loop. The new `SBThread::GetFrames()` method returns an `SBFrameList` that supports Python iteration (`for frame in frame_list:`), indexing (`frame_list[0]`, `frame_list[-1]`), and length queries (`len()`). The implementation uses `StackFrameListSP` as the opaque pointer, sharing the thread's underlying frame list to ensure frames are materialized on-demand. This is particularly useful for ScriptedFrameProviders, where user scripts will be to iterate, filter, and replace frames lazily without materializing the entire stack upfront. Signed-off-by: Med Ismail Bennani <[email protected]>

This new test case breaks the buildbot starting https://lab.llvm.org/buildbot/#/builders/201/builds/6934. The corresponding clang test case sets the target triple to avoid failures.

…6670) I was looking at the calls to `usleep` in debugserver and noticed that these default arguments are never overwritten. I converted them to constants in the function, which makes it easier to reason about.

This is step 1 of exposing WASM `ref.func` to LLVM. This PR only handles creating the instruction and a test for assembling it.

From the OpenACC 3.4 Specification: ``` 1924 2.7.9 copyout clause 1925 The copyout clause may appear on structured data and compute constructs, on declare di 1926 rectives, and on exit data directives. The clause may optionally have a zero modifier if the 1927 copyout clause appears on a structured data or compute construct. 1928 Only the following modifiers may appear in the optional modifier-list: always, alwaysin or zero. 1929 Additionally, on structured data and compute constructs capture modifier may appear ``` `readonly` is not a legal modifier for the `copyout` clause. The call to `genDataOperandOperationsWithModifier` should be checking the parsed modifier for the `copyout` clause against the `Zero` modifier.

Revealed in PR llvm#166102, which itself doesn't _cause_ the profile being dropped. Referencing if it makes debugging easier.

…ead exits (llvm#164714) As noted in the doc comment of `handleDeadExits`, the dummy switch is just an artifact of the constraints placed by the fact that we operate in a loop pass. Adding weights here is unnecessary, but the complexity is low, and it helps keep things easy for profile propagation verification (in a sense, the overall complexity would be higher if we special-cased this somehow). Issue llvm#147390

This was very hackily patched in 16ef893 to not use env -u. The internal shell does not support unset, but does supprt env -u. Disable the test for now so we can enable the internal shell with a TODO to enable it after the internal shell landing has stuck. Reviewers: fmayer, vitalybuka, w2yehia, daltenty, mingmingl-llvm, madanial0 Reviewed By: mingmingl-llvm Pull Request: llvm#166637

This fixes a regression exposed after 4454152. This introduces a few small regressions for true16. There are more cases where the value can propagate through subregister extracts which need new handling. They're also small enough that perhaps there's a way to avoid needing to deal with this case in the first place.

Removing skip after confirming builds pass locally. Upgraded workers to clang 20 Fixes llvm#162516

LSR on AMDGPU was selecting plans that reduce Setup/AddRec cost while increasing per-iteration base additions (`NumBaseAdds`) in the loop body.With minimal addressing modes, each extra `add i32` becomes a separate VALU op and hurts throughput (observed on gfx942). Teach GCNTTIImpl to compare LSR plans with `NumBaseAdds` and then `Insns` as dominant criteria, ahead of preheader costs (`SetupCost`, `AddRecCost`). Also return false from isNumRegsMajorCostOfLSR() and true from shouldDropLSRSolutionIfLessProfitable(). This steers LSR away from “-setup +more loop adds” trades; the reduced repro now keeps baseline and a large kernel regains performance.

- Fixed a few lit-tests by limiting to GFX9 and above - Regenerated no-crash tests

- Reorder GCNTTIImpl::isLSRCostLess() for GFX9+ to prioritize per-iteration work: 1) Insns (lower is better) 2) NumBaseAdds (tie-breaker to Insns) 3) NumIVMuls (penalize mul/mul_hi/addc IV chains) 4) AddRecCost, then SetupCost (preheader costs only if per-iter ties) 5) Minor keys: ScaleCost, then NumRegs - Keep pre-GFX9 behavior unchanged (fall back to BaseT::isLSRCostLess). - Rationale: AMDGPU lacks rich addressing modes; we must avoid LSR “wins” that reduce setup but add per-iteration base-adds / wide IV mul chains. This ordering consistently favors plans with less per-iteration work and reduces VALU pressure on gfx9+ (e.g., gfx942). In practice LSR now often drops its “improvement” in favor of the baseline plan (as seen in -debug-only=lsr logs). Tests: - Regenerate `llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll` checks (kernel `introduced_copy_to_sgpr`). Assembly changed in a beneficial way: fewer VALU ops in the hot loops and lower register usage: num_vgpr: 28 → 25 numbered_sgpr: 26 → 20 No functional change to the test’s intent (no RA asserts; copy/RA behavior intact). Checks updated via `utils/update_llc_test_checks.py`.

… base-add count Adjust the LSR cost comparison order for GFX9+ to prioritize per-iteration IV efficiency over minimizing base-adds. Previous order had NumBaseAdds checked early, causing the cost model to prefer baseline solutions with multiple IVs (NumBaseAdds=0) over optimized LSR solutions with fewer IVs but one base-add (NumBaseAdds=1). This resulted in extra register moves between IVs in the loop body, increasing VALU pressure. Changes to priority order for GFX9+: 1. Insns (total per-iteration instructions) 2. NumIVMuls (IV multiplication chains) <- moved above NumBaseAdds 3. AddRecCost (per-iteration IV updates) <- moved above NumBaseAdds 4. NumBaseAdds (base address adds) <- moved down 5. SetupCost (preheader costs) 6. ScaleCost, NumRegs (tie-breakers) Rationale: AddRecCost directly measures per-iteration IV update cost (fewer IVs = lower cost), which better reflects actual loop body work than NumBaseAdds alone. This fixes cases where baseline with 2 IVs generated more VALU instructions than LSR solution with 1 IV + base-add.

This patch refines the AMDGPU LSR cost model to better reflect the lack of rich addressing modes on the architecture. Key changes: 1. Implement getScalingFactorCost() to return 1 for base+scale*index addressing modes. While AMDGPU's isLegalAddressingMode() reports such modes as "legal", they require a separate ADD instruction unlike architectures with hardware-supported complex addressing. 2. Update isLSRCostLess() to use EffInsns = Insns + ScaleCost for the primary comparison. This ensures that LSR solutions using scaled addressing are properly penalized for the actual per-iteration work on AMDGPU. New priority order for GFX9+: 1. EffInsns (Insns + ScaleCost) - true per-iteration cost 2. NumIVMuls - IV multiplication chains 3. AddRecCost - IV update cost 4. NumBaseAdds - base address additions 5. SetupCost - preheader costs 6. ImmCost, NumRegs - tie-breakers Lit tests were carefully analyzed before regeneration to ensure: - No functional regressions (all test objectives preserved) - No incorrect machine sinking or divergence handling issues - Acceptable trade-offs (preheader overhead vs per-iteration efficiency) - No register spills or unnecessary barriers introduced

.github/workflows/PSDB-amd-staging.yml

+    if: github.event.pull_request.draft == false
+    runs-on: 
+      group: compiler-generic-runners
+    env:
+        svc_acc_org_secret: ${{secrets.CI_GITHUB_TOKEN}}
+        input_sha: ${{ github.event.pull_request.head.sha != '' && github.event.pull_request.head.sha || github.sha }}
+        input_pr_num: ${{ github.event.pull_request.number != '' && github.event.pull_request.number || 0 }}
+        input_pr_url: ${{ github.event.pull_request.html_url != '' && github.event.pull_request.html_url || '' }}
+        input_pr_title: ${{ github.event.pull_request.title != '' && github.event.pull_request.title || '' }}
+        # set the pipeline name here based on branch name
+        pipeline_name:  ${{secrets.CI_JENKINS_JOB_NAME}}
+        JENKINS_URL: ${{secrets.CI_JENKINS_URL}}
+        CONTAINER_IMAGE: ${{ secrets.JENKINS_TRIGGER_DOCKER_IMAGE }}
+
+    # Steps represent a sequence of tasks that will be executed as part of the job
+    steps:    
+
+      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
+      - name: Set environment variable for container image
+        run: |
+         echo "CONTAINER_IMAGE=${{ secrets.JENKINS_TRIGGER_DOCKER_IMAGE }}" >> $GITHUB_ENV
+         echo "CONTAINER_NAME=my_container_${{ github.run_id }}" >> $GITHUB_ENV
+
+
+      - name: Pull container image
+        run: docker pull "${{env.CONTAINER_IMAGE}}"      
+
+
+      - name: Run container
+        run: |
+          docker run -d --name "${{env.CONTAINER_NAME}}" $CONTAINER_IMAGE sleep infinity
+          #docker exec "${{env.CONTAINER_NAME}}" /bin/bash -c  "git clone ${{secrets.CI_UTILS_REPO}} ."
+          docker exec "${{env.CONTAINER_NAME}}" /bin/bash -c "echo 'Running commands inside the container'"
+
+      - name: Escape pull request title
+        run: |
+          import json
+          import os
+          import shlex
+          with open('${{ github.event_path }}') as fh:
+              event = json.load(fh)         
+          escaped = event['pull_request']['title']
+          with open(os.environ['GITHUB_ENV'], 'a') as fh:
+              print(f'PR_TITLE={escaped}', file=fh)
+        shell: python3 {0}          
+
+      - name: Run Jenkins Cancel Script
+        env:
+          JENKINS_URL: ${{secrets.CI_JENKINS_URL}}
+          JENKINS_USER:  ${{secrets.CI_JENKINS_USER}}
+          JENKINS_API_TOKEN: ${{secrets.CI_JENKINS_TOKEN}}
+          JENKINS_JOB_NAME: ${{secrets.CI_JENKINS_JOB_NAME}}
+          PR_NUMBER: ${{ github.event.pull_request.number }}
+          COMMIT_HASH: ${{ github.event.after }}
+        run: |
+          docker exec -e JENKINS_JOB_NAME=${{secrets.CI_JENKINS_JOB_NAME}}  -e PR_NUMBER=${{ github.event.pull_request.number }}  -e COMMIT_HASH=${{ github.event.after }} -e JENKINS_URL=${{secrets.CI_JENKINS_URL}} -e JENKINS_USER=${{secrets.CI_JENKINS_USER}} -e JENKINS_API_TOKEN=${{secrets.CI_JENKINS_TOKEN}}  "${{env.CONTAINER_NAME}}" /bin/bash -c "PYTHONHTTPSVERIFY=0 python3 cancel_previous_build.py"
+
+
+      # Runs a set of commands using the runners shell
+      - name: Getting Event Details  
+        run: |
+          echo $(pwd)
+          echo $GITHUB_ENV
+          echo $GITHUB_REPOSITORY
+          echo $GITHUB_SERVER_URL
+          echo "GITHUB_SHA is: $GITHUB_SHA"
+          echo "GITHUB_WORKFLOW_SHA is: $GITHUB_WORKFLOW_SHA"
+          echo "GITHUB_BASE_REF is: $GITHUB_BASE_REF"
+          echo "GITHUB_REF_NAME is: $GITHUB_REF_NAME"
+          echo "github.event.pull_request.id is: ${{github.event.pull_request.id}}"
+          echo "github.event.pull_request.html_url is: ${{github.event.pull_request.html_url}}"
+          echo "github.event.pull_request.number is: ${{github.event.pull_request.number}}"
+          echo "github.event.pull_request.url is: ${{github.event.pull_request.url}}"
+          echo "github.event.pull_request.issue_url is: ${{github.event.pull_request.issue_url}}"
+          echo "github.event.pull_request.head.sha is: ${{github.event.pull_request.head.sha}}"
+          echo "github.event.pull_request.base.ref is: ${{github.event.pull_request.base.ref}}"
+          echo "github.event.pull_request.merge_commit_sha is: ${{github.event.pull_request.merge_commit_sha}}"
+          echo "github.event.pull_request is: ${{github.event.pull_request}}"
+
+
+      - name: Trigger Jenkins Pipeline
+        if: steps.check_changes.outcome != 'failure'
+        run: |
+          echo "--Running jenkins_api.py with input sha - $input_sha for pull request - $input_pr_url" 
+          docker exec -e GITHUB_REPOSITORY="$GITHUB_REPOSITORY" -e svc_acc_org_secret="$svc_acc_org_secret" -e input_sha="$input_sha" -e input_pr_url="$input_pr_url" -e pipeline_name="$pipeline_name" \
+                     -e input_pr_num="$input_pr_num" -e PR_TITLE="$PR_TITLE" -e JENKINS_URL="$JENKINS_URL" -e GITHUB_PAT="$svc_acc_org_secret" "${{env.CONTAINER_NAME}}"  \
+                     /bin/bash -c 'echo \"PR NUM: "$input_pr_num"\" && PYTHONHTTPSVERIFY=0 python3 jenkins_api.py -s \"${JENKINS_URL}\" -jn "$pipeline_name" -ghr "$GITHUB_REPOSITORY" -ghsha "$input_sha" -ghprn "$input_pr_num" -ghpru "$input_pr_url" -ghprt "$PR_TITLE" -ghpat="$svc_acc_org_secret"'
+
+      - name: Stop and remove container
+        if: always()
+        run: |
+          docker stop "${{env.CONTAINER_NAME}}"
+          docker rm "${{env.CONTAINER_NAME}}"


To fix the problem, explicitly declare a permissions: block at the root of the workflow YAML file (line 2, immediately after the name: and before on:) to limit the GITHUB_TOKEN supplied to jobs. As a minimal baseline, set contents: read, which allows only read access to repository contents. If future steps need further access (such as posting checks, updating issues, etc.), they can be added to the permissions block, but for now, limiting to contents: read is the secure default. No methods, imports, or code logic change is required — only the YAML workflow file must be updated.

.github/workflows/buildbot-psdb-trigger.yml

+    if: github.event.pull_request.draft == false
+    runs-on: 
+      group: compiler-generic-runners
+    env:  
+      PR_SHA: ${{ github.event.pull_request.head.sha != '' && github.event.pull_request.head.sha || github.sha }}
+      PR_NUMBER: ${{ github.event.pull_request.number != '' && github.event.pull_request.number || 0 }}
+      PR_URL: ${{ github.event.pull_request.html_url != '' && github.event.pull_request.html_url || '' }}
+      PR_TITLE: ${{ github.event.pull_request.title != '' && github.event.pull_request.title || '' }}
+      BASE_BRANCH: ${{ github.event.pull_request.base.ref != '' && github.event.pull_request.base.ref || '' }}
+      GITHUB_TOKEN: ${{secrets.CI_GITHUB_TOKEN}}
+
+    steps:
+      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
+      - name: Set environment variable for container image
+        run: |
+         echo "CONTAINER_IMAGE=${{ secrets.BUILDBOT_DOCKER_IMAGE }}" >> $GITHUB_ENV
+         echo "CONTAINER_NAME=my_container_${{ github.run_id }}" >> $GITHUB_ENV
+
+      - name: Pull container image
+        run: docker pull "${{env.CONTAINER_IMAGE}}"      
+
+      - name: Run container
+        run: |
+          docker run -d --name "${{env.CONTAINER_NAME}}" $CONTAINER_IMAGE sleep infinity
+          docker exec "${{env.CONTAINER_NAME}}" /bin/bash -c "echo 'Running commands inside the container'"
+
+      - name: Escape pull request title
+        run: |
+          import json
+          import os
+          import shlex
+          with open('${{ github.event_path }}') as fh:
+              event = json.load(fh)         
+          escaped = event['pull_request']['title']
+          with open(os.environ['GITHUB_ENV'], 'a') as fh:
+              print(f'PR_TITLE={escaped}', file=fh)
+        shell: python3 {0}          
+
+      - name: Trigger Buildbot Build
+        run: |          
+          echo "${{ secrets.BUILDBOT_HOST }}:${{ secrets.BUILDBOT_WORKER_PORT }}"
+          docker exec -e PR_TITLE="$PR_TITLE" "${{env.CONTAINER_NAME}}" /bin/bash -c 'buildbot sendchange -W ${{ secrets.BUILDBOT_USER }}  -a ${{secrets.BUILDBOT_USER}}:${{secrets.BUILDBOT_PWD}} --master="${{ secrets.BUILDBOT_HOST }}:${{ secrets.BUILDBOT_WORKER_PORT }}" --branch=${{ env.BASE_BRANCH }} --revision=${{ env.PR_SHA }} -p PR_NUMBER:${{ env.PR_NUMBER }} -p PR_TITLE:"$PR_TITLE"  -p PR_URL:${{ env.PR_URL }}  -p SHA:${{ env.PR_SHA }}'
+
+      - name: Set Initial Status to Pending
+        run: |
+          docker exec -e PR_SHA=$PR_SHA -e GITHUB_TOKEN=$GITHUB_TOKEN "${{env.CONTAINER_NAME}}" /bin/bash -c "python3 -c \"
+          import os
+          import requests
+          GITHUB_TOKEN =  os.getenv('GITHUB_TOKEN')
+          TARGET_SHA = os.getenv('PR_SHA')
+          print('debug', TARGET_SHA)
+          api_url = f'https://api.github.com/repos/AMD-Lightning-Internal/llvm-project/statuses/{TARGET_SHA}'
+          headers = {
+          'Authorization': f'token {GITHUB_TOKEN}',
+          'Content-Type': 'application/json'
+          }
+          payload = {
+          'state': 'pending',
+          'context': 'buildbot',
+          'description': 'Build is in queue'
+          }
+          response = requests.post(api_url, json=payload, headers=headers)
+          if response.status_code == 201:
+            print('Status set to pending successfully.')
+          else:
+            print(f'Failed to set status: {response.status_code} {response.text}')
+            \""
+
+      - name: Poll Buildbot build status
+        run: |
+          python3 -c "
+          import os
+          import time
+          import requests
+          GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')
+          BUILD_URL = 'http://${{ secrets.BUILDBOT_HOST }}:${{ secrets.BUILDBOT_MASTER_PORT }}/api/v2/builds'
+          TARGET_SHA = os.getenv('PR_SHA')
+          print('debug', TARGET_SHA)
+          MAX_RETRIES = 10
+          RETRY_INTERVAL = 30  # seconds
+
+          def get_build_properties(build_id):
+              build_properties_url = f'http://${{ secrets.BUILDBOT_HOST }}:${{ secrets.BUILDBOT_MASTER_PORT }}/api/v2/builds/{build_id}/properties'
+              response = requests.get(build_properties_url, headers={'Accept': 'application/json', 'Authorization': f'token {GITHUB_TOKEN}'})
+              return response.json()
+
+          for i in range(MAX_RETRIES):
+               response = requests.get(BUILD_URL, headers={'Accept': 'application/json'})
+               response_json = response.json()
+               print(f'Attempt {i + 1}: Buildbot response:', response_json)
+
+               # Check if any build has the target SHA
+               builds = response_json.get('builds', [])
+               print (builds)
+               build_with_sha = None
+               for build in builds:
+                   build_id = build['buildid']
+                   properties = get_build_properties(build_id)
+                   #print(properties)
+                   #prop = properties.get('revision', [])
+
+                   if 'properties' in properties:
+                       print (properties['properties'])
+                       if 'revision' in properties['properties'][0]:
+                           print(properties['properties'][0])
+                       if 'revision' in properties['properties'][0] and properties['properties'][0]['revision'] [0] == TARGET_SHA:                            
+                           build_with_sha = build
+                           break                         
+
+               if build_with_sha:
+                     print('Build started successfully for SHA:', TARGET_SHA)
+                     break
+               else:
+                     print('Build for SHA not started yet, retrying in', RETRY_INTERVAL, 'seconds')
+                     time.sleep(RETRY_INTERVAL)
+          else:
+                print('Build did not start for SHA:', TARGET_SHA, 'after maximum retries')
+                exit(1)
+          "
+
+      - name: Stop and remove container
+        if: always()
+        run: |
+          docker stop "${{env.CONTAINER_NAME}}"
+          docker rm "${{env.CONTAINER_NAME}}"			


To fix this issue, an explicit permissions block should be added to the workflow YAML file. This block should be placed at the root of the workflow (i.e., before jobs:), so it apples to all jobs unless further overridden per-job. The permissions should be as minimal as required by the workflow. From the workflow code provided, the statuses: write permission is needed (the workflow sets commit status via API), and possibly contents: read if it needs to fetch the repository contents. Unless there's evidence the workflow needs other write-level permissions, the block
should specify:

permissions: contents: read statuses: write

This must be added on a new line after (or before) the name: and before the on: block. No additional imports or complex changes are needed; only a change to the YAML file.

.github/workflows/compute-rocm-dkmd-afar-trigger.yml

+    runs-on: 
+      group: compiler-generic-runners
+
+    steps:    
+    - name: Set environment variable for container image
+      run: |
+         echo "CONTAINER_IMAGE=${{ secrets.JENKINS_TRIGGER_DOCKER_IMAGE }}" >> $GITHUB_ENV
+         echo "CONTAINER_NAME=my_container_${{ github.run_id }}" >> $GITHUB_ENV        
+
+    - name: Pull container image
+      run: docker pull "${{env.CONTAINER_IMAGE}}"      
+
+    - name: Run container
+      run: |
+          docker run -d --name "${{env.CONTAINER_NAME}}" $CONTAINER_IMAGE sleep infinity    
+          docker exec "${{env.CONTAINER_NAME}}" /bin/bash -c "echo 'Running commands inside the container'" 
+
+    - name: Trigger compute-rocm-dkms-afar job
+      run: |
+        docker exec "${{env.CONTAINER_NAME}}" /bin/bash -c "python -c \"
+        import requests
+        import time
+        from requests.auth import HTTPBasicAuth
+
+        jenkins_user = '${{ secrets.CI_JENKINS_USER }}'
+        jenkins_token = '${{ secrets.ROCM_JENKINS_CI_TOKEN }}'
+        jenkins_host = '${{ secrets.ROCM_JENKINS_HOST }}'
+        jenkins_job = '${{ secrets.ROCM_JENKINS_OSDB_JOB }}'
+
+        jenkins_url = f'{jenkins_host}/job/{jenkins_job}/buildWithParameters'
+
+        response = requests.post(jenkins_url, auth=HTTPBasicAuth(jenkins_user, jenkins_token))
+
+        if response.status_code == 201:
+            print('Jenkins job triggered successfully!')
+            queue_url = response.headers.get('Location')
+            if queue_url:
+                print(f'Queue URL: {queue_url}')
+                print(f'Getting build URL(max 5 attempts with 10seconds interval)...')
+                # Poll the queue item to get the build number, limited to 5 attempts
+                max_attempts = 5
+                attempts = 0
+                while attempts < max_attempts:
+                  queue_response = requests.get(queue_url + 'api/json', auth=HTTPBasicAuth(jenkins_user, jenkins_token))
+                  queue_data = queue_response.json()
+                  if 'executable' in queue_data:
+                    build_number = queue_data['executable']['number']
+                    build_url = f'{jenkins_host}/job/{jenkins_job}/{build_number}/'
+                    print(f'Build URL: {build_url}')
+                    break
+                  attempts += 1
+                  time.sleep(10)  # Wait for 10 seconds before polling again
+                else:
+                  print('Exceeded maximum attempts to get the build URL. The trigger happened, so not failing the workflow')
+            else:
+                print('Build URL not found in the response headers.')
+
+        elif response.status_code == 200:
+            print('Request was successful, but check the response content for details.')
+            print(response.text)
+        else:
+            print(f'Failed to trigger Jenkins job. Status code: {response.status_code}')
+        \""
+
+    - name: Stop and remove container
+      if: always()
+      run: |
+          docker stop "${{env.CONTAINER_NAME}}"
+          docker rm "${{env.CONTAINER_NAME}}"        


To fix this, add a permissions block at the top-level of the workflow YAML (recommended), so it applies to all jobs unless overridden. The workflow does not perform any actions that require write privileges (e.g., modifying contents, issues, metadata, etc.). Therefore, the most restrictive and safe default is to set contents: read, which prevents the token from being used to make changes to the repository. This should be inserted directly after the name: declaration, before the on: block. No additional methods, imports, or code changes are needed—just an update to the YAML structure in .github/workflows/compute-rocm-dkmd-afar-trigger.yml.

clayborg and others added 30 commits November 4, 2025 09:36

Revert "[C2y] Support WG14 N3457, the __COUNTER__ macro (llvm#162662)"

0e0ec98

This reverts commit df1d786.

[CIR] Upstream CXXDefaultArgExpr for AggregateExpr (llvm#165991)

ecd67a7

Upstream the CXXDefaultArgExpr support for AggregateExpr

[libc] Add faccessat entrypoints for aarch64 and riscv (llvm#165869)

a02e574

Add faccessat entrypoints for aarch64 and riscv linux. Entrypoints are removed if faccessat2 syscall is not available.

[test] Avoid writing test output to readonly dir (llvm#166404)

8aff0d9

Omitting `-o /dev/null` may try to write output to the current dir, which may not have write permissions on some build systems. This fixes the test added by llvm#165737

[X86] select-copy.mir - regenerate test checks. NFC (llvm#166405)

208b736

[RISCV] Use TargetConstant for the immediate RISCVISD::SHL_ADD nodes. (…

8208591

…llvm#166312) This is consistent with some other nodes that require a constant. Particularly intrinsics with ImmArg.

[NFC][AMDGPU] use DAG.UpdateNodeOperands update chain (llvm#166396)

6217f35

[CIR] Upstream Builtin ExpOp (llvm#166061)

8ee1803

Upstream the Builtin ExpOp

[lld][WebAssembly] Allow --no-stack-first in addition to `--stack-f…

e29ee27

…irst` (llvm#166384) This paves the way to make `--stack-first` the default. See: llvm#151015

AArch64: Regenerate cost model tests

92a1eb3

Broken by 831e79a, though presubmit was somehow green.

DeclareRuntimeLibcalls: Add registered target checks to test run lines

e5f191e

[X86][NewPM] Port lower-amx-intrinsics to NewPM

4776451

Reviewers: paperchalice, phoebewang, arsenm Reviewed By: arsenm Pull Request: llvm#165113

[X86][NewPM] Port X86PartialReduction to NewPM

e03fac1

There are no tests that specifically stop/start at x86-partial-reduction, so no test cases have been updated for this patch. Reviewers: phoebewang, paperchalice, RKSimon, arsenm Reviewed By: arsenm, RKSimon Pull Request: llvm#166048

mgudim and others added 27 commits November 5, 2025 15:33

[gn build] Port 050cbd2

3665e76

[gn build] Port ac547a5

8321eaa

[gn build] Port d584d00

158dfe9

[AIX] unsupport gcc triple test case on aix NFC (llvm#166408)

d2b43ff

This new test case breaks the buildbot starting https://lab.llvm.org/buildbot/#/builders/201/builds/6934. The corresponding clang test case sets the target triple to avoid failures.

merge main into amd-staging (#503)

7d640c0

[debugserver] Move constants into TaskPortForProcessID (NFC) (llvm#16…

bd9030e

…6670) I was looking at the calls to `usleep` in debugserver and noticed that these default arguments are never overwritten. I converted them to constants in the function, which makes it easier to reason about.

Add support for ref.func to AsmParser/MC (llvm#163326)

1262dce

This is step 1 of exposing WASM `ref.func` to LLVM. This PR only handles creating the instruction and a test for assembling it.

merge main into amd-staging

e04f278

Add binop-select test to profcheck-xfail.txt (llvm#166678)

93ef576

Revealed in PR llvm#166102, which itself doesn't _cause_ the profile being dropped. Referencing if it makes debugging easier.

[profcheck] Exclude GPU tests (llvm#166681)

b84784f

[profcheck] Update profcheck-xfail.txt after recent fixes (llvm#166685)

260f9e9

merge main into amd-staging (#504)

7c863ec

[AIX] Enable libc++ bots on AIX (llvm#166650)

3361e40

Removing skip after confirming builds pass locally. Upgraded workers to clang 20 Fixes llvm#162516

merge main into amd-staging

5fcd39a

merge main into amd-staging (#506)

f6a44f4

some lit fixes

5fe0ed2

- Fixed a few lit-tests by limiting to GFX9 and above - Regenerated no-crash tests

regen test

356bc04

michaelselehov closed this Nov 6, 2025

michaelselehov deleted the amd/dev/mselehov/amdgpu-lsr-cost-model branch November 6, 2025 15:26

github-advanced-security bot found potential problems Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #509

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #509

Uh oh!

michaelselehov commented Nov 6, 2025

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

132 participants

@@ -1,4 +1,6 @@
             name: Compiler CI PSDB trigger on amd-staging branch
+            permissions:
+              contents: read
             # Controls when the workflow will run
             on:

@@ -1,4 +1,7 @@
             name: Trigger amd-debug Buildbot Build
+            permissions:
+              contents: read
+              statuses: write
             on:
               workflow_dispatch:
               pull_request:

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #509

[AMDGPU] TTI: Prioritize per-iter base adds in LSR plan comparison #509

Uh oh!

Conversation

michaelselehov commented Nov 6, 2025

AMDGPU: reflect lack of complex addressing in LSR; fix regressions after SROA copy-load promotion

Background

Problem

Solution

Priority order on GFX9+

Rationale

Validation

Notes

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

132 participants