Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft commits

zhaoxul-qti and others added 30 commits April 24, 2025 09:09
### Description
Add support to Upsample operator to op builder in QNN-EP.

### Motivation and Context
- Enhance QNN-EP support for Upsample operator.
- Add unit test for Upsample operator in QNN-EP.
### Description
Add 8bits support for matmulnbits on x86

__AVX512 VNNI__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |

|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** |
| 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** |
| 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** |
| 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** |
| 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** |
| 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** |
| 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** |
| 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** |

__AVX512__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |

|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** |
| 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** |
| 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** |
| 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** |
| 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** |
| 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** |
| 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** |
| 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** |

Token generation is bottlenecked at memory access. 8b model's 2x size is
major reason of token generation slow down.

For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow
extra instructions are needed. This is the major reason of non-vnni slow
down.

### Motivation and Context
MatMul4Bits model has repetition issue. 6b model resolved this issue.
This PR fixes incorrect input/output shape, according to [DML EP's
implementation](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorRotaryEmbedding.cpp#L142C47-L142C94),
we should ensure the input shape to be [batch_size, sequence_length,
num_heads, head_size].
…osoft#24533)

### Description

This PR fixes the program variable data type and revise ProgramInput:
- add support for int4/uint4
- fix inconsistency of dealing with number of components for int8/uint8
in ToProgramVariableDataType
- add a constructor for `ProgramInput` to allow "flatten" the shape
easily.
- fix dequantize linear
…wnstream node is not QuantizeLinear (microsoft#24537)

### Description
Updates the WeightBiasQuantization optimizer to skip processing on
Conv/Gemm nodes if the downstream child node is not a QuantizeLinear.

#### Before this PR
Original graph:
```
input_0 -> DQ -> Conv -> graph_output (or non-Q node)
                 ^  ^
                 |  |
weights_f32------+
                    |
bias_f32------------+
```
Becomes:

```
input_0 -> DQ ------> Conv -> graph_output (or non-Q node)
                      ^  ^
                      |  |
weights_quant -> DQ --+
                         |
bias_quant -> DQ --------+
```
The above is **NOT** a valid QDQ node unit for Conv because the Conv's
output is not consumed by a QuantizeLinear node.

#### With this PR
The above example graph remains unchanged after L1 optimizations:
```
input_0 -> DQ -> Conv -> graph_output (or non-Q node)
                 ^  ^
                 |  |
weights_f32------+
                    |
bias_f32------------+
```


### Motivation and Context
Caused inaccuracy for a customer model. Automatically quantizing the
weights and biases of a Conv/Gemm is detrimental if the output of the
Conv/Gemm is not consumed by a QuantizeLinear node. In this scenario,
the whole node group is not considered a valid QDQ node unit, and so the
EP has to run the Conv/Gemm as float32/float16 anyway. If the Conv/Gemm
is running as float32/float16, then quantizing the weights and biases
introduces inaccuracy for no gain.

PR that originally added this optimizer:
microsoft#22969
### Description
<!-- Describe your changes. -->
Add wrappers for the AutoEP C API changes to the C++ API.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…24525)

### Description
<!-- Describe your changes. -->

An additional check for non-constant inputs was added to
ConvActivationFusion in microsoft#20282. This was to avoid fusing an Add in a
Conv+Add+Relu that has another non-constant input.


https://github.com/microsoft/onnxruntime/blob/6c8cb6a6d1993f84fcf4008f468a071c0b73aad3/onnxruntime/core/optimizer/conv_activation_fusion.cc#L26-L39

However, this check fails to account for implicit inputs and will read
past the end of a node's explicit input defs if any implicit inputs are
present.

Moreover, this check is no longer necessary after microsoft#19470 removed
Conv+Add+Relu fusion from ConvActivationFusion.

This change removes the check and some other unused code.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix microsoft#24473.
…crosoft#24492)

### Description
<!-- Describe your changes. -->

Fix MatMulScaleFusion handling of scales with leading dimensions. The
previous approach accepted a Mul/Div with a scale that broadcasted
additional leading dimensions to its output shape. This caused a shape
mismatch in the fused replacement.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix microsoft#24407.
### Description
Fixes a segfault that occurs when an EP library is re-loaded in the same
process.


### Motivation and Context
A recent [PR ](microsoft#24430)
updated the Environment to unload all EP libraries on destruction of
`OrtEnv`. We forgot to properly update the state to mark the EP library
as unloaded. Therefore, this caused a segfault when the EP library was
re-loaded.
Fixed the bug in microsoft#24228 which causes the incorrect result for phi models
when flash attention is disabled.
…nt, etc. (microsoft#24527)

### Description
Fixed a few issues related to Conv2dMM and MatMul in the Native WebGPU
backend.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Support 8 bits in MatMulNBits cuda kernel.

The `MatMulFloat8bKernel` CUDA kernel performs a matrix-vector
multiplication (GEMM) where the matrix B is quantized per block using
8-bit integers.

The kernel computes $Output = A \times B$, where:
* $A$ is a row vector (shape `[M, K]`) of type `T` (`float` or `half`).
* $B$ is a matrix (shape `[K, N]`) quantized using 8-bit unsigned
integers (`uint8_t`) with a block structure. It's stored as `[N,
K/block_size, block_size]`.
* `scales_data` contains the dequantization scales (shape `[N,
K/block_size]`).
* `zero_points` contains the dequantization zero points (shape `[N,
K/block_size]`), if used (`has_zero_point` is true).
* `output` is the resulting row vector (shape `[M, N]`).

The kernel uses a thread block structure of `(kWarpSize,
kColsPerThreadBlock)`, meaning each block handles `kColsPerThreadBlock`
(which is 8) columns of the output. Each warp within the block is
responsible for one output element (`[m_id, n_id]`). Threads within a
warp cooperate to compute the dot product along the K dimension. Each
thread (`lane_id`) handles `kElementsPerThreadPerIteration` (which is 8)
elements of the K dimension in each step.

Here's a breakdown of the three algorithms (`kKernelAlgo`):

1.  **`kKernelAlgo = 0` (Unrolling):**
* **Strategy:** This algorithm processes the K dimension by iterating in
large steps (`k_per_iter = kWarpSize * kElementsPerThreadPerIteration =
32 * 8 = 256`). Inside the main loop, it uses a macro
(`UnRollReduction`) with `#pragma unroll` directives to aggressively
unroll the innermost computations. It tries unrolling factors of 16, 4,
and 1 sequentially to cover as much of the K dimension as possible with
unrolled code.
* **Pros:** Can significantly reduce loop overhead (branching
instructions, counter updates) and expose more instruction-level
parallelism, potentially hiding memory latency.
* **Cons:** Can lead to a large increase in compiled code size (register
pressure, potential instruction cache misses). The effectiveness heavily
depends on the compiler and the specific GPU architecture. The
multi-stage unrolling adds complexity. It requires `k_per_iter` to be a
multiple of `block_size` for correct scale/zp indexing within the
unrolled loop.
* **Performance Expectation:** Potentially the highest performance *if*
the unrolling is effective on the target hardware and doesn't cause
resource issues (registers, cache). Often good for compute-bound or
latency-bound scenarios where loop overhead is a bottleneck.

2.  **`kKernelAlgo = 1` (Simple Loop):**
* **Strategy:** This algorithm also iterates along the K dimension in
steps of `k_per_iter` (256), but uses a simple `for` loop without
explicit `#pragma unroll`. It relies on the compiler's default loop
optimization capabilities.
* **Pros:** Simpler code, smaller code size compared to Algorithm 0.
Less likely to cause register pressure or instruction cache issues.
Easier for the compiler to reason about.
* **Cons:** May incur higher loop overhead compared to effective
unrolling. Performance might be lower if loop overhead is significant.
* **Performance Expectation:** A solid baseline. Might be close to
Algorithm 0 if the compiler performs implicit unrolling effectively, or
faster if Algorithm 0 suffers from code bloat penalties.

3.  **`kKernelAlgo = 2` (Block Size Iteration):**
* **Strategy:** This algorithm changes the iteration strategy
fundamentally. Instead of iterating in fixed steps of `k_per_iter`, it
iterates based on the quantization `block_size`. The outer loop runs
`blocks_per_K` (`K / block_size`) times. Inside this loop, the scale and
zero point for the *entire block* are fetched once per warp. Then, each
thread checks if its assigned K-elements (`lane_offset`) fall within the
current `block_size` chunk and processes them using the fetched
scale/zp.
* **Pros:** Directly aligns with the block quantization data structure.
Fetches scale/zero-point values less frequently (once per `block_size`
chunk per warp), potentially reducing shared memory bank conflicts or
register usage compared to calculating the index (`current_meta_k`) in
every inner step as in Algo 0/1. Might have better memory access
patterns for scale/zp data.
* **Cons:** The outer loop iterates `K / block_size` times. If
`block_size` is small (e.g., 16, 32), this could be many iterations. The
logic inside the loop (`if (current_k_base < k_end_block ...)`) adds
conditional execution.
* **Performance Expectation:** Performance depends heavily on the
`block_size`. If `block_size` is large (e.g., 128, 256), the number of
outer loop iterations is small, and the efficiency gain from fetching
scale/zp once per block might outweigh the overhead. If `block_size` is
small, the overhead of the outer loop might dominate.

**Next Step:**

1. **Profile:** The most reliable way is to benchmark all three
algorithms (`kKernelAlgo = 0, 1, 2`) on your target GPU hardware with
representative input sizes (`N`, `K`), data types (`T`), and
`block_size` values. Use profiling tools like NVIDIA Nsight Compute to
analyze performance metrics (execution time, occupancy, instruction
throughput, memory bandwidth, cache hit rates, register spills).
2.  **Hypothesize based on `block_size`:**
* For **large `block_size`** (e.g., 128, 256), Algorithm 2 might be
competitive or even the best due to efficient scale/ZP handling.
Algorithm 0 could also be very fast.
* For **small `block_size`** (e.g., 16, 32), Algorithm 0 (unroll) or
Algorithm 1 (simple loop) might outperform Algorithm 2 due to lower loop
overhead in the K dimension.
3. Compare performance with TRT LLM FpA IntB GEMM.

### Motivation and Context
4 bits has accuracy loss for some LLM, need more bits for some layers.
…t#24371)

### Description
<!-- Describe your changes. -->
Onnxruntime manages a number of CPU based accelerators. I.e. those that
can operate on CPU based inputs.
However, several of them like `Qnn`, `Openvino` and `Vitis` may require
CPU based inputs to be either aligned to 4K so they can be memory mapped or
prefer to override the device with their own CPU accessible allocator.

To mitigate that, we introduce a new CPU based allocator that produces
4K aligned memory.

We also adjust allocation planner to override plain CPU device. When we
detect a compiled CPU based EP, we adjust the device according by
requesting the EP to return `OrtMemType::OrtMemTypeCPUInput`. This gives
the EP an opportunity to return either GPU/NPU device or CPU device
depending on the mode it is operating.

We select the device with larger alignment betrween CPU default devices.

We also adjust memory patterns to make sure 4K alignment is respected in
the contagious buffers when appropriate.

### Motivation and Context
CPU Based providers, notably accept CPU based inputs, but they have a
requirement of 4K allocations, otherwise the input incurs an extra copy.
This is especially noticeable with intermediate values that are produced
by upstream CPU based nodes. 

Qnn has its own allocator when it is enabled, we make sure it is correctly advertised to the allocation
planner. This PR excludes Qnn allocator usage for intermediate values
due to the overhead contributed by memhandle management.


Cc: @quic-ashigarg

---------

Co-authored-by: edgchen1 <[email protected]>
### Description
1. Update Github Actions pipelines' triggers. Make all of them to be the
same.
2. Format yaml files.

Before this change, the pipelines' triggers were set as following:

```
on:
  push:
    branches: [ main, 'rel-*']
  pull_request:
    branches: [ main, 'rel-*']
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
```

I set "cancel-in-progress: true" because for pipeline runs triggered by
pull requests if the pull request was updated(a new commit was added
there), the old pipeline runs can be cancelled. However, this setting
doesn't work well for the runs triggered by "push" events for the main
branch. Let's say, we merged a PR , then it triggered this pipeline.
Then before the pipeline is finished, we merged another PR. Then the old
pipeline run will be cancelled. But we do want it to be cancelled. Each
commit in the main branch should be verified.

### Motivation and Context
### Description
<!-- Describe your changes. -->

Fix memleakdbg call stack output.

The call stack output was getting clobbered:

`C:\dev\onnxruntime\build\Debug\_deps\googletest-src\googletest\include\gtest\internal\gtest-port.h(1631):
l\gtest-port.h(1631): eadLocal<testing::Sequence *>::GetOrCreateValue`

I think the issue is that this aliasing of `buffer` and `symbol`:

https://github.com/microsoft/onnxruntime/blob/173a11a4e7a2f7a360c9db6abbe601a06a16f004/onnxruntime/core/platform/windows/debug_alloc.cc#L97-L100

does not play nicely with a call to `_snprintf_s` like this:

https://github.com/microsoft/onnxruntime/blob/173a11a4e7a2f7a360c9db6abbe601a06a16f004/onnxruntime/core/platform/windows/debug_alloc.cc#L115

The clobbered output does not match the predefined, ignored patterns, so
we see spurious mem leak check output.

This change updates the memleakdbg output generation to use C++ ostreams
and instead of fixed size buffers and `_snprintf_s`.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix spurious mem leak check output.
Fix microsoft#24535.
### Description
This PR updates ONNX Runtime's LLM conversion tools to use [PyTorch
2.7](https://pytorch.org/blog/pytorch-2-7/) and reduces memory usage
during export.

### Motivation and Context
Importing the `transformers` package with `import transformers` will
take a long time because of the many namespaces it has at the top level.
It is more efficient to only import the desired class names.
Additionally, the benchmarking of the PyTorch model includes the deep
copy of the inputs when it does not need to. The deep copy can be
performed before measuring latency.
…icrosoft#24196)

### Description
During inference, using the QNN EP option to set enable_htp_shared_memory_allocator gives a hint that we use RPC allocated buffers to avoid buffer copy between CPU and NPU.

With the current PR, we add hints in the compilation phase that if RPC memory is going to be used, any additional  allocations done on the CPU can be avoided.

### Motivation and Context
This should help reduce the peak CPU memory consumption while running AI work loads using shared memory.

Related PR: microsoft#23136

Co-authored-by: Ashish Garg (AISW) <[email protected]>
These are source files, not executables, do not set the executable
permission bit on these files.
### Description
1. Add benchmark script for MatMulNBits. 
2. Update kernel based on benchmark results:
  - Change kernel back to handle m=1
  - Use simple loop kernel instead of unrolling
- Change partial sum to float type to trade-off precision and
performance (less precision loss, no obvious performance drop)

Example output of benchmark:
```
------------------------------------------------------------------------------------------------------------------------
Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0)
------------------------------------------------------------------------------------------------------------------------
CUDA Graph   | M        | N        | K        | Bits   | Block Size | Threads  | Latency (us)    | StdDev (us)  | TFLOPS
------------------------------------------------------------------------------------------------------------------------
True         | 1        | 3072     | 8192     | 4      | 32         | 0        | 95.7            | 5.7          | 0.526
True         | 1        | 3072     | 8192     | 8      | 32         | 0        | 110.7           | 81.1         | 0.454
True         | 1        | 3072     | 8192     | 4      | 128        | 0        | 93.7            | 41.2         | 0.537
True         | 1        | 3072     | 8192     | 8      | 128        | 0        | 105.0           | 129.3        | 0.479
True         | 1        | 5120     | 3072     | 4      | 32         | 0        | 86.7            | 49.9         | 0.363
True         | 1        | 5120     | 3072     | 8      | 32         | 0        | 90.1            | 41.1         | 0.349
True         | 1        | 5120     | 3072     | 4      | 128        | 0        | 83.9            | 46.7         | 0.375
True         | 1        | 5120     | 3072     | 8      | 128        | 0        | 85.2            | 57.1         | 0.369
True         | 1        | 8192     | 3072     | 4      | 32         | 0        | 107.3           | 29.2         | 0.469
True         | 1        | 8192     | 3072     | 8      | 32         | 0        | 102.3           | 57.1         | 0.492
True         | 1        | 8192     | 3072     | 4      | 128        | 0        | 99.2            | 61.2         | 0.507
True         | 1        | 8192     | 3072     | 8      | 128        | 0        | 97.5            | 47.4         | 0.516
True         | 1        | 200064   | 3072     | 4      | 32         | 0        | 1456.4          | 11.0         | 0.844
True         | 1        | 200064   | 3072     | 8      | 32         | 0        | 1336.4          | 10.3         | 0.920
True         | 1        | 200064   | 3072     | 4      | 128        | 0        | 1261.6          | 16.6         | 0.974
True         | 1        | 200064   | 3072     | 8      | 128        | 0        | 1232.6          | 17.9         | 0.997
True         | 256      | 3072     | 8192     | 4      | 32         | 0        | 211.1           | 5.8          | 61.030
True         | 256      | 3072     | 8192     | 8      | 32         | 0        | 217.8           | 62.8         | 59.154
True         | 256      | 3072     | 8192     | 4      | 128        | 0        | 208.7           | 63.3         | 61.751
True         | 256      | 3072     | 8192     | 8      | 128        | 0        | 213.0           | 58.2         | 60.491
True         | 256      | 5120     | 3072     | 4      | 32         | 0        | 151.9           | 57.4         | 53.028
True         | 256      | 5120     | 3072     | 8      | 32         | 0        | 156.2           | 71.1         | 51.554
True         | 256      | 5120     | 3072     | 4      | 128        | 0        | 151.4           | 22.6         | 53.198
True         | 256      | 5120     | 3072     | 8      | 128        | 0        | 154.6           | 47.1         | 52.092
True         | 256      | 8192     | 3072     | 4      | 32         | 0        | 219.0           | 4.4          | 58.847
True         | 256      | 8192     | 3072     | 8      | 32         | 0        | 226.6           | 14.5         | 56.860
True         | 256      | 8192     | 3072     | 4      | 128        | 0        | 206.7           | 39.9         | 62.333
True         | 256      | 8192     | 3072     | 8      | 128        | 0        | 216.2           | 41.3         | 59.587
True         | 256      | 200064   | 3072     | 4      | 32         | 0        | 3110.9          | 11.3         | 101.152
True         | 256      | 200064   | 3072     | 8      | 32         | 0        | 3290.9          | 8.3          | 95.619
True         | 256      | 200064   | 3072     | 4      | 128        | 0        | 3055.2          | 10.2         | 102.995
True         | 256      | 200064   | 3072     | 8      | 128        | 0        | 3220.4          | 9.8          | 97.712
True         | 1024     | 3072     | 8192     | 4      | 32         | 0        | 363.6           | 40.2         | 141.754
True         | 1024     | 3072     | 8192     | 8      | 32         | 0        | 369.0           | 46.0         | 139.669
True         | 1024     | 3072     | 8192     | 4      | 128        | 0        | 362.8           | 55.6         | 142.052
True         | 1024     | 3072     | 8192     | 8      | 128        | 0        | 367.5           | 56.5         | 140.256
True         | 1024     | 5120     | 3072     | 4      | 32         | 0        | 221.6           | 58.1         | 145.383
True         | 1024     | 5120     | 3072     | 8      | 32         | 0        | 225.4           | 56.6         | 142.938
True         | 1024     | 5120     | 3072     | 4      | 128        | 0        | 220.2           | 36.9         | 146.306
True         | 1024     | 5120     | 3072     | 8      | 128        | 0        | 224.1           | 57.8         | 143.751
True         | 1024     | 8192     | 3072     | 4      | 32         | 0        | 346.2           | 41.8         | 148.854
True         | 1024     | 8192     | 3072     | 8      | 32         | 0        | 352.8           | 21.6         | 146.097
True         | 1024     | 8192     | 3072     | 4      | 128        | 0        | 344.5           | 18.9         | 149.627
True         | 1024     | 8192     | 3072     | 8      | 128        | 0        | 350.6           | 10.6         | 147.016
True         | 1024     | 200064   | 3072     | 4      | 32         | 0        | 6822.0          | 44.1         | 184.504
True         | 1024     | 200064   | 3072     | 8      | 32         | 0        | 7018.5          | 38.4         | 179.339
True         | 1024     | 200064   | 3072     | 4      | 128        | 0        | 6757.8          | 51.5         | 186.257
True         | 1024     | 200064   | 3072     | 8      | 128        | 0        | 6947.7          | 38.1         | 181.167
------------------------------------------------------------------------------------------------------------------------
```
### Motivation and Context
Follow up with microsoft#24509
)

### Description

This PR updates how the K path is identified in Phi-4 multimodal.

### Motivation and Context

This is needed as part of the updates made to the rewritten modeling
code for the speech component of Phi-4 multimodal.
Added context to command line example when specifying platform.

### Description
Docker image build fails due to missing context in the command line
example when specifying the platform.

The dockerfiles directory is assumed as the current directory in the
command example, so the parent directory must be specified as the
context of the docker build command
…icrosoft#24220)

### Description
<!-- Describe your changes. -->

This PR adds a new CMake option:
onnxruntime_ENABLE_CONVSYMKERNELAVX2_SAT_CHECKER. When enabled, this
option activates a saturation checker for the VPMADDUBSW instruction
used in the ConvSymKernelAvx2 path.

The checker works by calling a helper function before each VPMADDUBSW
instruction. This function simulates the computation using C++ and
intrinsics with higher-precision types (int32_t) to detect whether the
result exceeds the bounds of int16_t (i.e., greater than INT16_MAX or
less than INT16_MIN).

By default, the checker logs a warning only once per inference session.
However, the logic can be easily extended to print more frequently if
needed. Developers can also reuse this pattern to implement similar
saturation checks for other instructions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

On some models running with AVX2 (instead of AVX-VNNI), we've observed
accuracy degradation due to saturation in vectorized instructions. This
saturation checker provides a way to debug and detect those cases by
reporting potential overflow in intermediate computations.
There are 2 tests that appear twice in the same list, so I removed the
duplicates:
- `^test_batchnorm_example_training_mode`
- `^test_batchnorm_epsilon_training_mode`

The other 3 tests passed locally, so I am enabling them to see if they
also pass on the pipelines:
- `test_batchnorm_epsilon_old`
- `test_batchnorm_example_old`
- `test_gathernd_example_int32_batch_dim1`

Sample run:
```
> .\build\Windows\Debug\Debug\onnx_test_runner.exe "C:\work\onnxruntime\build\Windows\Debug\_deps\onnx-src\onnx\backend\test\data\node\test_gathernd_example_int32_batch_dim1"
Load Test Case: gathernd_example_int32_batch_dim1 in C:\work\onnxruntime\build\Windows\Debug\_deps\onnx-src\onnx\backend\test\data\node\test_gathernd_example_int32_batch_dim1
result:
        Models: 1
        Total test cases: 1
                Succeeded: 1
                Not implemented: 0
                Failed: 0
        Stats by Operator type:
                Not implemented(0):
                Failed:
```
…oft#24575)

Bumps
[microsoft/onnxruntime-github-actions](https://github.com/microsoft/onnxruntime-github-actions)
from 0.0.5 to 0.0.6.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/microsoft/onnxruntime-github-actions/commit/9e3f6d0517cad4c4055d8ddd8b8bbadcc08e4e9a"><code>9e3f6d0</code></a>
Release artifacts for v0.0.6</li>
<li><a
href="https://github.com/microsoft/onnxruntime-github-actions/commit/4bc5bccb384a9785d4cbe25104735780bf10a27b"><code>4bc5bcc</code></a>
Initial commit on orphan branch</li>
<li>See full diff in <a
href="https://github.com/microsoft/onnxruntime-github-actions/compare/v0.0.5...v0.0.6">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=microsoft/onnxruntime-github-actions&package-manager=github_actions&previous-version=0.0.5&new-version=0.0.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.9.5 to 0.11.6.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/releases">ruff's
releases</a>.</em></p>
<blockquote>
<h2>0.11.6</h2>
<h2>Release Notes</h2>
<h3>Preview features</h3>
<ul>
<li>Avoid adding whitespace to the end of a docstring after an escaped
quote (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17216">#17216</a>)</li>
<li>[<code>airflow</code>] Extract <code>AIR311</code> from
<code>AIR301</code> rules (<code>AIR301</code>, <code>AIR311</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17310">#17310</a>,
<a
href="https://redirect.github.com/astral-sh/ruff/pull/17422">#17422</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>Raise syntax error when <code>\</code> is at end of file (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17409">#17409</a>)</li>
</ul>
<h2>Contributors</h2>
<ul>
<li><a
href="https://github.com/AlexWaygood"><code>@​AlexWaygood</code></a></li>
<li><a
href="https://github.com/BurntSushi"><code>@​BurntSushi</code></a></li>
<li><a href="https://github.com/Lee-W"><code>@​Lee-W</code></a></li>
<li><a
href="https://github.com/MatthewMckee4"><code>@​MatthewMckee4</code></a></li>
<li><a
href="https://github.com/MichaReiser"><code>@​MichaReiser</code></a></li>
<li><a
href="https://github.com/cake-monotone"><code>@​cake-monotone</code></a></li>
<li><a href="https://github.com/carljm"><code>@​carljm</code></a></li>
<li><a
href="https://github.com/charliermarsh"><code>@​charliermarsh</code></a></li>
<li><a
href="https://github.com/dcreager"><code>@​dcreager</code></a></li>
<li><a
href="https://github.com/dhruvmanila"><code>@​dhruvmanila</code></a></li>
<li><a
href="https://github.com/github-actions"><code>@​github-actions</code></a></li>
<li><a
href="https://github.com/maxmynter"><code>@​maxmynter</code></a></li>
<li><a
href="https://github.com/mishamsk"><code>@​mishamsk</code></a></li>
<li><a href="https://github.com/mtshiba"><code>@​mtshiba</code></a></li>
<li><a href="https://github.com/ntBre"><code>@​ntBre</code></a></li>
<li><a
href="https://github.com/renovate"><code>@​renovate</code></a></li>
<li><a href="https://github.com/sharkdp"><code>@​sharkdp</code></a></li>
</ul>
<h2>Install ruff 0.11.6</h2>
<h3>Install prebuilt binaries via shell script</h3>
<pre lang="sh"><code>curl --proto '=https' --tlsv1.2 -LsSf
https://github.com/astral-sh/ruff/releases/download/0.11.6/ruff-installer.sh
| sh
</code></pre>
<h3>Install prebuilt binaries via powershell script</h3>
<pre lang="sh"><code>powershell -ExecutionPolicy Bypass -c &quot;irm
https://github.com/astral-sh/ruff/releases/download/0.11.6/ruff-installer.ps1
| iex&quot;
</code></pre>
<h2>Download ruff 0.11.6</h2>
<table>
<thead>
<tr>
<th>File</th>
<th>Platform</th>
<th>Checksum</th>
</tr>
</thead>
</table>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md">ruff's
changelog</a>.</em></p>
<blockquote>
<h2>0.11.6</h2>
<h3>Preview features</h3>
<ul>
<li>Avoid adding whitespace to the end of a docstring after an escaped
quote (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17216">#17216</a>)</li>
<li>[<code>airflow</code>] Extract <code>AIR311</code> from
<code>AIR301</code> rules (<code>AIR301</code>, <code>AIR311</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17310">#17310</a>,
<a
href="https://redirect.github.com/astral-sh/ruff/pull/17422">#17422</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>Raise syntax error when <code>\</code> is at end of file (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17409">#17409</a>)</li>
</ul>
<h2>0.11.5</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>airflow</code>] Add missing <code>AIR302</code> attribute
check (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17115">#17115</a>)</li>
<li>[<code>airflow</code>] Expand module path check to individual
symbols (<code>AIR302</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17278">#17278</a>)</li>
<li>[<code>airflow</code>] Extract <code>AIR312</code> from
<code>AIR302</code> rules (<code>AIR302</code>, <code>AIR312</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17152">#17152</a>)</li>
<li>[<code>airflow</code>] Update oudated <code>AIR301</code>,
<code>AIR302</code> rules (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17123">#17123</a>)</li>
<li>[syntax-errors] Async comprehension in sync comprehension (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17177">#17177</a>)</li>
<li>[syntax-errors] Check annotations in annotated assignments (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17283">#17283</a>)</li>
<li>[syntax-errors] Extend annotation checks to <code>await</code> (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17282">#17282</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[<code>flake8-pie</code>] Avoid false positive for multiple
assignment with <code>auto()</code> (<code>PIE796</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17274">#17274</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>ruff</code>] Fix <code>RUF100</code> to detect unused
file-level <code>noqa</code> directives with specific codes (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17042">#17042</a>)
(<a
href="https://redirect.github.com/astral-sh/ruff/pull/17061">#17061</a>)</li>
<li>[<code>flake8-pytest-style</code>] Avoid false positive for legacy
form of <code>pytest.raises</code> (<code>PT011</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17231">#17231</a>)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li>Fix formatting of &quot;See Style Guide&quot; link (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17272">#17272</a>)</li>
</ul>
<h2>0.11.4</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>ruff</code>] Implement <code>invalid-rule-code</code> as
<code>RUF102</code> (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17138">#17138</a>)</li>
<li>[syntax-errors] Detect duplicate keys in <code>match</code> mapping
patterns (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17129">#17129</a>)</li>
<li>[syntax-errors] Detect duplicate attributes in <code>match</code>
class patterns (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17186">#17186</a>)</li>
<li>[syntax-errors] Detect invalid syntax in annotations (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17101">#17101</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[syntax-errors] Fix multiple assignment error for class fields in
<code>match</code> patterns (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17184">#17184</a>)</li>
<li>Don't skip visiting non-tuple slice in <code>typing.Annotated</code>
subscripts (<a
href="https://redirect.github.com/astral-sh/ruff/pull/17201">#17201</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/astral-sh/ruff/commit/fcd50a0496d725f773c6da149035f98bd90b6a30"><code>fcd50a0</code></a>
Bump 0.11.6 (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17449">#17449</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/3ada36b766583c92c82bccce3519a467ae068630"><code>3ada36b</code></a>
Auto generate <code>visit_source_order</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17180">#17180</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/bd8983821289e436c2d4c1463c118baa02c7ef5b"><code>bd89838</code></a>
[red-knot] Initial tests for protocols (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17436">#17436</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/b32407b6f3c300650b8a3b0a6cb1ce3c5f812c84"><code>b32407b</code></a>
[red-knot] Dataclasses: synthesize <code>__init__</code> with proper
signature (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17428">#17428</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/b4de245a5accc5ebe35e580a73040da8d99ed566"><code>b4de245</code></a>
[red-knot] Dataclasses: support <code>order=True</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17406">#17406</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/914095d08f02ed91b1acf807aca89723f3632fb9"><code>914095d</code></a>
[red-knot] Super-basic generic inference at call sites (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17301">#17301</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/5350288d0773f986e90653c44a6304d9411b5782"><code>5350288</code></a>
[red-knot] Check assignability of bound methods to callables (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17430">#17430</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/649610cc98add11d8ff48c6d0fba928fb1e00262"><code>649610c</code></a>
[red-knot] Support <code>super</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17174">#17174</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/1a79722ee0fb160f8929612508d5ee88b7838d09"><code>1a79722</code></a>
[<code>airflow</code>] Extend <code>AIR311</code> rules (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17422">#17422</a>)</li>
<li><a
href="https://github.com/astral-sh/ruff/commit/b67590bfde9de44757a3365d43040b8f93c10f35"><code>b67590b</code></a>
[red-knot] simplify union size limit handling (<a
href="https://redirect.github.com/astral-sh/ruff/issues/17429">#17429</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/astral-sh/ruff/compare/0.9.5...0.11.6">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=ruff&package-manager=pip&previous-version=0.9.5&new-version=0.11.6)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Changming Sun <[email protected]>
skottmckay and others added 27 commits April 29, 2025 13:52
…ull knowledge (microsoft#24568)

### Description
<!-- Describe your changes. -->
GetDeviceInfoIfSupported -> GetSupportedDevices

EP sees all devices so it can make decisions with full knowledge. This
is mainly applicable to GPU EPs like WebGPU.

EP has to iterate device and call CreateEpDevice for devices it
supports.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix DML autoep select test. It should only select one device as that's
all the test infrastructure is setup to handle.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…microsoft#24587)

### Description
`LoadPluginOrProviderBridge` is called when attempting to load a Plugin.
It uses the passed `library_path` to attempt to load the Plugin as a
`Provider` - using `ProviderLibrary` - to see if it can be treated as a
'ProviderBridge'. `ProviderLibrary` attempts to load the Provider by
prefixing the path to the onnxruntime.dll. Plugins needn't be
redistributed with OnnxRuntime, so the path to the Plugin _may_ be an
absolute path, and if so `ProviderLibrary` fails. At the same time -
however - `LoadPluginOrProviderBridge` needs to support
OnnxRuntime-relative paths: As 'Providers' are migrated to 'Plugins',
existing Providers should be usable as Plugins. To accommodate both
scenarios, this PR:

1. Adds support to `ProviderLibrary` to be created with an absolute
path.
2. Validates the path passed to `LoadPluginOrProviderBridge`;
1. if it is absolute, the same absolute path is passed to
`ProviderLibrary` and `EpLibraryPlugin`.
2. if the path is not absolute, it is converted to an absolute path by
prefixing the OnnxRuntime location, and the same path is passed to
`ProviderLibrary` and `EpLibraryPlugin`.

### Motivation and Context
This PR enables `LoadPluginOrProviderBridge` to be called with an
absolute path to the Plugin, allowing it to be used as a
'ProviderBridge', or with an OnnxRuntime-relative path to the Plugin.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->

Add some logic to detect whether I8MM is actually supported.

This info can be read from the registry. See the helpful comments here
for more details:

https://github.com/Dr-Noob/cpufetch/blob/a0c08ccc0b64b524ad2122e0595099f73cbba9c4/src/arm/midr.c#L30-L52

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Detect I8MM correctly to enable better performance.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…icrosoft#24534)

1. Migrate "Linux CPU Minimal Build E2E CI Pipeline" and
"onnxruntime-binary-size-checks-ci-pipeline" to Github Action
2. Add the support for building ONNX Runtime minimal build with vcpkg.
3. Auto format the yaml files with ruamel.yaml 
4. Update vcpkg to the latest release.
- Registered the ScatterND Op in QNN EP
- Created the op as part of the Simple Op Builder
- Added unit test to verify the Op runs on QNN
- Skipping ScatterND on QNN CPU (To Do)

### Description

Add ScatterND Op Support in QNN EP



### Motivation and Context

Performance improvement as ScatterND Op falls to ORT CPU due to missing
support
Compute 'total_sequence_length' the same way as JSEP.
### Description
<!-- Describe your changes. -->

The PR adds CPU support by following release logics in
https://github.com/onnx/onnx/wiki/Logistics-for-ONNX-Release-1.18.0. The
goal is to do the minimal changes needed to ensure ONNXRUNTIME works
fine with ONNX 1.18.0

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Essentially, incoming ONNX 1.18.0 provides the following
(1) Introduce opset 23 (included in this PR)
(2) Support Attention, RMSNormalization, and RotaryEmbedding (**NOT**
included in this PR)
(3) Support float4e2m1 (**NOT** included in this PR)

### Remaining Issues

1. onnx.patch
* ONNXRUNTIME is using static functions (shape inference) from ONNX
(microsoft#24558)
* GroupNormalization-18 is deprecated because its spec was wrong
(microsoft#24560)
* Contrib op registration api from ONNX: OpSchemaRegisterOnce is changed
to explicit, and ONNXRUNTIME was leveraging it to do fluent-chaining
style. (microsoft#24561)
2. Support float4e2m1
(microsoft#24553)
3. Support
Attention(microsoft#24554),
RMSNormalization(microsoft#24555),
and
RotaryEmbedding(microsoft#24556)
4. Disable QNN tests
### Description
Fix a corner case for Expand when the output size is 0



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This fix is required to pass YOLOv9
Fix:
```/local/mnt/workspace/onnxruntime-qnn-ep/onnxruntime/core/providers/qnn/builder/opbuilder/softmax_op_builder.cc:
In function ‘std::vector<unsigned int>
onnxruntime::qnn::FlattenShapeFromAxis(std::vector<unsigned int>&,
int32_t)’:

/local/mnt/workspace/onnxruntime-qnn-ep/onnxruntime/core/providers/qnn/builder/opbuilder/softmax_op_builder.cc:47:28:
error: comparison of integer expressions of different signedness:
‘int32_t’ {aka ‘int’} and ‘std::vector<unsigned int>::size_type’ {aka
‘long unsigned int’} [-Werror=sign-compare]
   47 |   assert(axis >= 0 && axis < input_shape.size());
      |```
…t#24578)

### Description

Fix warning caused by `-Wstrict-aliasing`.
Fix transpose store op.

Test results:
```
$ ./onnxruntime_test_all
[...]
[----------] Global test environment tear-down
[==========] 4761 tests from 311 test suites ran. (47828 ms total)
[  PASSED  ] 4759 tests.
[  SKIPPED ] 2 tests, listed below:
[  SKIPPED ] MatMulFpQ4.MatMul2DSym
[  SKIPPED ] MatMulFpQ4.MatMul2DBlkZp

  YOU HAVE 6 DISABLED TESTS
```
…transformers/models/stable_diffusion/requirements (microsoft#24591)

Bumps [transformers](https://github.com/huggingface/transformers) from
4.41.2 to 4.50.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/huggingface/transformers/releases">transformers's
releases</a>.</em></p>
<blockquote>
<h1>Release v4.50.0</h1>
<h2>New Model Additions</h2>
<h3>Model-based releases</h3>
<p>Starting with version v4.49.0, we have been doing model-based
releases, additionally to our traditional, software-based monthly
releases. These model-based releases provide a tag from which models may
be installed.</p>
<p>Contrarily to our software-releases; these are not pushed to pypi and
are kept on our GitHub. Each release has a tag attributed to it, such
as:</p>
<ul>
<li><code>v4.49.0-Gemma-3</code></li>
<li><code>v4.49.0-AyaVision</code></li>
</ul>
<p>⚠️ As bugs are identified and fixed on each model, the release tags
are updated so that installing from that tag always gives the best
experience possible with that model.</p>
<p>Each new model release will always be based on the current state of
the main branch at the time of its creation. This ensures that new
models start with the latest features and fixes available.</p>
<p>For example, if two models—Gemma-3 and AyaVision—are released from
main, and then a fix for gemma3 is merged, it will look something like
this:</p>
<pre><code> o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes)
            /                  \  
---o--o--o--o--o-- (fix for gemma3) --o--o--o main
       \          
        o---- v4.49.0-AyaVision
</code></pre>
<p>We strive to merge model specific fixes on their respective branches
as fast as possible!</p>
<h3>Gemma 3</h3>
<p><img
src="https://github.com/user-attachments/assets/2b7f31b3-02bd-496a-9d4e-a1867bd6d9d4"
alt="image" /></p>
<p>Gemma 3 is heavily referenced in the following <a
href="https://github.com/huggingface/transformers/releases/tag/v4.49.0-Gemma-3">model-based
release</a> and we recommend reading these if you want all the
information relative to that model.</p>
<p>The Gemma 3 model was proposed by Google. It is a vision-language
model composed by a <a
href="https://huggingface.co/docs/transformers/model_doc/siglip">SigLIP</a>
vision encoder and a <a
href="https://huggingface.co/docs/transformers/model_doc/gemma_2">Gemma
2</a> language decoder linked by a multimodal linear projection.</p>
<p>It cuts an image into a fixed number of tokens same way as Siglip if
the image does not exceed certain aspect ratio. For images that exceed
the given aspect ratio, it crops the image into multiple smaller pacthes
and concatenates them with the base image embedding.</p>
<p>One particularity is that the model uses bidirectional attention on
all the image tokens. Also, the model interleaves sliding window local
attention with full causal attention in the language backbone, where
each sixth layer is a full causal attention layer.</p>
<ul>
<li>Gemma3 by <a
href="https://github.com/RyanMullins"><code>@​RyanMullins</code></a> in
<a
href="https://redirect.github.com/huggingface/transformers/issues/36658">#36658</a></li>
</ul>
<h3>Shield Gemma2</h3>
<p>ShieldGemma 2 is built on <a
href="https://ai.google.dev/gemma/docs/core/model_card_3">Gemma 3</a>,
is a 4 billion (4B) parameter model that checks the safety of both
synthetic and natural images against key categories to help you build
robust datasets and models. With this addition to the Gemma family of
models, researchers and developers can now easily minimize the risk of
harmful content in their models across key areas of harm as defined
below:</p>
<ul>
<li>No Sexually Explicit content: The image shall not contain content
that depicts explicit or graphic sexual acts (e.g., pornography, erotic
nudity, depictions of rape or sexual assault).</li>
<li>No Dangerous Content: The image shall not contain content that
facilitates or encourages activities that could cause real-world harm
(e.g., building firearms and explosive devices, promotion of terrorism,
instructions for suicide).</li>
<li>No Violence/Gore content: The image shall not contain content that
depicts shocking, sensational, or gratuitous violence (e.g., excessive
blood and gore, gratuitous violence against animals, extreme injury or
moment of death).</li>
</ul>
<p>We recommend using ShieldGemma 2 as an input filter to vision
language models, or as an output filter of image generation systems. To
train a robust image safety model, we curated training datasets of
natural and synthetic images and instruction-tuned Gemma 3 to
demonstrate strong performance.</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/huggingface/transformers/commit/0b057e66b52556da3a1cbc29e2a98c0784ea9c33"><code>0b057e6</code></a>
fix import issue</li>
<li><a
href="https://github.com/huggingface/transformers/commit/26fbd6919af810bf508eaea8b9eb9dcee829e228"><code>26fbd69</code></a>
v 4.50.0</li>
<li><a
href="https://github.com/huggingface/transformers/commit/523f6e743c74ecea90d0c37a172c9819b5691a19"><code>523f6e7</code></a>
Fix: dtype cannot be str (<a
href="https://redirect.github.com/huggingface/transformers/issues/36262">#36262</a>)</li>
<li><a
href="https://github.com/huggingface/transformers/commit/3f9ff19b4ec7dcf4112225079f26ea756aafd211"><code>3f9ff19</code></a>
Minor Gemma 3 fixes (<a
href="https://redirect.github.com/huggingface/transformers/issues/36884">#36884</a>)</li>
<li><a
href="https://github.com/huggingface/transformers/commit/f94b0c59f20447c0e6bdb6d381ea014fa47ecac8"><code>f94b0c5</code></a>
Use <code>deformable_detr</code> kernel from the Hub (<a
href="https://redirect.github.com/huggingface/transformers/issues/36853">#36853</a>)</li>
<li><a
href="https://github.com/huggingface/transformers/commit/2638d54e7851f1323dc78a8b513b041835aba27b"><code>2638d54</code></a>
Gemma 3 tests expect greedy decoding (<a
href="https://redirect.github.com/huggingface/transformers/issues/36882">#36882</a>)</li>
<li><a
href="https://github.com/huggingface/transformers/commit/b8aadc31d56e49d8b9075e73e5c433f7c5b4e04b"><code>b8aadc3</code></a>
:red_circle: :red_circle: :red_circle: supersede paligemma forward to
shift p...</li>
<li><a
href="https://github.com/huggingface/transformers/commit/6321876b5bac106d7e7c84b53418ea31fe1d9754"><code>6321876</code></a>
add eustlb as an actor</li>
<li><a
href="https://github.com/huggingface/transformers/commit/94f487626a296deac0022dda6462c0d9f2336106"><code>94f4876</code></a>
[generate] model defaults being inherited only happens for newer models
(<a
href="https://redirect.github.com/huggingface/transformers/issues/36881">#36881</a>)</li>
<li><a
href="https://github.com/huggingface/transformers/commit/f19d018bfff1613ba05dcbf7e82c461d49aee73e"><code>f19d018</code></a>
Revert &quot;Update deprecated Jax calls (<a
href="https://redirect.github.com/huggingface/transformers/issues/35919">#35919</a>)&quot;
(<a
href="https://redirect.github.com/huggingface/transformers/issues/36880">#36880</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/huggingface/transformers/compare/v4.41.2...v4.50.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=transformers&package-manager=pip&previous-version=4.41.2&new-version=4.50.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
after deinitialize_onnxruntime_vitisai_ep, s_domains_vitisaiep will be incorrect, which may cause an exception

### Motivation and Context
Put deregister_xir_ops() before deinitialize_onnxruntime_vitisai_ep() to avoid dangling pointers

Co-authored-by: GenMing Zhong <[email protected]>
)

This PR enables matmul8bits for the dp4/subgroupMatrix path in webgpu.

This PR is separated from microsoft#24546 for easier review.
### Description
This PR incorporates the changes requested in PR: 24394

Changes are summarized below: 
1. Reordered enable_ovep_qdq_optimizer to appear before all output
parameters as per review suggestion. Other reorders are also done for
clarity.
2. Replaced non-release build check with RELEASE flag for clarity. This
will allow all build configs to dump model except release.
### Description
Add int64 as a supported datatype for moving nodes to the CoreML EP.

We already convert constants automatically from int64 to int32 for
CoreML by calling narrow.

Adding the conversion for outputs as well.

### Motivation and Context
- More nodes supported on CoreML

### Note on the Unsqueeze op
According to microsoft#22975 there is a bug with the Unsqueeze op with scalar
inputs on x86.

I was running into a bug for unsqueezes that unsqueezed a scalar input
to a tensor of shape [1] since CoreML doesn't support scalar values for
MLProgram. I adapted the HandleX86ArchUnsqueeze method but
alternatively, can replace with an identity operator or add some
additional checks. I went with adapting the HandleX86ArchUnsqueeze
method since it seemed like the fastest solution.
### Description
- Introduces `USE_<EP>_PROVIDER_INTERFACE` pre-processor macros that
indicate when an EP interface is enabled but the full EP is not being
compiled.
- Previously, the CMake configuration turned on `USE_<EP>` for both use
cases. This prevented tests from determining whether the full EP or only
the interface was available, which caused test failures. It also turned
on all EP code paths in core ORT code at the same time, which caused
compilation and logic errors.
- Adds the new NV EP to list of EPs whose interface is enabled with ORT
is built with `--enable_generic_interface`
- Updates the Windows Arm64 QNN CI Pipeline to actually use the
`--enable_generic_interface` flag.
- Previously, It was not actually being passed to the build command, so
no unit tests were being run with the flag enabled.
- Adds unit tests to check that adding an EP to the session options
fails when only the generic interface (but not the full EP) is built.

#### CI Pipelines that use --enable_generic_interface
- Windows ARM64 QNN CI Pipeline:
- Builds ORT with `--use_qnn --enable_generic_interface` and runs all
normal QNN EP unit tests.
- Builds ORT with `--use_qnn --enable_generic_interface` and runs new
unit tests that try to add the following EPs to the session options
(expect failure): OpenVINO, CUDA, NV, TensorRT, VitisAI
- Build and Test OpenVINO EP (AlamLinux8, Py3.12) / build_test_pipeline:
- Builds ORT with `--use_openvino --enable_generic_interface` and runs
all normal OpenVINO EP unit tests.
- Builds ORT with `--use_openvino --enable_generic_interface` and runs
new unit tests that try to add the following EPs to the session options
(expect failure): QNN, CUDA, NV, TensorRT, VitisAI
- windows_x64_release_ep_generic_interface
- Builds ORT with `--enable_generic_interface` and now runs CPU EP unit
tests (didn't previously).

### Motivation and Context
Fix use of `--enable_generic_interface` and make sure tests actually
run.
### Description
<!-- Describe your changes. -->
Update Qnn nuget package to use Arm64x binary.
Enable build with generic interface.
Copy Qnn libs with Qnn ep project build instead of the test_all project.
Update DML nuget package to enable generic interface, and pack the shared.dll into the package.
…ders/impl/gather_op_builder.cc. (microsoft#24609)

### Description
<!-- Describe your changes. -->

Fix unused variable warning in
onnxruntime/core/providers/coreml/builders/impl/gather_op_builder.cc.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix build.
…lls (microsoft#24606)

### Description
Fixes microsoft#24500

- Fixes local build of onnxruntime.dll to have a valid version, such as
"1.23.0", instead of the literal string "ORT_VERSION"
- Adds version info to onnxruntime_providers_qnn.dll,
onnxruntime_providers_cuda.dll, and onnxruntime_providers_tensorrt.dll.
It was missing completely. This was done by adding
`onnxruntime_providers_*.rc` files to define each EP's [DLL version
info](https://learn.microsoft.com/en-us/windows/win32/menurc/versioninfo-resource).

Fixed onnxruntime.dll version info (local non-ADO build):
<img width="263" alt="image"
src="https://github.com/user-attachments/assets/33ef85ea-ac36-4c6a-9171-8fe4fb35955d"
/>

Fixed onnxruntime_providers_qnn.dll version info (adds QNN SDK version
too):
<img width="275" alt="image"
src="https://github.com/user-attachments/assets/a1f04604-2e3c-416d-989e-e92cb7df1776"
/>


### Motivation and Context
We create dlls with invalid or missing version info.
### Description

This PR adds support for atomic types for program output. Applying
atomic type on program output can be done in the following way:
```c++
program.AddOutput({output_tensor, ProgramTensorMetadataDependency::TypeAndRank, ProgramOutput::Atomic});
```
The last
```

The support for atomic type is minimal. According to [spec](https://www.w3.org/TR/WGSL/#atomic-types), the only valid operations on atomic objects are the [atomic builtin functions](https://www.w3.org/TR/WGSL/#atomic-builtin-functions). This means atomic types cannot be accessed (get/set) using the normal way. Get* and Set* functions will not be working on atomic types for indices helper. Use the WGSL builtin functions directly. OffsetToIndices and IndicesToOffset functions still work.
### Description

While cleaning up the options I missed the part in the provider bridge
that translates session options to TRT options.
To better integrate with current IHV work I adopted the principle that
QNN and OV use to pipe through session options. Since all this is string
based magic it would be great to be access a general point of truth like
`EpContextModelGenerationOptions` in the provider wrappedtypes.

https://github.com/microsoft/onnxruntime/blob/6df620675290d97d7e406faf232b8b521333b6e8/onnxruntime/core/framework/session_options.h#L73

This is a fix on top of microsoft#24456 @ankan-ban and @chilo-ms to review.
### Description
<!-- Describe your changes. -->
Win TRT version was set to 10.8 when CI was migrating to Github Actions.
Reset to the latest 10.9.
Linux TRT CI and other packaging CIs have no issue as they are correctly
set to 10.9.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
[QNN EP] Add Einsum support for some equations. Intend is not to support all equations. But to enable case by case to improve performance.
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k May 2, 2025 05:54
@jatinwadhwa921 jatinwadhwa921 merged commit e354009 into ovep-develop May 2, 2025
4 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.