Simplify & optimize Arm64 NCHWc Convolution kernels #26691

Rohanjames1997 · 2025-12-01T17:57:25Z

Description

This PR makes the following changes:

Reroutes Pointwise Convolution to use GEMM, as it is essentially the same. This unlocks the performance benefits of GEMM too, as seen in the performance section below.
Makes the convolution kernels branchless. This is achieved by using in-built MLAS functions like MlasBlendFloat32x4

Performance

This speeds up any Conv model that uses the pointwise kernel.
For example, Mobilenet inference speeds up from 500 inf/sec to 590 inf/sec.

Testing

Build passed: ./build.sh --config=Release --build_shared_lib --parallel --cmake_extra_defines onnxruntime_USE_ARM_NEON_NCHWC=ON
Unit tests passed: ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=Conv2dNchwc_*
Perf: ./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx

Happy to run additional perf tests as required.

Rohanjames1997 · 2025-12-01T18:10:25Z

@hariharans29 this may be of interest to you 🙂

TIA!

hariharans29 · 2025-12-02T02:18:48Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-02T02:19:06Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-12-02T02:22:05Z

Nice, Thank you! Is there performance uplift from removing the branches in the kernel or is the main perf benefit coming from the pointwise kernel switch to using the Gemm kernel ?

Rohanjames1997 · 2025-12-02T16:08:33Z

The majority of the perf gain is from using GEMM. Making it branchless results in no noticeable gain but I realized that it's better SIMD practice, and MLAS has great support for it with these built-in functions

Rohanjames1997 · 2025-12-02T17:46:19Z

I ran the failing CI tests - onnxruntime_global_thread_pools_test and onnxruntime_test_all - on the main branch & they fail even there 🤔

Looks like these are Required CI's too. Any idea what can be done?

hariharans29 · 2025-12-03T04:01:31Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-03T04:01:50Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-12-03T04:03:36Z

I ran the failing CI tests - onnxruntime_global_thread_pools_test and onnxruntime_test_all - on the main branch & they fail even there 🤔

Looks like these are Required CI's too. Any idea what can be done?

Hmm - I don't see it on other PRs. Let's see what happens on this run.

Rohanjames1997 added 8 commits December 1, 2025 17:01

Simplify the Fp32 Pointwise Kernel to use GEMM

a26554a

Simplify the Fp32 Depthwise kernel

ce9b173

Simplify the Fp32 Depthwise Conv kernel

147ad53

Make fp32 Depthwise Conv branchless

aae9b94

Make fp32 Depthwise Conv branchless

5f3789a

Make MlasConvFloatKernelNeonImpl branchless

2bb2f4b

Remove redundant code

d07c57a

Merge two loops

314cef5

Make ReLU branchless in Fp32 Pointwise Conv

0efd348

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify & optimize Arm64 NCHWc Convolution kernels #26691

Simplify & optimize Arm64 NCHWc Convolution kernels #26691

Uh oh!

Rohanjames1997 commented Dec 1, 2025

Uh oh!

Rohanjames1997 commented Dec 1, 2025

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

Rohanjames1997 commented Dec 2, 2025 •

edited

Loading

Uh oh!

Rohanjames1997 commented Dec 2, 2025

Uh oh!

hariharans29 commented Dec 3, 2025

Uh oh!

azure-pipelines bot commented Dec 3, 2025

Uh oh!

hariharans29 commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Simplify & optimize Arm64 NCHWc Convolution kernels #26691

Are you sure you want to change the base?

Simplify & optimize Arm64 NCHWc Convolution kernels #26691

Uh oh!

Conversation

Rohanjames1997 commented Dec 1, 2025

Description

Performance

Testing

Uh oh!

Rohanjames1997 commented Dec 1, 2025

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

Rohanjames1997 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rohanjames1997 commented Dec 2, 2025

Uh oh!

hariharans29 commented Dec 3, 2025

Uh oh!

azure-pipelines bot commented Dec 3, 2025

Uh oh!

hariharans29 commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rohanjames1997 commented Dec 2, 2025 •

edited

Loading