Skip to content

Conversation

@Rohanjames1997
Copy link
Contributor

Description

This PR makes the following changes:

  1. Reroutes Pointwise Convolution to use GEMM, as it is essentially the same. This unlocks the performance benefits of GEMM too, as seen in the performance section below.
  2. Makes the convolution kernels branchless. This is achieved by using in-built MLAS functions like MlasBlendFloat32x4

Performance

This speeds up any Conv model that uses the pointwise kernel.
For example, Mobilenet inference speeds up from 500 inf/sec to 590 inf/sec.

Testing

  • Build passed: ./build.sh --config=Release --build_shared_lib --parallel --cmake_extra_defines onnxruntime_USE_ARM_NEON_NCHWC=ON
  • Unit tests passed: ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=Conv2dNchwc_*
  • Perf: ./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx

Happy to run additional perf tests as required.

@Rohanjames1997
Copy link
Contributor Author

@hariharans29 this may be of interest to you 🙂

TIA!

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

Nice, Thank you! Is there performance uplift from removing the branches in the kernel or is the main perf benefit coming from the pointwise kernel switch to using the Gemm kernel ?

@Rohanjames1997
Copy link
Contributor Author

Rohanjames1997 commented Dec 2, 2025

The majority of the perf gain is from using GEMM. Making it branchless results in no noticeable gain but I realized that it's better SIMD practice, and MLAS has great support for it with these built-in functions

@Rohanjames1997
Copy link
Contributor Author

I ran the failing CI tests - onnxruntime_global_thread_pools_test and onnxruntime_test_all - on the main branch & they fail even there 🤔

Looks like these are Required CI's too. Any idea what can be done?

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

I ran the failing CI tests - onnxruntime_global_thread_pools_test and onnxruntime_test_all - on the main branch & they fail even there 🤔

Looks like these are Required CI's too. Any idea what can be done?

Hmm - I don't see it on other PRs. Let's see what happens on this run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants