-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Simplify & optimize Arm64 NCHWc Convolution kernels #26691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@hariharans29 this may be of interest to you 🙂 TIA! |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Nice, Thank you! Is there performance uplift from removing the branches in the kernel or is the main perf benefit coming from the pointwise kernel switch to using the Gemm kernel ? |
|
The majority of the perf gain is from using GEMM. Making it branchless results in no noticeable gain but I realized that it's better SIMD practice, and MLAS has great support for it with these built-in functions |
|
I ran the failing CI tests - Looks like these are Required CI's too. Any idea what can be done? |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Hmm - I don't see it on other PRs. Let's see what happens on this run. |
Description
This PR makes the following changes:
MlasBlendFloat32x4Performance
This speeds up any Conv model that uses the pointwise kernel.
For example, Mobilenet inference speeds up from 500 inf/sec to 590 inf/sec.
Testing
./build.sh --config=Release --build_shared_lib --parallel --cmake_extra_defines onnxruntime_USE_ARM_NEON_NCHWC=ON./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=Conv2dNchwc_*./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnxHappy to run additional perf tests as required.