Due to the branch prediction issue of Kunpeng processor, we found memset_generic has poor performance on middle sizes setting, and so we reconstructed the logic, expanded the loop by 4 times in set_long to solve the problem, even when setting below 1K sizes have benefit.
Another change is that DZ_ZVA seems no work when setting zero, so we discarded it and used set_long to set zero instead. Fewer branches and predictions also make the zero case have slightly improvement.
Checked on aarch64-linux-gnu.
525de033a9 aarch64: Optimized memset for Kunpeng processor.
sysdeps/aarch64/multiarch/Makefile | 2 +-
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 1 +
sysdeps/aarch64/multiarch/memset.c | 5 +-
sysdeps/aarch64/multiarch/memset_kunpeng.S | 113 ++++++++++++++++++++++++++++
4 files changed, 119 insertions(+), 2 deletions(-)