The performance improvement is about 20%-30% for larger cases and about 1%-5% for smaller cases.
Used SIMD load/store instead of GPR for large overlapping forward moves.
Reused existing memcpy implementation for smaller or overlapping backward moves.
Fixed the existing memcpy implementation to allow it to deal with the overlapping case.
Simplified loop tails in the memcpy implementation -use branchless overlapping sequence of fixed length load/stores instead of branching depending on the size.
A cleanup/optimization converting str's to stp's.
Added __memmove_thunderx2 to the list of the available implementations.
32e902a94e aarch64: thunderx2 memmove performance improvements
ChangeLog | 11 +
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 1 +
sysdeps/aarch64/multiarch/memcpy_thunderx2.S | 572 ++++++++++-----------------
sysdeps/aarch64/multiarch/memmove.c | 5 +-
4 files changed, 227 insertions(+), 362 deletions(-)