aarch64: Optimized memset for falkor

System Internals / glibc - Siddhesh Poyarekar [sourceware.org] - 20 November 2017 12:55 EST

The generic memset reads dczid_el0 on every memset. This has a significant impact on falkor for a range of sizes because reading dczid_el0 is slow.

The DZP bit in the dczid_el0 register does not change dynamically, so it is safe to read once during program startup. With this patch dczid_el0 is read once during startup and zva_size is cached. This is used to invoke the falkor-specific memset; the generic memset routine remains unchanged.

The gains due to this are significant for falkor, with run time reductions as high as 48%. Here's a sample from the falkor tests:

Function: memset
Variant: walk simple_memset __memset_falkor __memset_generic

length=256, char=0: 139.96 (-698.28%) 9.07 ( 48.26%) 17.53 length=257, char=0: 140.50 (-699.03%) 9.53 ( 45.80%) 17.58 length=258, char=0: 140.96 (-703.95%) 9.58 ( 45.36%) 17.53 length=259, char=0: 141.56 (-705.16%) 9.53 ( 45.79%) 17.58 length=260, char=0: 142.15 (-710.76%) 9.57 ( 45.39%) 17.53 length=261, char=0: 142.50 (-710.39%) 9.53 ( 45.78%) 17.58 length=262, char=0: 142.97 (-715.09%) 9.57 ( 45.42%) 17.54 length=263, char=0: 143.51 (-716.18%) 9.53 ( 45.80%) 17.58 length=264, char=0: 143.93 (-720.55%) 9.58 ( 45.39%) 17.54 length=265, char=0: 144.56 (-722.07%) 9.53 ( 45.80%) 17.59 length=266, char=0: 144.98 (-726.42%) 9.58 ( 45.42%) 17.54 length=267, char=0: 145.53 (-727.53%) 9.53 ( 45.80%) 17.59 length=268, char=0: 146.25 (-731.81%) 9.53 ( 45.79%) 17.58 length=269, char=0: 146.52 (-735.39%) 9.53 ( 45.66%) 17.54 length=270, char=0: 146.97 (-735.81%) 9.53 ( 45.80%) 17.58 length=271, char=0: 147.54 (-741.08%) 9.58 ( 45.38%) 17.54 length=512, char=0: 268.26 (-1307.85%) 12.06 ( 36.71%) 19.05 length=513, char=0: 268.73 (-1273.89%) 13.56 ( 30.68%) 19.56 length=514, char=0: 269.31 (-1276.89%) 13.56 ( 30.68%) 19.56 length=515, char=0: 269.73 (-1279.05%) 13.56 ( 30.68%) 19.56 length=516, char=0: 270.34 (-1282.24%) 13.56 ( 30.67%) 19.56 length=517, char=0: 270.83 (-1284.71%) 13.56 ( 30.66%) 19.56 length=518, char=0: 271.20 (-1286.54%) 13.56 ( 30.67%) 19.56 length=519, char=0: 271.67 (-1288.67%) 13.65 ( 30.24%) 19.56 length=520, char=0: 272.14 (-1291.04%) 13.65 ( 30.22%) 19.56 length=521, char=0: 272.66 (-1293.69%) 13.65 ( 30.23%) 19.56 length=522, char=0: 273.14 (-1296.13%) 13.65 ( 30.20%) 19.56 length=523, char=0: 273.64 (-1298.75%) 13.65 ( 30.23%) 19.56 length=524, char=0: 274.34 (-1302.16%) 13.66 ( 30.20%) 19.57 length=525, char=0: 274.64 (-1297.78%) 13.56 ( 30.99%) 19.65 length=526, char=0: 275.20 (-1300.04%) 13.56 ( 31.01%) 19.66 length=527, char=0: 275.66 (-1302.86%) 13.56 ( 30.99%) 19.65 length=1024, char=0: 524.46 (-2169.75%) 20.12 ( 12.92%) 23.11 length=1025, char=0: 525.14 (-2124.63%) 21.62 ( 8.40%) 23.61 length=1026, char=0: 525.59 (-2125.36%) 21.88 ( 7.37%) 23.62 length=1027, char=0: 525.98 (-2127.14%) 21.62 ( 8.46%) 23.62 length=1028, char=0: 526.68 (-2131.10%) 21.62 ( 8.42%) 23.61 length=1029, char=0: 527.10 (-2131.70%) 21.79 ( 7.73%) 23.62 length=1030, char=0: 527.54 (-2118.51%) 21.62 ( 9.10%) 23.78 length=1031, char=0: 527.98 (-2136.37%) 21.62 ( 8.43%) 23.61 length=1032, char=0: 528.70 (-2139.38%) 21.62 ( 8.43%) 23.61 length=1033, char=0: 529.25 (-2124.37%) 21.62 ( 9.11%) 23.79 length=1034, char=0: 529.48 (-2142.95%) 21.62 ( 8.43%) 23.61 length=1035, char=0: 530.11 (-2145.13%) 21.62 ( 8.44%) 23.61 length=1036, char=0: 530.76 (-2147.10%) 21.79 ( 7.73%) 23.62 length=1037, char=0: 531.03 (-2149.45%) 21.62 ( 8.42%) 23.61 length=1038, char=0: 531.64 (-2151.87%) 21.62 ( 8.42%) 23.61 length=1039, char=0: 531.99 (-2151.63%) 21.80 ( 7.75%) 23.63

- sysdeps/aarch64/memset-reg.h: New file.
- sysdeps/aarch64/memset.S: Use it. (__memset): Rename to MEMSET macro. [ZVA_MACRO]: Use zva_macro.
- sysdeps/aarch64/multiarch/Makefile (sysdep_routines): Add memset_generic and memset_falkor.
- sysdeps/aarch64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add memset ifuncs.
- sysdeps/aarch64/multiarch/init-arch.h (INIT_ARCH): New local variable zva_size.
- sysdeps/aarch64/multiarch/memset.c: New file.
- sysdeps/aarch64/multiarch/memset_generic.S: New file.
- sysdeps/aarch64/multiarch/memset_falkor.S: New file.
- sysdeps/aarch64/multiarch/rtld-memset.S: New file.
- sysdeps/unix/sysv/linux/aarch64/cpu-features.c (DCZID_DZP_MASK): New macro. (DCZID_BS_MASK): Likewise. (init_cpu_features): Read and set zva_size.
- sysdeps/unix/sysv/linux/aarch64/cpu-features.h (struct cpu_features): New member zva_size.

5a67c4fa01 aarch64: Optimized memset for falkor
ChangeLog | 21 ++++++++++
sysdeps/aarch64/memset-reg.h | 30 +++++++++++++++
sysdeps/aarch64/memset.S | 27 +++++--------
sysdeps/aarch64/multiarch/Makefile | 2 +-
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 5 +++
sysdeps/aarch64/multiarch/init-arch.h | 8 ++--
sysdeps/aarch64/multiarch/memset.c | 40 +++++++++++++++++++
sysdeps/aarch64/multiarch/memset_falkor.S | 53 ++++++++++++++++++++++++++
sysdeps/aarch64/multiarch/memset_generic.S | 27 +++++++++++++
sysdeps/aarch64/multiarch/rtld-memset.S | 23 +++++++++++
sysdeps/unix/sysv/linux/aarch64/cpu-features.c | 10 +++++
sysdeps/unix/sysv/linux/aarch64/cpu-features.h | 1 +
12 files changed, 225 insertions(+), 22 deletions(-)

Upstream: sourceware.org


  • Share