[arm] Implement usadv16qi and ssadv16qi standard names

Programming / Compilers / GCC - ktkachov [138bc75d-0d04-0410-961f-82ee72b054a4] - 12 June 2019 08:27 EDT

This patch implements the usadv16qi and ssadv16qi standard names for arm.

The V16QImode variant is important as it is the most commonly used pattern: reducing vectors of bytes into an int. The midend expects the optab to compute the absolute differences of operands 1 and 2 and reduce them while widening along the way up to SImode. So the inputs are V16QImode and the output is V4SImode.

I've based my solution on Aarch64 usadv16qi and ssadv16qi standard names current implementation (r260437). This solution emits below sequence of instructions:

VABDL.u8 tmp, op1, op2 # op1, op2 lowpart
VABAL.u8 tmp, op1, op2 # op1, op2 highpart
VPADAL.u16 op3, tmp

So, for the code:

$ arm-none-linux-gnueabihf-gcc -S -O3 -march=armv8-a+simd -mfpu=auto -mfloat-abi=hard usadv16qi.c -dp

#define N 1024 unsigned char pix1[N]; unsigned char pix2[N];

int foo (void) { int i_sum = 0; int i; for (i = 0; i < N; i++) i_sum += __builtin_abs (pix1[i] - pix2[i]); return i_sum; }

we now generate on arm: foo: movw r3, #:lower16:pix2 @ 57 [c=4 l=4] *arm_movsi_vfp/3 movt r3, #:upper16:pix2 @ 58 [c=4 l=4] *arm_movt/0
vmov.i32 q9, #0 @ v4si @ 3 [c=4 l=4] *neon_movv4si/2 movw r2, #:lower16:pix1 @ 59 [c=4 l=4] *arm_movsi_vfp/3 movt r2, #:upper16:pix1 @ 60 [c=4 l=4] *arm_movt/0 add r1, r3, #1024 @ 8 [c=4 l=4] *arm_addsi3/4 .L2:
vld1.8 {q11}, [r3]! @ 11 [c=8 l=4] *movmisalignv16qi_neon_load
vld1.8 {q10}, [r2]! @ 10 [c=8 l=4] *movmisalignv16qi_neon_load cmp r1, r3 @ 21 [c=4 l=4] *arm_cmpsi_insn/2
vabdl.u8 q8, d20, d22 @ 12 [c=8 l=4] neon_vabdluv8qi
vabal.u8 q8, d21, d23 @ 15 [c=88 l=4] neon_vabaluv8qi
vpadal.u16 q9, q8 @ 16 [c=8 l=4] neon_vpadaluv8hi bne .L2 @ 22 [c=16 l=4] arm_cond_branch
vadd.i32 d18, d18, d19 @ 24 [c=120 l=4] quad_halves_plusv4si
vpadd.i32 d18, d18, d18 @ 25 [c=8 l=4] neon_vpadd_internalv2si
vmov.32 r0, d18[0] @ 30 [c=12 l=4] vec_extractv2sisi/1

instead of: foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movw r3, #:lower16:pix1 movt r3, #:upper16:pix1
vmov.i32 q9, #0 @ v4si movw r2, #:lower16:pix2 movt r2, #:upper16:pix2 add r1, r3, #1024 .L2:
vld1.8 {q8}, [r3]!
vld1.8 {q11}, [r2]!
vmovl.u8 q10, d16 cmp r1, r3
vmovl.u8 q8, d17
vmovl.u8 q12, d22
vmovl.u8 q11, d23
vsub.i16 q10, q10, q12
vsub.i16 q8, q8, q11
vabs.s16 q10, q10
vabs.s16 q8, q8
vaddw.s16 q9, q9, d20
vaddw.s16 q9, q9, d21
vaddw.s16 q9, q9, d16
vaddw.s16 q9, q9, d17 bne .L2
vadd.i32 d18, d18, d19
vpadd.i32 d18, d18, d18
vmov.32 r0, d18[0]

2019-06-12 Przemyslaw Wirkus

- config/arm/iterators.md (VABAL): New int iterator.
- config/arm/neon.md (sadv16qi): New define_expand.
- config/arm/unspecs.md ("unspec"): Define UNSPEC_VABAL_S, UNSPEC_VABAL_U
values.

- gcc.target/arm/ssadv16qi.c: New test.
- gcc.target/arm/usadv16qi.c: Likewise.

b1a4ffbd1cd [arm] Implement usadv16qi and ssadv16qi standard names
gcc/ChangeLog | 7 +++++++
gcc/config/arm/iterators.md | 3 +++
gcc/config/arm/neon.md | 26 ++++++++++++++++++++++++++
gcc/config/arm/unspecs.md | 2 ++
gcc/testsuite/ChangeLog | 5 +++++
gcc/testsuite/gcc.target/arm/ssadv16qi.c | 29 +++++++++++++++++++++++++++++
gcc/testsuite/gcc.target/arm/usadv16qi.c | 29 +++++++++++++++++++++++++++++
7 files changed, 101 insertions(+)

Upstream: gcc.gnu.org


  • Share