aarch64: vp8: Optimize vp8_idct_add_neon for aarch64

Martin Storsjö [martin.st] - 19 February 2019 09:46 EST

The previous version was a pretty exact translation of the arm
version. This version does do some unnecessary arithemetic (it does more operations on vectors that are only half filled; it does 4 uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead of packing data together (which could be done for free in the arm

This gives a decent speedup on Cortex A53, a minor speedup on A72 and a very minor slowdown on Cortex A73.

Before: Cortex A53 A72 A73
vp8_idct_add_neon: 79.7 67.5 65.0 After:
vp8_idct_add_neon: 67.7 64.8 66.7

7e42d5f0a aarch64: vp8: Optimize vp8_idct_add_neon for aarch64
libavcodec/aarch64/vp8dsp_neon.S | 49 ++++++++++++++++++++--------------------
1 file changed, 25 insertions(+), 24 deletions(-)

