Timings for Arrandale: C SSE win32: 2108 334 win64: 1152 322
Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with the jmp destination being aligned.
Unrolling for ARCH_X86_64 is a 20 cycles gain.
08e3ea6 x86: synth filter float: implement SSE2 version
libavcodec/synth_filter.c | 1 +
libavcodec/synth_filter.h | 1 +
libavcodec/x86/dcadsp.asm | 152 ++++++++++++++++++++++++++++++++++++++++++
libavcodec/x86/dcadsp_init.c | 28 ++++++++
4 files changed, 182 insertions(+)
Upstream: git.libav.org