This new iterator works in a separable way; that is, for a destination scaline, it scales the two involved source scanlines and then caches them so that they can be reused for the next destination scanlines.
There are two versions of the code, one that uses 64 bit arithmetic, and one that uses 32 bit arithmetic only. The latter version is used on 32 bit systems, where it is expected to be faster.
This scheme saves a substantial amount of arithmetic for larger scalings; the per-pixel times for various configurations as reported by scaling-bench are graphed here:
The "sse2" graph is current default on x86, "mmx" is with sse2 disabled, "old c" is with sse2 and mmx disabled. The "new 32" and "new 64" graphs show times for the new code. As the graphs show, the 64 bit
version of the new code beats the "old c" for all scaling ratios.
The data was taken on a Sandy Bridge Core i3-2350M CPU @ 2.0 GHz running in 64 bit mode.
The data used to generate the graph is available in this directory:
There is also a Gnumeric spreadsheet v2.gnumeric containing the per-pixel values and the graph.
V2:- Add error message in the OOM/bad matrix case- Save some shifts by storing the cached scanlines in AGBR order- Special cased version that uses 32 bit arithmetic when sizeof(long) <= 4
3518a0d Add an iterator that can fetch bilinearly scaled images
pixman/pixman-fast-path.c | 241 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 241 insertions(+)