The frontend optimizations are to: 1. Reorganize logically connected basic blocks so they are either in the same cache line or adjacent cache lines. 2. Avoid cases when basic blocks unnecissarily cross cache lines. 3. Try and 32 byte align any basic blocks possible without sacrificing code size. Smaller / Less hot basic blocks are used for this.
Overall code size shrunk by 168 bytes. This should make up for any extra costs due to aligning to 64 bytes.
In general performance before deviated a great deal dependending on whether entry alignment % 64 was 0, 16, 32, or 48. These changes essentially make it so that the current implementation is at least equal to the best alignment of the original for any arguments.
The only additional optimization is in the page cross case. Branch on equals case was removed from the size == [4, 7] case. As well the [4, 7] and [2, 3] case where swapped as [4, 7] is likely a more hot argument size.
test-memcmp and test-wmemcmp are both passing.
1bd8b8d58f x86: Optimize memcmp-evex-movbe.S for frontend behavior and size
sysdeps/x86_64/multiarch/memcmp-evex-movbe.S | 434 +++++++++++++++------------
1 file changed, 242 insertions(+), 192 deletions(-)