New optimizations for X86 in upcoming GCC 5.0

Part1. Vectorization of loads/stores group.

GCC 5.0 significantly improves vector code quality for load groups and store groups. By loads/stores group I mean iterated consecutive sequence of loads/stores. For example:

x = a[i], y = a[i + 1], z = a[i + 2] iterated by “i” is loads group of size 3

Group size is distance between smallest and largest loads/stores addresses. In the example (i + 2) – (i) + 1 = 3

Number of loads/stores in a group is less or equal to group size. For example:

x = a[i], z = a[I + 2] iterated by “i” is still loads group of size 3, but has only 2 loads.

GCC 4.9 vectorizes groups of any power of 2 size (2, 4, 8 …).

GCC 5.0 vectorizes groups of size 3 and any power of 2 (2, 4, 8 …). Other sizes (5, 6, 7, 9,...) are rarely used.

The most frequent case where loads/stores groups are applicable is array of structures.

Image conversion (RGB structure to some other)
(test case to try at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52252)
N-dimentional coordinates. (Normalize array of XYZ points)
(test case to try at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61403)
Multiplication of vectors by constant matrix:

a[i][0] = 7 * b[i][0] - 3 * b[i][1];
a[i][1] = 2 * b[i][0] + b[i][1];

Basically, GCC 5.0

Introduces vectorization of load/store groups of size 3
Improves load groups vectorization for all supported sizes
Maximizes load/store groups performance by generating code that is more optimal for particular x86 CPU.

Below is the table that tries to estimate performance impact on byte structures (maximum number of elements in a vector) comparing GCC 4.9 and GCC 5.0 performance on the following loop:

  int i, j, k;
  byte *in = a, *out = b;
  for (i = 0; i < 1024; i++)
    {
      for (k = 0; k < STGSIZE; k++)
        {
          byte s = 0;
          for (j = 0; j < LDGSIZE; j++)
            s += in[j] * c[j][k];
          out[k] = s;
        }
      in += LDGSIZE;
      out += STGSIZE;
    }

Where

“c” is constant matrix:

const byte c[8][8] = {1, -1, 1, -1, 1, -1, 1, -1,
                      1, 1, -1, -1, 1, 1, -1, -1,
                      1, 1, 1, 1, -1, -1, -1, -1,
                      -1, 1, -1, 1, -1, 1, -1, 1,
                      -1, -1, 1, 1, -1, -1, 1, 1,
                      -1, -1, -1, -1, 1, 1, 1, 1,
                      -1, -1, -1, 1, 1, 1, -1, 1,
                      1, -1, 1, 1, 1, -1, -1, -1};

Used to simplify calculations in the loop to just “add” and “sub” which are usually very fast.

“in“ and “out” pointers to global arrays "a[1024 * LDGSIZE]" and "b[1024 * STGSIZE]"
byte is unsigned char
LDGSIZE and STGSIZE – macros defining loads and stores group sizes accordingly

Compilation options "-Ofast" plus "-march=slm" for Silvermont, "-march=core-avx2" for Haswell and all combinations of -DLDGSIZE={1,2,3,4,8} -DSTGSIZE={1,2,3,4,8}

GCC 5.0 to 4.9 gain (in times, bigger is better).

Silvermont: Intel(R) Atom(TM) CPU C2750 @ 2.41GHz

gain is up to 6.5 times!

As we can see results for stores group of size 3 are not so good. This is because vectorization of stores group of size 3 currently takes 8 “pshufb” instructions which are about 5 ticks on Silvermont. However the loop is still vectorized and if there were more costly calculations inside the loop vectorization would give gain (we can see that for loads groups of size 2, 3, 4 and 8).

Haswell: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

gain is up to 3 times!

Here the results for stores groups of size 3 are much better as "pshufb" instruction on Haswell takes only 1 tick. The biggest improvement we see for newly implemented loads/stores groups of size 3.

You can find compilers used for the measurements at:

GCC 4.9: https://gcc.gnu.org/gcc-4.9

GCC 5.0 trunk built at revision 217914: https://gcc.gnu.org/viewcvs/gcc/trunk/?pathrev=218160

Download matrix.zip