AltiVec Programming

Compiler and Other Practical Issues

In the Good Old Days I did first PowerPC programming with the Metrowerks compiler. I must have been spoiled. If you write a leaf routine (a routine that doesn't call any other routines) and pass floating-point values as parameters, your first 12 floating-point parameters and first 15 (or so) local variables are automatically put in registers. This makes it really easy to write fast efficient code, by ensuring that each line of C code performs an operation that converts into a single machine instruction.

Unfortunately, in GCC (GNU C Compiler, or GNU Compiler Collection), there is no such efficiency, even if you turn the optimization all the way up. Instead, you have to explicitly declare your register variables with syntax like the following:

Even with this construct, you cannot get more than 12 vector registers: v20 through v31. GCC will allow

but will mysteriously and silently fail to generate the proper code when the variable v_bad is used.

Splat from a Variable

It is common to want to load from a scalar variable into all elements of a vector. The quickest way to do this is with 4 instructions:

v1 = vec_ld(0, &scalar); temp = vec_lvsl(0, &scalar); // temp is declared "vector char" v1 = vec_perm(v1, v1, temp); v1 = vec_splat(v1, 0);

Simple Operators

Quadword Arithmetic

Here are a set of routines that treat a vector as a 128-bit signed or unsigned value, performing arithmetic and other operations you can perform on a normal 32-bit word.

vec_ne_0(x) - boolean test for x not equal to zero. If x is all 0's the result will be all 0's; otherwise the result will be all 1's.

r = vec_ne_0(x):
   p1 = (vector char) (0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12);
   k0 = vec__splat__s8(0);
- - - - -
   r = vec_cmpne_u32(x, k0);
   r = vec_perm(r, r, p1);
   r = vec_cmpne_u32(x, k0);

Software Prefetch

The Data Stream Touch instructions (dst, dstt, dstst, dststt, etc.; vec_dst in C; hereafter called just DST) do not operate on data in the level 2 cache. In other words, if your data is already in the L2 cache, use of DST will not bring that data into the L1 cache for you. This is of critical importance for code that repeatedly accesses amounts of data in excess of 32K but less than 256K.

However, if you have two extra vector registers per data stream, you can perform the operation yourself. Set up your loop so that each time through the loop you're consuming 32 bytes worth of each input stream (this is 2 vectors, and 1 cache line). Make sure your pointers are set up so that you are reading your data starting at a 32-byte boundary (if not, use separate code to handle the 1 to 31 bytes prior to the 32-byte boundary). At the beginning of each loop, use two vec_ld operations to load 32 bytes of data into a pair of "next-input" buffer vectors. At the end of your loop (after your calculation and writing out the output stream) transfer the buffer vectors into the "current-input" vectors — these are the vectors from which your calculation will draws its values on the next pass through the loop. Then loop, and the process repeats with the reload of the buffer vectors. Of course, you have to set this up with two extra vec_ld instructions before the loop starts, and an extra copy of the calculation code after the loop ends. Here is a complete example:

Original Loop

for(i=0; i in = vec_ld(p); p += vec_stride(in);
out = vec_mul(vec_add(vec_xor(in, k1), k2), k3, k4);
vec_st(q, out); q += vec_stride(out);
}

Modified Loop Using Software Prefetch

in1 = vec_ld(p);
in2 = vec_ld(p); p += 2 * vec_stride(in);
for(i=0; i // prefetch
buf1 = vec_ld(p);
buf2 = vec_ld(p); p += 2 * vec_stride(in);
// calculate
out = vec_mul(vec_add(vec_xor(in1, k1), k2), k3, k4);
vec_st(q, out);
out = vec_mul(vec_add(vec_xor(in2, k1), k2), k3, k4);
vec_st(q, out); q += 2 * vec_stride(out);
// transfer
in1 = buf1; in2 = buf2;
}
// finish the last 2 vectors
out = vec_mul(vec_add(vec_xor(in1, k1), k2), k3, k4);
vec_st(q, out);
out = vec_mul(vec_add(vec_xor(in2, k1), k2), k3, k4);
vec_st(q, out); q += 2 * vec_stride(out);

Whenever the needed data is in L2 cache, this technique will save 3 cycles per input vector no matter how many cycles are used for the calculation itself. If the data is already in L1 cache each time through the loop, it might cause a slowdown of 1 cycle per vector. If the input data is not in L1 or L2 cache and the calculation is slow enough to be of comparable speed to the L3 (or RAM) loading time, then the technique will also provide a speed improvement, of variable amount depending on the exact time values involved.

Footnotes

1 : DST: Data Stream Touch

This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2011 Jan 22.

s.27