/* ------------------ Vector SIMD Macro Library ---------------------\

  vector_simd.h                                                              Wrap to 132 columns in Emacs:   ^u 132 ^x f

These macros provide a simple way to write vector (SSE or AltiVec)
code without having to worry about the differences between the
different processor implementations.

The definitions used by this module are supplied with GCC (at least
by version 4.0.1) in the files emmintrin.h, xmmintrin.h, etc. all
in the directory /usr/lib/gcc/.../include/. I have also archived these
files in shared/proj/include/gcc-sse-intrinsics.h.

ASSUMPTIONS

This module is designed to address the fact that AltiVec and SSE are
not completely interoperable. In particular, some operations, like
unaligned loads and arbitrary byte-level permute, are only supported
by one or the other. In certain cases, scalar code might have to
be used to fill in a missing function.

This problem is mitigated somewhat by providing multiple options to
the programmer. The limitations of interoperability are exposed by
providing multiple similar macros. The divide macros (VDIV_xx and
VDIVE_xx) provide a good example of this. Both are provided because
one is substantially faster on the AltiVec platform, and it is
anticipated that some applications might not need the full precision
of VDIV.

Another consequence of the differences between AltiVec and SSE is that
some operations need temp variables or extra constant parameters on
one platform but not on the other. For example, VMUL requires a vector
of 0's on AltiVec, because there is no native multiply (only
multiply-add). In order for the source code to be the same on both
platforms, the source code must invoke a macro that (on AltiVec)
declares a variable and fill it with 0's. However, this macro does not
need to do anything on Intel. This is the purpose of the INIT_xxx
macros. They provide the necessary constants and temp variables, but
only when compiling for the hardware target(s) that need them. Each
macro has a comment explaining which operations require it -- if your
code does not use those operations, you need not include the
initialize macro.

USAGE

The sourcefile should include this include file before declaring any
vector variables.

See detailed instructions below on how to make the compiler and/or Xcode
generate code for Intel and/or PPC.

In order to actually generate vector instructions, the makefile or
build script must supply a flag PPC_SIMD or INTEL_SIMD to tell the
code what target it is being compiled for. <<%%%In a future
implementation>> If neither is defined, the macros generate equivalent
scalar code.

Your subroutines and functions should invoke INIT_xxx macros within
any scope that uses SIMD instructions that need the corresponding
variables. Each INIT_xxx macro has a comment explaining which
operations use its variable(s). The variable name is supplied both to
the INIT_xxx macro and to the SIMD operation (e.g. VMUL_F4) so that
you can choose the name of the variable suitably for your namespace,
and to allow you to invoke the INIT_xxx macro multiple times in the
same scope (which might be desirable for multithreaded code, for
example)

REVISION HISTORY

 2009040x (much revision here, probably noted in linpack.c or PDE3_main.c)
 20090407 Add NO_SIMD capability, which also serves as documentation of what
each macro is supposed to do. Discover a syntactic error in the naming (or
usage convention) of the src and new arguments to VSHRx_F4: note that
3 elements of the destination come from "new" and only one from "src";
it therefore makes sense to rename all the "src" and "new" parameters
in a manner similar to the "up" and "down" convention. As currently written,
I am using the name "src" to refer to the "lower" half and "new" to "upper"
(or vice versa for the little-endian perception).
 20090408 A few bug-fixes for NO_SIMD; break INIT_SRTMP up into two
separate macros INIT_SHTMP_F4 and INIT_ROTMP_F4.
 Add V4F8 capability (in NO_SIMD case only). This greatly simplifies
conversion of PDE4 to double precision, but is also an important step
towards support of the AVX (Advanced Vector Extensions) which offer
native V4F8 capability in hardware and are expected to be introduced
sometime in 2010 or later.
 20090423 Add two versions of V2F8 (Intel native via SSE2, and emulated
for use when NO_SIMD or PPC_SIMD are defined). These do not include
shift and rotate because I currently would only need it for loading
data offset by 16 bytes, which would be supported in hardware anyway.
 20101015 Add two versions of V4F8 (emulated and using AVX
intrinsics). The intrinsics are not available yet but should arrive
with the Sandy Bridge CPUs from Intel. (%%% not yet tested, and I
think the comments regarding MUX and DEM are a little confused; I
should check the corresponding comments in the V4F4 version)
 20101101 Rename all the 128-bit operations: _I8 is now _16I8, _I16
is now _8I16, _F4 is now _4F4, and _F28 is _2F8. Also, _F48 is now _4F8.

TO DO

Merge with hwi_vector.h

Rename all the "XXX_F8" macros to "XXX_4F8", remove the redundant
copies in the second V4F8_IMPL_EMUL block, and make sure PDE4-main.c
(the only client) still builds. Then merge the rest of the
V4F8_IMPL_EMUL code into the first block (which is where it should all
be) leaving only the V4F8_IMPL_AVX version at the bottom.
  (This confusion resulted from lack of clarity on how this file
should be organized: Is the code grouped by the type of vector being
manipulated, or by hardware platform? Early on I decided it should be
grouped by hardware.)

Get some of the simpler AVX macros working for a client of _4F8
(the Intel file is
<immintrin.h> which includes all of the SIMD headers including
<avxintrin.h>. MacPorts version is in
/opt/local/lib/.../gcc/.../include/; Apple version is in
/usr/lib/clang/.../include/ (which requires the clang compiler)

Make sure everything works on scalar targets, (i.e. when NO_SIMD is
defined, flags like __SSE__ should be ignored and it should use the
emulated version for everything). Test PDE4 again to make sure I
didn't break it.

Implement parallel comparison, at the very least in a form that
reduces the result to a single boolean. An SSE version looks like
this (from "20050908 Altivec to SSE Conversion.pdf"):

  // SSE version of AltiVec's vec_any_eq intrinsic
  int _mm_any_eq( vFloat a, vFloat b )
  {
    //test a==b for each float in a & b
    vFloat mask = _mm_cmpeq_ps( a, b );

    //copy top bit of each result to maskbits
    int maskBits = _mm_movemask_ps( mask );

    return maskBits != 0;
  }

Conditional execution. Here are examples for both Altivec and SSE:

  // Both examples perform:  if (a > 0) a += a;

  // Altivec:
  vUInt32 mask = vec_cmpgt( a, zero );
  vFloat twoA = vec_add( a, a);
  a = vec_sel( a, twoA, mask );

  // SSE:
  vFloat _mm_sel_ps( vFloat a, vFloat b, vFloat mask )
  {
    b = _mm_and_ps( b, mask );
    a = _mm_andnot_ps( mask, a );
    return _mm_or_ps( a, b );
  }
  vFloat mask = _mm_cmpgt_ps( a, zero );
  vFloat twoA = _mm_add_ps( a, a);
  a = _mm_sel_ps( a, twoA, mask );

In some cases you can take advantage of the fact that 0 (float) is the
same as 0 (integer). Also note that both architectures have min and
max (vec_min, _mm_min_ps and vec_max, _mm_max_ps)

\-------------------------------------------------------------------*/

/* In this first section we try to auto-configure based on the runtime
   flags passed in by the compiler.

   You also need to compile your code with "-faltivec" or "-msse3",
   depending on whether it's building for PowerPC or Intel respectively.
   (Later Intel projects use something like -msse4 or -mcorei7-avx and
   these also define the appropriate runtime flags)

   In an Xcode project, you should:

  1. Click the target you want to add the setting to, or the root of
     the project tree if you want it in all targets.
  2. "Get Info" or click the inspector button.
  3. In the Build tab, set popup "Collection" to "All Settings"
  4. Use the search box to shorten the list when necessary.
  5. Click on the + button under the setting list
  6. Enter "PER_ARCH_CFLAGS_ppc" for the setting name and "-DPPC_SIMD"
     for its value.
  7. Add another setting called "PER_ARCH_CFLAGS_i386", set to "-DINTEL_SIMD".
  8. Enable both of the SIMD-related options ("Enable SSE3 Extensions"
     and "Enable AltiVec Extensions")
  9. On earlier versions of Xcode, the options in step 8 might not both
     be available. In this case, you can add "-faltivec" and/or "-msse3"
     to the PER_ARCH_xx settings in steps 6 and 7.
 */

// Detect SSE3 setting from -msse3 -mavx, etc.
#ifndef NO_SIMD
# ifndef PPC_SIMD
#   ifndef INTEL_SIMD
#     ifdef __SSE3__
#       define INTEL_SIMD
#     endif
#   endif
# endif
#endif

// Use endian to select Intel or PPC
#ifndef NO_SIMD
# ifndef PPC_SIMD
#   ifndef INTEL_SIMD
#     ifdef __LITTLE_ENDIAN__
#       define INTEL_SIMD
#     else
#       ifdef __BIG_ENDIAN__
#         define PPC_SIMD
#       endif
#     endif
#   endif
# endif
#endif

// Upgrade Intel SSE to AVX if available
#if (defined(INTEL_SIMD) && defined(__AVX__))
# define INTEL_USE_AVX
#endif

/* If user supplied no option and compiler doesn't provide either type of
   ENDIAN, then we just bail into emulation mode. */
#ifndef PPC_SIMD
# ifndef INTEL_SIMD
#   ifndef NO_SIMD
#     define NO_SIMD
#   endif
# endif
#endif


// Figure out how to implement each type of vector

#ifdef NO_SIMD
# define V4F8_IMPL_EMUL
#endif

/* Altivec has no V2F8 capability, except trivial operations like move,
   so we emulate V4F8. */
#ifdef PPC_SIMD
# define V4F8_IMPL_EMUL
#endif

/* Intel can do V4F8 only if it has AVX */
#ifdef INTEL_SIMD
# ifdef INTEL_USE_AVX
#   define V4F8_IMPL_AVX
# else
#   define V4F8_IMPL_EMUL
# endif
#endif


#ifdef NO_SIMD
# include <string.h>
#endif

#define U8  unsigned char
#define S8    signed char
#define U16 unsigned short
#define S16   signed short
#define U32 unsigned int
#define S32   signed int
#define U64 unsigned long long
#define S64   signed long long
#define F23E7 float
#define F52E11 double


///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//                                                                                                                   //
//  #include<stdio.h>                                                                                                //
//  /*                                                     */                    /*                             */   //
//  /*                                                     */                    /*                             */   //
//  int                 main()  {char      a[]     ={     'T',    'h','i',     's',32,     'p','r'        ,'o','g'   //
//  ,'r','a','m',      32,   'j',   'u',   's',    't'    ,32           ,'d',    'o',   'e'     ,'s'    ,32    ,'i'  //
//  ,'t'              ,32    ,'t'    ,'h'  ,'e'   ,32,    'h'    ,97,'r','d'   ,32,   'w','a','y',33,   32,    40,   //
//  68,               'o',   'n',    39,   't'    ,32     ,'y'   ,'o',   117,    32    ,'t'             ,'h'   ,'i'  //
//  ,'n'               /*     Xy     =a     +3     +n      ++     ;a=     b-     (*     x/z              );     if   //
//  (Xy-++n<(z+*x))z  =b;a   +b,     z+=     x*/,107 ,    63,63    ,63,41,'\n'   ,00};   puts(a);}       /*.RPM.*/   //
//                                                                                                                   //
//        Emulated versions of the V4F4 macros. (These also serve as documentation for what each macro does)         //
//                                                                                                                   //
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

#ifdef NO_SIMD
/* Non-SIMD implementation of the macros. This serves as a fallback
for targets that do not have (or do not specify) SIMD instruction
support, and also serve as a reference definition for the macros. */

  typedef F23E7  V4F4[4];
  typedef U8     V16I8[16];
  typedef U16    V8I16[16];
  typedef float HWIV_4F4_ALIGNED[4];

// Loads, fills and splats. Notice grossly inefficient implementation of the
// integer versions -- this emulation code is not intended to actually be
// fast or anything. But if it's a problem we could cast V16I8 into V4F4 and
// invoke the V4F4 load/fill/etc macro.
# define VLOAD_16I8(dest, src)  memcpy((void *) (dest), (void *) (src), 16);
# define VLOAD_8I16(dest, src) memcpy((void *) (dest), (void *) (src), 16);
# define VLOAD_4F4(dest, src) ({ (dest)[0]=(src)[0]; (dest)[1]=(src)[1]; \
                                (dest)[2]=(src)[2]; (dest)[3]=(src)[3]; })
# define VCOPY_4F4(dest, src) VLOAD_4F4(dest, src)

# define VLOADO_16I8(dest, src, offset)  memcpy((void *) (dest), \
                               (void *) (((char *) (src)) + offset), 16);
# define VLOADO_8I16(dest, src, offset)  memcpy((void *) (dest), \
                               (void *) (((char *) (src)) + offset), 16);
// VLOADO_F4 only works if the pointer is a multiple of sizeof(F4); this
// shouldn't be a problem because the vector hardware requires this anyway
# define VLOADO_4F4(dest, src, offset) ({ (dest)[0]=(src)[(offset)/4]; \
                                         (dest)[1]=(src)[(offset)/4+1]; \
                                         (dest)[2]=(src)[(offset)/4+2]; \
                                         (dest)[3]=(src)[(offset)/4+3]; })

# define VFILL_8I16(dest, sc0, sc1, sc2, sc3, sc4, sc5, sc6, sc7, tmp) ({ \
  (dest)[0]=(sc0); (dest)[1]=(sc1); (dest)[2]=(sc2); (dest)[3]=(sc3); \
  (dest)[4]=(sc4); (dest)[5]=(sc5); (dest)[6]=(sc6); (dest)[7]=(sc7); })

// tmp must be a pointer to an array of 4 F4's, and it must be aligned to
// a 16-byte boundary.
# define VFILL_4F4(dest, sc0, sc1, sc2, sc3, tmp) ({ \
  (dest)[0]=(sc0); (dest)[1]=(sc1); (dest)[2]=(sc2); (dest)[3]=(sc3); })

// VSPLAT_F4 takes an F4 and puts it into all 4 fields of a V4F4. The src
// parameter is a scalar but must be something you can take the pointer of.
// The tmp paramater is a V16I8, used only by the AltiVec version.
# define INIT_SPLAT_4F4(varname)
# define VSPLAT_4F4(dst, src, tmp) ({ (dst)[0]=(src); (dst)[1]=(src); \
                                     (dst)[2]=(src); (dst)[3]=(src); })

// Data Stream Touch
# define VEC_DST(ptr, opts, x)

// Stores
# define VSAVE_16I8(dest, src) ({ (dest)[0]=(src)[0]; (dest)[1]=(src)[1]; \
                                (dest)[2]=(src)[2]; (dest)[3]=(src)[3]; })

# define VSAVE_8I16(dest, src) memcpy((void *) (dest), (void *) (src), 16);
# define VSAVE_4F4(dest, src)  memcpy((void *) (dest), (void *) (src), 16);

# define VSAVEO_16I8(dest, offset, src) memcpy((void *) (dest), \
                               (void *) (((char *) (src)) + offset), 16);
# define VSAVEO_8I16(dest, offset, src) memcpy((void *) (dest), \
                               (void *) (((char *) (src)) + offset), 16);
# define VSAVEO_4F4(dest, offset, src) ({ (dest)[(offset)/4]=(src)[0]; \
                                         (dest)[(offset)/4+1]=(src)[1]; \
                                         (dest)[(offset)/4+2]=(src)[2]; \
                                         (dest)[(offset)/4+3]=(src)[3]; })

// Addition and subtraction
# define VADD_8I16(dest, a, b) ({ (dest)[0]=a[0]+b[0]; (dest)[1]=a[1]+b[1]; \
                                 (dest)[2]=a[2]+b[2]; (dest)[3]=a[3]+b[3]; \
                                 (dest)[4]=a[4]+b[4]; (dest)[5]=a[5]+b[5]; \
                                 (dest)[6]=a[6]+b[6]; (dest)[7]=a[7]+b[7]; \
                                 (dest)[8]=a[8]+b[8]; (dest)[9]=a[9]+b[9]; \
                              (dest)[10]=a[10]+b[10]; (dest)[11]=a[11]+b[11]; \
                              (dest)[12]=a[12]+b[12]; (dest)[13]=a[13]+b[13]; \
                             (dest)[14]=a[14]+b[14]; (dest)[15]=a[15]+b[15]; })
# define VADD_4F4(dest, a, b) ({ (dest)[0]=a[0]+b[0]; (dest)[1]=a[1]+b[1]; \
                                (dest)[2]=a[2]+b[2]; (dest)[3]=a[3]+b[3]; })

# define INIT_MUL0_4F4(varname)
# define INIT_MTMP_4F4(varname)

// Multiplication
# define VMUL_4F4(dest, a, b, v0) ({ (dest)[0]=a[0]*b[0]; (dest)[1]=a[1]*b[1]; \
                                (dest)[2]=a[2]*b[2]; (dest)[3]=a[3]*b[3]; })
# define VMADD_4F4(dest, a, b, c, t)  ({ (dest)[0]=a[0]*b[0] + c[0]; \
                                        (dest)[1]=a[1]*b[1] + c[1]; \
                                        (dest)[2]=a[2]*b[2] + c[2]; \
                                        (dest)[3]=a[3]*b[3] + c[3]; })
# define VNMSUB_4F4(dest, a, b, c, t) ({ (dest)[0]=c[0] - a[0]*b[0]; \
                                        (dest)[1]=c[1] - a[1]*b[1]; \
                                        (dest)[2]=c[2] - a[2]*b[2]; \
                                        (dest)[3]=c[3] - a[3]*b[3]; })

// Divide estimate
# define VDIVE_4F4(dest, a, b, v0) ({ (dest)[0]=a[0]/b[0]; (dest)[1]=a[1]/b[1];\
                                   (dest)[2]=a[2]/b[2]; (dest)[3]=a[3]/b[3]; })

// Bitwise operations
# define VXOR_8I16(dest, a, b) ({ (dest)[0]=a[0]^b[0]; (dest)[1]=a[1]^b[1]; \
                                 (dest)[2]=a[2]^b[2]; (dest)[3]=a[3]^b[3]; \
                                 (dest)[4]=a[4]^b[4]; (dest)[5]=a[5]^b[5]; \
                                 (dest)[6]=a[6]^b[6]; (dest)[7]=a[7]^b[7]; \
                                 (dest)[8]=a[8]^b[8]; (dest)[9]=a[9]^b[9]; \
                              (dest)[10]=a[10]^b[10]; (dest)[11]=a[11]^b[11]; \
                              (dest)[12]=a[12]^b[12]; (dest)[13]=a[13]^b[13]; \
                             (dest)[14]=a[14]^b[14]; (dest)[15]=a[15]^b[15]; })

// Declare this if you are doing any shift operation
# define INIT_SHTMP_4F4(vc) /* nop */

// Declare this if you are doing any rotate operaton
# define INIT_ROTMP_4F4(vc) float vc

// Vector "shift" and "rotate": These move floating point data one
// space to the right or left. The "new" parameter is a vector that
// supplies the element into the spot vacated by a left or right shift.
# define VSHR_4F4(dest, src, new, tmp) ({ dest[3]=new[2]; dest[2]=new[1]; \
                                         dest[1]=new[0]; dest[0]=src[3]; })
# define VSHRC_4F4(dest, src, new, tmp) VSHR_4F4(dest, src, new, tmp)

// We use the temp to hold one element so that it works properly when
// src and dest are the same vector.
# define VROR_4F4(dest, src, tmp)  ({ tmp = src[3]; \
                                    dest[3]=src[2]; dest[2]=src[1]; \
                                    dest[1]=src[0]; dest[0]=tmp; })

# define VSHL_4F4(dest, src, new, tmp)  ({ dest[0]=src[1]; dest[1]=src[2]; \
                                          dest[2]=src[3]; dest[3]=new[0]; })
# define VSHLC_4F4(dest, src, new, tmp) VSHL_4F4(dest, src, new, tmp)

# define VROL_4F4(dest, src, tmp)  ({ tmp = src[0]; \
                                    dest[0]=src[1]; dest[1]=src[2]; \
                                    dest[2]=src[3]; dest[3]=tmp; })

// The multiplex (MUX) and demultiplex (DEM) operations work as follows:
// [w x] = [a b c d e f g h]    w=MUX0(u,v) x=MUX1(u,v)
// [u v] = [a c e g b d f h]    u=DEM0(w,x) v=DEM1(w,x)
// This is described in greater detail in each operator.

// VDEM0_F4 (Vector DEMultiplex 0-mod) is for extracting every other element
// from a set of 8 elements. The source is in two vectors, "a" and "b".
// "a" contains elements 0, 1, 2 and 3 of the set, and b contains elements
// 4, 5, 6, and 7. The operator places elements 0, 2, 4 and 6 of the set
// into positions 0, 1, 2, and 3 (respectively) of the destination vector.
# define INIT_VDEM0_4F4(vc)
# define VDEM0_4F4(dst, a, b, vc) ({ dst[0]=a[0]; dst[1]=a[2]; \
                                 dst[2]=b[0]; dst[3]=b[2]; })
# define INIT_VDEM1_4F4(vc)
# define VDEM1_4F4(dst, a, b, vc) ({ dst[0]=a[1]; dst[1]=a[3]; \
                                 dst[2]=b[1]; dst[3]=b[3]; })

# define VMUX0_4F4(dst, a, b) ({ dst[0]=a[0]; dst[1]=b[0]; \
                             dst[2]=a[1]; dst[3]=b[1]; })

# define VMUX1_4F4(dst, a, b) ({ dst[0]=a[2]; dst[1]=b[2]; \
                             dst[2]=a[3]; dst[3]=b[3]; })

#define v4ADD(a,b) {(a)[0]+(b)[0],(a)[1]+(b)[1],(a)[2]+(b)[2],(a)[3]+(b)[3]}
#define v4SUB(a,b) {(a)[0]-(b)[0],(a)[1]-(b)[1],(a)[2]-(b)[2],(a)[3]-(b)[3]}
#define v4MUL(a,b) {(a)[0]*(b)[0],(a)[1]*(b)[1],(a)[2]*(b)[2],(a)[3]*(b)[3]}
#define v4SET(v0,v1,v2,v3) {(v0),(v1),(v2),(v3)}
#define v4SPLAT(a) {(a),(a),(a),(a)}
#define v4ROUP(a) {(a)[3],(a)[0],(a)[1],(a)[2]}
#define v4RODN(a) {(a)[1],(a)[2],(a)[3],(a)[0]}
#define v4RAISE(a, new) {(new)[3],(a)[0],(a)[1],(a)[2]}
#define v4LOWER(a, new) {(a)[1],(a)[2],(a)[3],(new)[0]}

#endif

#ifdef V4F8_IMPL_EMUL

  typedef F52E11 V4F8[4];

/* V4F8 macros.
   Most of these copy the F4 version because the implementation is exactly
   the same with double as for float. Notable exceptions are the INIT_XXX
   macros that declare a scalar variable, and the one (VSAVE) that calls
   memcpy.
 */

# define VLOAD_F8(dest, src) ({ (dest)[0]=(src)[0]; (dest)[1]=(src)[1]; \
                                (dest)[2]=(src)[2]; (dest)[3]=(src)[3]; })
# define VCOPY_F8(dest, src) VLOAD_F8(dest, src)
# define VLOADO_F8(dest, src, offset) ({ (dest)[0]=(src)[(offset)/8]; \
                                         (dest)[1]=(src)[(offset)/8+1]; \
                                         (dest)[2]=(src)[(offset)/8+2]; \
                                         (dest)[3]=(src)[(offset)/8+3]; })
# define VFILL_F8(dest, sc0, sc1, sc2, sc3, tmp) ({ \
                    (dest)[0]=(sc0); (dest)[1]=(sc1); (dest)[2]=(sc2); (dest)[3]=(sc3); })
# define INIT_SPLAT_F8(varname)
# define VSPLAT_F8(dst, src, tmp) ({ (dst)[0]=(src); (dst)[1]=(src); \
                                     (dst)[2]=(src); (dst)[3]=(src); })
# define VSAVE_F8(dest, src)  memcpy((void *) (dest), (void *) (src), 32);
# define VSAVEO_F8(dest, offset, src) ({ (dest)[(offset)/8]=(src)[0]; \
                                         (dest)[(offset)/8+1]=(src)[1]; \
                                         (dest)[(offset)/8+2]=(src)[2]; \
                                         (dest)[(offset)/8+3]=(src)[3]; })
# define VADD_F8(dest, a, b) ({ (dest)[0]=a[0]+b[0]; (dest)[1]=a[1]+b[1]; \
                                (dest)[2]=a[2]+b[2]; (dest)[3]=a[3]+b[3]; })
# define INIT_MUL0_F8(varname)
# define INIT_MTMP_F8(varname)
# define VMUL_F8(dest, a, b, v0) ({ (dest)[0]=a[0]*b[0]; (dest)[1]=a[1]*b[1]; \
                                (dest)[2]=a[2]*b[2]; (dest)[3]=a[3]*b[3]; })
# define VMADD_F8(dest, a, b, c, t)  ({ (dest)[0]=a[0]*b[0] + c[0]; \
                                        (dest)[1]=a[1]*b[1] + c[1]; \
                                        (dest)[2]=a[2]*b[2] + c[2]; \
                                        (dest)[3]=a[3]*b[3] + c[3]; })
# define VNMSUB_F8(dest, a, b, c, t) ({ (dest)[0]=c[0] - a[0]*b[0]; \
                                        (dest)[1]=c[1] - a[1]*b[1]; \
                                        (dest)[2]=c[2] - a[2]*b[2]; \
                                        (dest)[3]=c[3] - a[3]*b[3]; })
# define VDIVE_F8(dest, a, b, v0) ({ (dest)[0]=a[0]/b[0]; (dest)[1]=a[1]/b[1];\
                                   (dest)[2]=a[2]/b[2]; (dest)[3]=a[3]/b[3]; })
# define INIT_SHTMP_F8(vc) /* nop */
# define INIT_ROTMP_F8(vc) double vc
# define VSHR_F8(dest, src, new, tmp) ({ dest[3]=new[2]; dest[2]=new[1]; \
                                         dest[1]=new[0]; dest[0]=src[3]; })
# define VSHRC_F8(dest, src, new, tmp) VSHR_F8(dest, src, new, tmp)
# define VROR_F8(dest, src, tmp) ({ tmp = src[3]; \
                                    dest[3]=src[2]; dest[2]=src[1]; \
                                    dest[1]=src[0]; dest[0]=tmp; })
# define VSHL_F8(dest, src, new, tmp) ({ dest[0]=src[1]; dest[1]=src[2]; \
                                          dest[2]=src[3]; dest[3]=new[0]; })
# define VSHLC_F8(dest, src, new, tmp) VSHL_F8(dest, src, new, tmp)
# define VROL_F8(dest, src, tmp) ({ tmp = src[0]; \
                                    dest[0]=src[1]; dest[1]=src[2]; \
                                    dest[2]=src[3]; dest[3]=tmp; })
# define INIT_VDEM0_F8(vc) /* nop */
# define VDEM0_F8(dst, a, b, vc) ({ dst[0]=a[0]; dst[1]=a[2]; \
                                 dst[2]=b[0]; dst[3]=b[2]; })
# define INIT_VDEM1_F8(vc) /* nop */
# define VDEM1_F8(dst, a, b, vc) ({ dst[0]=a[1]; dst[1]=a[3]; \
                                 dst[2]=b[1]; dst[3]=b[3]; })
# define VMUX0_F8(dst, a, b) ({ dst[0]=a[0]; dst[1]=b[0]; \
                             dst[2]=a[1]; dst[3]=b[1]; })
# define VMUX1_F8(dst, a, b)  ({ dst[0]=a[2]; dst[1]=b[2]; \
                             dst[2]=a[3]; dst[3]=b[3]; })

#endif


/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*\
|                                                                             |
|                                                                             |
|                                                                             |
|                                                                             |
|                                                                             |
|        /^^^^^\                                       /^^^^^\  _-^^^^^^/ TM  |
|       /  ,-   )_-----_ --- ---- ----..----.  ---.---/  ,-   )/   ,---/      |
|      /  /_)  //  __   )| |/   |/  _// .-   )/    _//  /_)  //   /           |
|     /  //   //  / /  /|          / / (/___//   .^ /  //   /(   |            |
|    /  / ^^^'(   (/  / |   /|   _/ (  `----/   /  /  / ^^^' |   `---/        |
|   /__/       \____-'  |__/ |__/    \____//___/  /__/        \_____/         |
|                                                                             |
|                                                                             |
|                                                                             |
|                                                                             |
|                                                                             |
|                                                                             |
\*___________________________________________________________________________*/

/* AltiVec primitives. Most of this is lifted or adapted from PDE3. */
#ifdef PPC_SIMD
  typedef vector F23E7 V4F4;
  typedef vector U8    V16I8;
  typedef vector U16   V8I16;

// Loads, fills and splats
# define VLOAD_18I8(dest, src) (dest) = vec_ld(0, (src))
# define VLOAD_8I16(dest, src) (dest) = vec_ld(0, (src))
# define VLOAD_4F4(dest, src) (dest) = vec_ld(0, (src))
# define VCOPY_4F4(dest, src) (dest) = (src)

# define VLOADO_16I8(dest, src, offset) (dest) = vec_ld((offset), (src))
# define VLOADO_8I16(dest, src, offset) (dest) = vec_ld((offset), (src))
# define VLOADO_4F4(dest, src, offset) (dest) = vec_ld((offset), (src))

# define VFILL_8I16(dest, sc0, sc1, sc2, sc3, sc4, sc5, sc6, sc7, tmp) ({ \
  (tmp)[0]=(sc0); (tmp)[1]=(sc1); (tmp)[2]=(sc2); (tmp)[3]=(sc3); \
  (tmp)[4]=(sc4); (tmp)[5]=(sc5); (tmp)[6]=(sc6); (tmp)[7]=(sc7); \
  (dest) = vec_ld(0, (tmp));  })

// tmp must be a pointer to an array of 4 F4's, and it must be aligned to
// a 16-byte boundary.
# define VFILL_4F4(dest, sc0, sc1, sc2, sc3, tmp) ({ \
  (tmp)[0]=(sc0); (tmp)[1]=(sc1); (tmp)[2]=(sc2); (tmp)[3]=(sc3); \
  (dest) = vec_ld(0, (tmp));  })

// VSPLAT_F4 takes an F4 and puts it into all 4 fields of a V4F4. The src
// parameter is a scalar but must be something you can take the pointer of.
// The tmp paramater is a V16I8, used only by the AltiVec version.
# define INIT_SPLAT_4F4(varname) V16I8 varname
# define VSPLAT_4F4(dst, src, tmp) ({ \
  dst = vec_ld(0, &src); \
  tmp = vec_lvsl(0, &src); \
  dst = vec_perm(dst, dst, tmp); \
  dst = vec_splat(dst, 0); })

// Data Stream Touch
# define VEC_DST(ptr, opts, x) vec_dst((ptr), (opts), (x))

// Stores
# define VSAVE_16I8(dest, src) vec_st((src), 0, (dest))
# define VSAVE_8I16(dest, src) vec_st((src), 0, (dest))
# define VSAVE_4F4(dest, src) vec_st((src), 0, (dest))

# define VSAVEO_16I8(dest, offset, src) vec_st((src), (offset), (dest))
# define VSAVEO_8I16(dest, offset, src) vec_st((src), (offset), (dest))
# define VSAVEO_4F4(dest, offset, src) vec_st((src), (offset), (dest))

// Addition and subtraction
# define VADD_8I16(dest, a, b) (dest) = vec_add((a), (b))
# define VADD_4F4(dest, a, b) (dest) = vec_add((a), (b))

/* The following defines are used to declare variables that are used for the
   "v0" and "t" parameters in the multiply and multiply-add (and similar)
   operations. VMUL requires a "v0" operand, which is to contain a vector
   of zeros of the same component type (e.g. a vector of F4's for use with
   VMUL_F4). If you are doing any VMUL operations, and do not already have
   a vector of zeros of the appropriate type, then declare a variable with
   the INIT_MUL0 macro. */
// On PowerPC, INIT_MUL0_xx declares a variable, because PPC has
// no simple multiply operation, so multiply has to be done in terms of multiply-add.
# define INIT_MUL0_4F4(varname) V4F4 varname = (V4F4) ( 0.0, 0.0, 0.0, 0.0 )

/* If you are doing any MADD, NMSUB, etc. operations, declare one or more temp
   variables for use as the fifth argument, using INIT_MTMP. */
// On PowerPC, INIT_MTMP_xx does nothing, because PPC actually has FMA (fused
// multiply-add) operations.
# define INIT_MTMP_4F4(varname) /* nop */

// Multiplication
# define VMUL_4F4(dest, a, b, v0) (dest) = vec_madd((a), (b), (v0))
# define VMADD_4F4(dest, a, b, c, t) (dest) = vec_madd((a), (b), (c))
# define VNMSUB_4F4(dest, a, b, c, t) (dest) = vec_nmsub((a), (b), (c))

// Divide estimate
# define VDIVE_4F4(dest, a, b, v0) ({ \
  (dest) = vec_re((b)); \
  (dest) = vec_madd((dest), (a), (v0)); })

// Bitwise operations
# define VXOR_8I16(dest, a, b) (dest) = vec_xor((a), (b))

// Declare this if you are doing any shift operation
# define INIT_SHTMP_4F4(vc) V16I8 vc

// Declare this if you are doing any rotate operaton
# define INIT_ROTMP_4F4(vc) V16I8 vc

// Floating point "shift" and "rotate": These move floating point data one
// space to the right or left. Note that ROL is defined in terms of SHL, and
// likewise for ROR/SHR.
//
// The "tmp" parameter is a V16I8, used only by the PowerPC versions.
//
# define VSHL_4F4(dest, src, new, tmp) ({  (tmp) = vec_lvsl(0, 4); \
                 (dest) = vec_perm((src), (new), (tmp)); })
// VSHLC is for performing more SHL's after having done a first one. This
// is only for when you are shifting the same type of data in the same
// direction. On AltiVec it will run a bit faster because the temp does
// not need to be re-loaded.
# define VSHLC_4F4(dest, src, new, tmp) (dest) = vec_perm((src), (new), (tmp));
# define VROL_4F4(dest, src, tmp) VSHL_4F4(dest, src, src, tmp)
# define VSHR_4F4(dest, new, src, tmp) ({ (tmp) = vec_lvsl(0, 12); \
                 (dest) = vec_perm((new), (src), (tmp)); })
# define VSHRC_4F4(dest, src, new, tmp) (dest) = vec_perm((src), (new), (tmp));
# define VROR_4F4(dest, src, tmp) VSHR_4F4(dest, src, src, tmp)

// The multiplex (MUX) and demultiplex (DEM) operations work as follows:
// [w x] = [a b c d e f g h]    w=MUXH(u,v) x=MUXL(u,v)
// [u v] = [a c e g b d f h]    u=DEM0(w,x) v=DEM1(w,x)
// This is described in greater detail in each operator.

/* VDEM0_F4 (Vector DEMultiplex 0-mod) is for extracting every other element
   from a set of 8 elements. The source is in two vectors, "a" and "b". "a"
   contains elements 0, 1, 2 and 3 of the set, and b contains elements 4, 5,
   6, and 7. The operator places elements 0, 2, 4 and 6 of the set into
   positions 0, 1, 2, and 3 (respectively) of the destination vector. */
# define INIT_VDEM0_4F4(vc) V16I8 vc = (V16I8) \
                ( 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27 )
# define VDEM0_4F4(dst, a, b, vc) (dst) = vec_perm((a), (b), (vc))

//
# define INIT_VDEM1_4F4(vc) V16I8 vc = (V16I8) \
                ( 4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31 )
# define VDEM1_4F4(dst, a, b, vc) (dst) = vec_perm((a), (b), (vc))

//
# define VMUX0_4F4(dst, a, b) (dst) = vec_mergeh((a), (b))

//
# define VMUX1_4F4(dst, a, b) (dst) = vec_mergel((a), (b))

#endif

/* Sources for info about the vector intrinsics:
  developer.apple.com/hardwaredrivers/ve/sse.html
    (also in "20050908 Altivec to SSE Conversion.pdf")
  http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/X86-Built_002din-Functions.html
  http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html
  en.wikipedia.org/wiki/Streaming_SIMD_Extensions
 */


/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*\
|                                                                             |
|        @@@@                       @@@@                     @@@@             |
|        @@@@                       @@@@                     @@@@             |
|        """"                       @@@@                     @@@@             |
|        eeee    eeee  ,e@@e..   eee@@@@eee                  @@@@             |
|        @@@@    @@@@@@@@@@@@@@. @@@@@@@@@@                  @@@@             |
|        @@@@    @@@@f'    `@@@@    @@@@                     @@@@             |
|        @@@@    @@@@       @@@@    @@@@        ,e@@@e.      @@@@             |
|        @@@@    @@@@       @@@@    @@@@     e@@@@@@@@@@@e   @@@@             |
|        @@@@    @@@@       @@@@    @@@@   .@@@@'     `@@@@i @@@@             |
|        @@@@    @@@@       @@@@    @@@@kee@@@@eeeeeeeee@@@@ @@@@             |
|        @@@@    @@@@       @@@@    `@@@@@@@@@@@@@@@@@@@@@@@@@@@@  (R)        |
|                                          @@@@.                              |
|                                          `@@@@e.    .eeee-                  |
|                                            *@@@@@@@@@@@*                    |
|                                               "*@@@@*"                      |
|                                                                             |
\*___________________________________________________________________________*/

#ifdef INTEL_SIMD

# if (defined(USE_IMMINTRIN) || defined (__AVX__))
#   include <immintrin.h>
# else
#   include <xmmintrin.h>
#   include <emmintrin.h>
# endif

  typedef __m128  V4F4;
  typedef __v16qi V16I8;
  typedef __v8hi  V8I16;
  typedef float __attribute__((aligned (16))) HWIV_4F4_ALIGNED[4];

// Loads and Stores -- the bane of all aspiring GFLOPs champions.
//

/* For some reason the GCC include files do not define typechecked versions
   of (most of) the SSE integer operations. */
//static __inline __v8hi __attribute__((__always_inline__, __nodebug__))
static __inline __v8hi __attribute__((__always_inline__))
_mm_load_v8hi (short const *__P)
{
  return (__v8hi) *(__v8hi *)__P;
}
# define VLOAD_8I16(dest, src) (dest) = _mm_load_v8hi(src)
# define VLOADO_8I16(dest, src, offset) (dest) = _mm_load_v8hi((src) + (offset)/2)

# define VFILL_8I16(dest, sc0, sc1, sc2, sc3, sc4, sc5, sc6, sc7, tmp) ({ \
  (tmp)[0]=(sc0); (tmp)[1]=(sc1); (tmp)[2]=(sc2); (tmp)[3]=(sc3); \
  (tmp)[4]=(sc4); (tmp)[5]=(sc5); (tmp)[6]=(sc6); (tmp)[7]=(sc7); \
  VLOAD_8I16(dest, tmp);  })

// On Intel, _mm_load_ps is just a pointer deference with typecasting
# define VLOAD_4F4(dest, src) (dest) = _mm_load_ps(src)
# define VLOADO_4F4(dest, src, offset) (dest) = _mm_load_ps((src) + (offset)/4)
# define VCOPY_4F4(dest, src) (dest) = (src)

# define VFILL_4F4(dest, sc0, sc1, sc2, sc3, tmp) ({ (tmp)[0]=(sc0); \
               (tmp)[1]=(sc1); (tmp)[2]=(sc2); (tmp)[3]=(sc3); \
               VLOAD_4F4(dest, tmp); })

// VSPLAT_F4 takes an F4 and puts it into all 4 fields of a V4F4. The src
// parameter is a scalar but must be something you can take the pointer of.
// The tmp paramater is a V16I8, used only by the AltiVec version.
# define INIT_SPLAT_4F4(varname) /* nop */
# define VSPLAT_4F4(dst, src, tmp) (dst) = _mm_set_ps1(src)

// For now, no Data Stream Touch on Intel
# define VEC_DST(ptr, opts, x) // no-op

// Store operations

//static __inline void __attribute__((__always_inline__, __nodebug__))
static __inline void __attribute__((__always_inline__))
_mm_store_v8hi (short *__P, __v8hi __A)
{
  *(__v8hi *)__P = (__v8hi)__A;
}
# define VSAVE_8I16(dest, src) _mm_store_v8hi((dest), (src))

// _mm_stream_ps is really __builtin_ia32_movntps
// _mm_store_ps is an assignment to a derefernced pointer lvalue
# define VSAVE_4F4(dest, src) _mm_store_ps((dest), (src))
# define VSAVEO_4F4(dest, offset, src) _mm_store_ps((dest)+(offset)/4, (src))

// __builtin_ia32_addps is _mm_add_ps
# define VADD_8I16(dest, a, b) (dest) = _mm_add_epi16((a), (b))
# define VADD_4F4(dest, a, b) (dest) = _mm_add_ps((a), (b))

// INIT_MUL0_xx is described above. On Intel it does nothing because Intel
// actually has a multiply operation.
# define INIT_MUL0_4F4(varname) /* nop */

// INIT_MTMP_xx is described above. On Intel it declares a variable because
// Intel has no FMA (fused multiply-add) operations.
# define INIT_MTMP_4F4(varname) V4F4 varname = _mm_setzero_ps()

// _mm_mul_ps is __builtin_ia32_mulps
# define VMUL_4F4(dest, a, b, v0) (dest) = _mm_mul_ps((a), (b))

// We use the temp var t to hold the intermediate result. Note that
// Intel results will differ from PowerPC results, because the intermediate
// result within the FMADD hardware has full double precision.
# define VMADD_4F4(dest, a, b, c, t) ({ (t) = _mm_mul_ps((a), (b)); \
                                  (dest) = _mm_add_ps((t), (c)); })
# define VNMSUB_4F4(dest, a, b, c, t) ({ (t) = _mm_mul_ps((a), (b)); \
                                  (dest) = _mm_sub_ps((c), (t)); })

// _mm_div_ps is __builtin_ia32_divps
# define VDIVE_4F4(dest, a, b, v0) (dest) = _mm_div_ps((a), (b))

// Bitwise operations
# define VXOR_8I16(dest, a, b) (dest) = _mm_xor_si128((a), (b))

// Floating point "shift" and "rotate": These move floating point data one
// space to the right or left. Note that SHL is defined in terms of ROL, and
// likewise for the SHR/ROR.
//
// The "tmp" parameter is a V16I8, used only by the PowerPC versions.
//
// _mm_shuffle_ps is __builtin_ia32_shufps
// Declare this if you are doing any shift operation
# define INIT_SHTMP_4F4(vc) /* nop */

// Declare this if you are doing any rotate operaton
# define INIT_ROTMP_4F4(vc) /* nop */


# define VROL_4F4(dest, src, tmp) \
                 (dest) = _mm_shuffle_ps((src), (src), 0x39)
// _mm_move_ss is __builtin_ia32_movss
# define VSHL_4F4(dest, src, new, vs) ({ \
                 (dest) = _mm_move_ss((src), (new)); \
                 VROL_4F4(dest, dest, 0); })
// VSHLC is for performing more SHL's after having done a first one. This
// is only for when you are shifting the same type of data in the same
// direction. On AltiVec it will run a bit faster because the temp does
// not need to be re-loaded.
# define VSHLC_4F4(dest, src, new, tmp) VSHL_4F4(dest, src, new, tmp)

# define VROR_4F4(dest, src, tmp) \
                 (dest) = _mm_shuffle_ps((src), (src), 0x93)
#if 0
// This version produced the "smear towards right" distortion I observed immedately
// after porting to Intel.
# define VSHR_4F4(dest, new, src, vs) ({ \
                 VROR_4F4(dest, src, 0); \
                 (dest) = _mm_move_ss((dest), (new)); \
                 (dest) = _mm_shuffle_ps((dest), (dest), 0xe4); })
#else
// This is the fixed version.
# define VSHR_4F4(dest, new, src, vs) ({ \
                 (dest) = _mm_shuffle_ps((new), (src), 0x0f); \
                 (dest) = _mm_shuffle_ps((dest), (src), 0x98); \
                 })
#endif

# define VSHRC_4F4(dest, src, new, tmp) VSHR_4F4(dest, src, new, tmp)


// The multiplex (MUX) and demultiplex (DEM) operations work as follows:
// [w x] = [a b c d e f g h]    w=MUXH(u,v) x=MUXL(u,v)
// [u v] = [a c e g b d f h]    u=DEM0(w,x) v=DEM1(w,x)
//
// On Intel, we don't need a constant vector vc
# define INIT_VDEM0_4F4(vc) /* nop */
# define INIT_VDEM1_4F4(vc) /* nop */

# define VDEM0_4F4(dst, a, b, vc)  (dst) = _mm_shuffle_ps((a), (b), 0x88)
# define VDEM1_4F4(dst, a, b, vc)  (dst) = _mm_shuffle_ps((a), (b), 0xdd)

# define VMUX0_4F4(dst, a, b) ({ (dst) = _mm_shuffle_ps((a), (b), 0x44); \
                             (dst) = _mm_shuffle_ps((dst), (dst), 0xd8); })
# define VMUX1_4F4(dst, a, b) ({ (dst) = _mm_shuffle_ps((a), (b), 0xbb); \
                             (dst) = _mm_shuffle_ps((dst), (dst), 0x8d); })


// Here is the subset for FORTRAN-style code
#define v4ADD(a,b) _mm_add_ps((a), (b))
#define v4SUB(a,b) _mm_sub_ps((a), (b))
#define v4MUL(a,b) _mm_mul_ps((a), (b))
// in v4SET, note the reversal of argument order
#define v4SET(v0,v1,v2,v3) _mm_set_ps((v3),(v2),(v1),(v0))
#define v4SPLAT(a) _mm_set1_ps(a)
#define v4ROUP(src) _mm_shuffle_ps((src), (src), 0x93)
#define v4RODN(src) _mm_shuffle_ps((src), (src), 0x39)
#define v4RAISE(src, new) _mm_shuffle_ps(_mm_shuffle_ps((new), (src), 0x0f), (src), 0x98)
#define v4LOWER(src, new) _mm_shuffle_ps(_mm_move_ss((src),(new)), _mm_move_ss((src),(new)), 0x39)

#endif


/* Partial implementation of V2F8 */

#ifdef NO_SIMD
# define V2F8_IMPL_EMUL
#endif

/* Altivec has no V2F8 capability, except trivial operations like move,
   so we emulate it. */
#ifdef PPC_SIMD
# define V2F8_IMPL_EMUL
#endif

#ifdef INTEL_SIMD
# define V2F8_IMPL_SSE3
#endif

#ifdef V2F8_IMPL_EMUL

/* Emulated implementation of V2F8 */

  typedef F52E11 V2F8[2];

# define VLOAD_2F8(dest, src) ({ (dest)[0]=(src)[0]; (dest)[1]=(src)[1]; })
# define VCOPY_2F8(dest, src) VLOAD_2F8(dest, src)
# define VLOADO_2F8(dest, src, offset) ({ (dest)[0]=(src)[(offset)/8]; \
                                         (dest)[1]=(src)[(offset)/8+1]; })
# define VFILL_2F8(dest, sc0, sc1, tmp) ({ (dest)[0]=(sc0); (dest)[1]=(sc1); })
# define VSPLAT_2F8(dst, src) VFILL_2F8(dst, src, src, 0)
# define VSAVE_2F8(dest, src) ({ (dest)[0]=(src)[0]; (dest)[1]=(src)[1]; })

# define VSAVEO_2F8(dest, offset, src) ({ (dest)[(offset)/8]=(src)[0]; \
                                         (dest)[(offset)/8+1]=(src)[1]; })

# define VADD_2F8(dest, a, b) ({ (dest)[0]=a[0]+b[0]; (dest)[1]=a[1]+b[1]; })

# define INIT_MUL0_2F8(varname)
# define INIT_MTMP_2F8(varname)

# define VMUL_2F8(dest, a, b, v0) ({ \
                                (dest)[0]=a[0]*b[0]; (dest)[1]=a[1]*b[1]; })
# define VMADD_2F8(dest, a, b, c, t)  ({ (dest)[0]=a[0]*b[0] + c[0]; \
                                         (dest)[1]=a[1]*b[1] + c[1]; })
# define VNMSUB_2F8(dest, a, b, c, t) ({ (dest)[0]=c[0] - a[0]*b[0]; \
                                         (dest)[1]=c[1] - a[1]*b[1]; })

# define VDIVE_2F8(dest, a, b, v0) ({ \
                                   (dest)[0]=a[0]/b[0]; (dest)[1]=a[1]/b[1]; })

// The multiplex (MUX) and demultiplex (DEM) operations work as follows:
// [w x] = [a b c d]    w=MUX0(u,v) x=MUX1(u,v)
// [u v] = [a c b d]    u=DEM0(w,x) v=DEM1(w,x)
// This is described in greater detail in each operator.

# define VDEM0_2F8(dst, a, b)     ({ dst[0]=a[0]; dst[1]=b[0]; })
# define VDEM1_2F8(dst, a, b)     ({ dst[0]=a[1]; dst[1]=b[1]; })

# define VMUX0_2F8(dst, a, b)     ({ dst[0]=a[0]; dst[1]=b[0]; })
# define VMUX1_2F8(dst, a, b)     ({ dst[0]=a[1]; dst[1]=b[1]; })

#endif


#ifdef V2F8_IMPL_SSE3

# if (defined(USE_IMMINTRIN) || defined (__AVX__))
#   include <immintrin.h>
# else
#   include <emmintrin.h>
# endif

  typedef __m128d  V2F8;

# define VLOAD_2F8(dest, src)         (dest) = _mm_load_pd(src)
# define VCOPY_2F8(dest, src)         (dest) = (src)
# define VLOADO_2F8(dst, src, offset)  (dst) = _mm_load_pd((src) + (offset)/8)

# define VFILL_2F8(dest, sc0, sc1, tmp) ({ (tmp)[0]=(sc0); (tmp)[1]=(sc1); \
                                                      VLOAD_2F8(dest, tmp); })

/* _mm_set_pd1 is deprecated, and is just a synonym for _mm_set1_pd; apparently
   both are provided in MacOS 10.4's version of emmintrin.h -20110920 */
# define VSPLAT_2F8(dst, src)          (dst) = _mm_set1_pd(src)

# define VSAVE_2F8(dest, src) _mm_store_pd((dest), (src))
# define VSAVEO_2F8(dest, offset, src) _mm_store_pd((dest)+(offset)/8, (src))

# define VADD_2F8(dest, a, b)         (dest) = _mm_add_pd((a), (b))

# define INIT_MUL0_2F8(varname) /* nop */
# define INIT_MTMP_2F8(varname) V2F8 varname = _mm_setzero_pd()

# define VMUL_2F8(dest, a, b, v0)     (dest) = _mm_mul_pd((a), (b))

# define VMADD_2F8(dest, a, b, c, t)  ({ (t) = _mm_mul_pd((a), (b)); \
                                      (dest) = _mm_add_pd((t), (c)); })

# define VNMSUB_2F8(dest, a, b, c, t) ({ (t) = _mm_mul_pd((a), (b)); \
                                      (dest) = _mm_sub_pd((c), (t)); })

# define VDIVE_2F8(dest, a, b, v0)    (dest) = _mm_div_pd((a), (b))

// The multiplex (MUX) and demultiplex (DEM) operations work as follows:
// [w x] = [a b c d]    w=MUX0(u,v) x=MUX1(u,v)
// [u v] = [a c b d]    u=DEM0(w,x) v=DEM1(w,x)
// This is described in greater detail in each operator.

# define VDEM0_2F8(dst, a, b)          (dst) = _mm_shuffle_pd((a), (b), 0)
# define VDEM1_2F8(dst, a, b)          (dst) = _mm_shuffle_pd((a), (b), 3)

# define VMUX0_2F8(dst, a, b)          (dst) = _mm_shuffle_pd((a), (b), 0)
# define VMUX1_2F8(dst, a, b)          (dst) = _mm_shuffle_pd((a), (b), 3)

#endif

/*
  new = 11 9 7 5  src == 4 3 2 1

  dst = 1 1 11 11

  dst = 3 2 1 11
*/


/* Partial implementation of V4F8 */

#ifdef V4F8_IMPL_EMUL

/* Emulated implementation of V4F8 */

# define VLOAD_4F8(dest, src) ({ (dest)[0]=(src)[0]; (dest)[1]=(src)[1]; \
                                 (dest)[2]=(src)[2]; (dest)[3]=(src)[3]; })
# define VCOPY_4F8(dest, src) VLOAD_4F8(dest, src)
# define VLOADO_4F8(dest, src, offset) ({ (dest)[0]=(src)[(offset)/8]; \
                                         (dest)[1]=(src)[(offset)/8+1]; \
                                         (dest)[2]=(src)[(offset)/8+2]; \
                                         (dest)[3]=(src)[(offset)/8+3]; })
# define VFILL_4F8(dest, sc0, sc1, tmp) ({ (dest)[0]=(sc0); (dest)[1]=(sc1); \
                                          (dest)[2]=(sc0); (dest)[3]=(sc1); })
# define VSPLAT_4F8(dst, src) VFILL_4F8(dst, src, src, 0)
# define VSAVE_4F8(dest, src) ({ (dest)[0]=(src)[0]; (dest)[1]=(src)[1]; \
                                 (dest)[0]=(src)[2]; (dest)[1]=(src)[3]; })

# define VSAVEO_4F8(dest, offset, src) ({ (dest)[(offset)/8]=(src)[0]; \
                                         (dest)[(offset)/8+1]=(src)[1]; \
                                         (dest)[(offset)/8+2]=(src)[2]; \
                                         (dest)[(offset)/8+3]=(src)[3]; })

# define VADD_4F8(dest, a, b) ({ (dest)[0]=a[0]+b[0]; (dest)[1]=a[1]+b[1]; \
                                 (dest)[2]=a[2]+b[2]; (dest)[3]=a[3]+b[3]; })

# define INIT_MUL0_4F8(varname)
# define INIT_MTMP_4F8(varname)

# define VMUL_4F8(dest, a, b, v0) ({ \
                                (dest)[0]=a[0]*b[0]; (dest)[1]=a[1]*b[1]; \
                                (dest)[2]=a[2]*b[2]; (dest)[3]=a[3]*b[3]; })
# define VMADD_4F8(dest, a, b, c, t)  ({ (dest)[0]=a[0]*b[0] + c[0]; \
                                         (dest)[1]=a[1]*b[1] + c[1]; \
                                         (dest)[2]=a[2]*b[2] + c[2]; \
                                         (dest)[3]=a[3]*b[3] + c[3]; })
# define VNMSUB_4F8(dest, a, b, c, t) ({ (dest)[0]=c[0] - a[0]*b[0]; \
                                         (dest)[1]=c[1] - a[1]*b[1]; \
                                         (dest)[2]=c[2] - a[2]*b[2]; \
                                         (dest)[3]=c[3] - a[3]*b[3]; })

# define VDIVE_4F8(dest, a, b, v0) ({ \
                                   (dest)[0]=a[0]/b[0]; (dest)[1]=a[1]/b[1]; \
                                   (dest)[2]=a[2]/b[2]; (dest)[3]=a[3]/b[3]; })

/* The multiplex (MUX) and demultiplex (DEM) operations are used to interleave
   and un-interleave two lists of data.

  If list A consists of the F8 values (A0, A1, A2, A3, A4, ...) stored
consecutively in memory, and similarly list B consists of (B0, B1, ...)
then the demultiplex operation reads this data in and writes it out into a
single list containing (A0, B0, A1, B1, A2, ...). The multiplex operation
does the opposite.
  To illustrate the vector macros, consider 4 vectors u, v, w, and h
that consist of two data items in both interleaved and un-interleaved
orderings. One data stream consists of the odd letters: a, c, e, g, i,
... and the other consists of the even letters. If we define the
vectors as follows:

 vectors [w x] = scalars [a b c d e f g h]  (interleaved odd and even letters)
         [u v] =         [a c e g b d f h]  (two separate streams)

Then we can convert back and forth between the two orderings via these macro
calls:

    w=MUX0(u,v); x=MUX1(u,v); // multiplex (interleave)
    u=DEM0(w,x); v=DEM1(w,x); // demultiplex (de-interleave)
 */

# define VDEM0_4F8(dst, a, b) \
                      ({ dst[0]=a[0]; dst[1]=a[2]; dst[2]=b[0]; dst[3]=b[2]; })
# define VDEM1_4F8(dst, a, b) \
                      ({ dst[0]=a[1]; dst[1]=a[3]; dst[2]=b[1]; dst[3]=b[3]; })

# define VMUX0_4F8(dst, a, b) \
                      ({ dst[0]=a[0]; dst[1]=b[0]; dst[2]=a[1]; dst[3]=b[1]; })
# define VMUX1_4F8(dst, a, b) \
                      ({ dst[0]=a[2]; dst[1]=b[2]; dst[2]=a[3]; dst[3]=b[3]; })

#endif


#ifdef V4F8_IMPL_AVX

# if (defined(USE_IMMINTRIN) || defined (__AVX__))
/* AVX is defined through immintrin.h */
#   include <immintrin.h>
# else
  error__AVX__not_defined error__AVX__not_defined;
# endif

  typedef __m256d  V4F8;

# define VLOAD_4F8(dest, src)         (dest) = _mm256_load_pd(src)
# define VCOPY_4F8(dest, src)         (dest) = (src)
# define VLOADO_4F8(dst, src, offset) \
                                    (dst) = _mm256_load_pd((src) + (offset)/8)

# define VFILL_4F8(dest, sc0, sc1, sc2, sc3, tmp) \
         ({ (tmp)[0]=(sc0); (tmp)[1]=(sc1); (tmp)[2]=(sc2); (tmp)[3]=(sc3); \
                                                      VLOAD_4F8(dest, tmp); })

# define VSPLAT_4F8(dst, src)          (dst) = _mm256_set1_pd(src)

# define VSAVE_4F8(dest, src)          _mm256_store_pd((dest), (src))
# define VSAVEO_4F8(dest, offset, src) \
                                     _mm256_store_pd((dest)+(offset)/8, (src))

# define VADD_4F8(dest, a, b)         (dest) = _mm256_add_pd((a), (b))

# define INIT_MUL0_4F8(varname) /* nop */
# define INIT_MTMP_4F8(varname) V4F8 varname = _mm256_setzero_pd()

# define VMUL_4F8(dest, a, b, v0)     (dest) = _mm256_mul_pd((a), (b))

# define VMADD_4F8(dest, a, b, c, t)  ({ (t) = _mm256_mul_pd((a), (b)); \
                                      (dest) = _mm256_add_pd((t), (c)); })

# define VNMSUB_4F8(dest, a, b, c, t) ({ (t) = _mm256_mul_pd((a), (b)); \
                                      (dest) = _mm256_sub_pd((c), (t)); })

# define VDIVE_4F8(dest, a, b, v0)    (dest) = _mm256_div_pd((a), (b))

/* The multiplex (MUX) and demultiplex (DEM) operations work as follows:
   [w x] = [a b c d]    w=MUX0(u,v) x=MUX1(u,v)
   [u v] = [a c b d]    u=DEM0(w,x) v=DEM1(w,x)

 vectors [w x] = scalars [a b c d e f g h]
         [u v] =         [a c e g b d f h]

    w=MUX0(u,v); x=MUX1(u,v); // multiplex (interleave)
    u=DEM0(w,x); v=DEM1(w,x); // demultiplex (de-interleave)

 This is described in greater detail in each operator. */

# define VDEM0_4F8(dst, a, b)          (dst) = _mm256_shuffle_pd((a), (b), 0)
# define VDEM1_4F8(dst, a, b)          (dst) = _mm256_shuffle_pd((a), (b), 3)

# define VMUX0_4F8(dst, a, b)          (dst) = _mm256_unpacklo_pd((a), (b), 0)
# define VMUX1_4F8(dst, a, b)          (dst) = _mm256_unpackhi_pd((a), (b), 3)

#endif