This file describes how to use SSE builtin functions in gcc 3.2 for QDP.

1. It seems that performance bottleneck on current versions of P-4
(and Xeon) lies in the memory bus bandwidth. It manifests itself both
in single processor code and when communication is present. Therefore,
it is imperative to decrease amount of data needed per floating point
operation. While very little could be done to implrove FLOP/memory
ratio for Wilson fermions, domain wall fermions (DWF) offer an extra data
reuse opportunity, because the gauge field is replicated along the
fifth dimension.

2. Therefore, I suggest arranging 5-d DWF fields in such a way in memory that:
   a) 5-th dimension is allocated localy on a processor
   b) psi[x][y][z][t][s][color][dirac][re_im] is mapped into memory in
      the following way
       psi[x][y][z][t][0][0][0][re]
       psi[x][y][z][t][1][0][0][re]
       psi[x][y][z][t][2][0][0][re]
       psi[x][y][z][t][3][0][0][re]
       psi[x][y][z][t][0][0][0][im]
       psi[x][y][z][t][1][0][0][im]
       psi[x][y][z][t][2][0][0][im]
       psi[x][y][z][t][3][0][0][im]
       psi[x][y][z][t][0][1][0][re]
       ....
       psi[x][y][z][t][3][2][0][im]
       psi[x][y][z][t][0][0][1][re]
       ....
       psi[x][y][z][t][3][2][3][im]
       psi[x][y][z][t][4][0][0][re]
       ....
      So, that slices 0...3 are packed into 1 SSE array and no SSE shuffling is
      required for complex multiplication.
   c) In such a situation layout of gauge fields is not important
      (most of the time is spent in the Dirac inverter). We will
      assume that gauge field is stored without SSE consideration.

3. Given the layout above, we need the following operations

   a) loading SSE vectors into SSE registers -- gcc 3.2 does it for us
      for the V4SF vector type.
   b) storing SSE registers -- unfortunately, gcc 3.2.3 generates poor
      code for that. It appears this could be fixed with operator=().
   c) Real vector multiplication, addition and substruction -- inline
      wrappers around SSE-specific builtins do that.
   d) Complex multiplication and addition -- inlining real operations
      seems to do the trick.
   e) loading gauge field elements:  movss
   f) Replicating elements of gauge field for vector operations --
      there is no builtin corresponding to pshufd (we need only
      imm8=0). There area two possibilities:
        i) implement builtin and patch gcc
       ii) use GNU __asm__

4. File p4sse.hh shows a possible implementation of primitive
   operations needed for efficient use of SSE hardware along the line
   outlined above. I've chosen to use 3.f.ii for the shuffle.

5. File dwf.cc shows some of QDP++ - like operations built on top of p4sse.hh


