\documentclass{article} \usepackage{noweb} \usepackage[boxed]{algorithm2e} %\usepackage[dvipdfm,colorlinks=true,linkcolor=cyan]{hyperref} \usepackage[dvipdfm,colorlinks=true,linkcolor=black]{hyperref} \newcommand{\note}[1]{$[\![$NB: #1$]\!]$} \setlength{\parindent}{0pt} \setlength{\topmargin}{-80pt} %\setlength{\oddsidemargin}{-28pt} %\setlength{\evensidemargin}{-28pt} %\setlength{\textwidth}{546pt} \setlength{\textheight}{720pt} \author{Andrew Pochinsky} \date{August 5, 2004} \title{Conjugate Gradient for Domain Wall Fermions with 4-d EO preconditioning\\Version 1.1.1} \begin{document} \maketitle \begin{abstract} This document presents an implementation of a conjugate gradient solver for the Domain Wall Fermion Dirac operator using Pentium~4 streaming SIMD extension (SSE). The code targets SciDAC's cluster machines implementing the QMP protocol. \end{abstract} \tableofcontents \section{INTRODUCTION} The code below interfaces with a Chroma-like upper level environment to provide file access and machine initialization and configuration. In fact, this file is an implementation of a level 3 routine for solving the Dirac equation. There are some restrictions on input parameters imposed by the algorithm and a particular way the SSE is used by the implementation. There are the following restrictions on the lattice geometry: \begin{itemize} \item All four-dimensional extends of the lattice should be even. This is required for even-odd decomposition used in the preconditioner. \item The fifth-dimension extend should be a multiple of 4. It is needed for efficient use of SSE registers and simplification of vector code. \item The implementation supports up to four dimensional tori as a network topology. \end{itemize} Because of many issues involved in optimizing the code, it is advantageous to put together some definitions and outline here the optimization strategy used. \subsection{Definitions} \begin{description} \item[Lattice extend] is the total size of the lattice in a given dimension. \item[Network] is the logical topology of the network presented by QMP to the apptication. \item[Node] is a computing element in the network which runs an execution thread. For this implementation we assume that there is one compute node per network location. If an SMP is used, it is the responsibility of QMP to provide a proper abstraction to the application. \item[Sublattice] is the part of the lattice that resides on a compute node. \item[Site] is a point on the lattice. \end{description} \subsection{Optimization Strategy} For this code we assume that scarsity of resources makes us run the inverter on a small number of nodes compared to the number of sites. This is based on the observation that physics needs grow faster than SciDAC budget and computer deployment plans. We also assume, that the current trend in computer industry persists, namely, that the processors grow faster while memory speed and latency continues to lag in relatice terms. We also want a solver whose performance would degrade gracefully when one moves out of the optimization domain. In particular, we impose no limitation on the size of sublattice. There is even no requirement that all sublattices should be of the same size. For the optimization sweetspot, we assume that the typical problem is too large to fit into the cache hierarhy and mostly resides in main memory. This is true now for existing and proposed clusters and is like to remain true for the future, since large scale computations tend to use larger lattices most of the time. \pagebreak \section{PHYSICS} Here we give the fermion action and $\gamma$-matrix and other conventions. \subsection{Dirac Operator} The Domain Wall Fermion Dirac operator is \begin{eqnarray*} \chi_{s,x} = D\psi &=& M_0\psi_{s,x}+ \sum_{\mu}\left((1+\gamma_\mu)U_{x,\mu}\psi_{s,x+\hat\mu}+ (1-\gamma_\mu)U^\dagger_{x-\hat\mu,\mu}\psi_{s,x-\hat\mu}\right)\\ && +(1+\gamma_5)M^{(+)}_s\psi_{s+1,x} +(1-\gamma_5)M^{(-)}_s\psi_{s-1,x} \end{eqnarray*} where \begin{displaymath} M^{(+)}_s = \left\{\begin{array}{cl} 1,& \mbox{if}\quad s < N_s-1\\ -m_f,& \mbox{if}\quad s = N_s-1 \end{array}\right. \end{displaymath} and \begin{displaymath} M^{(-)}_s = \left\{\begin{array}{cl} 1,& \mbox{if}\quad s > 0\\ -m_f,& \mbox{if}\quad s = 0 \end{array}\right. \end{displaymath} We also assume that $\psi_{N_s,x}=\psi_{0,x}$ and $\psi_{-1,x}=\psi_{N_s-1,x}$. \subsection{Gamma matrices} We use the same $\gamma$-matrix basis as Chroma to simplify conversion between two codes. The choice below could be changed with a few modifications to the rest of the code, if $\gamma_5$ is kept diagonal, and one of other $\gamma$-matrices has all nonzero entries equal to $+1$. \pagebreak \[ \gamma_0 = -\sigma_2 \otimes \sigma_1 = \left(\begin{array}{cc} 0&i\sigma_1\\ -i\sigma_1&0 \end{array}\right) = \left(\begin{array}{cccc} 0&0&0&i\\ 0&0&i&0\\ 0&-i&0&0\\ -i&0&0&0 \end{array}\right)\\ \] First, the projector and the reconstructor for $(1+\gamma_0)$: <>= g->f[0][c].re = f->f[0][c].re - f->f[3][c].im; g->f[0][c].im = f->f[0][c].im + f->f[3][c].re; g->f[1][c].re = f->f[1][c].re - f->f[2][c].im; g->f[1][c].im = f->f[1][c].im + f->f[2][c].re; @ <>= qs->f[0][c].re += hh[k].f[0][c].re; qs->f[3][c].im -= hh[k].f[0][c].re; qs->f[0][c].im += hh[k].f[0][c].im; qs->f[3][c].re += hh[k].f[0][c].im; qs->f[1][c].re += hh[k].f[1][c].re; qs->f[2][c].im -= hh[k].f[1][c].re; qs->f[1][c].im += hh[k].f[1][c].im; qs->f[2][c].re += hh[k].f[1][c].im; @ Now, same for $(1-\gamma_0)$: <>= g->f[0][c].re = f->f[0][c].re + f->f[3][c].im; g->f[0][c].im = f->f[0][c].im - f->f[3][c].re; g->f[1][c].re = f->f[1][c].re + f->f[2][c].im; g->f[1][c].im = f->f[1][c].im - f->f[2][c].re; @ <>= qs->f[0][c].re += hh[k].f[0][c].re; qs->f[3][c].im += hh[k].f[0][c].re; qs->f[0][c].im += hh[k].f[0][c].im; qs->f[3][c].re -= hh[k].f[0][c].im; qs->f[1][c].re += hh[k].f[1][c].re; qs->f[2][c].im += hh[k].f[1][c].re; qs->f[1][c].im += hh[k].f[1][c].im; qs->f[2][c].re -= hh[k].f[1][c].im; @ %\pagebreak \[ \gamma_1 = \sigma_2 \otimes \sigma_2 = \left(\begin{array}{cc} 0 & -i\sigma_2\\ i\sigma_2 & 0 \end{array}\right) = \left(\begin{array}{cccc} 0&0&0&-1\\ 0&0&1&0\\ 0&1&0&0\\ -1&0&0&0 \end{array}\right)\\ \] First, the projector and the reconstructor for $(1+\gamma_1)$: <>= g->f[0][c].re = f->f[0][c].re - f->f[3][c].re; g->f[0][c].im = f->f[0][c].im - f->f[3][c].im; g->f[1][c].re = f->f[1][c].re + f->f[2][c].re; g->f[1][c].im = f->f[1][c].im + f->f[2][c].im; @ <>= qs->f[0][c].re += hh[k].f[0][c].re; qs->f[3][c].re -= hh[k].f[0][c].re; qs->f[0][c].im += hh[k].f[0][c].im; qs->f[3][c].im -= hh[k].f[0][c].im; qs->f[1][c].re += hh[k].f[1][c].re; qs->f[2][c].re += hh[k].f[1][c].re; qs->f[1][c].im += hh[k].f[1][c].im; qs->f[2][c].im += hh[k].f[1][c].im; @ Now, same for $(1-\gamma_1)$: <>= g->f[0][c].re = f->f[0][c].re + f->f[3][c].re; g->f[0][c].im = f->f[0][c].im + f->f[3][c].im; g->f[1][c].re = f->f[1][c].re - f->f[2][c].re; g->f[1][c].im = f->f[1][c].im - f->f[2][c].im; @ <>= qs->f[0][c].re += hh[k].f[0][c].re; qs->f[3][c].re += hh[k].f[0][c].re; qs->f[0][c].im += hh[k].f[0][c].im; qs->f[3][c].im += hh[k].f[0][c].im; qs->f[1][c].re += hh[k].f[1][c].re; qs->f[2][c].re -= hh[k].f[1][c].re; qs->f[1][c].im += hh[k].f[1][c].im; qs->f[2][c].im -= hh[k].f[1][c].im; @ \pagebreak \[ \gamma_2 = -\sigma_2 \otimes \sigma_3 = \left(\begin{array}{cc} 0 & i\sigma_3\\ -i\sigma_3&0 \end{array}\right) = \left(\begin{array}{cccc} 0&0&i&0\\ 0&0&0&-i\\ -i&0&0&0\\ 0&i&0&0 \end{array}\right)\\ \] First, the projector and the reconstructor for $(1+\gamma_2)$: <>= g->f[0][c].re = f->f[0][c].re - f->f[2][c].im; g->f[0][c].im = f->f[0][c].im + f->f[2][c].re; g->f[1][c].re = f->f[1][c].re + f->f[3][c].im; g->f[1][c].im = f->f[1][c].im - f->f[3][c].re; @ <>= qs->f[0][c].re += hh[k].f[0][c].re; qs->f[2][c].im -= hh[k].f[0][c].re; qs->f[0][c].im += hh[k].f[0][c].im; qs->f[2][c].re += hh[k].f[0][c].im; qs->f[1][c].re += hh[k].f[1][c].re; qs->f[3][c].im += hh[k].f[1][c].re; qs->f[1][c].im += hh[k].f[1][c].im; qs->f[3][c].re -= hh[k].f[1][c].im; @ Now, same for $(1-\gamma_2)$: <>= g->f[0][c].re = f->f[0][c].re + f->f[2][c].im; g->f[0][c].im = f->f[0][c].im - f->f[2][c].re; g->f[1][c].re = f->f[1][c].re - f->f[3][c].im; g->f[1][c].im = f->f[1][c].im + f->f[3][c].re; @ <>= qs->f[0][c].re += hh[k].f[0][c].re; qs->f[2][c].im += hh[k].f[0][c].re; qs->f[0][c].im += hh[k].f[0][c].im; qs->f[2][c].re -= hh[k].f[0][c].im; qs->f[1][c].re += hh[k].f[1][c].re; qs->f[3][c].im -= hh[k].f[1][c].re; qs->f[1][c].im += hh[k].f[1][c].im; qs->f[3][c].re += hh[k].f[1][c].im; @ %\pagebreak \[ \gamma_3 = \sigma_1 \otimes 1 = \left(\begin{array}{cc} 0 & 1\\ 1 & 0 \end{array}\right) = \left(\begin{array}{cccc} 0&0&1&0\\ 0&0&0&1\\ 1&0&0&0\\ 0&1&0&0 \end{array}\right)\\ \] First, the projector and the reconstructor for $(1+\gamma_3)$: <>= g->f[0][c].re = f->f[0][c].re + f->f[2][c].re; g->f[0][c].im = f->f[0][c].im + f->f[2][c].im; g->f[1][c].re = f->f[1][c].re + f->f[3][c].re; g->f[1][c].im = f->f[1][c].im + f->f[3][c].im; @ <>= qs->f[0][c].re = hh[k].f[0][c].re; qs->f[2][c].re = hh[k].f[0][c].re; qs->f[0][c].im = hh[k].f[0][c].im; qs->f[2][c].im = hh[k].f[0][c].im; qs->f[1][c].re = hh[k].f[1][c].re; qs->f[3][c].re = hh[k].f[1][c].re; qs->f[1][c].im = hh[k].f[1][c].im; qs->f[3][c].im = hh[k].f[1][c].im; @ Now, same for $(1-\gamma_3)$: <>= g->f[0][c].re = f->f[0][c].re - f->f[2][c].re; g->f[0][c].im = f->f[0][c].im - f->f[2][c].im; g->f[1][c].re = f->f[1][c].re - f->f[3][c].re; g->f[1][c].im = f->f[1][c].im - f->f[3][c].im; @ <>= qs->f[0][c].re += hh[k].f[0][c].re; qs->f[2][c].re -= hh[k].f[0][c].re; qs->f[0][c].im += hh[k].f[0][c].im; qs->f[2][c].im -= hh[k].f[0][c].im; qs->f[1][c].re += hh[k].f[1][c].re; qs->f[3][c].re -= hh[k].f[1][c].re; qs->f[1][c].im += hh[k].f[1][c].im; qs->f[3][c].im -= hh[k].f[1][c].im; @ \pagebreak \section{CONJUGATE GRADIENT} Here we develop the algorithm used in the solver. \subsection{Standard Algorithm} The basic conjugate gradient algorithm~\ref{primitive} is simple. Its only requirement is that matrix $A$ is hermitian. Otherwise, it appears suited for DWF better than other iterative solvers. \incmargin{20pt} \begin{algorithm} \SetKwData{Input}{input} \KwIn{$A$, the matrix} \KwIn{$b$, the right hand side of the linear equation} \KwIn{$x_0$, an initial guess} \KwIn{$n$, the maximum number of iterations} \KwIn{$\epsilon$, required precision} \KwOut{$x$, approximate solution} \KwOut{$\rho$, final residue} \KwOut{$k$, number of iterations used} \SetKw{Break}{break} \dontprintsemicolon \Begin{ $x\leftarrow x_0$\; $p\leftarrow r\leftarrow b - Ax$\; $\rho\leftarrow\langle r,r\rangle$\; $k\leftarrow 0$\; \While{$\rho>\epsilon$ or $k < n$}{ $q\leftarrow Ap$\; $\alpha\leftarrow \rho/\langle p,q\rangle$\; $r\leftarrow r-\alpha q$\; $x\leftarrow x+\alpha p$\; $\gamma\leftarrow \langle r,r\rangle$\; \If{$\gamma<\epsilon$}{$\rho\leftarrow\gamma$\; \Break} $\beta\leftarrow\gamma/\rho$\; $\rho\leftarrow\gamma$\; $p\leftarrow r+\beta p$\; $k\leftarrow k+1$\; } \Return {$x$, $\rho$, $k$}. } \caption{\label{primitive}Generic Conjugate Gradient Solver} \end{algorithm} \decmargin{20pt} \subsection{Overlap Opportunities} Our approach to overlapping computations with communications is to break the sublattice into boundary and inside pieces. After that, we first compute $(1\pm\gamma_\mu)$ projections on the boundary and start send and receive operations. While communications are in progress, everything is computed on the inside nodes of the sublattice. Once receive is complete, we compute the operator on the boundary sites. Such an approach helps to improve temporal locality (and, therefore, cache utilization) at the expence of losing the ability of overlap if one of the sublattice dimensions is $2$. However, it is unlikely that we could afford a large enough cluster to be forced into this corner of the parameter space. \subsection{Non-hermitial Matrix} Hermiticity of $M$ is the only obstacle in applying algorithm~\ref{primitive} directly to our problem $M\psi=\eta$. This issue can be easily resolved by multiplying both sides by $M^\dagger$. However, instead of using algorithm~\ref{primitive} with $A=M^\dagger M$, it is better to keep $M$ and $M^\dagger$ separate---this makes it possible to hide one of the global sum computations, thus improving machine size scaling. Algorithm~\ref{real} is what we use in the solver. \incmargin{20pt} \begin{algorithm} \SetKwData{Input}{input} \KwIn{$M$, the matrix} \KwIn{$b$, the right hand side of the linear equation} \KwIn{$x_0$, an initial guess} \KwIn{$n$, the maximum number of iterations} \KwIn{$\epsilon$, required precision} \KwOut{$x$, approximate solution} \KwOut{$\rho$, final residue} \KwOut{$k$, number of iterations used} \SetKw{Break}{break} \dontprintsemicolon \Begin{ $x\leftarrow x_0$\; $p\leftarrow r\leftarrow b - M^{\dagger}Mx$\; $\rho\leftarrow\langle r,r\rangle$\; $k\leftarrow 0$\; \While{$\rho>\epsilon$ or $k < n$}{ $z\leftarrow Mp$\; $q\leftarrow M^{\dagger}z$\; $\alpha\leftarrow \rho/\langle z,z\rangle$\; $r\leftarrow r-\alpha q$\; $x\leftarrow x+\alpha p$\; $\gamma\leftarrow \langle r,r\rangle$\; \If{$\gamma<\epsilon$}{$\rho\leftarrow\gamma$\; \Break} $\beta\leftarrow\gamma/\rho$\; $\rho\leftarrow\gamma$\; $p\leftarrow r+\beta p$\; $k\leftarrow k+1$\; } \Return {$x$, $\rho$, $k$}. } \caption{\label{real}DWF-ready Gradient Solver.} \end{algorithm} \decmargin{20pt} \pagebreak \section{PRECONDITIONING} We use four dimensional preconditioner to improve convergence of the CG. Following Kostas Orginos, let us color the lattice sites according to the parity of $x_0+x_1+x+2+x_3$. Then we can rewrite $\chi=D\psi$ as follows: \begin{displaymath} \left(\begin{array}{c} \chi_e\\ \chi_o \end{array}\right) = D\psi = \left(\begin{array}{cc} Q_{ee}&Q_{eo}\\ Q_{oe}&Q_{oo} \end{array}\right) \left(\begin{array}{c} \psi_e\\ \psi_o \end{array}\right) \end{displaymath} From the form of $D$ it follows that all dependance on the gauge field is located in $Q_{xy}$, and that $Q_{xx}$ does not depend on $U$. That, in turn, allows us to invert $Q_{xx}$ easily. With this in mind, one writes: \begin{displaymath} \left(\begin{array}{cc} Q_{ee}&Q_{eo}\\ Q_{oe}&Q_{oo} \end{array}\right)= \left(\begin{array}{cc} Q_{ee}&0\\ Q_{oe}&Q_{oo} \end{array}\right) \left(\begin{array}{cc} 1&Q_{ee}^{-1}Q_{eo}\\ 0&1-Q_{oo}^{-1}Q_{oe}Q_{ee}^{-1}Q_{eo} \end{array}\right) \end{displaymath} Now, to solve the equation \begin{displaymath} D\psi=\eta, \end{displaymath} one needs to perform the following steps: \begin{enumerate} \item Compute \begin{displaymath} \phi_o=Q_{oo}^{-1}(\eta_o-Q_{oe}Q_{ee}^{-1}\eta_e) \end{displaymath} \item Set $M=1-Q_{oo}^{-1}Q_{oe}Q_{ee}^{-1}Q_{eo}$ for the following. \item Compute \begin{displaymath} \varphi_o=M^\dagger\phi_o \end{displaymath} \item Solve for $\psi_o$ the following equation using Algorithm~\ref{real} \begin{displaymath} M^{\dagger}M\psi_o=\varphi_o \end{displaymath} \item Compute \begin{displaymath} \psi_e=Q_{ee}^{-1}(\eta_e-Q_{eo}\psi_o) \end{displaymath} \end{enumerate} Note, that $M^\dagger=1-(Q_{eo})^\dagger (Q_{ee}^{-1})^\dagger (Q_{oe})^\dagger (Q_{oo}^{-1})^\dagger = 1-S_{oe}S_{ee}^{-1}S_{oe}S_{oo}^{-1}$, where \begin{eqnarray*} S_{ee}&=&Q_{ee}[\gamma_{5}\rightarrow-\gamma_{5}]\\ S_{oo}&=&Q_{oo}[\gamma_{5}\rightarrow-\gamma_{5}]\\ S_{oe}&=&Q_{eo}[\gamma_{\mu}\rightarrow-\gamma_{\mu}]\\ S_{eo}&=&Q_{oe}[\gamma_{\mu}\rightarrow-\gamma_{\mu}]\\ \end{eqnarray*} \subsection{$Q_{xx}$ inversion} The previous section is based on a tacit assumption that $Q_{ee}$ and $Q_{oo}$ are easy to invert. Here we show that it is so. Let us rewrite \begin{displaymath} \chi_{s,x}=\left(Q_{ee}\psi\right)_{s,x}= M_0\psi_{s,x}+(1+\gamma_5)M_s^{(+)}\psi_{s+1,x}+ (1+\gamma_5)M_s^{(-)}\psi_{s-1,x} \end{displaymath} as follows: \begin{displaymath} \left(Q_{ee}\psi\right)_{s,x}= M_0\left(\left({1+\gamma_5}\over{2}\right)\left(\psi_{s,x} +{2M_s^{(+)}\over M_0}\psi_{s+1,x}\right)+ \left({1-\gamma_5}\over{2}\right)\left(\psi_{s,x} +{2M_s^{(-)} \over M_0}\psi_{s-1,x}\right)\right). \end{displaymath} Thus, \begin{displaymath} Q_{ee}= \frac{1+\gamma_5}{2}\left(\begin{array}{ccccc} a&b&\cdots&0&0\\ 0&a&&0&0\\ \vdots&&\ddots&&\vdots\\ 0&0&&a&b\\ c&0&\cdots&0&a \end{array}\right) +\frac{1-\gamma_5}{2}\left(\begin{array}{ccccc} a&0&\cdots&0&c\\ b&a&&0&0\\ \vdots&&\ddots&&\vdots\\ 0&0&&a&0\\ 0&0&\cdots&b&a\\ \end{array}\right)=P_{+}A+P_{-}B, \end{displaymath} where $a=M_0$, $b=2$, $c=-2m_f$. Now, since $P_{\pm}$ comute with $A$ and $B$, $Q_{ee}^{-1}=P_{+}A^{-1}+P_{-}B^{-1}$. Computing $A^{-1}$ and $B^{-1}$ is done by decomposition $A=L_AR_A$, $B=L_BR_B$, where \begin{displaymath} R_A=\left( \begin{array}{ccccc} a&b&\cdots&0&0\\ 0&a&&0&0\\ \vdots&&\ddots&&\vdots\\ 0&0&&a&b\\ 0&0&\cdots&0&a\\ \end{array} \right)\quad\quad R_B=\left( \begin{array}{ccccc} a&0&\cdots&0&0\\ b&a&&0&0\\ \vdots&&\ddots&&\vdots\\ 0&0&&a&0\\ 0&0&\cdots&b&a\\ \end{array} \right),\\ \end{displaymath} and \begin{displaymath} L_A=\left(\begin{array}{cccccc} 1&0&0&0&\cdots&0\\ 0&1&0&0&&0\\ \vdots&&&&&\vdots\\ c/a&-bc/a^2&b^2c/a^3&-b^3c/a^4&\cdots&1+(-b)^{n-1}c/a^n\\ \end{array}\right) \end{displaymath} \begin{displaymath} L_B=\left(\begin{array}{cccccc} 1+(-b)^{n-1}c/a^n&(-b)^{n-2}c/a^{n-1}&\cdots&b^2c/a^3&-bc/a^2&c/a\\ 0&1&&0&0&0\\ \vdots&&&&&\vdots\\ 0&0&\cdots&0&0&1\\ \end{array}\right). \end{displaymath} In these terms, \begin{displaymath} Q_{ee}^{-1}=\frac{1+\gamma_5}{2}R_A^{-1}L_A^{-1} +\frac{1-\gamma_5}{2}R_B^{-1}L_B^{-1}. \end{displaymath} We will also need \begin{displaymath} S_{ee}^{-1}=\frac{1-\gamma_5}{2}R_A^{-1}L_A^{-1} +\frac{1+\gamma_5}{2}R_B^{-1}L_B^{-1}. \end{displaymath} For further reference, \begin{displaymath} \gamma_5 = \left(\begin{array}{cccc} 1&0&0&0\\ 0&1&0&0\\ 0&0&-1&0\\ 0&0&0&-1\\ \end{array}\right). \end{displaymath} \pagebreak \section{CODE} This section contains chunks of the source that go into \texttt{dwf.c} source file. We start with the interface functions and elaborate from there. \subsection{Interface Functions} We can not expect the user to call different parts of the interface in an appropriate order. Therefore, successful initialization allows the user to call other interface elements, as well as prevents repeated initializations. <>= static int inited_p = 0; @ \subsubsection{SSE DWF Initializer} <>= int SSE_DWF_init(const int lattice[DIM+1], SSE_DWF_FP_SIZE fp_size, void *(*allocator)(size_t size), void (*deallocator)(void *)) { if (inited_p) return 1; /* error: second init */ <> <> <> <> <> <> <> inited_p = 1; return 0; <> } @ If any error occurs during initialization, we simply unroll state and fail: <>= error: SSE_DWF_fini(); return 1; @ Check if the user requested special allocation mechanisms: <>= if (allocator) tmalloc = allocator; else tmalloc = malloc; if (deallocator) tfree = deallocator; else tfree = free; @ <>= static void *(*tmalloc)(size_t size); static void (*tfree)(void *ptr); @ For now we only support single precision floating point numbers: <>= if (fp_size != SSE_DWF_FLOAT) goto error; @ For single precision arithmentics, $L_s$ should be a muplitple of 4. <>= if (lattice[DIM] % Vs) goto error; tlattice[DIM] = lattice[DIM]; @ Otherwise, lattice sizes must be even to allow us to do red/black preconditioning: <>= { int i; for (i = 0; i < DIM; i++) { if (lattice[i] & 1) goto error; tlattice[i] = lattice[i]; } } @ <>= static int tlattice[DIM+1]; @ \subsubsection{SSE DWF Clean Up} The cleanup routine may be called from partially initialized context, we should be able to do a partial cleanup. <>= void SSE_DWF_fini(void) { <> <> <> inited_p = 0; } @ \subsubsection{DWF Fermion Allocator} When one needs an SSE DWF fermion, the allocator does the job. Remember, users are stupid enough to call this function in the uninitialized state. It is convenient to break all internal fermions into odd and even parts at this stage. <>= struct SSE_DWF_Fermion { vEvenFermion *even; vOddFermion *odd; }; @ Now, the fermion allocator proper: <>= SSE_DWF_Fermion * SSE_DWF_allocate_fermion(void) { SSE_DWF_Fermion *ptr; if (!inited_p) return 0; ptr = tmalloc(sizeof (*ptr)); if (ptr == 0) return 0; ptr->even = allocate_even_fermion(); if (ptr->even == 0) goto error1; ptr->odd = allocate_odd_fermion(); if (ptr->odd == 0) goto error2; return ptr; error2: free16(ptr->even); error1: tfree(ptr); return 0; } @ \subsubsection{DWF Fermion Exporter} When we need to create an SSE fermion field and populate it from an outer environment, we use the following procedure <>= SSE_DWF_Fermion * SSE_DWF_load_fermion(const void *OuterFermion, void *env, SSE_DWF_fermion_reader reader) { SSE_DWF_Fermion *ptr = SSE_DWF_allocate_fermion(); /* Handle both lack of memory and missing initialization */ if (ptr == 0) return 0; <> return ptr; } @ \subsubsection{DWF Fermion Importer} For moving data back to the outer environment, the following importer is used: <>= void SSE_DWF_save_fermion(void *OuterFermion, void *env, SSE_DWF_fermion_writer writer, SSE_DWF_Fermion *CGfermion) { if (!inited_p) return; <> } @ \subsubsection{DWF Fermion Deallocator} We only free pointers that we allocated. The magic is in [[free16()]]---it knows about all heap objects allocated by [[alloc16()]]. <>= void SSE_DWF_delete_fermion(SSE_DWF_Fermion *ptr) { if (!inited_p) return; free16(ptr->even); free16(ptr->odd); tfree(ptr); } @ \subsubsection{DWF Gauge Exporter} Unlike fermions, gauge fields are 4-d in the solver. Though they are not loaded by SSE memory operations, we still allocate 16-byte aligned memory for them (apparently for no good reason at all.) <>= SSE_DWF_Gauge * SSE_DWF_load_gauge(const void *OuterGauge_U, const void *OuterGauge_V, void *env, SSE_DWF_gauge_reader reader) { SSE_DWF_Gauge *g; if (!inited_p) return 0; g = allocate_gauge_field(); if (g == 0) return 0; <> return g; } @ Let us also define [[SSE_DWF_Gauge]] here. We do not need anything fancy for the gauge field: <>= struct SSE_DWF_Gauge { complex v[Nc][Nc]; }; @ \subsubsection{DWF Gauge Deallocator} Gauge deallocator is very much like fermion deallocator. We only keep them separate to help the type system cope with a error making user. <>= void SSE_DWF_delete_gauge(SSE_DWF_Gauge *ptr) { if (!inited_p) return; free16(ptr); } @ \subsubsection{The Solver} Finally, the solver itself. Here we check if the system has been properly initialized and dispatch on the float size (but not now yet.) <>= int SSE_DWF_cg_solver(SSE_DWF_Fermion *psi, /* result */ double *out_eps, int *out_iter, const SSE_DWF_Gauge *gauge, double M, double m0, const SSE_DWF_Fermion *x0, /* guess */ const SSE_DWF_Fermion *eta, /* rhs */ double eps, int max_iter) { int status; if (!inited_p) return 1; U = (SU3 *)gauge; <> <> <> <> return status; } @ Save one argument in many functions: <>= static SU3 *U; @ \subsection{Memory Allocation} SSE does like properly aligned memory. While automatic variables are aligned by the compiler, extra care is needed when dealing with the heap. The code allocates all its own memory aligned on 16-byte boundary by calling [[alloc16()]], and returns the memory through [[free16()]]. <>= static void * alloc16(int size) { int xsize = PAD16(size + sizeof (struct memblock)); struct memblock *p = tmalloc(xsize); if (p == 0) return p; p->data = ALIGN16(&p[1]); p->size = size; p->next = memblock.next; p->prev = &memblock; p->next->prev = p; p->prev->next = p; return p->data; } @ For readability, here are alignment operations: <>= #define PAD16(size) (15+(size)) #define ALIGN16(addr) ((void *)(~15 & (15 + (size_t)(addr)))) @ For deallocation we need to find an appropriate memory block: <>= static void free16(void *ptr) { struct memblock *p; if (ptr == 0) return; for (p = memblock.next; p != &memblock; p = p->next) { if (p->data != ptr) continue; p->next->prev = p->prev; p->prev->next = p->next; tfree(p); return; } /* this is BAD: control should reach here! */ } @ The head of the memory list is stored in a static variable. Of course, such an implementation makes no threadable, but let us worry about that when the time is right. <>= static struct memblock memblock = { &memblock, &memblock, NULL, 0 }; @ Finally, the datatype for the linked list: <>= struct memblock { struct memblock *next; struct memblock *prev; void *data; size_t size; }; @ \subsubsection{Field allocators} First, the prototypes: <>= static vEvenFermion *allocate_even_fermion(void); static vOddFermion *allocate_odd_fermion(void); static SSE_DWF_Gauge *allocate_gauge_field(void); /* vFermion *allocate_subfermion(int size); */ @ The only difference between even and odd fermions is (possibly) their size: <>= vEvenFermion * allocate_even_fermion(void) { return alloc16(even_odd.size * S_4 * sizeof (vFermion)); } vOddFermion * allocate_odd_fermion(void) { return alloc16(odd_even.size * S_4 * sizeof (vFermion)); } SSE_DWF_Gauge * allocate_gauge_field(void) { return alloc16(gauge_XYZT * sizeof (SSE_DWF_Gauge)); } @ \subsection{Probing Cluster Topology} There is no proper way to query QMP about lattice layout. We have to request the minimal meaningful information the library provides and try to repeat outer layer's partitioning of the lattice. There are good chances of success, but this is a potential danger spot. Here we prepare compute where on the lattice this node is and to build up our understanding of neighbors. Maybe optimistically, we assume that once QMP is initialized, it reports logical dimensions and coordinates properly, so that we do not need to be paranoidal about errors here. <>= { int i, dn; const QMP_u32_t *xn, *xc; if (!QMP_logical_topology_is_declared()) /* The user must have declared logical topology before */ goto error; dn = QMP_get_logical_number_of_dimensions(); if (dn > DIM) /* Too high dimension of the logical network */ goto error; xn = QMP_get_logical_dimensions(); xc = QMP_get_logical_coordinates(); for (i = 0; i < dn; i++) { network[i] = xn[i]; coord[i] = xc[i]; } for (; i < dn; i++) { network[i] = 1; coord[i] = 0; } } @ Some global variables: <>= static int network[DIM]; static int coord[DIM]; @ \subsection{Moving Data} \subsubsection{Reading the Gauge Field} Let us start with reading of the gauge field from the outer environment first. Here we assume that there is an address translation function to help us in talking to the outer layer. <>= { int x[DIM], i, d, a, b, p1; <> <> <> for (d = 0; d < DIM; d++) <> } @ At a given site, load [[DIM]] gauge elements: <>= p1 = to_Ulinear(x, &bounds, -1); for (d = 0; d < DIM; d++) { for (a = 0; a < Nc; a++) { for (b = 0; b < Nc; b++) { g[p1 + d].v[a][b].re = reader(OuterGauge_U, env, x, d, a, b, 0); g[p1 + d].v[a][b].im = reader(OuterGauge_U, env, x, d, a, b, 1); } } } @ To start a scan over the lattice, initialize [[x]] and start the loop: <>= for (i = 0; i < DIM; i++) x[i] = bounds.lo[i]; for (i = 0; i < DIM;) { @ <>= for (i = 0; i < DIM; i++) x[i] = bounds->lo[i]; for (i = 0; i < DIM;) { @ Once all is done with the site [[x]], we are ready to advance the index: <>= for (i = 0; i < DIM; i++) { <> } } @ <>= for (i = 0; i < DIM; i++) { <> } } @ Since we are going to use a [[DIM-1]] dimensional scan as well, let us write it down here: <>= for (i = 0; i < DIM; i++) { if (i == d) continue; <> } } @ Now we can scan [[DIM]]-dimensional indices: <>= if (++x[i] == bounds.hi[i]) x[i] = bounds.lo[i]; else break; @ <>= if (++x[i] == bounds->hi[i]) x[i] = bounds->lo[i]; else break; @ DWF Dirac operator needs backward gauge links. We get them from [[OuterGauge_V]]. Here we only read the boundary links. <>= { if (network[d] == 1) continue; <> <> <> } @ Now we read a boundary element: <>= x[d] = bounds.lo[d] - 1; p1 = to_Ulinear(x, &bounds, d); x[d] = bounds.lo[d]; for (a = 0; a < Nc; a++) { for (b = 0; b < Nc; b++) { g[p1].v[a][b].re = reader(OuterGauge_V, env, x, d, a, b, 0); g[p1].v[a][b].im = reader(OuterGauge_V, env, x, d, a, b, 1); } } @ \subsubsection{Reading a Fermion} There are but two complications in reading the domain wall fermion. First, this is a good time to break the fermion into red and black pieces. In addition, here we construct SSE fermions. <>= { int x[DIM+1], i; <> <> <> } @ Data conversion is inherently inefficient. We do not try to overoptimize it here: <>= { int p = parity(x); int p1 = S_4 * to_HFlinear(x, &bounds, -1, 0); /* p is taken care of! */ vFermion *f = p? &ptr->odd[p1].f: &ptr->even[p1].f; for (x[DIM] = 0; x[DIM] < tlattice[DIM]; x[DIM] += Vs, f++) { int d; for (d = 0; d < Fd; d++) { int c; for (c = 0; c < Nc; c++) { f->f[d][c].re = import_vector(OuterFermion, env, reader, x, c, d, 0); f->f[d][c].im = import_vector(OuterFermion, env, reader, x, c, d, 1); } } } } @ A simple packer of [[Vs]] elements into a vector: <>= static inline vReal import_vector(const void *z, void *env, SSE_DWF_fermion_reader reader, int x[DIM+1], int c, int d, int re_im) { vReal f; REAL *v = (REAL *)&f; int i, xs; for (xs = x[DIM], i = 0; i < Vs; i++, x[DIM]++) { *v++ = reader(z, env, x, c, d, re_im); } x[DIM] = xs; return f; } @ \subsubsection{Writing a Fermion} Writing a fermion is not much different: <>= { int x[DIM+1], i; <> <> <> } @ <>= { int p = parity(x); int p1 = S_4 * to_HFlinear(x, &bounds, -1, 0); /* p is taken care of! */ vFermion *f = p? &CGfermion->odd[p1].f: &CGfermion->even[p1].f; for (x[DIM] = 0; x[DIM] < tlattice[DIM]; x[DIM] += Vs, f++) { int d; for (d = 0; d < Fd; d++) { int c; for (c = 0; c < Nc; c++) { save_vector(OuterFermion, env, writer, x, c, d, 0, &f->f[d][c].re); save_vector(OuterFermion, env, writer, x, c, d, 1, &f->f[d][c].im); } } } } @ Here's another little helper good only for writing back the fermion from SSE to the outer environement: <>= static inline void save_vector(void *z, void *env, SSE_DWF_fermion_writer writer, int x[DIM+1], int c, int d, int re_im, vReal *f) { REAL *v = (REAL *)f; int i, xs; for (xs = x[DIM], i = 0; i < Vs; i++, x[DIM]++) { writer(z, env, x, c, d, re_im, *v++); } x[DIM] = xs; } @ %--------- \subsection{Solver Initialization} Here are all pieces for setting up the structures needed to run the solver. \subsubsection{Constructing the neighbor tables} <>= if (init_tables()) { /* Something went wrong in the table construction */ goto error; } @ The table initializer creates all tables necessary for communication and computation. Memory is allocated here for index arrays. <>= static int init_tables(void) { struct neighbor tmp; int i, v; init_neighbor(&bounds, &neighbor); <> tmp = neighbor; build_neighbor(&even_odd, &bounds, 0, &tmp); build_neighbor(&odd_even, &bounds, 1, &tmp); return 0; } @ First, we set global data: <>= static struct bounds bounds; static int gauge_XYZT; static int S_4, S_4_1; @ <>= S_4 = tlattice[DIM] / 4; S_4_1 = S_4 - 1; for (v = 1, i = 0; i < DIM; i++) { v *= bounds.hi[i] - bounds.lo[i]; } gauge_XYZT = DIM * v; for (i = 0; i < DIM; i++) { if (network[i] < 2) continue; gauge_XYZT += v / (bounds.hi[i] - bounds.lo[i]); } @ The [[struct bounds]] helps us to navigate through the local part of the lattice. It is used by the initialization code only. <>= struct bounds { int lo[DIM]; int hi[DIM]; }; @ We keep two [[struct neighbor]], one for computation on the even sublattice, another---on the odd. In addition to [[even_odd]] and [[odd_even]], we need one more [[struct neighbor]] to keep the allocated pointers in. <>= static struct neighbor neighbor; static struct neighbor odd_even; static struct neighbor even_odd; @ Let us start with computing the boundary of the sublattice <>= static inline int lattice_start(int lat, int net, int coord) { int q = lat / net; int r = lat % net; return coord * q + ((coord < r)? coord: r); } static inline void mk_sublattice(struct bounds *bounds, int coord[]) { int i; for (i = 0; i < DIM; i++) { bounds->lo[i] = lattice_start(tlattice[i], network[i], coord[i]); bounds->hi[i] = lattice_start(tlattice[i], network[i], coord[i] + 1); } } @ All dynamic data are allocated in [[init_neighbor]] and are stored in [[neighbor]]. <>= static void init_neighbor(struct bounds *bounds, struct neighbor *neighbor); @ <>= static void init_neighbor(struct bounds *bounds, struct neighbor *neighbor) { int i; mk_sublattice(bounds, coord); neighbor->qmp_smask = 0; <> <> <> <> } @ <>= for (neighbor->size = 1, neighbor->inside_size = 1, i = 0; i < DIM; i++) { int ext = bounds->hi[i] - bounds->lo[i]; neighbor->size *= ext; if (network[i] > 1) neighbor->inside_size *= ext - 2; else neighbor->inside_size *= ext; } neighbor->boundary_size = neighbor->size - neighbor->inside_size; neighbor->site = tmalloc(neighbor->size * sizeof (struct site)); @ <>= if (neighbor->inside_size) neighbor->inside = tmalloc(neighbor->inside_size * sizeof (int)); else neighbor->inside = 0; @ <>= if (neighbor->boundary_size) neighbor->boundary = tmalloc(neighbor->boundary_size * sizeof (struct boundary)); else neighbor->boundary = 0; @ <>= for (i = 0; i < 2 * DIM; i++) { int d = i / 2; if (network[d] > 1) { neighbor->snd_size[i] = neighbor->size / (bounds->hi[d] - bounds->lo[d]); neighbor->snd[i] = tmalloc(neighbor->snd_size[i] * sizeof (int)); } else { neighbor->snd_size[i] = 0; neighbor->snd[i] = 0; } } @ Here is the definition of the neighbor table we spent soo much time initializing: <>= struct neighbor { int size; /* size of site table */ int inside_size; /* number of inside sites */ int boundary_size; /* number of boundary sites */ int snd_size[2*DIM]; /* size of send buffers in 8 dirs */ int rcv_size[2*DIM]; /* size of receive buffers */ int *snd[2*DIM]; /* i->x translation for send buffers */ int *inside; /* i->x translation for inside sites */ struct boundary *boundary; /* i->x,mask translation for boundary */ struct site *site; /* x->site translation for sites */ vHalfFermion *snd_buf[2*DIM]; /* Send buffers */ vHalfFermion *rcv_buf[2*DIM]; /* Receive buffers */ int qmp_size[4*DIM]; /* sizes of QMP buffers */ void *qmp_xbuf[4*DIM]; /* QMP snd/rcv buffer addresses */ vHalfFermion *qmp_buf[4*DIM]; /* send and receive buffers for QMP */ QMP_msgmem_t qmp_mm[4*DIM]; /* msgmem's for send and receive */ int Nx; /* number of msegs */ QMP_msghandle_t qmp_sh[2*DIM]; /* handles for sends */ QMP_msghandle_t qmp_sv[2*DIM]; /* copies of handles for finilization */ int qmp_smask; /* send flags for qmp_sh[] */ int Ns; /* number of send handles */ QMP_msghandle_t qmp_rh[2*DIM]; /* handles for receives */ int Nr; /* number of receive handles */ QMP_msghandle_t qmp_cr; /* common receive handle */ }; @ For boundary sites we only need 8 bits for the boundary indicators. However, allocating a whole [[int]] for [[mask]] is what the compiler does anyway. <>= struct boundary { int index; /* x-index of this boundary site */ int mask; /* bitmask of the borders */ }; @ In the following structure we keep information about links and neighboors of the site. Note, that there is one address for four forward links: they are packed in memory as defined in the comment. <>= struct site { int Uup; /* up-links are Uup, Uup+1, Uup+2, Uup+3 */ int Udown[DIM]; /* four down-links */ int F[2*DIM]; /* eight neighboring fermions on the other sublattice */ }; @ Now we can define [[build_neighbor()]]: <>= static void build_neighbor(struct neighbor *out, struct bounds *bounds, int par, struct neighbor *in) { int i,d, s, p, m; int x[DIM]; <> <> <> } @ <>= static void build_neighbor(struct neighbor *out, struct bounds *bounds, int parity, struct neighbor *in); @ First part is easy: we start with copying [[in]] to [[out]], resetting fields which will be computed shortly and setting [[p]] to [[bounds->lo]]: <>= *out = *in; out->size = 0; out->inside_size = 0; out->boundary_size = 0; for (d = 0; d < DIM; d++) { out->rcv_size[2*d] = out->snd_size[2*d] = 0; out->rcv_size[2*d+1] = out->snd_size[2*d+1] = 0; } @ This is a good place to reuse our lattice walking chunks. <>= <> s = parity(x); if (s != par) goto next; <> <> <> out->size++; in->site++; next: <> @ For [[p]] we use a function to compute it from [[x]]. As for [[m]], its eight low bits encode if there is a boundary nearby. Note, that even bits corresponds to \emph{step down} and odd bits correspond to \emph{step up}. <>= p = to_HFlinear(x, bounds, -1, 0); for (m = 0, d = 0; d < DIM; d++) { if (network[d] > 1) { if (x[d] == bounds->lo[d]) m |= 1 << (2 * d); if (x[d] + 1 == bounds->hi[d]) m |= 1 << (2 * d + 1); } } @ If no boundary was found near [[p]], we put it into inside. Otherwise, [[p]] belongs to the boundary. <>= if (m) { <> } else { <> } @ For the inside, simply add [[p]] to the list of sites and advance pointers and counters: <>= *in->inside++ = p; out->inside_size++; @ For the boundary, place [[p]] into [[index]] and [[m]] into [[mask]] and advance pointers. We also take the opprotunity to place [[p]] into send buffers where bits of [[m]] are set <>= in->boundary->index = p; in->boundary->mask = m; in->boundary++; out->boundary_size++; for (d = 0; d < 2*DIM; d++) { if ((m & (1 << d)) == 0) continue; *in->snd[d]++ = p; out->snd_size[d]++; } @ We are ready now to build local neighbors. All gauge fields are local, and we still have [[m]] to tell if the other sublattice neighbor is local or not. <>= in->site->Uup = to_Ulinear(x, bounds, -1); for (d = 0; d < DIM; d++) { in->site->Udown[d] = to_Ulinear(x, bounds, d); if ((m & (1 << (2 * d))) == 0) in->site->F[2*d] = S_4 * to_HFlinear(x, bounds, d, -1); if ((m & (1 << (2 * d + 1))) == 0) in->site->F[2*d + 1] = S_4 * to_HFlinear(x, bounds, d, +1); } @ The only piece left is the one dealing with outside indices. This is a tricky part, but we just happen to have almost enough machinery already to solve it: <>= for (d = 0; d < DIM; d++) { if (network[d] < 2) continue; construct_rec(out, par, bounds, d, +1); construct_rec(out, par, bounds, d, -1); } @ We also need a function that will walk through a boundary of a neighbor building the outside part of the [[site[].F]] indices. <>= static void construct_rec(struct neighbor *out, int par, struct bounds *bounds, int dir, int step); @ <>= static void construct_rec(struct neighbor *out, int par, struct bounds *bounds, int dir, int step) { struct bounds xb; int xc[DIM], x[DIM]; int s, d, p, k; int dz = dir * 2 + ((step>0)?1:0); <> <> <> } @ Constucting the neighbor's network position is straightforward: <>= for (d = 0; d < DIM; d++) { int v = coord[d] + ((d==dir)?step:0); if (v < 0) v += network[d]; if (v >= network[d]) v -= network[d]; xc[d] = v; } mk_sublattice(&xb, xc); @ The initial point should be on the surface we are walking: <>= for (d = 0; d < DIM; d++) x[d] = ((d == dir) && (step < 0))? (xb.hi[d] - 1): xb.lo[d]; @ Walking through the hypersurface is very much like walking through the sublattice below. There are only two differences: (a) we are walking opposite parity sublattice surface here and, (b) while advancing the point, we should stay on the surface selected above. <>= /* ZZZ: This needs some cleaning */ k = 0; do { for (d = 0, s = par; d < DIM; d++) s += x[d]; if (!(s & 1)) goto next; <> <> next: for (d = 0; d < DIM; d++) { if (d == dir) continue; if (++x[d] == xb.hi[d]) x[d] = xb.lo[d]; else break; } } while (d != DIM); out->rcv_size[dz^1] = k; /* XXX is it true? */ @ <>= p = to_HFlinear(x, bounds, dir, -step); @ <>= out->site[p].F[dz] = S_4 * k++; @ Here we do the reverse, namely, free all memory allocated by [[init_tables()]]: <>= { int i; if (neighbor.site) { tfree(neighbor.site); neighbor.site = 0; } if (neighbor.inside) { tfree(neighbor.inside); neighbor.inside = 0; } if (neighbor.boundary) { tfree(neighbor.boundary); neighbor.boundary = 0; } for (i = 2 * DIM; i--;) { if (neighbor.snd[i] == 0) continue; tfree(neighbor.snd[i]); neighbor.snd[i] = 0; } } @ \subsubsection{Address translation routines} Let us define a couple of functions for translating 4-d lattice positions into 1-d offsets. Computing linear position on the sublattice is used often enough to be placed in a function. To avoid writing two very similar functions, we pass two arguments [[q]], and [[z]] to specify that $q$-component of [[p]] should adjusted by $z$. If $q < 0$, [[q]] and [[z]] are ignored. <>= static int to_HFlinear(int p[], struct bounds *b, int q, int z) { int x, d; for (x = 0, d = 4; d--;) { int v = p[d] + ((d == q)?z:0); int s = b->hi[d] - b->lo[d]; if (v < 0) v += tlattice[d]; if (v >= tlattice[d]) v -= tlattice[d]; x = x * s + v - b->lo[d]; } return x / 2; } @ Computing the index of the gauge link is similar to [[to_HFlinear]], except that the extra parameter [[q]] tells us which of [[p]] should be stepped down by one. If $q<0$, we are computing forward link position. <>= static int to_Ulinear(int p[], struct bounds *b, int q) { int x, d; if ((q < 0) || (p[q] > b->lo[q]) || (network[q] < 2)) { <> } else { <> } } @ Regular gauge links sits four per site and their indices are easy to compute: <>= for (x = 0, d = 4; d--;) { int s = b->hi[d] - b->lo[d]; int v = p[d] - ((q == d)?1:0); if (v < 0) v += tlattice[d]; x = x * s + v - b->lo[d]; } return 4 * x + ((q < 0)?0:q); @ For borrowed links we need first to skip all regulars and previous faces and then count position on the borrowed 3-face: <>= int s0, v0; for (d = 0, v0 = 1; d < 4; d++) v0 *= b->hi[d] - b->lo[d]; for (d = 0, s0 = 4 * v0; d < q; d++) s0 += v0 / (b->hi[d] - b->lo[d]); for (d = 4, x = 0; d--;) { int s = b->hi[d] - b->lo[d]; int v = p[d]; if (d == q) continue; x = x * s + v - b->lo[d]; } return s0 + x; @ \subsection{QMP Initialization} <>= #include @ Once the tables and sizes are known, allocate all send and receive buffers and register them with QMP. <>= if (build_buffers(&even_odd)) goto error; if (build_buffers(&odd_even)) goto error; @ There are three cases we need to consider when preparing the communication handles. Note: return 1 if there was trouble. <>= static int build_buffers(struct neighbor *nb); @ <>= static int build_buffers(struct neighbor *nb) { int i, k, Nr; QMP_msghandle_t Rh[2*DIM]; Nr = nb->Ns = nb->Nx = 0; for (i = 0; i < DIM; i++) { switch (network[i]) { case 1: break; case 2: <> break; default: /* Order here is important */ <> <> break; } } <> return 0; } @ If there is only two nodes in a direction, we use only up link to communicate (becasuse there is only one wire between the nodes.) <>= k = make_buffer(nb, nb->snd_size[2*i] + nb->snd_size[2*i+1]); nb->snd_buf[2*i] = nb->qmp_buf[k]; nb->snd_buf[2*i+1] = nb->snd_buf[2*i] + S_4 * nb->snd_size[2*i]; make_send(nb, k, i, +1); k = make_buffer(nb, nb->rcv_size[2*i] + nb->rcv_size[2*i+1]); nb->rcv_buf[2*i] = nb->qmp_buf[k]; nb->rcv_buf[2*i+1] = nb->snd_buf[2*i] + S_4 * nb->snd_size[2*i]; Nr = make_receive(nb, k, i, -1, Rh, Nr); /* -1 here helps with a bug in GigE QMP */ @ On a large machine, up and down buffers are separate: <>= k = make_buffer(nb, nb->snd_size[2*i+1]); nb->snd_buf[2*i+1] = nb->qmp_buf[k]; make_send(nb, k, i, +1); k = make_buffer(nb, nb->rcv_size[2*i+1]); nb->rcv_buf[2*i+1] = nb->qmp_buf[k]; Nr = make_receive(nb, k, i, +1, Rh, Nr); @ <>= k = make_buffer(nb, nb->snd_size[2*i]); nb->snd_buf[2*i] = nb->qmp_buf[k]; make_send(nb, k, i, -1); k = make_buffer(nb, nb->rcv_size[2*i]); nb->rcv_buf[2*i] = nb->qmp_buf[k]; Nr = make_receive(nb, k, i, -1, Rh, Nr); @ Allocate a buffer of [[size]] [[vHalfFermion]]'s fit for send and/or receive. <>= static int make_buffer(struct neighbor *nb, int size); @ <>= static int make_buffer(struct neighbor *nb, int size) { int bcount = size * S_4 * sizeof (vHalfFermion); int N = nb->Nx; nb->qmp_size[N] = size; sse_aligned_buffer(nb, N, bcount); nb->qmp_mm[N] = QMP_declare_msgmem(nb->qmp_buf[N], bcount); nb->Nx = N + 1; return N; } @ Construct a send handle. This function also places a copy of the send handle into a proper place and sets a bit in [[qmp_smask]]. <>= static void make_send(struct neighbor *nb, int k, int i, int d); @ <>= static void make_send(struct neighbor *nb, int k, int i, int d) { QMP_msghandle_t h = QMP_declare_send_relative(nb->qmp_mm[k], i, d, 1); int j = 2 * i + ((d < 0)? 0: 1); nb->qmp_sh[j] = h; nb->qmp_sv[nb->Ns++] = h; nb->qmp_smask |= (1 << j); } @ Constructing a receive handle is similar. We increment [[Nr]] to keep the count of filled positions in [[Rh]]. <>= static int make_receive(struct neighbor *nb, int k, int i, int d, QMP_msghandle_t Rh[2*DIM], int Nr); @ <>= static int make_receive(struct neighbor *nb, int k, int i, int d, QMP_msghandle_t Rh[2*DIM], int Nr) { Rh[Nr] = QMP_declare_receive_relative(nb->qmp_mm[k], i, d, 1); return Nr+1; } @ Finally, aggregate all receive handles: <>= nb->qmp_cr = QMP_declare_multiple(Rh, Nr); @ SSE likes its memory aligned at 16 bytes. We need to keep that in mind when asking for QMP memory. Note, that this function may be in violation of a strict interpretation of the QMP Specification, but on many SciDAC calls numerous assurances were given that such usage is permissable. <>= static void sse_aligned_buffer(struct neighbor *nb, int k, int size); @ <>= static void sse_aligned_buffer(struct neighbor *nb, int k, int size) { int xcount = size + 15; char *ptr = QMP_allocate_aligned_memory(xcount); nb->qmp_buf[k] = (void *)(~15 & (15 + (unsigned long)(ptr))); nb->qmp_xbuf[k] = ptr; } @ Freeing QMP structure does the reverse of the allocator: <>= free_buffers(&even_odd); free_buffers(&odd_even); @ There are some unsettling omissions in the QMP specification. What follows is based on the tribal wisdom which was not codified. <>= static void free_buffers(struct neighbor *nb); @ <>= static void free_buffers(struct neighbor *nb) { int i; <> <> <> } @ Here we assume that [[QMP_free_msghandle()]] knows what to do with a bad handle returned from [[QMP_declare_send...()]] and [[QMP_declare_receive...()]]. The first common wisdom is that [[QMP_declare_multiple()]] invalidates individual handles. We only need to free one handle in [[nb->qmp_cr]]: <>= QMP_free_msghandle(nb->qmp_cr); @ There is no need to walk through the dimension again: we conveniently packed all send handles into an array: <>= for (i = nb->Ns; i--;) QMP_free_msghandle(nb->qmp_sv[i]); @ Two steps are needed to deallocate QMP memory: <>= for (i = nb->Nx; i--;) { QMP_free_msgmem(nb->qmp_mm[i]); QMP_free_aligned_memory(nb->qmp_xbuf[i]); } @ \subsection{Parts of the Solver} Here are three principal parts of the solver. Firts, we compute the right hand side of the equation to be solved by the CG. Next, there is a solver of a hermitian matrix. Finally, the second half of the solution is computed. \subsubsection{Compute the RHS} Here we perform steps 1--3 of the outline above. <>= compute_Qee1(auxA_e, eta->even); compute_Qoe(auxB_o, auxA_e); compute_sum_o(auxA_o, eta->odd, -1, auxB_o); compute_Qoo1(auxB_o, auxA_o); compute_Mx(Phi_o, auxB_o); @ <>= static vOddFermion *auxA_o, *auxB_o, *Phi_o; static vEvenFermion *auxA_e; @ <>= Phi_o = allocate_odd_fermion(); if (Phi_o == 0) goto error; auxA_o = allocate_odd_fermion(); if (auxA_o == 0) goto error; auxB_o = allocate_odd_fermion(); if (auxB_o == 0) goto error; auxA_e = allocate_even_fermion(); if (auxA_e == 0) goto error; @ <>= if (auxA_e) free16(auxA_e); auxA_e = 0; if (auxB_o) free16(auxB_o); auxB_o = 0; if (auxA_o) free16(auxA_o); auxA_o = 0; if (Phi_o) free16(Phi_o); Phi_o = 0; @ \subsection{Field Operations} Hermitian solver follows: <>= status = cg(psi->odd, Phi_o, x0->odd, eps, max_iter, out_eps, out_iter); @ <>= static int cg(vOddFermion *psi, const vOddFermion *b, const vOddFermion *x0, double epsilon, int max_iter, double *out_eps, int *out_iter); @ <>= static int cg(vOddFermion *x_o, const vOddFermion *b, const vOddFermion *x0, double epsilon, int N, double *out_eps, int *out_N) { double rho, alpha, beta, gamma, norm_z; int status = 1; int k; copy_o(x_o, x0); compute_MxM(p_o, &norm_z, x_o); compute_sum_oN(r_o, &rho, b, -1, p_o); copy_o(p_o, r_o); <> for (k = 0; (rho > epsilon) && (k < N); k++) { compute_MxM(q_o, &norm_z, p_o); <> alpha = rho / norm_z; compute_sum2_oN(r_o, &gamma, -alpha, q_o); compute_sum2_o(x_o, alpha, p_o); <> if (gamma <= epsilon) { rho = gamma; status = 0; break; } beta = gamma / rho; rho = gamma; compute_sum2x_o(p_o, r_o, beta); } <> *out_N = k; *out_eps = rho; return status; } @ Temporaries used by the CG <>= static vOddFermion *r_o, *p_o, *q_o; @ <>= r_o = allocate_odd_fermion(); if (r_o == 0) goto error; p_o = allocate_odd_fermion(); if (p_o == 0) goto error; q_o = allocate_odd_fermion(); if (q_o == 0) goto error; @ <>= if (r_o) free16(r_o); r_o = 0; if (p_o) free16(p_o); p_o = 0; if (q_o) free16(q_o); q_o = 0; @ \subsubsection{Computing the even part of the result} Again, this is simpling performing step 5 of the outline above: <>= compute_Qeo(auxA_e, psi->odd); compute_sum_e(auxB_e, eta->even, -1, auxA_e); compute_Qee1(psi->even, auxB_e); @ <>= static vEvenFermion *auxB_e; @ <>= auxB_e = allocate_even_fermion(); if (auxB_e == 0) goto error; @ <>= if (auxB_e) free16(auxB_e); auxB_e = 0; @ \subsubsection{[[copy_o(d, s)]] or $d\leftarrow s$} This is a copies $d\leftarrow s$. Since it is used outside of the cg loop, we do not worry too much about efficiency here. Hence, cache pollution. <>= static void copy_o(vOddFermion *dst, const vOddFermion *src); @ <>= static void copy_o(vOddFermion *dst, const vOddFermion *src) { int i = odd_even.size * S_4 * sizeof (vOddFermion) / sizeof (vReal); vReal *d = (vReal *)dst; const vReal *s = (const vReal *)src; for ( ;i--;) *d++ = *s++; } @ \subsubsection{[[compute_sum2_o(d,alpha,s)]], or $d\leftarrow d + \alpha s$} This is a function we can not speedup much: too many bytes are needed per operation. In principle, one can play with uncached loads and stores, but let us leave that for later. <>= static void compute_sum2_o(vOddFermion *dst, double alpha, const vOddFermion *src); @ <>= static void compute_sum2_o(vOddFermion *dst, double alpha, const vOddFermion *src) { int i = odd_even.size * S_4 * sizeof (vOddFermion) / sizeof (vReal); vReal a = vmk1(alpha); vReal *d = (vReal *)dst; const vReal *s = (const vReal *)src; for ( ;i--;) *d++ += a * *s++; } @ \subsubsection{[[compute_sum2x_o(d,s,alpha)]], or $d\leftarrow \alpha d + s$} Almost the same as the previous one, but scaling is applied to another summand. <>= static void compute_sum2x_o(vOddFermion *dst, const vOddFermion *src, double alpha); @ <>= static void compute_sum2x_o(vOddFermion *dst, const vOddFermion *src, double alpha) { int i = odd_even.size * S_4 * sizeof (vOddFermion) / sizeof (vReal); vReal a = vmk1(alpha); vReal *d = (vReal *)dst; const vReal *s = (const vReal *)src; for ( ;i--; d++) *d = a * *d + *s++; } @ \subsubsection{[[compute_sum_x(d,x,alpha,y)]] or $q\leftarrow x + \alpha y$} Next are a pair of general sums with the destination distict from the sources. Do we really need separate functions for these? <>= static void compute_sum_e(vEvenFermion *d, const vEvenFermion *x, double alpha, const vEvenFermion *y); static void compute_sum_o(vOddFermion *d, const vOddFermion *x, double alpha, const vOddFermion *y); @ <>= static void compute_sum_e(vEvenFermion *d, const vEvenFermion *x, double alpha, const vEvenFermion *y) { const vReal *X = (const vReal *)x; const vReal *Y = (const vReal *)y; vReal *D = (vReal *)d; vReal a = vmk1(alpha); int i = even_odd.size * S_4 * sizeof (vEvenFermion) / sizeof (vReal); for (;i--;) *D++ = *X++ + a * *Y++; } @ <>= static void compute_sum_o(vOddFermion *d, const vOddFermion *x, double alpha, const vOddFermion *y) { const vReal *X = (const vReal *)x; const vReal *Y = (const vReal *)y; vReal *D = (vReal *)d; vReal a = vmk1(alpha); int i = odd_even.size * S_4 * sizeof (vOddFermion) / sizeof (vReal); for (;i--;) *D++ = *X++ + a * *Y++; } @ \subsubsection{Compute $d\leftarrow x + \alpha y$ and $r\leftarrow\langle d,d\rangle$} There are two remaining sums which compute a sum of two fermions and the norm of the result at the same time. <>= static void compute_sum_oN(vOddFermion *d, double *norm, const vOddFermion *x, double alpha, const vOddFermion *y); <>= static void compute_sum_oN(vOddFermion *d, double *norm, const vOddFermion *x, double alpha, const vOddFermion *y) { const vReal *X = (const vReal *)x; const vReal *Y = (const vReal *)y; vReal *D = (vReal *)d; vReal a = vmk1(alpha); vReal s = vmk1(0.0); vReal v; int i = odd_even.size * S_4 * sizeof (vOddFermion) / sizeof (vReal); for (;i--;) { v = *X++ + a * *Y++; s += v * v; *D++ = v; } *norm = vsum(s); <> } @ <>= static void compute_sum2_oN(vOddFermion *d, double *norm, double alpha, const vOddFermion *y); <>= static void compute_sum2_oN(vOddFermion *d, double *norm, double alpha, const vOddFermion *y) { const vReal *Y = (const vReal *)y; vReal *D = (vReal *)d; vReal a = vmk1(alpha); vReal s = vmk1(0.0); vReal v; int i = odd_even.size * S_4 * sizeof (vOddFermion) / sizeof (vReal); for (;i--;) { v = *D + a * *Y++; s += v * v; *D++ = v; } *norm = vsum(s); <> } @ \subsubsection{Compute $\eta\leftarrow M^{\dagger}M\psi$ and friends} Last three easy pieces. <>= static void compute_MxM(vOddFermion *eta, double *norm, const vOddFermion *psi); static void compute_M(vOddFermion *eta, double *norm, const vOddFermion *psi); static void compute_Mx(vOddFermion *eta, const vOddFermion *psi); @ <>= static void compute_MxM(vOddFermion *eta, double *norm, const vOddFermion *psi) { compute_M(auxB_o, norm, psi); compute_Mx(eta, auxB_o); } @ Computation of $M$ starts the global sum which will be completed separately. <>= static void compute_M(vOddFermion *eta, double *norm, const vOddFermion *psi) { compute_Qee1Qeo(auxA_e, psi); compute_1Qoo1Qoe(eta, norm, psi, auxA_e); } @ For $M^{\dagger}$ the order of factors differs from optimal. For now we have to live with the inefficiency here. <>= static void compute_Mx(vOddFermion *eta, const vOddFermion *psi) { compute_Soo1(auxA_o, psi); compute_See1Seo(auxA_e, auxA_o); compute_1Soe(eta, psi, auxA_e); } @ \subsubsection{Standalone diagonal pieces} Some code savings are still possible, since [[compute_Qee1()]] may differ from [[compute_Qoo1()]] by the number of sites only. <>= static void compute_Qxx1(vFermion *eta, const vFermion *psi, int xyzt); static void inline compute_Qee1(vEvenFermion *eta, const vEvenFermion *psi) { compute_Qxx1(&eta->f, &psi->f, even_odd.size); } static void inline compute_Qoo1(vOddFermion *eta, const vOddFermion *psi) { compute_Qxx1(&eta->f, &psi->f, odd_even.size); } static void compute_Soo1(vOddFermion *eta, const vOddFermion *psi); @ \begin{displaymath} \chi = Q_{xx}^{-1}\psi \end{displaymath} <>= static void compute_Qxx1(vFermion *chi, const vFermion *psi, int size) { const vFermion *qs, *qx5; <> <> for (i = 0; i < size; i++) { xyzt5 = i * S_4; <> <> <> } } @ \begin{displaymath} \chi = S_{oo}^{-1}\psi \end{displaymath} <>= static void compute_Soo1(vOddFermion *Chi, const vOddFermion *Psi) { vFermion *chi = &Chi->f; const vFermion *psi = &Psi->f; int size = odd_even.size; const vFermion *qs, *qx5; <> <> for (i = 0; i < size; i++) { xyzt5 = i * S_4; <> <> <> } } @ \pagebreak \subsubsection{$Q_{xx}^{-1}$ and $S_{xx}^{-1}$ on a single $s$-chain} Therefore, <>= <> <> @ <>= <> <> @ And <>= <> <> @ <>= <> <> @ <>= <> <> @ <>= <> <> @ For both $Q_{xx}^{-1}$ and $S_{xx}^{-1}$ we need to compute $R_A$ and $R_B$. This can be done iteratively: \begin{eqnarray*} y^{(A)}_k &=& \left\{\begin{array}{ll}\frac{1}{a}z_k,&\quad\mbox{if } k=n-1\\ \frac{1}{a}z_k-\frac{b}{a}y^{(A)}_{k+1},&\quad\mbox{otherwise}\end{array} \right.\\ y^{(B)}_k &=& \left\{\begin{array}{ll}\frac{1}{a}z_0,&\quad\mbox{if } k=0\\ \frac{1}{a}z_k-\frac{b}{a}y^{(B)}_{k-1},&\quad\mbox{otherwise}\end{array} \right.\\ \end{eqnarray*} It turns out, that these computations are faster on the regular FP instructions than on SSE. For this reason corresponding parts for $L_X^{-1}$ depend on the memory layout of [[vReal]]. Let us compute constant pieces first. Division is slow, so we compute $1/a$ and $-b/a$ once and for all: <>= static REAL inv_a; static REAL b_over_a; @ <>= inv_a = 1.0 / a; b_over_a = -b * inv_a; @ Computing $z^{(A)}=L_A^{-1}x$ and $z^{(B)}=L_B^{-1}x$ is easy: \begin{eqnarray*} z^{(A)}_k&=&\left\{\begin{array}{ll} -\sum_{j=0}^{n-2}{\frac{(-b)^jc/a^{j+1}}{1+(-b)^{n-1}c/a^n}x_j}+ \frac{1}{1+(-b)^{n-1}c/a^n}x_{n-1}, &\quad\mbox{if }k=n-1\\ x_k,&\quad\mbox{otherwise} \end{array}\right.\\ z^{(B)}_k&=&\left\{\begin{array}{ll} \frac{1}{1+(-b)^{n-1}c/a^n}x_{0}- \sum_{j=1}^{n-1}{\frac{(-b)^{n-1-j}c/a^{n-j}}{1+(-b)^{n-1}c/a^n}x_j}, &\quad\mbox{if }k=0\\ x_k,&\quad\mbox{otherwise} \end{array}\right.\\ \end{eqnarray*} We need to rewrite these expressions in a form suitable for SSE. Let us write \begin{eqnarray*} z^{(A)}_{n-1}&=&z^{(A)}_a+z^{(A)}_b+z^{(A)}_c+z^{(A)}_d\\ z^{(B)}_{0}&=&z^{(B)}_a+z^{(B)}_b+z^{(B)}_c+z^{(B)}_d\\ \end{eqnarray*} where \begin{eqnarray*} z^{(A)}_a&=&\sum_{j=0}^{n/4-1}\frac{-b^{4j}c/a^{4j+1}}{1+(-b)^{n-1}c/a^n} x_{4j}\\ z^{(A)}_b&=&\sum_{j=0}^{n/4-1}\frac{b^{4j+1}c/a^{4j+2}}{1+(-b)^{n-1}c/a^n} x_{4j+1}\\ z^{(A)}_c&=&\sum_{j=0}^{n/4-1}\frac{-b^{4j+2}c/a^{4j+3}}{1+(-b)^{n-1}c/a^n} x_{4j+2}\\ z^{(A)}_d&=&\sum_{j=0}^{n/4-2}\frac{b^{4j+3}c/a^{4j+4}}{1+(-b)^{n-1}c/a^n} x_{4j+3} + \frac{1}{1+(-b)^{n-1}c/a^n}x_{n-1}\\ z^{(B)}_a&=&\frac{1}{1+(-b)^{n-1}c/a^n}x_{0}+ \sum_{j=1}^{n/4-1}\frac{b^{n-1-4j}c/a^{n-4j}}{1+(-b)^{n-1}c/a^n} x_{4j}\\ z^{(B)}_b&=&\sum_{j=0}^{n/4-1}\frac{-b^{n-2-4j}c/a^{n-4j-1}}{1+(-b)^{n-1}c/a^n} x_{4j+1}\\ z^{(B)}_b&=&\sum_{j=0}^{n/4-1}\frac{b^{n-3-4j}c/a^{n-4j-2}}{1+(-b)^{n-1}c/a^n} x_{4j+2}\\ z^{(B)}_b&=&\sum_{j=0}^{n/4-1}\frac{-b^{n-4-4j}c/a^{n-4j-3}}{1+(-b)^{n-1}c/a^n} x_{4j+3}\\ \end{eqnarray*} These sums could be effectively computed with SSE, because their structures match DWF memory layout. Here are constants needed to compute $z^{(A)}$ and $z^{(B)}$: <>= static vReal vfx_A; static vReal vfx_B; static vReal vab; static REAL c0; @ <>= c0 = 1./(1.-c/b*pow(b/a, S_4*4.)); vab = vmk1(pow(b/a, 4.)); vfx_A = vmk4(-c0*c/a, c0*c*b/(a*a), -c0*c*b*b/(a*a*a), c0*c*b*b*b/(a*a*a*a)); vfx_B = vmk4(c0*c*b*b*b/(a*a*a*a), -c0*c*b*b/(a*a*a), c0*c*b/(a*a), -c0*c/a); @ We need [[math.h]] for the prototype of [[pow()]]: <>= #include @ \subsubsection{Compute $L_A^{-1}$ and $L_B^{-1}$} There are two cases: \begin{enumerate} \item $L_X^{-1}$ is computed as part of standalone diagonal piece. In this case, the computation is done from $q$ to $r$ and $L_X^{-1}$ copies elements as needed. \item $Q_{xx}^{-1}$ is part of combined operator. In this case $q$ is aliased to $r$, and no copy is performed. \end{enumerate} Before spelling out the details, let us define a few handy macros: <>= #if defined(qs) #define QSETUP(s) #define Q2R(d,pt) #else #define QSETUP(s) qs = &qx5[s]; #define Q2R(d,pt) rs->f[d][c].pt = qs->f[d][c].pt; #endif @ <>= #undef QSETUP #undef Q2R @ For completeness, here are definitions of variables used in the pieces bellow: <>= vReal fx; vHalfFermion zV; vcomplex zn; complex zX[2][3]; @ <>= vhfzero(&zV); fx = vfx_A; <> for (s = 0; s < S_4_1; s++, fx = fx * vab) { rs = &rx5[s]; QSETUP(s) <> } rs = &rx5[S_4_1]; QSETUP(S_4_1) vput_3(&fx, c0); <> for (c = 0; c < 3; c++) { <> zn.re = qs->f[0][c].re; zn.im = qs->f[0][c].im; vput_3(&zn.re, zX[0][c].re); vput_3(&zn.im, zX[0][c].im); rs->f[0][c].re = zn.re; rs->f[0][c].im = zn.im; zn.re = qs->f[1][c].re; zn.im = qs->f[1][c].im; vput_3(&zn.re, zX[1][c].re); vput_3(&zn.im, zX[1][c].im); rs->f[1][c].re = zn.re; rs->f[1][c].im = zn.im; } <> @ This piece is used twice: once in the loop over $L_s$, and the second time after correcting $s_3$: <>= for (c = 0; c < 3; c++) { zV.f[0][c].re += fx * qs->f[0][c].re; Q2R(0,re) zV.f[0][c].im += fx * qs->f[0][c].im; Q2R(0,im) zV.f[1][c].re += fx * qs->f[1][c].re; Q2R(1,re) zV.f[1][c].im += fx * qs->f[1][c].im; Q2R(1,im) } @ The only difference between $L_A^{-1}$ on lower components is the source of the data and the destination of the result. We have to repeat most of the above pieces though. <>= vhfzero(&zV); fx = vfx_A; <> for (s = 0; s < S_4_1; s++, fx = fx * vab) { rs = &rx5[s]; QSETUP(s) <> } rs = &rx5[S_4_1]; QSETUP(S_4_1) vput_3(&fx, c0); <> for (c = 0; c < 3; c++) { <> zn.re = qs->f[2][c].re; zn.im = qs->f[2][c].im; vput_3(&zn.re, zX[0][c].re); vput_3(&zn.im, zX[0][c].im); rs->f[2][c].re = zn.re; rs->f[2][c].im = zn.im; zn.re = qs->f[3][c].re; zn.im = qs->f[3][c].im; vput_3(&zn.re, zX[1][c].re); vput_3(&zn.im, zX[1][c].im); rs->f[3][c].re = zn.re; rs->f[3][c].im = zn.im; } <> @ <>= for (c = 0; c < 3; c++) { zV.f[0][c].re += fx * qs->f[2][c].re; Q2R(2,re) zV.f[0][c].im += fx * qs->f[2][c].im; Q2R(2,im) zV.f[1][c].re += fx * qs->f[3][c].re; Q2R(3,re) zV.f[1][c].im += fx * qs->f[3][c].im; Q2R(3,im) } @ For $L_B^{-1}$ the difference is in the direction of the sweep along the $s$-chain: <>= vhfzero(&zV); fx = vfx_B; <> for (s = S_4; --s; fx = fx * vab) { rs = &rx5[s]; QSETUP(s) <> } rs = &rx5[0]; QSETUP(0) vput_0(&fx, c0); <> for (c = 0; c < 3; c++) { <> zn.re = qs->f[0][c].re; zn.im = qs->f[0][c].im; vput_0(&zn.re, zX[0][c].re); vput_0(&zn.im, zX[0][c].im); rs->f[0][c].re = zn.re; rs->f[0][c].im = zn.im; zn.re = qs->f[1][c].re; zn.im = qs->f[1][c].im; vput_0(&zn.re, zX[1][c].re); vput_0(&zn.im, zX[1][c].im); rs->f[1][c].re = zn.re; rs->f[1][c].im = zn.im; } <> @ Again, some repetition is needed for the lower component case: <>= vhfzero(&zV); fx = vfx_B; <> for (s = S_4; --s; fx = fx * vab) { rs = &rx5[s]; QSETUP(s) <> } rs = &rx5[0]; QSETUP(0) vput_0(&fx, c0); <> for (c = 0; c < 3; c++) { <> zn.re = qs->f[2][c].re; zn.im = qs->f[2][c].im; vput_0(&zn.re, zX[0][c].re); vput_0(&zn.im, zX[0][c].im); rs->f[2][c].re = zn.re; rs->f[2][c].im = zn.im; zn.re = qs->f[3][c].re; zn.im = qs->f[3][c].im; vput_0(&zn.re, zX[1][c].re); vput_0(&zn.im, zX[1][c].im); rs->f[3][c].re = zn.re; rs->f[3][c].im = zn.im; } <> @ By now, we have four partial sums which must be combined into $z_{n-1}$: <>= zX[0][c].re = vsum(zV.f[0][c].re); zX[0][c].im = vsum(zV.f[0][c].im); zX[1][c].re = vsum(zV.f[1][c].re); zX[1][c].im = vsum(zV.f[1][c].im); @ \subsubsection{Compute $R_A^{-1}$ and $R_B^{-1}$} Since $R_X^{-1}$ is always computed after $L_X^{-1}$, it takes its source from [[rs]] and places the result back into [[rs]]. For $R_X^{-1}$, again combinations of $A$ and $B$ and upper and lower parts require some cut, paste and edit. <>= <> for (s = S_4; s--;) { rs = &rx5[s]; for (c = 0; c < 3; c++) { <> <> } } @ <>= <> for (s = S_4; s--;) { rs = &rx5[s]; for (c = 0; c < 3; c++) { <> <> } } @ <>= <> for (s = 0; s < S_4; s++) { rs = &rx5[s]; for (c = 0; c < 3; c++) { <> <> } } @ <>= <> for (s = 0; s < S_4; s++) { rs = &rx5[s]; for (c = 0; c < 3; c++) { <> <> } } @ We do not handle boundary cases in a special way. Instead, the previos value of $y$ is stored in [[yOut]]: <>= complex yOut[2][3]; @ <>= yOut[0][0].re = yOut[0][0].im = 0; yOut[0][1].re = yOut[0][1].im = 0; yOut[0][2].re = yOut[0][2].im = 0; yOut[1][0].re = yOut[1][0].im = 0; yOut[1][1].re = yOut[1][1].im = 0; yOut[1][2].re = yOut[1][2].im = 0; @ Now, the magic of copy paste: <>= { REAL *rs0re = (REAL *)&rs->f[0][c].re; REAL *rs0im = (REAL *)&rs->f[0][c].im; rs0re[3] = inv_a * rs0re[3] + b_over_a * yOut[0][c].re; rs0re[2] = inv_a * rs0re[2] + b_over_a * rs0re[3]; rs0re[1] = inv_a * rs0re[1] + b_over_a * rs0re[2]; yOut[0][c].re = rs0re[0] = inv_a * rs0re[0] + b_over_a * rs0re[1]; rs0im[3] = inv_a * rs0im[3] + b_over_a * yOut[0][c].im; rs0im[2] = inv_a * rs0im[2] + b_over_a * rs0im[3]; rs0im[1] = inv_a * rs0im[1] + b_over_a * rs0im[2]; yOut[0][c].im = rs0im[0] = inv_a * rs0im[0] + b_over_a * rs0im[1]; } @ <>= { REAL *rs1re = (REAL *)&rs->f[1][c].re; REAL *rs1im = (REAL *)&rs->f[1][c].im; rs1re[3] = inv_a * rs1re[3] + b_over_a * yOut[1][c].re; rs1re[2] = inv_a * rs1re[2] + b_over_a * rs1re[3]; rs1re[1] = inv_a * rs1re[1] + b_over_a * rs1re[2]; yOut[1][c].re = rs1re[0] = inv_a * rs1re[0] + b_over_a * rs1re[1]; rs1im[3] = inv_a * rs1im[3] + b_over_a * yOut[1][c].im; rs1im[2] = inv_a * rs1im[2] + b_over_a * rs1im[3]; rs1im[1] = inv_a * rs1im[1] + b_over_a * rs1im[2]; yOut[1][c].im = rs1im[0] = inv_a * rs1im[0] + b_over_a * rs1im[1]; } @ <>= { REAL *rs2re = (REAL *)&rs->f[2][c].re; REAL *rs2im = (REAL *)&rs->f[2][c].im; rs2re[3] = inv_a * rs2re[3] + b_over_a * yOut[0][c].re; rs2re[2] = inv_a * rs2re[2] + b_over_a * rs2re[3]; rs2re[1] = inv_a * rs2re[1] + b_over_a * rs2re[2]; yOut[0][c].re = rs2re[0] = inv_a * rs2re[0] + b_over_a * rs2re[1]; rs2im[3] = inv_a * rs2im[3] + b_over_a * yOut[0][c].im; rs2im[2] = inv_a * rs2im[2] + b_over_a * rs2im[3]; rs2im[1] = inv_a * rs2im[1] + b_over_a * rs2im[2]; yOut[0][c].im = rs2im[0] = inv_a * rs2im[0] + b_over_a * rs2im[1]; } @ <>= { REAL *rs3re = (REAL *)&rs->f[3][c].re; REAL *rs3im = (REAL *)&rs->f[3][c].im; rs3re[3] = inv_a * rs3re[3] + b_over_a * yOut[1][c].re; rs3re[2] = inv_a * rs3re[2] + b_over_a * rs3re[3]; rs3re[1] = inv_a * rs3re[1] + b_over_a * rs3re[2]; yOut[1][c].re = rs3re[0] = inv_a * rs3re[0] + b_over_a * rs3re[1]; rs3im[3] = inv_a * rs3im[3] + b_over_a * yOut[1][c].im; rs3im[2] = inv_a * rs3im[2] + b_over_a * rs3im[3]; rs3im[1] = inv_a * rs3im[1] + b_over_a * rs3im[2]; yOut[1][c].im = rs3im[0] = inv_a * rs3im[0] + b_over_a * rs3im[1]; } @ <>= { REAL *rs0re = (REAL *)&rs->f[0][c].re; REAL *rs0im = (REAL *)&rs->f[0][c].im; rs0re[0] = inv_a * rs0re[0] + b_over_a * yOut[0][c].re; rs0re[1] = inv_a * rs0re[1] + b_over_a * rs0re[0]; rs0re[2] = inv_a * rs0re[2] + b_over_a * rs0re[1]; yOut[0][c].re = rs0re[3] = inv_a * rs0re[3] + b_over_a * rs0re[2]; rs0im[0] = inv_a * rs0im[0] + b_over_a * yOut[0][c].im; rs0im[1] = inv_a * rs0im[1] + b_over_a * rs0im[0]; rs0im[2] = inv_a * rs0im[2] + b_over_a * rs0im[1]; yOut[0][c].im = rs0im[3] = inv_a * rs0im[3] + b_over_a * rs0im[2]; } @ <>= { REAL *rs1re = (REAL *)&rs->f[1][c].re; REAL *rs1im = (REAL *)&rs->f[1][c].im; rs1re[0] = inv_a * rs1re[0] + b_over_a * yOut[1][c].re; rs1re[1] = inv_a * rs1re[1] + b_over_a * rs1re[0]; rs1re[2] = inv_a * rs1re[2] + b_over_a * rs1re[1]; yOut[1][c].re = rs1re[3] = inv_a * rs1re[3] + b_over_a * rs1re[2]; rs1im[0] = inv_a * rs1im[0] + b_over_a * yOut[1][c].im; rs1im[1] = inv_a * rs1im[1] + b_over_a * rs1im[0]; rs1im[2] = inv_a * rs1im[2] + b_over_a * rs1im[1]; yOut[1][c].im = rs1im[3] = inv_a * rs1im[3] + b_over_a * rs1im[2]; } @ <>= { REAL *rs2re = (REAL *)&rs->f[2][c].re; REAL *rs2im = (REAL *)&rs->f[2][c].im; rs2re[0] = inv_a * rs2re[0] + b_over_a * yOut[0][c].re; rs2re[1] = inv_a * rs2re[1] + b_over_a * rs2re[0]; rs2re[2] = inv_a * rs2re[2] + b_over_a * rs2re[1]; yOut[0][c].re = rs2re[3] = inv_a * rs2re[3] + b_over_a * rs2re[2]; rs2im[0] = inv_a * rs2im[0] + b_over_a * yOut[0][c].im; rs2im[1] = inv_a * rs2im[1] + b_over_a * rs2im[0]; rs2im[2] = inv_a * rs2im[2] + b_over_a * rs2im[1]; yOut[0][c].im = rs2im[3] = inv_a * rs2im[3] + b_over_a * rs2im[2]; } @ <>= { REAL *rs3re = (REAL *)&rs->f[3][c].re; REAL *rs3im = (REAL *)&rs->f[3][c].im; rs3re[0] = inv_a * rs3re[0] + b_over_a * yOut[1][c].re; rs3re[1] = inv_a * rs3re[1] + b_over_a * rs3re[0]; rs3re[2] = inv_a * rs3re[2] + b_over_a * rs3re[1]; yOut[1][c].re = rs3re[3] = inv_a * rs3re[3] + b_over_a * rs3re[2]; rs3im[0] = inv_a * rs3im[0] + b_over_a * yOut[1][c].im; rs3im[1] = inv_a * rs3im[1] + b_over_a * rs3im[0]; rs3im[2] = inv_a * rs3im[2] + b_over_a * rs3im[1]; yOut[1][c].im = rs3im[3] = inv_a * rs3im[3] + b_over_a * rs3im[2]; } @ \subsubsection{Standalone off-diagonal pieces} First, simple off-diagonal parts. <>= static void compute_Qxy(vFermion *d, const vFermion *s, struct neighbor *nb); static void inline compute_Qoe(vOddFermion *d, const vEvenFermion *s) { compute_Qxy(&d->f, &s->f, &odd_even); } static void inline compute_Qeo(vEvenFermion *d, const vOddFermion *s) { compute_Qxy(&d->f, &s->f, &even_odd); } static void compute_1Sxy(vFermion *d, const vFermion *q, const vFermion *s, struct neighbor *nb); static void inline compute_1Soe(vOddFermion *d, const vOddFermion *q, const vEvenFermion *s) { compute_1Sxy(&d->f, &q->f, &s->f, &odd_even); } @ \subsubsection{Common off-diagonal parts} We start from the top level: \begin{displaymath} \chi = Q_{xy}\psi \end{displaymath} <>= static void compute_Qxy(vFermion *chi, const vFermion *psi, struct neighbor *nb) { <> <> <> <> <> <> <> <> <> <> } @ For other functions, little need to be changes at this granularity. E.g., the final part of $M^{\dagger}$ is \begin{displaymath} \chi = \eta-S_{xy}\psi \end{displaymath} <>= static void compute_1Sxy(vFermion *chi, const vFermion *eta, const vFermion *psi, struct neighbor *nb) { <> <> <> <> <> <> <> <> <> <> } @ Likewise, \begin{displaymath} \chi = Q_{xx}^{-1}Q_{xy}\psi \end{displaymath} <>= static void compute_Qxx1Qxy(vFermion *chi, const vFermion *psi, struct neighbor *nb) { <> <> <> <> <> <> <> <> <> <> <> } @ and \begin{displaymath} \chi = S_{xx}^{-1}S_{xy}\psi \end{displaymath} <>= static void compute_Sxx1Sxy(vFermion *chi, const vFermion *psi, struct neighbor *nb) { <> <> <> <> <> <> <> <> <> <> <> } @ Finally, \begin{displaymath} \chi=\eta-Q_{xx}^{-1}Q_{xy}\psi \end{displaymath} <>= static void compute_1Qxx1Qxy(vFermion *chi, double *norm, const vFermion *eta, const vFermion *psi, struct neighbor *nb) { <> <> <> vReal vv; vReal nv; *norm = 0; <> <> <> <> <> <> <> <> <> } @ Remember, that $Z_{xy}$ always puts result into [[q]]. For standalone diagonal pieces a couple of [[define]]'s help to manage [[__restrict__]] pointers properly. <>= #define qx5 rx5 #define qs rs @ <>= #undef qs #undef qx5 @ \subsubsection{Projections to be sent} Next we compute $(1\pm\gamma_\mu)$ projections to be sent to our neighbors. There are two cases here, one of $Q_{xy}$ and another for $S_{xy}$. In principle, we might have handled both of them with some jungling of the [[struct neighbor]] tables, but let us go a simple if extensive way for now. <>= { int k, i, s, c, *src; const vFermion *f; vHalfFermion *g; k = 0; <> k = 1; <> k = 2; <> k = 3; <> k = 4; <> k = 5; <> k = 6; <> k = 7; <> } @ <>= { int k, i, s, c, *src; const vFermion *f; vHalfFermion *g; k = 0; <> k = 1; <> k = 2; <> k = 3; <> k = 4; <> k = 5; <> k = 6; <> k = 7; <> } @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ <>= for (i = nb->snd_size[k], g = nb->snd_buf[k], src = nb->snd[k]; i--; src++) { for (s = S_4, f = &psi[*src]; s--; g++, f++) { for (c = 0; c < 3; c++) { <> } } } <> @ \subsubsection{Parts of $Q_{xy}\psi$} Let us start with the simplest of the five operators we need. <>= for (i = 0; i < nb->inside_size; i++) { xyzt = nb->inside[i]; xyzt5 = xyzt * S_4; <> <> <> } @ <>= for (i = 0; i < nb->boundary_size; i++) { int m = nb->boundary[i].mask; xyzt = nb->boundary[i].index; xyzt5 = xyzt * S_4; <> <> <> } @ <>= for (s = 0; s < S_4; s++) { <> <> <> } @ <>= for (s = 0; s < S_4; s++) { <> <> <> } @ <>= <> for (c = 0; c < 3; c++) { k=0; f=&psi[ps[0]]; g=&gg[0]; <> k=1; f=&psi[ps[1]]; g=&gg[1]; <> k=2; f=&psi[ps[2]]; g=&gg[2]; <> k=3; f=&psi[ps[3]]; g=&gg[3]; <> k=4; f=&psi[ps[4]]; g=&gg[4]; <> k=5; f=&psi[ps[5]]; g=&gg[5]; <> k=6; f=&psi[ps[6]]; g=&gg[6]; <> k=7; f=&psi[ps[7]]; g=&gg[7]; <> } @ <>= <> for (c = 0; c < 3; c++) { if ((m & 0x01) == 0) { k=0; f=&psi[ps[0]]; g=&gg[0]; <> } if ((m & 0x02) == 0) { k=1; f=&psi[ps[1]]; g=&gg[1]; <> } if ((m & 0x04) == 0) { k=2; f=&psi[ps[2]]; g=&gg[2]; <> } if ((m & 0x08) == 0) { k=3; f=&psi[ps[3]]; g=&gg[3]; <> } if ((m & 0x10) == 0) { k=4; f=&psi[ps[4]]; g=&gg[4]; <> } if ((m & 0x20) == 0) { k=5; f=&psi[ps[5]]; g=&gg[5]; <> } if ((m & 0x40) == 0) { k=6; f=&psi[ps[6]]; g=&gg[6]; <> } if ((m & 0x80) == 0) { k=7; f=&psi[ps[7]]; g=&gg[7]; <> } } @ <>= rs = &rx5[s]; for (c = 0; c < 3; c++) { k = 6; <> k = 7; <> k = 2; <> k = 3; <> k = 0; <> k = 1; <> k = 4; <> k = 5; <> } @ Now we have everything we need to compute $U(1\pm\gamma_\mu)\psi$ pieces: <>= for (d = 0; d < 8; d++) { vHalfFermion * __restrict__ h = &hh[d]; vSU3 *u = &V[d]; g = &gg[d]; <> } @ If the neighbor is on another node, it is in the receive buffer by now. <>= for (d = 0; d < 8; d++) { vHalfFermion * __restrict__ h = &hh[d]; vSU3 *u = &V[d]; g = (m & (1 << d))? &nb->rcv_buf[d][ps[d]]: &gg[d]; <> } @ <>= for (c = 0; c < 3; c++) { h->f[0][c].re=u->v[c][0].re*g->f[0][0].re-u->v[c][0].im*g->f[0][0].im +u->v[c][1].re*g->f[0][1].re-u->v[c][1].im*g->f[0][1].im +u->v[c][2].re*g->f[0][2].re-u->v[c][2].im*g->f[0][2].im; h->f[0][c].im=u->v[c][0].im*g->f[0][0].re+u->v[c][0].re*g->f[0][0].im +u->v[c][1].im*g->f[0][1].re+u->v[c][1].re*g->f[0][1].im +u->v[c][2].im*g->f[0][2].re+u->v[c][2].re*g->f[0][2].im; h->f[1][c].re=u->v[c][0].re*g->f[1][0].re-u->v[c][0].im*g->f[1][0].im +u->v[c][1].re*g->f[1][1].re-u->v[c][1].im*g->f[1][1].im +u->v[c][2].re*g->f[1][2].re-u->v[c][2].im*g->f[1][2].im; h->f[1][c].im=u->v[c][0].im*g->f[1][0].re+u->v[c][0].re*g->f[1][0].im +u->v[c][1].im*g->f[1][1].re+u->v[c][1].re*g->f[1][1].im +u->v[c][2].im*g->f[1][2].re+u->v[c][2].re*g->f[1][2].im; } @ \subsubsection{Parts of $\eta-S_{xy}\psi$} <>= for (i = 0; i < nb->inside_size; i++) { const vFermion *ex5, *es; xyzt = nb->inside[i]; xyzt5 = xyzt * S_4; <> ex5 = &eta[xyzt5]; <> <> } @ <>= for (i = 0; i < nb->boundary_size; i++) { const vFermion *ex5, *es; int m = nb->boundary[i].mask; xyzt = nb->boundary[i].index; xyzt5 = xyzt * S_4; <> ex5 = &eta[xyzt5]; <> <> } @ <>= for (s = 0; s < S_4; s++) { <> <> <> } @ <>= for (s = 0; s < S_4; s++) { <> <> <> } @ <>= <> for (c = 0; c < 3; c++) { k=0; f=&psi[ps[0]]; g=&gg[0]; <> k=1; f=&psi[ps[1]]; g=&gg[1]; <> k=2; f=&psi[ps[2]]; g=&gg[2]; <> k=3; f=&psi[ps[3]]; g=&gg[3]; <> k=4; f=&psi[ps[4]]; g=&gg[4]; <> k=5; f=&psi[ps[5]]; g=&gg[5]; <> k=6; f=&psi[ps[6]]; g=&gg[6]; <> k=7; f=&psi[ps[7]]; g=&gg[7]; <> } @ <>= <> for (c = 0; c < 3; c++) { if ((m & 0x01) == 0) { k=0; f=&psi[ps[0]]; g=&gg[0]; <> } if ((m & 0x02) == 0) { k=1; f=&psi[ps[1]]; g=&gg[1]; <> } if ((m & 0x04) == 0) { k=2; f=&psi[ps[2]]; g=&gg[2]; <> } if ((m & 0x08) == 0) { k=3; f=&psi[ps[3]]; g=&gg[3]; <> } if ((m & 0x10) == 0) { k=4; f=&psi[ps[4]]; g=&gg[4]; <> } if ((m & 0x20) == 0) { k=5; f=&psi[ps[5]]; g=&gg[5]; <> } if ((m & 0x40) == 0) { k=6; f=&psi[ps[6]]; g=&gg[6]; <> } if ((m & 0x80) == 0) { k=7; f=&psi[ps[7]]; g=&gg[7]; <> } } @ <>= rs = &rx5[s]; es = &ex5[s]; for (c = 0; c < 3; c++) { k = 7; <> k = 6; <> k = 3; <> k = 2; <> k = 0; <> k = 1; <> k = 4; <> k = 5; <> <> } @ <>= rs->f[0][c].re = es->f[0][c].re - rs->f[0][c].re; rs->f[0][c].im = es->f[0][c].im - rs->f[0][c].im; rs->f[1][c].re = es->f[1][c].re - rs->f[1][c].re; rs->f[1][c].im = es->f[1][c].im - rs->f[1][c].im; rs->f[2][c].re = es->f[2][c].re - rs->f[2][c].re; rs->f[2][c].im = es->f[2][c].im - rs->f[2][c].im; rs->f[3][c].re = es->f[3][c].re - rs->f[3][c].re; rs->f[3][c].im = es->f[3][c].im - rs->f[3][c].im; @ \subsubsection{Parts of $Q_{xx}^{-1}Q_{xy}\psi$} <>= for (i = 0; i < nb->inside_size; i++) { xyzt = nb->inside[i]; xyzt5 = xyzt * S_4; <> <> <> <> } @ <>= for (i = 0; i < nb->boundary_size; i++) { int m = nb->boundary[i].mask; xyzt = nb->boundary[i].index; xyzt5 = xyzt * S_4; <> <> <> <> } @ \subsubsection{Parts of $S_{xx}^{-1}S_{xy}\psi$} <>= for (i = 0; i < nb->inside_size; i++) { xyzt = nb->inside[i]; xyzt5 = xyzt * S_4; <> <> <> <> } @ <>= for (i = 0; i < nb->boundary_size; i++) { int m = nb->boundary[i].mask; xyzt = nb->boundary[i].index; xyzt5 = xyzt * S_4; <> <> <> <> } @ <>= for (s = 0; s < S_4; s++) { <> <> <> } @ <>= for (s = 0; s < S_4; s++) { <> <> <> } @ <>= rs = &rx5[s]; for (c = 0; c < 3; c++) { k = 7; <> k = 6; <> k = 3; <> k = 2; <> k = 0; <> k = 1; <> k = 4; <> k = 5; <> } @ \subsubsection{Parts of $\eta-Q_{xx}^{-1}Q_{xy}\psi$} <>= for (i = 0; i < nb->inside_size; i++) { const vFermion *ex5, *es; xyzt = nb->inside[i]; xyzt5 = xyzt * S_4; <> ex5 = &eta[xyzt5]; <> <> <> } @ <>= for (i = 0; i < nb->boundary_size; i++) { const vFermion *ex5, *es; int m = nb->boundary[i].mask; xyzt = nb->boundary[i].index; xyzt5 = xyzt * S_4; <> ex5 = &eta[xyzt5]; <> <> <> } @ <>= <> for (s = 0; s < S_4; s++) { rs = &rx5[s]; es = &ex5[s]; nv = vmk1(0.0); for (c = 0; c < 3; c++) { <> } *norm += vsum(nv); } @ <>= vv = es->f[0][c].re - rs->f[0][c].re; rs->f[0][c].re = vv; nv += vv * vv; vv = es->f[0][c].im - rs->f[0][c].im; rs->f[0][c].im = vv; nv += vv * vv; vv = es->f[1][c].re - rs->f[1][c].re; rs->f[1][c].re = vv; nv += vv * vv; vv = es->f[1][c].im - rs->f[1][c].im; rs->f[1][c].im = vv; nv += vv * vv; vv = es->f[2][c].re - rs->f[2][c].re; rs->f[2][c].re = vv; nv += vv * vv; vv = es->f[2][c].im - rs->f[2][c].im; rs->f[2][c].im = vv; nv += vv * vv; vv = es->f[3][c].re - rs->f[3][c].re; rs->f[3][c].re = vv; nv += vv * vv; vv = es->f[3][c].im - rs->f[3][c].im; rs->f[3][c].im = vv; nv += vv * vv; @ \subsubsection{Miscallenious} We also need to uplift the gauge fields <>= Uup = &U[nb->site[xyzt].Uup]; for (d = 0; d < 4; d++, Uup++) { Udown = &U[nb->site[xyzt].Udown[d]]; for (c1 = 0; c1 < 3; c1++) { for (c2 = 0; c2 < 3; c2++) { V[d*2+0].v[c1][c2].re = vmk1(Uup->v[c1][c2].re); V[d*2+0].v[c1][c2].im = vmk1(Uup->v[c1][c2].im); /* conjugate down-link */ V[d*2+1].v[c1][c2].re = vmk1( Udown->v[c2][c1].re); V[d*2+1].v[c1][c2].im = vmk1(-Udown->v[c2][c1].im); } } } @ We want to keep code small, so computing the neighbors is done in a loop: <>= for (d = 0; d < 8; d++) { ps[d] = p5[d] + s; } @ \subsubsection{Combined pieces} In these cases, $Q_{xx}^{-1}$ is applied to the result of $Q_{xy}$ <>= static void compute_Qxx1Qxy(vFermion *d, const vFermion *s, struct neighbor *nb); static void inline compute_Qee1Qeo(vEvenFermion *d, const vOddFermion *s) { compute_Qxx1Qxy(&d->f, &s->f, &even_odd); } static void compute_Sxx1Sxy(vFermion *d, const vFermion *s, struct neighbor *nb); static void inline compute_See1Seo(vEvenFermion *d, const vOddFermion *s) { compute_Sxx1Sxy(&d->f, &s->f, &even_odd); } static void compute_1Qxx1Qxy(vFermion *d, double *norm, const vFermion *q, const vFermion *s, struct neighbor *nb); static void inline compute_1Qoo1Qoe(vOddFermion *d, double *norm, const vOddFermion *q, const vEvenFermion *s) { compute_1Qxx1Qxy(&d->f, norm, &q->f, &s->f, &odd_even); } @ \subsubsection{Common Locals} Some local bindings are used by all parts above. Let us collect them together. <>= int i, xyzt5, s, c; vFermion * __restrict__ rx5, * __restrict__ rs; @ Others are used only in $Z_{xy}$ parts: <>= int xyzt, k, d; const vFermion *f; vHalfFermion *g; vHalfFermion gg[8], hh[8]; vSU3 V[8]; int ps[8], p5[8]; @ <>= const SU3 *Uup, *Udown; int c1, c2; @ For the inside sites, compute the $s$-chain address of the neighbor. For the boundary sites, the address of the $s$-chain in the receive buffer is used instead: <>= for (d = 0; d < 8; d++) p5[d] = nb->site[xyzt].F[d]; <> @ <>= rx5 = &chi[xyzt5]; @ <>= qx5 = &psi[xyzt5]; @ \subsubsection{Common globals} @ Some of these values depend of [[m0]] and [[M]]. Here we compute their values: <>= { double a = M; double b = 2.; double c = -2*m0; <> } @ \subsection{QMP Pieces} Here are miscalenious piece of QMP: We are ready to use the result of the global sum. Check that it has been computed. <>= /* relax, QMP does not support split reductions yet. */ @ All receive operations are bundled together, so that starting and stopping them is easy: <>= QMP_start(nb->qmp_cr); @ <>= QMP_wait(nb->qmp_cr); @ For the send operations, we can easily start them separately, because of the way the send buffers are filled. Since the case of [[network[d]==2]] is handled specially, we use [[qmp_smask]] to decide when to start send. <>= if (nb->qmp_smask & (1 << k)) { QMP_start(nb->qmp_sh[k]); sending = nb; } @ It is convenient to invert waiting for the send to complete---we only wait for it just before the send buffer is filled again and at the end of CG to provide a clean completion to the routine. <>= if (sending) { int i; /* This is QMP_wait_vector(nb->qmp_sv, nb->Ns); */ for (i = sending->Ns; i--;) QMP_wait(sending->qmp_sv[i]); sending = 0; } @ A global variable [[sending]] is set to a corrsponding [[struct neighbor]] when a send operation is started, so that we do not wait on never started send. <>= static struct neighbor *sending = 0; @ \subsubsection{Global sums} Until the split global sums are implemented in QMP, everything is done at the beginning, when [[*norm]] contains the local part of the sum. Start the global operation which will distribute the pieces, compute the sum, and provide the result to each node. <>= QMP_sum_double(norm); @ \subsection{SSE Types and Operations} It is convenient to place all SSE specific matter into a separate file which we include into the solver source: <>= #define Vs 4 /* Length of SSE vector */ #define REAL float /* floating point type compatible with vReal */ #include @ Here we define the top level structure of [[sse.h]]: <>= #ifndef _SSE_H #define _SSE_H <> <> #endif @ \subsubsection{SSE Types} Let us start with the floating point scalar type. The C standard does not provide proper type encapsulation in [[typedef]], but we do not need a fool-proof solution here. First is our floating point vector type. The fact that its length is four is used heavily in the code above and below. The attribute [[aligned(16)]] helps gcc to keep variables properly aligned. <>= typedef REAL vReal __attribute__((mode(V4SF),aligned(16))); @ Now, let us declare complex types. They come in two kinds: scalar and vector, as usual. <>= typedef struct { REAL re, im; } complex; typedef struct { vReal re, im; } vcomplex; @ We handled enough general cases to express lattice specific data types. The gauge field needs two kind of types (yes, they are scalar and vector, what else?): <>= typedef struct SU3 { complex v[3][3]; } SU3; typedef struct { vcomplex v[3][3]; } vSU3; @ But we only use vector fermions. However, two component spinors come handy (note that we have committed to the color index varying faster than the spinor index. Is it a good choice?---That is unclear. Changes throughout the code are needed, however, to flip the order of indices.) <>= typedef struct { vcomplex f[4][3]; } vFermion; typedef struct { vcomplex f[2][3]; } vHalfFermion; @ Strictly speaking, there is no need to have separate types for even/odd sublattices. But, while writing the CG, the compiler caught quite a few logic errors because of these two tiny structures. <>= typedef struct { vFermion f; } vEvenFermion; typedef struct { vFermion f; } vOddFermion; @ \subsubsection{SSE inline functions} For efficiency, all functions dealing with SSE data are inlined. The code below requires [[gcc 3.3.x]]. By the good grace of [[gcc]] we already have arithmetic operations and assignments on SSE vectors. A few more functions will complete the needed set. All functions below are defined [[inline]], so that gcc can eliminate the standard function call dance.\note{In fact, gcc does a reasonably good job in dissolving inline function calls in C, but sometimes a residue is left. If you want to use [[inline]], frequent consultations with [[gcc -S]] are advantageous}. First, propagate a scalar value to all four components of the SSE vector. <>= static inline vReal vmk1(REAL a) { vReal v = __builtin_ia32_loadss((float *)&a); asm("shufps\t$0,%0,%0" : "+x" (v)); return v; } @ Packaging four values into an SSE vector is next. This defines the numbering conventions: which element is zeroth etc. <>= static inline vReal vmk4(REAL a0, REAL a1, REAL a2, REAL a3) { vReal v; REAL *r = (REAL *)&v; r[0] = a0; r[1] = a1; r[2] = a2; r[3] = a3; return v; } @ Next, sum all four components of the SSE vector. Maybe there is a craftier way to do it, but let us leave it as an exercise for now: <>= static inline REAL vsum(vReal v) { REAL *vv = (REAL *)&v; return vv[0] + vv[1] + vv[2] + vv[3]; } @ Mutators for vector numbers. We only need access to the $0^{th}$ and the $3^{rd}$ elements of the vector: <>= static inline void vput_3(vReal *v, REAL a3) { ((REAL *)v)[3] = a3; } static inline void vput_0(vReal *v, REAL a0) { ((REAL *)v)[0] = a0; } @ Given \begin{eqnarray*} a & = & (a_0, a_1, a_2, a_3),\\ b & = & (b_0, b_1, b_2, b_3),\\ \end{eqnarray*} compute various shifts as follows: \begin{displaymath} \mbox{\texttt{shift\_up1}} \leftarrow (a_1, a_2, a_3, b_0) \end{displaymath} <>= static inline vReal shift_up1(vReal a, vReal b) { vReal x = a; vReal y = b; asm("shufps\t$0x30,%0,%1\n\t" "shufps\t$0x29,%1,%0" : "+x" (x), "+x" (y)); return x; } @ \begin{displaymath} \mbox{\texttt{shift\_up2}} \leftarrow (a_2, a_3, b_0, b_1) \end{displaymath} <>= static inline vReal shift_up2(vReal a, vReal b) { vReal x = a; asm("shufps\t$0x4e,%1,%0" : "+x" (x): "x" (b)); return x; } @ \begin{displaymath} \mbox{\texttt{shift\_up3}} \leftarrow (a_3, b_0, b_1, b_2) \end{displaymath} <>= static inline vReal shift_up3(vReal a, vReal b) { vReal x = a; asm("shufps\t$0x03,%1,%0\n\t" "shufps\t$0x9c,%1,%0" : "+x" (x): "x" (b)); return x; } @ \begin{displaymath} \mbox{\texttt{shift\_down1}} \leftarrow (a_3, b_0, b_1, b_2) \end{displaymath} <>= static inline vReal shift_down1(vReal a, vReal b) { return shift_up3(a, b); } @ \begin{displaymath} \mbox{\texttt{shift\_down2}} \leftarrow (a_2, a_3, b_0, b_1) \end{displaymath} <>= static inline vReal shift_down2(vReal a, vReal b) { return shift_up2(a, b); } @ \begin{displaymath} \mbox{\texttt{shift\_down3}} \leftarrow (a_1, a_2, a_3, b_0) \end{displaymath} <>= static inline vReal shift_down3(vReal a, vReal b) { return shift_up1(a, b); } @ The very last of the SSE functions: clear a half fermion: <>= static inline void vhfzero(vHalfFermion *v) { vReal z = vmk1(0.0); v->f[0][0].re = v->f[0][0].im = v->f[0][1].re = v->f[0][1].im = v->f[0][2].re = v->f[0][2].im = v->f[1][0].re = v->f[1][0].im = v->f[1][1].re = v->f[1][1].im = v->f[1][2].re = v->f[1][2].im = z; } @ %----------------------------------- \subsection{Generally Useful Functions} Here is a collection of simple functions that are useful throughout the code: <>= static inline int parity(const int x[DIM]) { int i, v; for (i = v = 0; i < DIM; i++) v += x[i]; return v & 1; } @ \subsection{Handy Constants} For some constants it is better to have symbolic names even if one can not easily change their values. <>= #define Nc 3 /* Number of colors */ #define DIM 4 /* number of dimensions */ #define Fd 4 /* Fermion representation dimension */ @ \subsection{Source File} Finally, let us put together all the pieces: <>= #include #include "sse-dwf-cg.h" <> <> <> <> <> <> <> @ \pagebreak \section{CHUNKS} \nowebchunks \end{document}