1 Krylov subspace methods

This chapter describes the iterative methods contained in this library. The following notation is used, $Ax = b$ represents the linear system which will be solved, with $A$ being an n $\times$ n sparse matrix and $x_0$ representing the initial vector. $K$ represents an appropriate preconditioner.

1.1 Preconditioned Conjugate Gradient Method [1]

The Conjugate Gradient (CG) method is an iterative Krylov subspace algorithm for solving symmetric matrices.

Algorithm description

$\mathbf{{Compute\ }}\mathbf{r}^{\mathbf{0}}\mathbf{= b - A}\mathbf{x}^{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}^{\mathbf{- 1}}\mathbf{=}\mathbf{x}^{\mathbf{0}}\mathbf{\ }$
$\mathbf{p}^{\mathbf{- 1}}\mathbf{= 0}$
$\mathbf{\alpha}_{\mathbf{- 1}}\mathbf{= 0\ }\mathbf{\ }$
$\mathbf{\beta}_{\mathbf{- 1}}\mathbf{= 0}$
$\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{0}}$
$\mathbf{\rho}_{\mathbf{0}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{0}} \right\rangle$
$\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}$
　　 $\mathbf{p}^{\mathbf{i}}\mathbf{= s +}\mathbf{\beta}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}$
　　 $\mathbf{q}^{\mathbf{i}}\mathbf{= A}\mathbf{p}^{\mathbf{i}}$
　　 $\mathbf{\gamma =}\left\langle \mathbf{p}^{\mathbf{i}}\mathbf{,}\mathbf{q}^{\mathbf{i}} \right\rangle$
　　 $\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}$
　　 $\mathbf{\alpha}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i}}\mathbf{\ /\ \gamma}$
　　 $\mathbf{r}^{\mathbf{i + 1}}\mathbf{=}\mathbf{r}^{\mathbf{i}}\mathbf{-}\mathbf{\alpha}_{\mathbf{i}}\mathbf{q}^{\mathbf{i}}$
　　 $\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{i + 1}}$
　　 $\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{i + 1}} \right\rangle$
　　 $\mathbf{{if\ }}\left\| \mathbf{r}^{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ the}}\mathbf{n}$
　　　　 $\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}$
　　　　 $\mathbf{{qui}}\mathbf{t}$
　　 $\mathbf{{endi}}\mathbf{f}$
　　 $\mathbf{\beta}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{\ /\ }\mathbf{\rho}_{\mathbf{i}}$
$\mathbf{{\ en}}\mathbf{d}$

1.2 Preconditioned Biconjugate Gradient Stabilized Method[1]

The Biconjugate Gradient Stabilized (Bi-CGSTAB) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices.

Algorithm description

$\mathbf{{Compute\ }}\mathbf{r}_{\mathbf{0}}\mathbf{= b - A}\mathbf{x}_{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}_{\mathbf{- 1}}\mathbf{=}\mathbf{x}_{\mathbf{0}}$
$\mathbf{p}_{\mathbf{0}}\mathbf{=}\mathbf{r}_{\mathbf{0}}$
$\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{0}} \right\rangle$
$\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}$
　　 $\mathbf{{solve\ }}\widehat{\mathbf{p}}\mathbf{{\ from\ K}}\widehat{\mathbf{p}}\mathbf{= \ }\mathbf{p}_{\mathbf{i}}$
　　 $\mathbf{q = A}\widehat{\mathbf{p}}$
　　 $\mathbf{c}_{\mathbf{2}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,q} \right\rangle$
　　 $\mathbf{\alpha =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\mathbf{c}_{\mathbf{2}}$
　　 $\mathbf{e =}\mathbf{r}_{\mathbf{i}}\mathbf{- \alpha q}$
　　 $\mathbf{{solve\ }}\widehat{\mathbf{e}}\mathbf{{\ from\ K}}\widehat{\mathbf{e}}\mathbf{= e}$
　　 $\mathbf{v = A}\widehat{\mathbf{e}}$
　　 $\mathbf{c}_{\mathbf{3}}\mathbf{=}\left\langle \mathbf{e,v} \right\rangle\mathbf{\ /\ }\left\langle \mathbf{v,v} \right\rangle$
　　 $\mathbf{x}_{\mathbf{i + 1}}\mathbf{=}\mathbf{x}_{\mathbf{i}}\mathbf{+ \alpha}\widehat{\mathbf{p}}\mathbf{+}\mathbf{c}_{\mathbf{3}}\widehat{\mathbf{e}}$
　　 $\mathbf{r}_{\mathbf{i + 1}}\mathbf{= e -}\mathbf{c}_{\mathbf{3}}\mathbf{v}$
　　 $\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{i + 1}} \right\rangle$
　　 $\mathbf{{if\ }}\left\| \mathbf{r}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{the}}\mathbf{n}$
　　　　 $\mathbf{{quit}}$
　　 $\mathbf{{endif}}$
　　 $\mathbf{\beta =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\left( \mathbf{c}_{\mathbf{2}}\mathbf{c}_{\mathbf{3}} \right)$
　　 $\mathbf{p}_{\mathbf{i + 1}}\mathbf{=}\mathbf{r}_{\mathbf{i + 1}}\mathbf{+ \beta}\left( \mathbf{p}_{\mathbf{i}}\mathbf{-}\mathbf{c}_{\mathbf{3}}\mathbf{q} \right)$
$\mathbf{{end}}$

1.3 Preconditioned Generalized Minimum Residual Method [1]

The Generalized Minimum Residual (GMRES(m)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. GMRES(m) is restarted periodically after $m$ iterations if it did not converge within $m$ iterations, the solution obtained until $m$ iterations will be used as an input for the new restart cycle. PARCEL provides variants of GMRES with the Classical Gram-Schmidt (sometimes referred to as standard Gram-Schmidt) and Modified Gram-Schmidt (MGS) orthogonalization methods.

Algorithm description

$\mathbf{H}_{\mathbf{n}}$ is an upper triangular matrix with $\mathbf{h}_{\mathbf{j,k}}$ representing its elements, $\mathbf{\ }\mathbf{e}_{\mathbf{i}}$ represents a vector consisting of the first i elements of the vector $\mathbf{e}$ .

$\mathbf{for\ j = 0,1,2,3}\mathbf{\ldots}$
　　 $\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}$
　　 $\mathbf{v}_{\mathbf{1}}\mathbf{= - r\ /\ }\left\| \mathbf{r} \right\|$
　　 $\mathbf{e =}\left( \mathbf{-}\left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}$
　　 $\mathbf{n = m}$
　　 $\mathbf{for\ i = 1,2,3}\mathbf{\ldots}\mathbf{m}$
　　　　 $\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{=}\mathbf{v}_{\mathbf{i}}$
　　　　 $\mathbf{\omega = A}{\widehat{\mathbf{v}}}_{\mathbf{i}}$
　　　　 $\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i}$
　　　　　　 $\mathbf{h}_{\mathbf{k,i}}\mathbf{=}\left\langle \mathbf{\omega,}\mathbf{v}_{\mathbf{k}} \right\rangle$
　　　　　　 $\mathbf{\omega = \omega -}\mathbf{h}_{\mathbf{k,i}}\mathbf{v}_{\mathbf{k}}$
　　　　 $\mathbf{{end}}$
　　　　 $\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{=}\left\| \mathbf{\omega} \right\|$
　　　　 $\mathbf{v}_{\mathbf{i + 1}}\mathbf{= \omega/}\left\| \mathbf{\omega} \right\|$
　　　　 $\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i - 1}$
　　　　　　 $\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}$
　　　　 $\mathbf{{end}}$
　　　　 $\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$
　　　　 $\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$
　　　　 $\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}$
　　　　 $\mathbf{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathbf{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathbf{h}_{\mathbf{i + 1,i}}$
　　　　 $\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{= 0}$
　　　　 $\mathbf{{if\ }}\left\| \mathbf{e}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{then}}$
　　　　　　 $\mathbf{n = i}$
　　　　　　 $\mathbf{{exit}}$
　　　　 $\mathbf{{endif}}$
　　 $\mathbf{{end}}$
　　 $\mathbf{y}_{\mathbf{n}}\mathbf{=}\mathbf{H}_{\mathbf{n}}^{\mathbf{- 1}}\mathbf{e}_{\mathbf{n}}$
　　 $\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 1}}^{\mathbf{n}}{\mathbf{y}_{\mathbf{k}}\mathbf{v}_{\mathbf{k}}}$
　　 $\mathbf{x}_{\mathbf{n}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}$
　　 $\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{n}}$
$\mathbf{{end}}$

1.4 Preconditioned Communication Avoiding Generalized Minimum Residual Method [2,3]

The Communication Avoiding Generalized Minimum Residual (CA-GMRES(s,t)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. The calculation which is done in $s$ iterations in the GMRES(m) method is equivalent to one iteration of the CA-GMRES method. CA-GMRES(s,t) is restarted periodically after $t$ iterations if it did not converge within $t$ iterations, the solution obtained until $m$ iterations will be used as an input for the new restart cycle. CA-GMRES(s,t) is equivalent to GMRES(m) if $s$ and $t$ are chosen so that $m = t \times s$ . The convergence property of CA-GMRES(s,t) is the same as GMRES(m) in exact arithmetic. In addition, the number of global collective communication calls is reduced by communication avoiding QR factorization. As CA-GMRES(s,t) produces $s$ basis vectors at once, when $s$ is too large, the linear independency of the basic vectors may become worse because of round off errors, leading to worse convergence property than GMRES(m). In order to improve the orthogonality of the basic vectors, the PARCEL implementation of CA-GMRES(s,t) provides an option to use the Newton basis in addition to the monomial basis.

Algorithm description

$\mathbf{e}$ represents a unit vector, ${\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}$ represents the $s$ -th row and $s$ -th column element of the matrix $\mathbf{R}_{\mathbf{k}}$ , and $\mathbf{E}$ represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.

$\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{= \lbrack}\mathbf{e}_{\mathbf{2}}\mathbf{,}\mathbf{e}_{\mathbf{3}}\mathbf{,\ldots,}\mathbf{e}_{\mathbf{s + 1}}\mathbf{\rbrack}$
$\mathbf{for\ j = 0,1,2,3\ldots}$
　　 $\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}$
　　 $\mathbf{v}_{\mathbf{1}}\mathbf{= r\ /\ }\left\| \mathbf{r} \right\|$
　　 $\mathbf{\zeta =}\left( \left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}$
　　 $\mathbf{for\ k = 0,1\ldots,t - 1}$
　　　　 $\mathbf{Fix\ basis\ conversion\ matrix\ \lbrack}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,E\rbrack}$
　　　　 $\mathbf{Comupute\ \ basic\ vector\ \ \lbrack\ s,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{\ \rbrack}$
　　　　 $\mathbf{if(k.eq.0)}$
　　　　　　 $\mathbf{{QR\ decomposition\ }}\mathbf{V}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}$
　　　　　　 $\mathfrak{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}$
　　　　　　 $\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{- 1}}$
　　　　　　 $\mathcal{H}_{\mathbf{0}}\mathbf{=}\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}$
　　　　　　 $\mathbf{for\ o = 1,2\ldots,s}$
　　　　　　 $\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta\rbrack}$
　　　　　　 $\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}$
　　　　　　　　 $\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{Q}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}\mathbf{R}_{\mathbf{0}}\mathbf{,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$
　　　　　　　　 $\mathbf{{quit}}$
　　　　　　 $\mathbf{{endif}}$
　　　　 $\mathbf{{else}}$
　　　　　　 ${\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{=}\left( \mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}} \right)^{\mathbf{H}}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}$
　　　　　　 ${\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{-}\mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}}{\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}$
　　　　　　 $\mathbf{{QR\ decomposition\ }}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{Q}}}_{\mathbf{k}}^{\mathbf{\sim}}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}$
　　　　　　 $\mathbf{R}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathbf{R}_{\mathbf{k}} & \mathbf{z}_{\mathbf{k}} \\ \mathbf{0}_{\mathbf{1,k}} & \mathbf{\rho} \\ \end{pmatrix}$
　　　　　　 $\mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{= -}\mathfrak{H}_{\mathbf{k - 1}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}\mathfrak{R}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}$
　　　　　　 $\mathbf{H}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\mathbf{B}_{\mathbf{k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{b}_{\mathbf{k}}\mathbf{z}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}}\mathbf{-}\mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}$
　　　　　　 $\mathbf{h}_{\mathbf{k}}\mathbf{=}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{\rho}_{\mathbf{k}}\mathbf{b}_{\mathbf{k}}$
　　　　　　 $\mathfrak{H}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1}} & \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \begin{matrix} \mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}} \\ \mathbf{0}_{\mathbf{1,s(k - 1)}} \\ \end{matrix} & \begin{matrix} \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{matrix} \\ \end{pmatrix}$
　　　　　　 $\mathcal{H}_{\mathbf{k}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{pmatrix}$
　　　　　　 $\mathbf{for\ o = 1 + sk,\ldots,s(k + 1)}$
　　　　　　　　 $\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta\rbrack}$
　　　　　　　　 $\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}$
　　　　　　　　　　 $\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$
　　　　　　　　　　 $\mathbf{{quit}}$
　　　　　　　　 $\mathbf{{endif}}$
　　　　　　 $\mathbf{{end}}$
　　　　 $\mathbf{{endif}}$
　　 $\mathbf{{end}}$
　　 $\mathbf{update\ \ solution\ \ vector\lbrack s,t - 1,st,}{\acute{\mathbf{V}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{t - 1}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$
$\mathbf{{end}}$

Fix basis conversion matrix $\mathbf{\lbrack s,B,E\rbrack}$

$\mathbf{{if\ compute\ Newton\ Basis}}$
　　 $\mathbf{i = 0}$
　　 $\mathbf{{while}}\left( \mathbf{i \leq s - 1} \right)$
　　　　 $\mathbf{{if}}\left( \mathbf{i.eq.s - 1} \right)\mathbf{{then}}$
　　　　　　 $\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$
　　　　 $\mathbf{{else}}$
　　　　　　 $\mathbf{{if}}\left( \mathbf{{Im}}\left\lbrack \mathbf{E}_{\mathbf{i}} \right\rbrack\mathbf{.eq.0} \right)\mathbf{{then}}$
　　　　　　　　 $\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$
　　　　　　 $\mathbf{{else}}$
　　　　　　　　 $\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$
　　　　　　　　 $\mathbf{B}_{\mathbf{i + 1,i + 1}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$
　　　　　　　　 $\mathbf{B}_{\mathbf{i,i + 1}}\mathbf{= -}\left( \mathbf{Im\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack} \right)^{\mathbf{2}}$
　　　　　　　　 $\mathbf{i = i + 1}$
　　　　　　 $\mathbf{{endif}}$
　　　　 $\mathbf{{endif}}$
　　　　 $\mathbf{i = i + 1}$
　　 $\mathbf{{end\ \ }}$
$\mathbf{{end\ \ }}$

Comupute basis vector $\mathbf{\lbrack\ s,v\ ,B\rbrack}$

The elements of the matrix $\mathbf{B}$ are represented as $\mathbf{b}_{\mathbf{k,i}}$

$\mathbf{for\ k = 1,2,3\ldots s}$
　　 $\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{=}\mathbf{v}_{\mathbf{k - 1}}$
　　 $\mathbf{if(k.ne.1)then}$
　　　　 $\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}$
　　　　 $\mathbf{\beta =}\mathbf{b}_{\mathbf{k - 2,k - 1}}$
　　　　 $\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}\mathbf{+ \beta}\mathbf{v}_{\mathbf{k - 2}}$
　　 $\mathbf{{else}}$
　　　　 $\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}$
　　　　 $\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}$
　　 $\mathbf{{endif}}$
$\mathbf{{en}}\mathbf{d}$

Givens rotation $\mathbf{\lbrack i,}\mathcal{H,}\mathbf{\zeta\rbrack}$

The elements of the matrix $\mathcal{H}$ are represented as $\mathcal{h}_{\mathbf{k,i}}$ .

$\mathbf{for\ k = 1,2,3\ldots i - 1}$
　　 $\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}$
$\mathbf{\text{end}}$
$\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$
$\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$
$\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}$
$\mathcal{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathcal{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathcal{h}_{\mathbf{i + 1,i}}$
$\mathcal{h}_{\mathbf{i + 1,i}}\mathbf{= 0}$

Update solution vector $\mathbf{\lbrack s,k,n,V,Q,R,}\mathcal{H,}\mathbf{\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$

$\mathbf{y =}\mathcal{H}^{\mathbf{- 1}}\mathbf{\zeta}$
$\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 0}}^{\mathbf{s}\mathbf{m}\mathbf{- 1}}\mathbf{Q}_{\mathbf{k}}\mathbf{y}_{\mathbf{k}}\mathbf{+}\sum_{\mathbf{l =}\mathbf{{sm}}}^{\mathbf{n - 1}}\mathbf{V}_{\mathbf{l}\mathbf{- sm}}\sum_{\mathbf{k =}\mathbf{{sm}}}^{\mathbf{n - 1}}{\mathbf{R}_{\mathbf{l - sm}\mathbf{,}\mathbf{k}\mathbf{- sm}}^{\mathbf{- 1}}\mathbf{y}_{\mathbf{k}}}$
$\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}$

2 Preconditioning

In iterative algorithms, the application of a preconditioner of the form $K \approx A$ can help to reduce the number of iterations until convergence. The notation [ ${{solve\ }}\widehat{{p}}{{\ from\ K}}\widehat{{p}}{= \ }{p}$ ] means [ $approximately solve {A}\widehat{p}{= \ }{p}$ with respect to the vector $\widehat{p}$ ]. Although the more accurate approximation leads to the less number of iterations until convergence, the cost of preconditioner is normally increased. In non-parallel computing, one of the most common preconditioners is the incomplete LU factorization (ILU). However, in parallel computing, parallel preconditioners are needed. PARCEL provides the following parallel preconditioners: Point Jacobi, Block Jacobi and Additive Schwarz.

2.1 Point Jacobi Preconditioner [1]

The point Jacobi preconditioner is one of the most simplest preconditioners. $K$ only consists of the diagonal elements of the matrix $A$ . Compared to other preconditioners the efficiency of the point Jacobi preconditioner is very low. However, it can be easily applied in parallel and it does not require any additional communication between processors.

2.2 Zero Fill-in Incomplete LU Factorization Preconditioners (ILU(0)) [1]

LU factorization decompose a square matrix $A$ into a lower triangular matrix $L$ and an upper triangular matrix $U$ . For a typical sparse matrix, the LU factors can be much less sparse than the original matrix, which is called as fill-in. This increases the memory and computational requirements. In order to avoid this issue, PARCEL provides the zero fill-in incomplete LU factorization (ILU(0)).

2.3 Diagonal Incomplete LU Factorization Preconditioners (D-ILU) [1]

The diagonal incomplete LU factorization (D-ILU) computes a diagonal matrix $D$ , a lower triangular matrix $L$ and an upper triangular matrix $U$ for the input matrix $A$ . $A = L_{A} + D_{A} + U_{A}$ The preconditioner $K$ is constructed with $L_{A}$ , $U_{A}$ and $D$ ( $D \neq D_{A}$ ) with: $K = \left( D + L_{A} \right)D^{- 1}\left( D + U_{A} \right)$ Two different conditions exist in order to construct the diagonal matrix $D$ :

Condition 1: The diagonal elements of $K$ equal those of $A$ : $A_{{ii}} = K_{{ii}}\ \ (i = 1,\ldots,n)$
Condition 2: The row sum of $K$ equals the row sum of $A$ : $\sum_{j}^{}A_{{ij}} = \sum_{j}^{}K_{{ij}}\ \ (i = 1,\ldots,n)$

Given the conditions above $D$ can be computed as follows:

If condition 1 should be fullfilled the following computation is used: $D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{A_{{ij}}D_{jj}^{-1}A_{{ji}}}\ \ (i = 1,\ldots,n)$
If condition 2 should be fullfilled the following computation is used: $D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{\sum_{k > j}^{}{A_{{ij}}D_{{jj}}^{-1}A_{jk}}}\ \ (i = 1,\ldots,n)$

2.4 Block Jacobi Preconditioner [1]

The Block Jacobi Preconditioner constructs $K$ out of the diagonal blocks of the matrix $A$ . For each block, the incomplete ILU factorization is computed. When the preconditioner is applied, each block can be processed independently, resulting in a high level of parallelism. In addition, no communication is needed when each block is defined within a sub-matrix on each processor.

2.5 Jacobi Preconditioning [1]

　Jacobi preconditioning is a type of preconditioner that approximately solves the system using the following iterative method. Since there are no dependencies within a single iteration, it is suitable for GPU-based computation. ${\hat{p}}_{k+1}={\hat{p}}_k+D^{-1}\left(p-A{\hat{p}}_k\right)$

3 QR factorization

This section describes the QR decomposition algorithms that can be used in this library.

3.1 Classical Gram-Schmidt (CGS) Method[1]

The QR factorization by the Gram-Schmidt orthogonalization. This algorithm has High parallelism but poor orthogonality.

Algorithm description

$\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}$
　　 $\mathbf{for\ k = 1,i - 1\ do}$
　　　　 $\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$
　　 $\mathbf{{enddo\ }}$
　　 $\mathbf{for\ k = 1,i - 1\ do}$
　　　　 $\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}$
　　　　 $\mathbf{{enddo\ }}$
　　 $\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$
　　 $\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}$
$\mathbf{{enddo\ }}$

3.2 Modified Gram-Schmidt (MGS) Method[1]

The QR factorization with a modified algorithm of the classical Gram-Schmidt method to reduce the error. This algorithm improves the orthogonality from the classical Gram-Schmidt method, but requires more collective communication.

Algorithm description

$\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}$
　　 $\mathbf{for\ k = 1,i - 1\ do}$
　　　　 $\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$
　　　　 $\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}$
　　 $\mathbf{{enddo\ }}$
　　 $\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$
　　 $\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}$
$\mathbf{{enddo\ }}$

3.3 Cholesky QR Method[7]

The Cholesky QR factoriztion consists of a matrix product and the Cholesky factorization. This algorithm has high computational intensity and can be computed with one collective communication. However, its orthogonality is poor.

Algorithm description

　 $\mathbf{B =}\mathbf{V}^{\mathbf{T}}\mathbf{V}$
　 $\mathbf{Cholesky\ decomposition(B)}$
　 $\mathbf{Q = V}\mathbf{R}^{\mathbf{- 1}}$

4 Sparse Matrix Data Formats

In order to save memory in sparse matrix computation, only the non-zero elements and their locations are stored. PARCEL supports different sparse matrix formats.

4.1 Compressed Row Storage (CRS) Format

The CRS format compresses the row indices of the non-zero elements of the sparse matrix. Together with the compressed rows, the column index and the value for each non-zero element are stored in one dimensional arrays. Compresed row indeces are stored in an incremental order.

$A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right)$

One process case

In case a single process processes the whole matrix from above

Values of non-zero elements :: $crsA=\{a,b,c,d,e,f,g,h,i,j\}$

Compressed row indices :: $crsRow\_ptr=\{1,4,5,8,11\}$

Column indices of non-zero elements :: $crsCol=\{1,2,3,2,1,3,4,2,3,4\}$

Two process case

Rank 0 process

Values of non-zero elements :: $crsA=\{a,b,c,d\}$

Compressed row indices :: $crsRow\_ptr=\{1,4,5\}$

Column indices of non-zero elements :: $crsCol=\{1,2,3,2\}$

Rank 1 process

Values of non-zero elements :: $crsA=\{e,f,g,h,i,j\}$

Compressed row indices :: $crsRow\_ptr=\{1,4,7\}$

Column indices of non-zero elements :: $crsCol=\{1,3,4,2,3,4\}$

5 Installation

This section explains how to install the SYCL version of PARCEL.
You can use it by including the header file ./Include/parcel_sycl.hpp and compiling your program with a SYCL-compatible compiler. Examples of compilation on SGI8600 (Intel Xeon Gold 6248R + NVIDIA V100) are shown below.

Compiling CPU programs

  module purge
  module load gnu/9.5.0
  module load openmpi/cur
  . ~/spack/share/spack/setup-env.sh
  spack load hipsycl
  export sycl_root=`spack location -i hipsycl`
  export llvm_root=`spack location -i llvm`
  export LD_LIBRARY_PATH=$sycl_root/lib/:$llvm_root/lib/:${LD_LIBRARY_PATH}
  export MPICXX_CXX=${compiler_base}
  export OMPI_CXX=${compiler_base}
  src_name=sample.cpp
  compiler_base="acpp"
  compiler_option=" -O3 --acpp-targets=omp -mlong-double-128  -march=native -mtune=native  -Wpass-failed=transform-warning"
  compiler=mpicxx
  ${compiler} ${compiler_option} -o a.out ${src_name}

Compiling GPU programs

  module purge
  module load gnu/9.5.0
  module load cuda/12.6
  module load openmpi/cur
  . ~/spack/share/spack/setup-env.sh
  spack load hipsycl
  export sycl_root=`spack location -i hipsycl`
  export llvm_root=`spack location -i llvm`
  export cuda_root=/home/app/cuda/12.6
  export ACPP_CUDA_PATH=$cuda_root
  export LD_LIBRARY_PATH=$sycl_root/lib/:$llvm_root/lib/:$cuda_root/lib64:${LD_LIBRARY_PATH}
  export MPICXX_CXX=${compiler_base}
  export OMPI_CXX=${compiler_base}
  src_name=sample.cpp
  compiler_base="acpp"
  compiler_option=" -O3 --acpp-targets=cuda:sm_70 -march=native -mtune=native -v"
  compiler=mpicxx
  ${compiler} ${compiler_option} -o a.out ${src_name}

6 How to Install a SYCL-Compatible Compiler

6.1 AdaptiveCpp

AdaptiveCpp can be installed from source code. For more details, please refer to the following page: https://github.com/AdaptiveCpp/AdaptiveCpp

One can also use spack to install AdaptiveCpp. Examples on SGI8600 (Intel Xeon Gold 6248R + NVIDIA V100) are shown below.

Install hipsycl for CPUs

  module purge
  module load gnu/9.5.0
  module load openmpi/cur
  . ~/spack/share/spack/setup-env.sh
  spack compiler find
  spack external find
  spack install llvm@18
  spack install hipsycl % llvm@18

When the CPU architectures differ between login nodes and compute nodes, as in Fugaku or Wisteria Odyssey, the installation should be performed on the compute nodes.

Install hipsycl for GPUs

  module purge
  module load gnu/9.5.0
  module load cuda/12.6
  module load openmpi/cur
  . ~/spack/share/spack/setup-env.sh
  spack compiler find
  spack external find
  spack external find cuda
  spack install llvm@18
  spack install hipsycl +cuda % llvm@18

7 SYCL Version of the PARCEL Class

This section describes the classes provided in the SYCL version of PARCEL.

7.1 PARCEL Data Types

The SYCL version of PARCEL uses the following data types (parceltype):

parcel_fp16 : sycl::half type
parcel_float : float type
parcel_double : double type
parcel_longdouble : long double type
parcel_doubledouble : double-double type (quadruple precision)
PARCEL_INT : integer type

7.2 parcel::comm

An MPI communicator class for the SYCL version of PARCEL.

*Syntax

Definition	Arguments
parcel::comm(MPI_Comm comm, sycl::queue* q) class parcel_comm;	comm: MPI communicator q: pointer to a sycl::queue

7.3 parcel::matrix

Class for Sparse Matrix Manipulation in SYCL-based PARCEL

Syntax

Definition	Arguments
template< class parceltype > matrix(sycl::queue* q) class parcel_matrix;	parceltype:Data precision of the CRS matrix q:Pointer to a sycl::queue

Member Functions

Function Name	Description	Arguments
set_crsMat<T>(comm,nstart,nend,n,nnz,rowptr,crsCol,mem)	Define a CRS-format matrix for MPI parallelization	T:Floating-point type parcel::comm* comm:MPI communicator for the SYCL version of PARCEL PARCEL_INT nstart:Starting row of the matrix handled by each process PARCEL_INT nend:Ending row of the matrix handled by each process size_t n:Size of the vector handled by each process size_t nnz:Number of non-zero elements in the coefficient matrix handled by each process PARCEL_INT* rowptr:Pointer table in CRS format PARCEL_INT* crsCol:Column indices of non-zero elements in the partitioned matrix stored in CRS format T* mem:Non-zero elements of the coefficient matrix

7.4 parcel::vector

A class for manipulating vectors in the SYCL version of PARCEL.

Syntax

Definition	Arguments
template vector(size_t n, sycl::queue* q)	parceltype: Data precision of the vector n: Number of array elements q: Pointer to a sycl::queue

Member Functions**

Function Name	Description	Arguments
queue()	Retrieves the sycl::queue pointer
resize(n, q)	Changes the number of array elements and the sycl::queue	size_t n: Number of array elements sycl::queue* q: Pointer to a sycl::queue
setValue(mem)	Copies array elements	T: Floating-point type T* mem: Pointer to array elements
setValue_hi(mem)	Copies array elements to the high part of a double-double array	double* mem: Pointer to array elements
setValue_lo(mem)	Copies array elements to the low part of a double-double array	double* mem: Pointer to array elements
getValue(mem)	Retrieves array elements	T: Floating-point type T* mem: Pointer to array elements
getValue(mem_hi, mem_lo)	Retrieves double-double array elements	T: Floating-point type T* mem_hi: Pointer to high part T* mem_lo: Pointer to low part
getValue_hi(T* mem)	Retrieves the high part of double-double array elements	T: Floating-point type T* mem: Pointer to array elements
getValue_lo(T* mem)	Retrieves the low part of double-double array elements	T: Floating-point type T* mem: Pointer to array elements
size()	Retrieves the number of array elements
data()	Retrieves the pointer to the beginning of the array
data_hi()	Retrieves the pointer to the beginning of the high part of a double-double array
data_lo()	Retrieves the pointer to the beginning of the low part of a double-double array

7.5 parcel::krylov::pcg

A class for the Preconditioned Conjugate Gradient (PCG) method.

Syntax

Definition	Arguments
template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pcg;	parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation

Member Functions

Function Name	Description	Arguments
solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec >(Mat, x, b)	Executes the Conjugate Gradient method	parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parcel::matrix Mat: CRS matrix parcel::vector x: Solution vector parcel::vector b: Right-hand side vector
set_precon_algo(algo)	Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi)	PARCEL_INT algo
set_jacobi(iter)	Sets the number of iterations for the Jacobi method	PARCEL_INT iter
set_blockjacobi(nblock)	Specifies the number of blocks for the Block Jacobi method	PARCEL_INT nblock
set_scaling_matrix(flag)	Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled)	bool flag
set_scaling_vector(flag)	Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled)	bool flag
set_tolerance(tol)	Sets the convergence threshold for the relative residual norm	double tol
set_max_iteration(itr)	Sets the maximum number of iterations for the Krylov subspace method	PARCEL_INT itr

7.6 parcel::krylov::pbicgstab

A class for the Preconditioned Bi-Conjugate Gradient Stabilized (BiCGStab) method.

Syntax

Definition	Arguments
template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pbicgstab;	parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation

Member Functions

Function Name	Description	Arguments
solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec >(Mat, x, b)	Executes the Preconditioned BiCGStab method	parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parcel::matrix Mat: CRS matrix parcel::vector x: Solution vector parcel::vector b: Right-hand side vector
set_precon_algo(algo)	Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi)	PARCEL_INT algo
set_jacobi(iter)	Sets the number of iterations for the Jacobi method	PARCEL_INT iter
set_blockjacobi(nblock)	Specifies the number of blocks for the Block Jacobi method	PARCEL_INT nblock
set_scaling_matrix(flag)	Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled)	bool flag
set_scaling_vector(flag)	Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled)	bool flag
set_tolerance(tol)	Sets the convergence threshold for the relative residual norm	double tol
set_max_iteration(itr)	Sets the maximum number of iterations for the Krylov subspace method	PARCEL_INT itr

7.7 parcel::krylov::pgmres

A class for the Preconditioned Generalized Minimal Residual (GMRES) method.

Syntax

Definition	Arguments
template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pgmres;	parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation

Member Functions

Function Name	Description	Arguments
solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec, Tgs, Thess_mat >(Mat, x, b)	Executes the Preconditioned GMRES method	parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parceltype Tgs: Precision for Gram-Schmidt orthogonalization parceltype Thess_mat: Precision of the Hessenberg matrix parcel::matrix Mat: CRS matrix parcel::vector x: Solution vector parcel::vector b: Right-hand side vector
set_precon_algo(algo)	Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi)	PARCEL_INT algo
set_jacobi(iter)	Sets the number of iterations for the Jacobi method	PARCEL_INT iter
set_blockjacobi(nblock)	Specifies the number of blocks for the Block Jacobi method	PARCEL_INT nblock
set_scaling_matrix(flag)	Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled)	bool flag
set_scaling_vector(flag)	Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled)	bool flag
set_tolerance(tol)	Sets the convergence threshold for the relative residual norm	double tol
set_max_iteration(itr)	Sets the maximum number of iterations for the Krylov subspace method	PARCEL_INT itr

7.8 parcel::krylov::pcagmres

A class for the Preconditioned Communication-Avoiding Generalized Minimal Residual (CA-GMRES) method.

Syntax

Definition	Arguments
template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pcagmres;	parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation

Member Functions

Function Name	Description	Arguments
solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec, Tbasis, TQR, Thess_mat >(Mat, x, b)	Executes the Communication-Avoiding GMRES method	parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parceltype Tbasis: Precision of the basis vectors parceltype TQR: Precision for QR decomposition parceltype Thess_mat: Precision of the Hessenberg matrix parcel::matrix Mat: CRS matrix parcel::vector x: Solution vector parcel::vector b: Right-hand side vector
set_precon_algo(algo)	Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi)	PARCEL_INT algo
set_jacobi(iter)	Sets the number of iterations for the Jacobi method	PARCEL_INT iter
set_blockjacobi(nblock)	Specifies the number of blocks for the Block Jacobi method	PARCEL_INT nblock
set_scaling_matrix(flag)	Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled)	bool flag
set_scaling_vector(flag)	Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled)	bool flag
set_tolerance(tol)	Sets the convergence threshold for the relative residual norm	double tol
set_max_iteration(itr)	Sets the maximum number of iterations for the Krylov subspace method	PARCEL_INT itr

8 Usage

This section shows a sample program using the SYCL version of PARCEL.

8.1 C++

  #include <mpi.h>
 
  #include "./Include/parcel_sycl.hpp"
  #include "./parcel_default.hpp"
 
  void run_parcel_sycl_main(
              MPI_Comm mpi_comm_c,
              int n,
              int nnz,
              double* crsA,
              int* crsCol,
              int* crsRow_ptr,
              double* vecx,
              double* vecb
               ) {
 
    sycl::device device;
    sycl::property::queue::in_order property;
    sycl::queue q{device,property};
    q = sycl::default_selector{};
 
    parcel::comm comm(mpi_comm_c,&q);
    int myrank   = comm.mpi_rank();
    int nprocess = comm.mpi_size();
 
    if(myrank == 0)std::cout << q.get_device().get_info() << std::endl;
 
    parcel::matrix parcel_matrix(&q);
    PARCEL_INT nlocal = n;
    PARCEL_INT nstart = nlocal*myrank;
    PARCEL_INT nend   = nlocal*(myrank+1)-1;
 
    q.wait();
    parcel_matrix.set_crsMat(
                     &comm,nstart,nend,
                     n,nnz,
                     crsRow_ptr,
                     crsCol,
                     crsA
                     );
    q.wait();
 
    parcel::vector vx(n,&q);
    parcel::vector vb(n,&q);
    vx.setValue(vecx);
    vb.setValue(vecb);
 
 
    parcel::krylov::pcg<
      parcel_solver_calc_type,
      parcel_solver_calc_precon_type,
      parcel_solver_calc_type,
      parcel_solver_calc_type
      > solver;
    solver.set_tolerance( 1.0e-9 );
    solver.set_max_iteration( MAX_ITERATION );
 
    PARCEL_INT precon_algo = parcel_solver_precon_algo;
    switch (precon_algo) {
    case parcel::PRECON_NONE:
      // none
      solver.set_precon_algo(parcel::PRECON_NONE);
      break;
    case parcel::PRECON_POINTJACOBI:
      // point jacobi precon
      solver.set_precon_algo(parcel::PRECON_POINTJACOBI);
      break;
    case parcel::PRECON_BLOCKJACOBI:
      // block jacobi precon
      solver.set_precon_algo(parcel::PRECON_BLOCKJACOBI);
      solver.set_scaling_matrix(true);
      solver.set_scaling_vector(true);
      solver.set_blockjacobi(BLOCKJACOBI_BLOCK);
      break;
    case parcel::PRECON_JACOBI:
      // jacobi precon
      solver.set_precon_algo(parcel::PRECON_JACOBI);
      solver.set_scaling_matrix(true);
      solver.set_scaling_vector(true);
      solver.set_jacobi(JACOBI_ITER);
      break;
    }
    q.wait();
 
    solver.solver<
      parcel_solver_data_type,
      parcel_solver_data_type,
      parcel_solver_data_type,
      parcel_solver_data_type,
      parcel_solver_data_precon_mat_type,
      parcel_solver_data_precon_vec_type
      >(parcel_matrix,vx,vb);
    vx.getValue(vecx);  q.wait();
    q.wait();
 
    return ;
  }

8.2 Fortran

Fortran Wrapper Functions (C++)


  extern "C" {
    void run_parcel_sycl_f(
             int* mpi_comm_f,
             int n, int nnz,
             double* crsA,
             int* crsCol,
             int* crsRow_ptr,
             double* vecx,
             double* vecb
              ) {
      MPI_Comm mpi_comm_c = MPI_Comm_f2c(mpi_comm_f[0]);
      run_parcel_sycl_main(
             mpi_comm_c,
             n, nnz,
             crsA,
             crsCol,
             crsRow_ptr,
             vecx,
             vecb );
      return ;
    }
  }

Fortran Program

make_poisson3d_crs_get_nnz and make_poisson3d_crs are subroutines that generate CRS-format matrices with support for matrix partitioning.


  module parcel_sycl_wrapper
 
    interface
       subroutine run_parcel_sycl( &
            mpi_comm_f, &
            n,nnz, &
            crsA, &
            crsCol, &
            crsRow_ptr, &
            vecx, &
            vecb &
            ) &
            bind(c,name="run_parcel_sycl_f")
         implicit none
         integer mpi_comm_f
         integer(kind=4),value :: n
         integer(kind=4),value :: nnz
         real*8 crsA(*)
         integer(kind=4) crsCol(*)
         integer(kind=4) crsRow_ptr(*)
         real*8 vecx(*)
         real*8 vecb(*)
 
       end subroutine run_parcel_sycl
    end interface
  contains
 
  end module parcel_sycl_wrapper
 
  program main
    use mpi
    use parcel_sycl_wrapper
    implicit none
    integer(kind=4) nx,ny,nz
    integer(kind=4) nx0,ny0,nz0
 
    integer(kind=4),parameter :: gnx = 16
    integer(kind=4),parameter :: gny = 16
    integer(kind=4),parameter :: gnz = 16
 
    integer(kind=4) gn,n
    integer(kind=4) nnz
    integer(kind=4) istart
 
    real*8,allocatable :: crsA(:)
    integer(kind=4),allocatable :: crsCol(:)
    integer(kind=4),allocatable :: crsRow_ptr(:)
 
    real*8,allocatable :: vecx(:)
    real*8,allocatable :: vecb(:)
 
    integer ierr
    integer npes,myrank
 
    integer i
    integer mpi_comm_f
 
    mpi_comm_f = MPI_COMM_WORLD
 
    ! MPI Init
    call MPI_Init(ierr)
    call MPI_Comm_size(mpi_comm_f,npes,ierr)
    call MPI_Comm_rank(mpi_comm_f,myrank,ierr)
 
    nx     = gnx
    ny     = gny
    nz     = gnz/npes
 
    nx0    = 1
    ny0    = 1
    nz0    = 1 + nz*myrank
 
    gn     = gnx*gny*gnz
    n      = nx * ny * nz
    istart = 1+n*myrank
 
    call make_poisson3d_crs_get_nnz( &
         nx,ny,nz,   &
         nx0,ny0,nz0,&
         gnx,gny,gnz,&
         nnz, &
         istart)
 
    allocate(crsA(nnz))
    allocate(crsCol(nnz))
    allocate(crsRow_ptr(n+1))
    allocate(vecx(n))
    allocate(vecb(n))
 
    vecx = 0.0d0
    vecb = 1.0d0
    call make_poisson3d_crs( &
         nx,ny,nz,   &
         nx0,ny0,nz0,&
         gnx,gny,gnz,&
         crsA,crsCol,crsRow_ptr,n,nnz, &
         istart)
 
    call run_parcel_sycl( &
         mpi_comm_f, &
         n,nnz, &
         crsA, &
         crsCol, &
         crsRow_ptr, &
         vecx, &
         vecb &
         )
 
    deallocate(crsA)
    deallocate(crsCol)
    deallocate(crsRow_ptr)
    deallocate(vecx)
    deallocate(vecb)
 
    call MPI_Finalize(ierr);
 
  end program main

8.3 C

C Wrapper Functions (C++)


  extern "C" {
    void run_parcel_sycl_c(
             MPI_Comm mpi_comm_c,
             int n, int nnz,
             double* crsA,
             int* crsCol,
             int* crsRow_ptr,
             double* vecx,
             double* vecb
              ) {
      run_parcel_sycl_main(
             mpi_comm_c,
             n, nnz,
             crsA,
             crsCol,
             crsRow_ptr,
             vecx,
             vecb );
      return ;
    }
  }

C Program

make_poisson3d_crs_get_nnz_ and make_poisson3d_crs_ are functions that generate CRS-format matrices with support for matrix partitioning.


  #include <stdio.h>
  #include <stdlib.h>
  #include <mpi.h>
 
  extern void make_poisson3d_crs_get_nnz_(
                       int* nx,
                       int* ny,
                       int* nz,
                       int* nx0,
                       int* ny0,
                       int* nz0,
                       int* gnx,
                       int* gny,
                       int* gnz,
                       int* nnz,
                       int* istart);
  extern void make_poisson3d_crs_(
                   int* nx,
                   int* ny,
                   int* nz,
                   int* nx0,
                   int* ny0,
                   int* nz0,
                   int* gnx,
                   int* gny,
                   int* gnz,
                   double* crsA,
                   int* crsCol,
                   int* crsRow_ptr,
                   int* n,
                   int* nnz0,
                   int* istart);
  extern void run_parcel_sycl_c(
               MPI_Comm mpi_comm_c,
               int n, int nnz,
               double* crsA,
               int* crsCol,
               int* crsRow_ptr,
               double* vecx,
               double* vecb
            );
 
  int main(int argc, char* argv[]) {
 
    int gnx = 16;
    int gny = 16;
    int gnz = 16;
 
    MPI_Init(&argc, &argv);
    MPI_Comm mpi_comm_c = MPI_COMM_WORLD;
    int npes,myrank;
    MPI_Comm_size(MPI_COMM_WORLD,&npes);
    MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
 
    int nx     = gnx;
    int ny     = gny;
    int nz     = gnz/npes;
 
    int nx0    = 1;
    int ny0    = 1 ;
    int nz0    = 1 + nz*myrank;
 
    int gn     = gnx*gny*gnz;
    int n      = nx * ny * nz;
    int istart = 1+n*myrank;
    int nnz;
 
    make_poisson3d_crs_get_nnz_(
         &nx,&ny,&nz,
         &nx0,&ny0,&nz0,
         &gnx,&gny,&gnz,
         &nnz,
         &istart);
 
    double* crsA = (double*)malloc(sizeof(double)*nnz);
    int* crsCol = (int*)malloc(sizeof(int)*nnz);
    int* crsRow_ptr = (int*)malloc(sizeof(int)*(n+1));
    double* vecx = (double*)malloc(sizeof(double)*n);
    double* vecb = (double*)malloc(sizeof(double)*n);
 
    make_poisson3d_crs_(
             &nx,&ny,&nz,
             &nx0,&ny0,&nz0,
             &gnx,&gny,&gnz,
             crsA,crsCol,crsRow_ptr,
             &n,&nnz,
             &istart);
    for(int i = 0;i<n;i++){
      vecx[i] = 0.0;
      vecb[i] = 1.0;
    }
 
    run_parcel_sycl_c(
            mpi_comm_c,
            n,nnz,
            crsA,
            crsCol,
            crsRow_ptr,
            vecx,
            vecb
             );
 
    free(crsA);
    free(crsCol);
    free(crsRow_ptr);
    free(vecx);
    free(vecb);
 
    MPI_Finalize();
 
    return 0;
  }

9 REFERENCES

[1]R. Barret, M. Berry, T. F. Chan, et al., “Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods”, SIAM（1994）
[2]M. Hoemmen, “Communication-avoiding Krylov subspace methods”. Ph.D.thesis, University of California, Berkeley（2010）
[3]Y. Idomura, T. Ina, A. Mayumi, et al., “Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms”,ScalA 17:8th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp.1-8, (2017).
[4] A. Stathopoulos, K. Wu, “A block orthogonalization procedure with constant synchronization requirements”. SIAM J. Sci. Comput. 23, 2165–2182 (2002)