1 Krylov subspace methods

This chapter describes the iterative methods contained in this library. The following notation is used, \(Ax = b\) represents the linear system which will be solved, with \(A\) being an n\(\times\)n sparse matrix and \(x_0\) representing the initial vector. \(K\) represents an appropriate preconditioner.

1.1 Preconditioned Conjugate Gradient Method [1,2]

The Conjugate Gradient (CG) method is an iterative Krylov subspace algorithm for solving symmetric matrices.

Algorithm description

\(\mathbf{{Compute\ }}\mathbf{r}^{\mathbf{0}}\mathbf{= b - A}\mathbf{x}^{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}^{\mathbf{- 1}}\mathbf{=}\mathbf{x}^{\mathbf{0}}\mathbf{\ }\)
\(\mathbf{p}^{\mathbf{- 1}}\mathbf{= 0}\)
\(\mathbf{\alpha}_{\mathbf{- 1}}\mathbf{= 0\ }\mathbf{\ }\)
\(\mathbf{\beta}_{\mathbf{- 1}}\mathbf{= 0}\)
\(\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{0}}\)
\(\mathbf{\rho}_{\mathbf{0}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{0}} \right\rangle\)
\(\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}\)
　　\(\mathbf{p}^{\mathbf{i}}\mathbf{= s +}\mathbf{\beta}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}\)
　　\(\mathbf{q}^{\mathbf{i}}\mathbf{= A}\mathbf{p}^{\mathbf{i}}\)
　　\(\mathbf{\gamma =}\left\langle \mathbf{p}^{\mathbf{i}}\mathbf{,}\mathbf{q}^{\mathbf{i}} \right\rangle\)
　　\(\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}\)
　　\(\mathbf{\alpha}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i}}\mathbf{\ /\ \gamma}\)
　　\(\mathbf{r}^{\mathbf{i + 1}}\mathbf{=}\mathbf{r}^{\mathbf{i}}\mathbf{-}\mathbf{\alpha}_{\mathbf{i}}\mathbf{q}^{\mathbf{i}}\)
　　\(\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{i + 1}}\)
　　\(\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{i + 1}} \right\rangle\)
　　\(\mathbf{{if\ }}\left\| \mathbf{r}^{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ the}}\mathbf{n}\)
　　　　\(\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}\)
　　　　\(\mathbf{{qui}}\mathbf{t}\)
　　\(\mathbf{{endi}}\mathbf{f}\)
　　\(\mathbf{\beta}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{\ /\ }\mathbf{\rho}_{\mathbf{i}}\)
\(\mathbf{{\ en}}\mathbf{d}\)

1.2 Preconditioned Biconjugate Gradient Stabilized Method [1,2]

The Biconjugate Gradient Stabilized (Bi-CGSTAB) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices.

Algorithm description

\(\mathbf{{Compute\ }}\mathbf{r}_{\mathbf{0}}\mathbf{= b - A}\mathbf{x}_{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}_{\mathbf{- 1}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\)
\(\mathbf{p}_{\mathbf{0}}\mathbf{=}\mathbf{r}_{\mathbf{0}}\)
\(\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{0}} \right\rangle\)
\(\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}\)
　　\(\mathbf{{solve\ }}\widehat{\mathbf{p}}\mathbf{{\ from\ K}}\widehat{\mathbf{p}}\mathbf{= \ }\mathbf{p}_{\mathbf{i}}\)
　　\(\mathbf{q = A}\widehat{\mathbf{p}}\)
　　\(\mathbf{c}_{\mathbf{2}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,q} \right\rangle\)
　　\(\mathbf{\alpha =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\mathbf{c}_{\mathbf{2}}\)
　　\(\mathbf{e =}\mathbf{r}_{\mathbf{i}}\mathbf{- \alpha q}\)
　　\(\mathbf{{solve\ }}\widehat{\mathbf{e}}\mathbf{{\ from\ K}}\widehat{\mathbf{e}}\mathbf{= e}\)
　　\(\mathbf{v = A}\widehat{\mathbf{e}}\)
　　\(\mathbf{c}_{\mathbf{3}}\mathbf{=}\left\langle \mathbf{e,v} \right\rangle\mathbf{\ /\ }\left\langle \mathbf{v,v} \right\rangle\)
　　\(\mathbf{x}_{\mathbf{i + 1}}\mathbf{=}\mathbf{x}_{\mathbf{i}}\mathbf{+ \alpha}\widehat{\mathbf{p}}\mathbf{+}\mathbf{c}_{\mathbf{3}}\widehat{\mathbf{e}}\)
　　\(\mathbf{r}_{\mathbf{i + 1}}\mathbf{= e -}\mathbf{c}_{\mathbf{3}}\mathbf{v}\)
　　\(\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{i + 1}} \right\rangle\)
　　\(\mathbf{{if\ }}\left\| \mathbf{r}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{the}}\mathbf{n}\)
　　　　\(\mathbf{{quit}}\)
　　\(\mathbf{{endif}}\)
　　\(\mathbf{\beta =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\left( \mathbf{c}_{\mathbf{2}}\mathbf{c}_{\mathbf{3}} \right)\)
　　\(\mathbf{p}_{\mathbf{i + 1}}\mathbf{=}\mathbf{r}_{\mathbf{i + 1}}\mathbf{+ \beta}\left( \mathbf{p}_{\mathbf{i}}\mathbf{-}\mathbf{c}_{\mathbf{3}}\mathbf{q} \right)\)
\(\mathbf{{end}}\)

1.3 Preconditioned Generalized Minimum Residual Method [1,2]

The Generalized Minimum Residual (GMRES(m)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. GMRES(m) is restarted periodically after \(m\) iterations if it did not converge within \(m\) iterations, the solution obtained until \(m\) iterations will be used as an input for the new restart cycle. PARCEL provides variants of GMRES with the Classical Gram-Schmidt (sometimes referred to as standard Gram-Schmidt) and Modified Gram-Schmidt (MGS) orthogonalization methods.

Algorithm description

\(\mathbf{H}_{\mathbf{n}}\) is an upper triangular matrix with \(\mathbf{h}_{\mathbf{j,k}}\) representing its elements, \(\mathbf{\ }\mathbf{e}_{\mathbf{i}}\) represents a vector consisting of the first i elements of the vector \(\mathbf{e}\).

\(\mathbf{for\ j = 0,1,2,3}\mathbf{\ldots}\)
　　\(\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}\)
　　\(\mathbf{v}_{\mathbf{1}}\mathbf{= - r\ /\ }\left\| \mathbf{r} \right\|\)
　　\(\mathbf{e =}\left( \mathbf{-}\left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}\)
　　\(\mathbf{n = m}\)
　　\(\mathbf{for\ i = 1,2,3}\mathbf{\ldots}\mathbf{m}\)
　　　　\(\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{=}\mathbf{v}_{\mathbf{i}}\)
　　　　\(\mathbf{\omega = A}{\widehat{\mathbf{v}}}_{\mathbf{i}}\)
　　　　\(\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i}\)
　　　　　　\(\mathbf{h}_{\mathbf{k,i}}\mathbf{=}\left\langle \mathbf{\omega,}\mathbf{v}_{\mathbf{k}} \right\rangle\)
　　　　　　\(\mathbf{\omega = \omega -}\mathbf{h}_{\mathbf{k,i}}\mathbf{v}_{\mathbf{k}}\)
　　　　\(\mathbf{{end}}\)
　　　　\(\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{=}\left\| \mathbf{\omega} \right\|\)
　　　　\(\mathbf{v}_{\mathbf{i + 1}}\mathbf{= \omega/}\left\| \mathbf{\omega} \right\|\)
　　　　\(\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i - 1}\)
　　　　　　\(\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\)
　　　　\(\mathbf{{end}}\)
　　　　\(\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
　　　　\(\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
　　　　\(\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}\)
　　　　\(\mathbf{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathbf{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathbf{h}_{\mathbf{i + 1,i}}\)
　　　　\(\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{= 0}\)
　　　　\(\mathbf{{if\ }}\left\| \mathbf{e}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{then}}\)
　　　　　　\(\mathbf{n = i}\)
　　　　　　\(\mathbf{{exit}}\)
　　　　\(\mathbf{{endif}}\)
　　\(\mathbf{{end}}\)
　　\(\mathbf{y}_{\mathbf{n}}\mathbf{=}\mathbf{H}_{\mathbf{n}}^{\mathbf{- 1}}\mathbf{e}_{\mathbf{n}}\)
　　\(\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 1}}^{\mathbf{n}}{\mathbf{y}_{\mathbf{k}}\mathbf{v}_{\mathbf{k}}}\)
　　\(\mathbf{x}_{\mathbf{n}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}\)
　　\(\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{n}}\)
\(\mathbf{{end}}\)

1.4 Preconditioned Communication Avoiding Generalized Minimum Residual Method [3,4]

The Communication Avoiding Generalized Minimum Residual (CA-GMRES(s,t)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. The calculation which is done in \(s\) iterations in the GMRES(m) method is equivalent to one iteration of the CA-GMRES method. CA-GMRES(s,t) is restarted periodically after \(t\) iterations if it did not converge within \(t\) iterations, the solution obtained until \(m\) iterations will be used as an input for the new restart cycle. CA-GMRES(s,t) is equivalent to GMRES(m) if \(s\) and \(t\) are chosen so that \(m = t \times s\). The convergence property of CA-GMRES(s,t) is the same as GMRES(m) in exact arithmetic. In addition, the number of global collective communication calls is reduced by communication avoiding QR factorization. As CA-GMRES(s,t) produces \(s\) basis vectors at once, when \(s\) is too large, the linear independency of the basic vectors may become worse because of round off errors, leading to worse convergence property than GMRES(m). In order to improve the orthogonality of the basic vectors, the PARCEL implementation of CA-GMRES(s,t) provides an option to use the Newton basis in addition to the monomial basis.

Algorithm description

\(\mathbf{e}\) represents a unit vector, \({\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\) represents the \(s\)-th row and \(s\)-th column element of the matrix \(\mathbf{R}_{\mathbf{k}}\), and \(\mathbf{E}\) represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.

\(\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{= \lbrack}\mathbf{e}_{\mathbf{2}}\mathbf{,}\mathbf{e}_{\mathbf{3}}\mathbf{,\ldots,}\mathbf{e}_{\mathbf{s + 1}}\mathbf{\rbrack}\)
\(\mathbf{for\ j = 0,1,2,3\ldots}\)
　　\(\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}\)
　　\(\mathbf{v}_{\mathbf{1}}\mathbf{= r\ /\ }\left\| \mathbf{r} \right\|\)
　　\(\mathbf{\zeta =}\left( \left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}\)
　　\(\mathbf{for\ k = 0,1\ldots,t - 1}\)
　　　　\(\mathbf{Fix\ basis\ conversion\ matrix\ \lbrack}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,E\rbrack}\)
　　　　\(\mathbf{Comupute\ \ basic\ vector\ \ \lbrack\ s,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{\ \rbrack}\)
　　　　\(\mathbf{if(k.eq.0)}\)
　　　　　　\(\mathbf{{QR\ decomposition\ }}\mathbf{V}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\)
　　　　　　\(\mathfrak{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\)
　　　　　　\(\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{- 1}}\)
　　　　　　\(\mathcal{H}_{\mathbf{0}}\mathbf{=}\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\)
　　　　　　\(\mathbf{for\ o = 1,2\ldots,s}\)
　　　　　　\(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta\rbrack}\)
　　　　　　\(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
　　　　　　　　\(\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{Q}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}\mathbf{R}_{\mathbf{0}}\mathbf{,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}\)
　　　　　　　　\(\mathbf{{quit}}\)
　　　　　　\(\mathbf{{endif}}\)
　　　　\(\mathbf{{else}}\)
　　　　　　\({\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{=}\left( \mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}} \right)^{\mathbf{H}}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
　　　　　　\({\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{-}\mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}}{\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\)
　　　　　　\(\mathbf{{QR\ decomposition\ }}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{Q}}}_{\mathbf{k}}^{\mathbf{\sim}}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
　　　　　　\(\mathbf{R}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathbf{R}_{\mathbf{k}} & \mathbf{z}_{\mathbf{k}} \\ \mathbf{0}_{\mathbf{1,k}} & \mathbf{\rho} \\ \end{pmatrix}\)
　　　　　　\(\mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{= -}\mathfrak{H}_{\mathbf{k - 1}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}\mathfrak{R}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
　　　　　　\(\mathbf{H}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\mathbf{B}_{\mathbf{k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{b}_{\mathbf{k}}\mathbf{z}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}}\mathbf{-}\mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
　　　　　　\(\mathbf{h}_{\mathbf{k}}\mathbf{=}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{\rho}_{\mathbf{k}}\mathbf{b}_{\mathbf{k}}\)
　　　　　　\(\mathfrak{H}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1}} & \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \begin{matrix} \mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}} \\ \mathbf{0}_{\mathbf{1,s(k - 1)}} \\ \end{matrix} & \begin{matrix} \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{matrix} \\ \end{pmatrix}\)
　　　　　　\(\mathcal{H}_{\mathbf{k}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{pmatrix}\)
　　　　　　\(\mathbf{for\ o = 1 + sk,\ldots,s(k + 1)}\)
　　　　　　　　\(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta\rbrack}\)
　　　　　　　　\(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
　　　　　　　　　　\(\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}\)
　　　　　　　　　　\(\mathbf{{quit}}\)
　　　　　　　　\(\mathbf{{endif}}\)
　　　　　　\(\mathbf{{end}}\)
　　　　\(\mathbf{{endif}}\)
　　\(\mathbf{{end}}\)
　　\(\mathbf{update\ \ solution\ \ vector\lbrack s,t - 1,st,}{\acute{\mathbf{V}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{t - 1}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}\)
\(\mathbf{{end}}\)

Fix basis conversion matrix \(\mathbf{\lbrack s,B,E\rbrack}\)

\(\mathbf{{if\ compute\ Newton\ Basis}}\)
　　\(\mathbf{i = 0}\)
　　\(\mathbf{{while}}\left( \mathbf{i \leq s - 1} \right)\)
　　　　\(\mathbf{{if}}\left( \mathbf{i.eq.s - 1} \right)\mathbf{{then}}\)
　　　　　　\(\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
　　　　\(\mathbf{{else}}\)
　　　　　　\(\mathbf{{if}}\left( \mathbf{{Im}}\left\lbrack \mathbf{E}_{\mathbf{i}} \right\rbrack\mathbf{.eq.0} \right)\mathbf{{then}}\)
　　　　　　　　\(\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
　　　　　　\(\mathbf{{else}}\)
　　　　　　　　\(\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
　　　　　　　　\(\mathbf{B}_{\mathbf{i + 1,i + 1}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
　　　　　　　　\(\mathbf{B}_{\mathbf{i,i + 1}}\mathbf{= -}\left( \mathbf{Im\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack} \right)^{\mathbf{2}}\)
　　　　　　　　\(\mathbf{i = i + 1}\)
　　　　　　\(\mathbf{{endif}}\)
　　　　\(\mathbf{{endif}}\)
　　　　\(\mathbf{i = i + 1}\)
　　\(\mathbf{{end\ \ }}\)
\(\mathbf{{end\ \ }}\)

Comupute basis vector \(\mathbf{\lbrack\ s,v\ ,B\rbrack}\)

The elements of the matrix \(\mathbf{B}\) are represented as \(\mathbf{b}_{\mathbf{k,i}}\)

\(\mathbf{for\ k = 1,2,3\ldots s}\)
　　\(\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{=}\mathbf{v}_{\mathbf{k - 1}}\)
　　\(\mathbf{if(k.ne.1)then}\)
　　　　\(\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}\)
　　　　\(\mathbf{\beta =}\mathbf{b}_{\mathbf{k - 2,k - 1}}\)
　　　　\(\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}\mathbf{+ \beta}\mathbf{v}_{\mathbf{k - 2}}\)
　　\(\mathbf{{else}}\)
　　　　\(\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}\)
　　　　\(\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}\)
　　\(\mathbf{{endif}}\)
\(\mathbf{{en}}\mathbf{d}\)

Givens rotation \(\mathbf{\lbrack i,}\mathcal{H,}\mathbf{\zeta\rbrack}\)

The elements of the matrix \(\mathcal{H}\) are represented as \(\mathcal{h}_{\mathbf{k,i}}\).

\(\mathbf{for\ k = 1,2,3\ldots i - 1}\)
　　\(\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\)
\(\mathbf{\text{end}}\)
\(\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
\(\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
\(\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}\)
\(\mathcal{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathcal{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathcal{h}_{\mathbf{i + 1,i}}\)
\(\mathcal{h}_{\mathbf{i + 1,i}}\mathbf{= 0}\)

Update solution vector \(\mathbf{\lbrack s,k,n,V,Q,R,}\mathcal{H,}\mathbf{\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}\)

\(\mathbf{y =}\mathcal{H}^{\mathbf{- 1}}\mathbf{\zeta}\)
\(\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 0}}^{\mathbf{s}\mathbf{m}\mathbf{- 1}}\mathbf{Q}_{\mathbf{k}}\mathbf{y}_{\mathbf{k}}\mathbf{+}\sum_{\mathbf{l =}\mathbf{{sm}}}^{\mathbf{n - 1}}\mathbf{V}_{\mathbf{l}\mathbf{- sm}}\sum_{\mathbf{k =}\mathbf{{sm}}}^{\mathbf{n - 1}}{\mathbf{R}_{\mathbf{l - sm}\mathbf{,}\mathbf{k}\mathbf{- sm}}^{\mathbf{- 1}}\mathbf{y}_{\mathbf{k}}}\)
\(\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}\)

1.5 Preconditioned Chebyshev basis Conjugate Gradient Method [5,6]

The Chebyshev basis Conjugate Gradient (CBCG) method is an iterative Krylov subspace algorithm for solving symmetric matrices. The CBCG method calculates \(k\) iterations of the CG method in one iteration. By this approach, the CBCG method reduces the global collective communication calls. In order to use the Chebyshev basis the largest and smallest eigenvalues are needed. PARCEL provides two methods to obtain these eigenvalues, the so called power method and the communication avoiding Arnoldi method.

Algorithm

\(\mathbf{{Compute\ }}\mathbf{r}_{\mathbf{0}}\mathbf{= b - A}\mathbf{x}_{\mathbf{0}}\mathbf{{\ for\ some\ initial\ gue}}\mathbf{{ss\ }}\mathbf{x}_{\mathbf{- 1}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\)
\(\mathbf{Compute\ Chebyshev\ basis\ \lbrack}\mathbf{S}_{\mathbf{0}}\mathbf{,}{\mathbf{A}\mathbf{S}}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}\)
\(\mathbf{Q}_{\mathbf{0}}\mathbf{=}\mathbf{S}_{\mathbf{0}}\)
\(\mathbf{for\ i = 0,1,2,3\ldots}\)
\(\mathbf{{Compute\ }}\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}}\mathbf{\ ,\ }\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{r}_{\mathbf{{ik}}}\)
　　\(\mathbf{\alpha}_{\mathbf{i}}\mathbf{= \ }\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}} \right)^{\mathbf{- 1}}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{r}_{\mathbf{{ik}}} \right)\)
　　\(\mathbf{x}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{=}\mathbf{x}_{\mathbf{{ik}}}\mathbf{+}\mathbf{Q}_{\mathbf{i}}\mathbf{\alpha}_{\mathbf{i}}\)
　　\(\mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{=}\mathbf{r}_{\mathbf{{ik}}}\mathbf{- A}\mathbf{Q}_{\mathbf{i}}\mathbf{\alpha}_{\mathbf{i}}\)
　　\(\mathbf{{if\ }}\left\| \mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{then}}\)
　　\(\mathbf{Compute\ Chebyshev\ basis\ \lbrack}\mathbf{S}_{\mathbf{i + 1}}\mathbf{,}{\mathbf{A}\mathbf{S}}_{\mathbf{i + 1}}\mathbf{,}\mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}\)
　　\(\mathbf{{Compute\ }}\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{S}_{\mathbf{i + 1}}\)
　　\(\mathbf{B}_{\mathbf{i}}\mathbf{=}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}} \right)^{\mathbf{- 1}}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{S}_{\mathbf{i + 1}} \right)\mathbf{\ }\)
　　\(\mathbf{Q}_{\mathbf{i + 1}}\mathbf{=}\mathbf{S}_{\mathbf{i + 1}}\mathbf{-}\mathbf{Q}_{\mathbf{i}}\mathbf{B}_{\mathbf{i}}\)
　　\(\mathbf{A}\mathbf{Q}_{\mathbf{i + 1}}\mathbf{=}\mathbf{{AS}}_{\mathbf{i + 1}}\mathbf{-}\mathbf{{AQ}}_{\mathbf{i}}\mathbf{B}_{\mathbf{i}}\)
\(\mathbf{{end}}\)

Compute Chebyshev basis \(\mathbf{\lbrack S,AS,r,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}\)

\(\mathbf{\lambda}_{\mathbf{\max}}\), \(\mathbf{\lambda}_{\mathbf{\min}}\)represent the largest and smallest eigenvalues of \(\mathbf{AK}^{-1}\).

\(\mathbf{\eta = 2\ /\ (}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{-}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{)}\)
\(\mathbf{\zeta =}\left( \mathbf{\lambda}_{\mathbf{\max}}\mathbf{+}\mathbf{\lambda}_{\mathbf{\min}} \right)\mathbf{\ /\ (}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{-}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{)}\)
\(\mathbf{s}_{\mathbf{0}}\mathbf{=}\mathbf{r}_{\mathbf{0}}\mathbf{\ }\)
\(\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{= \ }\mathbf{s}_{\mathbf{0}}\)
\(\mathbf{s}_{\mathbf{1}}\mathbf{= \eta A}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{- \zeta}\mathbf{s}_{\mathbf{0}}\)
\(\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{= \ }\mathbf{s}_{\mathbf{1}}\)
\(\mathbf{for\ j = 2,3,\ldots,k}\)
　　\(\mathbf{s}_{\mathbf{j}}\mathbf{= 2\eta A}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{- 2\zeta}\mathbf{s}_{\mathbf{j - 1}}\mathbf{-}\mathbf{s}_{\mathbf{j - 2}}\)
　　\(\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{= \ }\mathbf{s}_{\mathbf{j}}\)
\(\mathbf{{end}}\)
\(\mathbf{S = (}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{,}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{,\ldots,}{\widetilde{\mathbf{s}}}_{\mathbf{k - 1}}\mathbf{)}\)
\(\mathbf{AS = (A}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{,}{\mathbf{A}\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{,\ldots,}{\mathbf{A}\widetilde{\mathbf{s}}}_{\mathbf{k - 1}}\mathbf{)}\)

1.6 Preconditioned Communication Avoiding Arnoldi Method [3]

The Preconditioned Communication Avoiding Arnoldi Method (CA-Arnoldi(s,t)) is an eigenvalue calculation algorithm for asymmetric matrices. CA-Arnoldi(s,t) computes s iterations of the Arnordi method in a single iteration, and by repeating this for t times, eigenvalues and eigenvectors of t\(\times\)s Hessenberg matrices are computed. By applying communication-reducing QR factorization algorithms, the number of collective communications is reduced compared to the Arnoldi method.

Algorithm description

\(\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{= \lbrack}\mathbf{e}_{\mathbf{2}}\mathbf{,}\mathbf{e}_{\mathbf{3}}\mathbf{,\ldots,}\mathbf{e}_{\mathbf{s + 1}}\mathbf{\rbrack}\)
　　\(\mathbf{r = }\mathbf{x}_{\mathbf{0}}\)
　　\(\mathbf{\zeta =}\left( \left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}\)
　　\(\mathbf{for\ k = 0,1\ldots,t - 1}\)
　　　　\(\mathbf{Fix\ basis\ conversion\ matrix\ \lbrack}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,E\rbrack}\)
　　　　\(\mathbf{Comupute\ \ basic\ vector\ \ \lbrack\ s,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{\ \rbrack}\)
　　　　\(\mathbf{if(k.eq.0)}\)
　　　　　　\(\mathbf{{QR\ decomposition\ }}\mathbf{V}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\)
　　　　　　\(\mathfrak{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\)
　　　　　　\(\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{- 1}}\)
　　　　　　\(\mathcal{H}_{\mathbf{0}}\mathbf{=}\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\)
　　　　　　\(\mathbf{for\ o = 1,2\ldots,s}\)
　　　　　　\(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta\rbrack}\)
　　　　　　\(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
　　　　　　　　\(\mathbf{{quit}}\)
　　　　　　\(\mathbf{{endif}}\)
　　　　\(\mathbf{{else}}\)
　　　　　　\({\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{=}\left( \mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}} \right)^{\mathbf{H}}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
　　　　　　\({\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{-}\mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}}{\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\)
　　　　　　\(\mathbf{{QR\ decomposition\ }}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{Q}}}_{\mathbf{k}}^{\mathbf{\sim}}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
　　　　　　\(\mathbf{R}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathbf{R}_{\mathbf{k}} & \mathbf{z}_{\mathbf{k}} \\ \mathbf{0}_{\mathbf{1,k}} & \mathbf{\rho} \\ \end{pmatrix}\)
　　　　　　\(\mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{= -}\mathfrak{H}_{\mathbf{k - 1}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}\mathfrak{R}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
　　　　　　\(\mathbf{H}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\mathbf{B}_{\mathbf{k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{b}_{\mathbf{k}}\mathbf{z}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}}\mathbf{-}\mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
　　　　　　\(\mathbf{h}_{\mathbf{k}}\mathbf{=}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{\rho}_{\mathbf{k}}\mathbf{b}_{\mathbf{k}}\)
　　　　　　\(\mathfrak{H}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1}} & \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \begin{matrix} \mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}} \\ \mathbf{0}_{\mathbf{1,s(k - 1)}} \\ \end{matrix} & \begin{matrix} \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{matrix} \\ \end{pmatrix}\)
　　　　　　\(\mathcal{H}_{\mathbf{k}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{pmatrix}\)
　　　　　　\(\mathbf{for\ o = 1 + sk,\ldots,s(k + 1)}\)
　　　　　　　　\(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta\rbrack}\)
　　　　　　　　\(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
　　　　　　　　　　\(\mathbf{{quit}}\)
　　　　　　　　\(\mathbf{{endif}}\)
　　　　　　\(\mathbf{{end}}\)
　　　　\(\mathbf{{endif}}\)
　　\(\mathbf{{end}}\)
　　\(\mathbf{Eigen\ Value\ problem\ \lbrack\mathfrak{H}_{t - 1}^{\sim},E,z\rbrack}\)
　　\(\mathbf{Eigen Vectors \ X = \mathfrak{Q}_{t - 1}^{\sim}z}\)

2 Preconditioning

In iterative algorithms, the application of a preconditioner of the form \(K \approx A\) can help to reduce the number of iterations until convergence. The notation [\({{solve\ }}\widehat{{p}}{{\ from\ K}}\widehat{{p}}{= \ }{p}\)] means [\(approximately solve {A}\widehat{p}{= \ }{p}\) with respect to the vector \(\widehat{p}\)]. Although the more accurate approximation leads to the less number of iterations until convergence, the cost of preconditioner is normally increased. In non-parallel computing, one of the most common preconditioners is the incomplete LU factorization (ILU). However, in parallel computing, parallel preconditioners are needed. PARCEL provides the following parallel preconditioners: Point Jacobi, Block Jacobi, Additive Schwarz, and Neumann expansion.

2.1 Point Jacobi Preconditioner [1,2]

The point Jacobi preconditioner is one of the most simplest preconditioners. \(K\) only consists of the diagonal elements of the matrix \(A\). Compared to other preconditioners the efficiency of the point Jacobi preconditioner is very low. However, it can be easily applied in parallel and it does not require any additional communication between processors.

2.2 Zero Fill-in Incomplete LU Factorization Preconditioners (ILU(0)) [1,2]

LU factorization decompose a square matrix \(A\) into a lower triangular matrix \(L\) and an upper triangular matrix \(U\). For a typical sparse matrix, the LU factors can be much less sparse than the original matrix, which is called as fill-in. This increases the memory and computational requirements. In order to avoid this issue, PARCEL provides the zero fill-in incomplete LU factorization (ILU(0)).

2.3 Diagonal Incomplete LU Factorization Preconditioners (D-ILU) [1,2]

The diagonal incomplete LU factorization (D-ILU) computes a diagonal matrix \(D\), a lower triangular matrix \(L\) and an upper triangular matrix \(U\) for the input matrix \(A\). \[A = L_{A} + D_{A} + U_{A}\] The preconditioner \(K\) is constructed with \(L_{A}\), \(U_{A}\) and \(D\) (\(D \neq D_{A}\)) with: \[K = \left( D + L_{A} \right)D^{- 1}\left( D + U_{A} \right)\] Two different conditions exist in order to construct the diagonal matrix \(D\):

Condition 1: The diagonal elements of \(K\) equal those of \(A\): \[A_{{ii}} = K_{{ii}}\ \ (i = 1,\ldots,n)\]
Condition 2: The row sum of \(K\) equals the row sum of \(A\): \[\sum_{j}^{}A_{{ij}} = \sum_{j}^{}K_{{ij}}\ \ (i = 1,\ldots,n)\]

Given the conditions above \(D\) can be computed as follows:

If condition 1 should be fullfilled the following computation is used: \[D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{A_{{ij}}D_{jj}^{-1}A_{{ji}}}\ \ (i = 1,\ldots,n)\]
If condition 2 should be fullfilled the following computation is used: \[D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{\sum_{k > j}^{}{A_{{ij}}D_{{jj}}^{-1}A_{jk}}}\ \ (i = 1,\ldots,n)\]

2.4 Block Jacobi Preconditioner [1,2]

The Block Jacobi Preconditioner constructs \(K\) out of the diagonal blocks of the matrix \(A\). For each block, the incomplete ILU factorization is computed. When the preconditioner is applied, each block can be processed independently, resulting in a high level of parallelism. In addition, no communication is needed when each block is defined within a sub-matrix on each processor.

2.5 Additive Schwarz Preconditioner [1,2]

The Additive Schwarz Preconditioner constructs \(K\) out of the overlapping diagonal blocks of \(A\). For each block the incomplete LU factorization is computed. Compared to the Block Jacobi Preconditioner, the Additive Schwarz Preconditioner may require additional communication. PARCEL provides four different overlapping methods, BASIC, RESTRICT, INTERPOLATE and NONE.

Fig.1 Additive Schwarz Preconditioner with overlapping diagonal blocks

BASIC

Solve \(Ks=r\) with an extended \(r\) including an overlapping region with the neighboring process, \(s\) in the overlapping region is determined by summing up the results between the neighboring region.

Fig.2 BASIC

RESTRICT

　 Solve \(Ks=r\) with an extended \(r\) including an overlapping region with the neighboring process, \(s\) in the overlapping region is determined without taking account of the result in the neighboring region.

Fig.3 RESTRICT

INTERPOLATE

Solve \(Ks=r\) with an extended \(r\) without an overlapping region with the neighboring process, \(s\) in the overlapping region is determined by summing up the results between the neighboring regions.　

Fig.4 INTERPOLATE

NONE

Solve \(Ks=r\) with an extended \(r\) without an overlapping region with the neighboring process, \(s\) in the overlapping region is determined without taking account of the result in the neighboring regions.　

Fig.5 NONE

2.6 Fine-block Preconditioner [7]

The fine-block preconditioner generates a preconditionig matrix so that SIMD operations can be applied to the incomplete LU decomposition. In each block of the block Jacobi preconditioner or the additive Schwarz preconditioner, fine diagonal blocks are defined. By ignoring off diagonal elements only in the fine diagonal blocks, data dependency is eliminated, and SIMD operations are enabled. In Fig.6, 3 \(\times\) 3 fine blocks are defined within 9 \(\times\) 9 blocks of the block Jacobi preconditioner, where data dependency is eliminated by ignoring off diagonal elements only within fine blocks (shown in white) and three vector elements are processed by SIMD operations.

Block Jacobi Preconditioner	Subdividing 　Block Jacobi Preconditioner
Fig.6 Subdividing preconditioning

2.7 Neumann Series Preconditioner [2]

The diagonal scaling of the matrix \(A\) yields \(\bar{A}\). In the Neumann series preconditioner, the preconditioning matrix \(K\) is defined by an approximate inverse of \(\bar{A}\) using the Neumann series.

\[ \bar{A}^{-1}\simeq K^{-1}=[I+F+F^2+F^3+...] \]

Here, \(\bar{A}=I-F\). This algorithm can be implemented using sparse matrix-vector products (SpMV), and has high parallel efficiency. In addition, by combining the block Jacobi preconditioner, the block Neumann series preconditioner is constructed to avoid inter-proccess communication in SpMV. These preconditioners are available only for GPU version.

3 QR factorization

This section describes the QR decomposition algorithms that can be used in this library.

3.1 Classical Gram-Schmidt (CGS) Method [1,2]

The QR factorization by the Gram-Schmidt orthogonalization. This algorithm has High parallelism but poor orthogonality.

Algorithm description

\(\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}\)
　　\(\mathbf{for\ k = 1,i - 1\ do}\)
　　　　\(\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
　　\(\mathbf{{enddo\ }}\)
　　\(\mathbf{for\ k = 1,i - 1\ do}\)
　　　　\(\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}\)
　　　　\(\mathbf{{enddo\ }}\)
　　\(\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
　　\(\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}\)
\(\mathbf{{enddo\ }}\)

3.2 Modified Gram-Schmidt (MGS) Method [1,2]

The QR factorization with a modified algorithm of the classical Gram-Schmidt method to reduce the error. This algorithm improves the orthogonality from the classical Gram-Schmidt method, but requires more collective communication.

Algorithm description

\(\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}\)
　　\(\mathbf{for\ k = 1,i - 1\ do}\)
　　　　\(\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
　　　　\(\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}\)
　　\(\mathbf{{enddo\ }}\)
　　\(\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
　　\(\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}\)
\(\mathbf{{enddo\ }}\)

3.3 Tall Skinny QR (TSQR) Method [3]

TSQR is based on the Householder QR factorization shown below, and has good orthogonality. The sequential TSQR is used between threads, while the parallel TSQR is used between processes.

Algorithm description

\(\mathbf{for\ i = 1,\ldots,n\ do}\)
　　\(\mathbf{y}_{\mathbf{i}}\mathbf{(i:m) = A(i:m,i) -}\left\| \mathbf{A(i:m,i)} \right\|\mathbf{e}_{\mathbf{i}}\mathbf{(i:m)}\)
　　\(\mathbf{t}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{2}}{\mathbf{<}\mathbf{y}_{\mathbf{i}}\left( \mathbf{i:m} \right)\mathbf{,\ }\mathbf{y}_{\mathbf{i}}\mathbf{(i:m) >}}\)
　　\(\mathbf{Q}_{\mathbf{i}}\mathbf{=}\left( \mathbf{I -}\mathbf{t}_{\mathbf{i}}\mathbf{y}_{\mathbf{i}}\mathbf{(i:m)}{\mathbf{y}_{\mathbf{i}}\mathbf{(i:m)}}^{\mathbf{T}} \right)\)
　　\(\mathbf{A(i:m,i) =}\mathbf{Q}_{\mathbf{i}}\mathbf{A(i:m,i)}\)
\(\mathbf{{enddo}}\)

Sequential TSQR		Parallel TSQR

3.4 Cholesky QR Method [8]

The Cholesky QR factoriztion consists of a matrix product and the Cholesky factorization. This algorithm has high computational intensity and can be computed with one collective communication. However, its orthogonality is poor.

Algorithm description

　\(\mathbf{B =}\mathbf{V}^{\mathbf{T}}\mathbf{V}\)
　\(\mathbf{Cholesky\ decomposition(B)}\)
　\(\mathbf{Q = V}\mathbf{R}^{\mathbf{- 1}}\)

3.5 Cholesky QR2 Method [9]

The Cholesky QR2 factorization performs the second Cholesky QR factorization for the orthogonal matrix obtained by the first Cholesky QR factorization. This algorithm improves the orthogonality by executing the Cholesky QR factorization twice.

Algorithm description

　\(\mathbf{B}_{\mathbf{1}}\mathbf{=}\mathbf{V}^{\mathbf{T}}\mathbf{V}\)
　\(\mathbf{R}_{\mathbf{1}}^{\mathbf{T}}\mathbf{R}_{\mathbf{1}}\mathbf{= Cholesky\ decomposition(}\mathbf{B}_{\mathbf{1}}\mathbf{)}\)
　\(\mathbf{Q}_{\mathbf{1}}\mathbf{= V}\mathbf{R}_{\mathbf{1}}^{\mathbf{- 1}}\)
　\(\mathbf{B}_{\mathbf{2}}\mathbf{=}\mathbf{Q}_{\mathbf{1}}^{\mathbf{T}}\mathbf{Q}_{\mathbf{1}}\)
　\(\mathbf{R}_{\mathbf{2}}^{\mathbf{T}}\mathbf{R}_{\mathbf{2}}\mathbf{= Cholesky\ decomposition(}\mathbf{B}_{\mathbf{2}}\mathbf{)}\)
　\(\mathbf{Q}_{\mathbf{2}}\mathbf{=}\mathbf{Q}_{\mathbf{1}}\mathbf{R}_{\mathbf{2}}^{\mathbf{- 1}}\)
　\(\mathbf{V =}\mathbf{Q}_{\mathbf{1}}\mathbf{R}_{\mathbf{1}}\mathbf{=}\mathbf{Q}_{\mathbf{2}}\mathbf{R}_{\mathbf{2}}\mathbf{R}_{\mathbf{1}}\mathbf{= QR}\)
　\(\mathbf{Q =}\mathbf{Q}_{\mathbf{2}}\)
　\(\mathbf{R =}\mathbf{R}_{\mathbf{2}}\mathbf{R}_{\mathbf{1}}\)

4 Sparse Matrix Data Formats

In order to save memory in sparse matrix computation, only the non-zero elements and their locations are stored. PARCEL supports different sparse matrix formats.

4.1 Compressed Row Storage (CRS) Format

The CRS format compresses the row indices of the non-zero elements of the sparse matrix. Together with the compressed rows, the column index and the value for each non-zero element are stored in one dimensional arrays. Compresed row indeces are stored in an incremental order.

\[ A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right) \]

One process case

In case a single process processes the whole matrix from above

Values of non-zero elements :: \(crsA=\{a,b,c,d,e,f,g,h,i,j\}\)

Compressed row indices :: \(crsRow\_ptr=\{1,4,5,8,11\}\)

Column indices of non-zero elements :: \(crsCol=\{1,2,3,2,1,3,4,2,3,4\}\)

Two process case

Rank 0 process

Values of non-zero elements :: \(crsA=\{a,b,c,d\}\)

Compressed row indices :: \(crsRow\_ptr=\{1,4,5\}\)

Column indices of non-zero elements :: \(crsCol=\{1,2,3,2\}\)

Rank 1 process

Values of non-zero elements :: \(crsA=\{e,f,g,h,i,j\}\)

Compressed row indices :: \(crsRow\_ptr=\{1,4,7\}\)

Column indices of non-zero elements :: \(crsCol=\{1,3,4,2,3,4\}\)

4.2 Diagonal (DIA) Format

The DIA format stores the non-zero elements of a matrix as diagonals. This format provides high performance for band block diagonal matrices. In addition to the element values, an offset value for each diagonal of the matrix needs to be stored inside an one dimensional array. A negative offset indicates that the diagonal is below the main diagonal of the matrix, a positive offset indicates that the diagonal is above the main diagonal. The main diagonal has the offset zero. Offset values are stored in an incremental order.

\[ A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right) \]

Single process case

In case the matrix above is processed by a single process

Elements :: \(diaA=\{0,0,e,h,0,0,0,i,a,d,f,j,b,0,g,0,c,0,0,0\}\)

Offsets :: \(offset=\{-2,-1,0,1,2\}\)

Number of diagonals :: \(nnd=5\)

Two process case

Rank 0 process

Elements :: \(diaA=\{a,d,b,0,c,0\}\)

Offsets :: \(offset=\{0,1,2\}\)

Number of diagonals :: \(nnd=3\)

Rank 1 process

Elements :: \(diaA=\{e,h,0,i,f,j,g,0\}\)

Offsets :: \(offset=\{-2,-1,0,1\}\)

Number of diagonals :: \(nnd=4\)

4.3 Domain Decomposition Method (DDM) Format

4.3.1 Domain Decomposition Parameters

In the DDM format, the DIA format is extended by assuming simulations based on stencil computation such as finite difference with domain decomposition. By specifying a list of processes which exchange data with each other so that it matches domain decomposition in the simulation, one can minimize communication needed for parallel processing of matrices given by one-, two-, and three-dimensional structured grids. Related parameter settings are shown below.

Number of neighbor processes: num_neighbor_ranks

The number of neighboring processes for \(m\)-dimensional domain decomposition is given by \(3^{m-1}\). For example, in the two dimensional case, the number of neighboring processes is eight, while for the three dimensional case, it becomes \(26\).

1D array: neighbor_ranks ( num_neighbor_ranks )

A list of the neoghboring processes is specified by their ranks on (one dimensional) MPI communicator. At the boundaries of computational domain, negative indexes are substituted to neighbor_ranks for the directions, where the neighboring processes do not exist. A sample code below can be used as the standard method for setting neighbor_ranks. The parameters npe_x, npe_y and npe_z represent the number of processes in each direction (x, y, z). rank_x, rank_y and rank_z represent the ranks in each direction. Here, the definition of one-dimensional index iirank should be consistent with that in the simulation.

ixmax = 0; iymax = 0; izmax = 0; num = 0;

neighbor_ranks = -1

if ( npe_x > 1 ) ixmax=1

if ( npe_y > 1 ) iymax=1

if ( npe_z > 1 ) izmax=1

do iz=-izmax, izmax

　　do iy=-iymax, iymax

　　　　do ix=-ixmax, ixmax

　　　　　　n1x = rank_x +ix; n1y = rank_y +iy; n1z = rank_z +iz;

　　　　　　if ( ix*ix +iy*iy + iz*iz == 0 ) cycle

　　　　　　num = num +1

　　　　　　if ( n1z >=0 .and. n1z < npe_z ) then

　　　　　　　　if ( n1y >=0 .and. n1y < npe_y ) then

　　　　　　　　　　if ( n1x >=0 .and. n1x < npe_x ) then

　　　　　　　　　　　　iirank = npe_x *npe_y *n1z +npe_x *n1y +n1x;

　　　　　　　　　　　　neighbor_ranks(num) = iirank;

　　　　　　　　　　endif

　　　　end do

2D array: ioff_grids (nnd, ncomp_grids )

Grid positions referred in stencil computation such as finite difference are specified in each direction. For example, in the case of seven stencil computation consisting of three point finite difference for three-dimensional Poisson equation, the grid offsets for East, West, South, North, Up and Down from the base grid can be set as shown below.

	i	ioff_grids(*,1)	ioff_grids(*,2)	ioff_grids(*,3)
base	1	0	0	0
east	2	1	0	0
west	3	-1	0	0
south	4	0	1	0
north	5	0	-1	0
top	6	0	0	1
bottom	7	0	0	-1

Matrix Structure One-dimensional Poisson equation with three point finite difference.

\[ A = \left( \begin{array}{ccc} a & b & 0 & 0\\ c & d & e & 0\\ 0 & f & g & h\\ 0 & 0 & i & j \end{array} \right) \]

Single process case

Rank 0 process

Elements　　　　 ::　\(val\_dia=\{a,d,g,j,b,e,h,0,0,c,f,i\}\)

Offsets 　　　 ::　\(ioff\_dia=\{0,0,-1\}\)

Number of diagonals　::　\(nnd=3\)

Stencils 　::

	i	ioff_grids(*,1)
reference	1	0
east	2	1
west	3	-1

Two process case

Rank 0 process

Elements　　　　 ::　\(val\_dia=\{a,d,b,e,0,c\}\)

Offsets 　　　 ::　\(ioff\_dia=\{0,0,-1\}\)

Number of diagonals　::　\(nnd=3\)

Stencils 　::

	i	ioff_grids(*,1)
reference	1	0
east	2	1
west	3	-1

Rank 1 process

Elements　　　　 ::　\(val\_dia=\{g,j,h,0,f,i\}\)

Offsets 　　　 ::　\(ioff\_dia=\{0,0,0\}\)

Number of diagonals　::　\(nnd=3\)

Stencils 　::

	i	ioff_grids(*,1)
reference	1	0
east	2	1
west	3	-1

5 Install

This chapter explains how to compile and install the PARCEL library.

5.1 Compiling the PARCEL library

PARCEL requires a C compiler, a Fortran compiler, which supports preprocessor, MPI, LAPACK, and OpenMP. When compiling with \(gfortran\), the option \(-cpp\) has to be specified to to activate preprocessor. Compiler options can be changed inside the file arch/make_config. An example of a make_config file is shown below.

To compile, execute the make command in the top directory of the extracted archive. Then, execute “make install” to create the example, include, lib directory in the directory specified in INSTALLDIR. If parcel_sparse.a is created in the lib directory, the compilation is successful. For large-scale problems of 2 billion dimensions or more, 64-bit integers are required, so it is necessary to compile with 64-bit integer types. To compile the GPU version of PARCEL, turn CUDAFLAG “ON”.By turning CUDAAWARE “ON”, a library that supports CUDA AWARE MPI is created.You need to link cuBLAS,cuSPARSE libraries.

An example of make_config

  CUDAFLAG  = ON
  CUDAAWARE = ON
  CC     = mpiicc
  FC     = mpiifort
 
  CFLAGS  = -O3
  FFLAGS  = -O3 -fpp -qopenmp -xHost  -mcmodel=large
 
  FP_MODE = -fp-model precise
 
  INCLUDEDIR =
  LD_MPI     = -lmpifort
  LD_FORT    = -lifcore
  LD_LAPACK  = -mkl
  LD_CUDA    = -lcudart -lcublas -lcusparse
  NVCFLAGS  = \
            -arch=sm_80 -ccbin=${CC} -O3 -restrict  \
            --expt-extended-lambda -use_fast_math   \
            --default-stream per-thread  \
            --nvlink-options -Werror     \
            --maxrregcount=128  \
            --extra-device-vectorization  \
            --restrict \
             "-std=c++11"
 
  #FDEFINE = -DPARCEL_INT8
  INSTALLDIR = ../PARCEL_1.2

Option	Explanation
CC	C compiler command name
FC	Fortran compiler command name
CFLAGS	C compiler options
FFLAGS	Fortran compiler options
FP_MODE	Option to suppress optimizations with arithmetic order change (Only for quad precision)
CUDAFLAG	CUDA compile flag (ON:enable.OFF:disable)
CUDAAWARE	CUDA-Aware MPI compile flag (ON:enable.OFF:disable)
NVCFLAGS	CUDA compiler options
INCLUDEDIR	Include directories
LD_MPI	MPI library
LD_LAPACK	LAPACK library
LD_CUDA	CUDA library(cuBLAS,cuSPARSE)
FDEFINE	64bit integer becomes available with -DPARCEL_INT8.
INSTALLDIR	PARCEL installation directory

5.2 Usage of the PARCEL library

All the routines in the PARCEL library can be called by linking parcel_sparse.a which is generated in the directory lib.

6 PARCEL routines

The interface of each rountine in the PARCEL library is described. The following implementations are commonry used.

Precision

The names of routines starting with ‘parcel_d’ and ‘parcel_dd’ donote the PARCEL routines implemented in Double precision and Quadruple (Double-Double) precision, respectively.

GPU version

The GPU version allocates memory on the GPU with Fortran type(c_ptr). The subroutine PARCEL_Init must be called before executing the subroutine of the Krylov subspace method to specify the GPU used by the process. The input matrix \(A\), right hand side vector \(b\), and the initial vector \(x\) are also allocated on the GPU memory with Fortran type(c_ptr). Sparse matrix-vector products in the CRS format are implemented using cuSPARSE(cusparseSpMV). The GPU version does not support the Fine-block Preconditioner, but the Neumann series Preconditioner is available.

Communication-computation overlap

The communication-computation overlap is implemneted in the following three modes. iovlflag=0 does not use any communication-computation overlap. iovlflag=1 uses the communication-computation overlap, in which computation processes are started after all non-blocking communications are launched at once. iovlflag=0 uses the communication-computation overlap, in which non-blocking communication and computation in each direction are processed sequencially with taking barrior synchronization, which reuses and saves communication buffers. The communication-computation overlap is not supported in the GPU version.

Number of blocks in the block Jacobi preconditioner and the additive Schwarz preconditioner

When precon_thblock is zero or negative, or is greater than the number of threads per process, precon_nblock is set to be the number of threads per process, which is the default value. One can reduce the number of threads to enlarge the block size and improve the accuracy of preconditioning, but this also leads to the degradation of thread parallel performance. In the GPU version, one block per process is computed by ILU decomposition of cuSPARSE (cusparseDcsrsv2_solve).

Size of diagonal blocks in the fine-block preconditioner

When independ_nvec is zero or negative, the fine-block preconditioner is not applied. On A64FX, independ_nvec=300 is recommended to facilitate optimizations for SIMD operations and software pipelining on the Fujitsu compiler. Larger independ_nvec improves the computational performance, while the accuracy of preconditioning becomes worse. When independ_nvec is greater than the size of vector on each process n, the fine-block preconditioner becomes equivalent to the point Jacobi preconditioner.

Thread optimization of SpMV

Thread optimization of SpMV employs a cyclic loop division to facilitate the reuse of on-cache data between different threads. nBlock is the chank size for this loop division. When nBlock is zero or negative, the default value of nBlock=2000 is set.

Data structure of eigenvector in the CA-Arnoldi method

When the i-th eigenvalue is a real value, the corresponding eigenvector is stored in Evec(n*(i-1)+1:n*i). When the i-th eigenvalue is a complex value and the (i+1)-th eigenvalue is its complex conjugate value, the i-th and (i+1)-th eigenvectors are stored as follows. The CA-Arnoldi method is not supported in the GPU version.

Eigenvector	Real Part	Imaginary Part
i-th	Evec(n(i-1)+1:ni)	Evec(ni+1:n(i+1))
(i+1)-th	Evec(n(i-1)+1:ni)	-Evec(ni+1:n(i+1))

6.1 PARCEL_Init

Interface

call PARCEL_Init(myrank)

Function

　Initialize the PARCEL execution environment.

Parameter(dimension)	Type	Input/Output	Description
myrank	integer*4	in	MPI process rank

6.2 PARCEL_Free

Interface

call PARCEL_Free()

Function

　Terminates PARCEL execution environment.

6.3 parcel_dcg (CPU/GPU)

Interface

call parcel_dcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )

Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA(nnz)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Pointer table in CRS format
crsCol(nnz)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
iret	integer	out	Error code (0:normal)

6.4 parcel_dbicgstab (CPU/GPU)

Interface

call parcel_dbicgstab( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )

Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are storedin CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA(nnz)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Pointer table in CRS format
crsCol(nnz)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
iret	integer	out	Error code (0:normal)

6.5 parcel_dgmres (CPU/GPU)

Interface

call parcel_dgmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA(nnz)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Pointer table in CRS format
crsCol(nnz)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
gmres_m	integer	in	Number of iterations until restart
gmres_GSflag	integer	in	Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret	integer	out	Error code (0:normal)

6.6 parcel_dcagmres (CPU/GPU)

Interface

call parcel_dcagmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA(nnz)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Pointer table in CRS format
crsCol(nnz)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
cagmres_sstep	integer	in	Number of communication avoiding steps s in the CA-GMRES method
cagmres_tstep	integer	in	Number of outer iterations t in the CA-GMRES method (restart length = st)
cagmres_basis	integer	in	Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis)
cagmres_QRflag	integer	in	Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret	integer	out	Error code (0:normal)

6.7 parcel_dcbcg (CPU/GPU)

Interface

call parcel_dcbcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA(nnz)	double precision	in	Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1)	integer4 / integer8	in	Pointer table in CRS format
crsCol(nnz)	integer4 / integer8	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
cbcg_kstep	integer	in	Number of communication avoiding steps
cbcg_Eigenflag	integer	in	Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax	integer	in	Maximum number of iterations in the power method
caarnoldi_sstep	integer	in	Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep	integer	in	Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis	integer	in	Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag	integer	in	Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret	integer	out	Error code (0:normal)

6.8 parcel_dcaarnoldi (CPU)

Interface

call parcel_dcaarnoldi( icomm, vecx, vecb, n, gn, nnz, istart, crsA, crsRow_ptr, crsCol, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )

Function

An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	double precision	in/out	in: initial vector, out: solution vector
vecb(n)	double precision	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA(nnz)	double precision	in	Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1)	integer4 / integer8	in	Pointer table in CRS format
crsCol(nnz)	integer4 / integer8	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
caarnoldi_sstep	integer	in	Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep	integer	in	Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis	integer	in	Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag	integer	in	Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Imaginary part of eigenvalue
Evec(ncaarnoldi_sstepcaarnoldi_tstep)	double precision	out	Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Error norm \(\frac{\left\\| Ax - \lambda x \right\\|}{\left\\| \text{λx} \right\\|}\)
EmaxID	integer	out	ID number of the maximum eigenvalue
EminID	integer	out	ID number of the minimam eigenvalue
iret	integer	out	Error code (0:normal)

6.9 parcel_dcg_dia (CPU/GPU)

Interface

call parcel_dcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )

Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA(n*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DIA format
offset(nnd)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
iret	integer	out	Error code (0:normal)

6.10 parcel_dbicgstab_dia (CPU/GPU)

Interface

call parcel_dbicgstab_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )

Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA(n*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DIA format
offset(nnd)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
iret	integer	out	Error code (0:normal)

6.11 parcel_dgmres_dia (CPU/GPU)

Interface

call parcel_dgmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA(n*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DIA format
offset(nnd)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
gmres_m	integer	in	Number of iterations until restart
gmres_GSflag	integer	in	Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret	integer	out	Error code (0:normal)

6.12 parcel_dcagmres_dia (CPU/GPU)

Interface

call parcel_dcagmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DIA format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA(n*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DIA format
offset(nnd)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
cagmres_sstep	integer	in	Small communication step of CA-GMRES method
cagmres_tstep	integer	in	List step of the CA-GMRES method
cagmres_basis	integer	in	Base generation flag of CA-GMRES method (0: single basis, 1: newton basis)
cagmres_QRflag	integer	in	QR Decomposition flag of CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret	integer	out	Error code (0:normal)

6.13 parcel_dcbcg_dia (CPU/GPU)

Interface

call parcel_dcbcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DIA format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
vecb(n)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA(n*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DIA format
offset(nnd)	CPU : integer4 / integer8 / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
cbcg_kstep	integer	in	Number of communication avoiding steps
cbcg_Eigenflag	integer	in	Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax	integer	in	Maximum number of iterations in the power method
caarnoldi_sstep	integer	in	Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep	integer	in	Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis	integer	in	Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag	integer	in	Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret	integer	out	Error code (0:normal)

6.14 parcel_dcaarnoldi_dia (CPU)

Interface

call parcel_dcaarnoldi_dia( icomm, vecx, vecb, n, gn, istart, diaA,offset,nnd, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )

Function

An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DIA format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx(n)	double precision	in/out	in: initial vector, out: solution vector
vecb(n)	double precision	in	Right hand side vector
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA(n*nnd)	double precision	in	Non-zero elements of matrix stored in DIA format
offset(nnd)	integer4 / integer8	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
caarnoldi_sstep	integer	in	Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep	integer	in	Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis	integer	in	Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag	integer	in	Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Imaginary part of eigenvalue
Evec(ncaarnoldi_sstepcaarnoldi_tstep)	double precision	out	Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Error norm \({\left\\| Ax - \lambda x \right\\|}/{\left\\| \text{λx} \right\\|}\)
EmaxID	integer	out	ID number of the maximum eigenvalue
EminID	integer	out	ID number of the minimam eigenvalue
iret	integer	out	Error code (0:normal)

6.15 parcel_dcg_ddm (CPU/GPU)

Interface

call parcel_dcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )

Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x(m)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
b(m)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	integer	in	Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
iret	integer	out	Error code (0:normal)

6.16 parcel_dbicgstab_ddm (CPU/GPU)

Interface

call parcel_dbicgstab_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )

Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x(m)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
b(m)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	CPU : integer / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DDM format
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
iret	integer	out	Error code (0:normal)

6.17 parcel_dgmres_ddm (CPU/GPU)

Interface

call parcel_dgmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m, gmres_GSflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x(m)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
b(m)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	CPU : integer / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DDM format
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
gmres_m	integer	in	Number of iterations until restart
gmres_GSflag	integer	in	Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret	integer	out	Error code (0:normal)

6.18 parcel_dcagmres_ddm (CPU/GPU)

Interface

call parcel_dcagmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x(m)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
b(m)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	CPU : integer / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DDM format
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
cagmres_sstep	integer	in	Number of communication avoiding steps s in the CA-GMRES method
cagmres_tstep	integer	in	Number of outer iterations t in the CA-GMRES method (restart length = st)
cagmres_basis	integer	in	Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis)
cagmres_QRflag	integer	in	Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret	integer	out	Error code (0:normal)

6.19 parcel_dcbcg_ddm (CPU/GPU)

Interface

call parcel_dcbcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )

Function

A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x(m)	CPU : double precision / GPU : type(c_ptr)	in/out	in: initial vector, out: solution vector
b(m)	CPU : double precision / GPU : type(c_ptr)	in	Right hand side vector
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	CPU : integer / GPU : type(c_ptr)	in	Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd)	CPU : double precision / GPU : type(c_ptr)	in	Non-zero elements of matrix stored in DDM format
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax	integer	in	Order of Neumann series preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
cbcg_kstep	integer	in	Number of communication avoiding steps
cbcg_Eigenflag	integer	in	Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax	integer	in	Maximum number of iterations in the power method
caarnoldi_sstep	integer	in	Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep	integer	in	Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis	integer	in	Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag	integer	in	Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret	integer	out	Error code (0:normal)

6.20 parcel_dcaarnoldi_ddm (CPU)

Interface

call parcel_dcaarnoldi_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )

Function

An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x(m)	double precision	in/out	in: initial vector, out: solution vector
b(m)	double precision	in	Right hand side vector
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	integer	in	Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
caarnoldi_sstep	integer	in	Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep	integer	in	Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis	integer	in	Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag	integer	in	Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Imaginary part of eigenvalue
Evec(mcaarnoldi_sstepcaarnoldi_tstep)	double precision	out	Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep)	double precision	out	Error norm \({\left\\| Ax - \lambda x \right\\|}/{\left\\| \text{λx} \right\\|}\)
EmaxID	integer	out	ID number of the maximum eigenvalue
EminID	integer	out	ID number of the minimam eigenvalue
iret	integer	out	Error code (0:normal)

6.21 parcel_ddcg (CPU)

Interface

call parcel_ddcg( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx_hi(n)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n)	double precision	in/out	in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n)	double precision	in	Right hand side vector (upper bit)
vecb_lo(n)	double precision	in	Right hand side vector (lower bit)
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA_hi(nnz)	double precision	in	Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz)	double precision	in	Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1)	integer4 / integer8	in	Pointer table in CRS format
crsCol(nnz)	integer4 / integer8	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.22 parcel_ddbicgstab (CPU)

Interface

call parcel_ddbicgstab( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx_hi(n)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n)	double precision	in/out	in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n)	double precision	in	Right hand side vector (upper bit)
vecb_lo(n)	double precision	in	Right hand side vector (lower bit)
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA_hi(nnz)	double precision	in	Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz)	double precision	in	Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1)	integer4 / integer8	in	Pointer table in CRS format
crsCol(nnz)	integer4 / integer8	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.23 parcel_ddgmres (CPU)

Interface

call parcel_ddgmres( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m,gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx_hi(n)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n)	double precision	in/out	in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n)	double precision	in	Right hand side vector (upper bit)
vecb_lo(n)	double precision	in	Right hand side vector (lower bit)
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
nnz	integer4 / integer8	in	Number of non-zero elements on each process
istart	integer4 / integer8	in	Start line of matrix on each process
crsA_hi(nnz)	double precision	in	Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz)	double precision	in	Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1)	integer4 / integer8	in	Pointer table in CRS format
crsCol(nnz)	integer4 / integer8	in	Column numbers of non-zero elements in CRS format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
gmres_m	integer	in	Number of iterations until restart
gmres_GSflag	integer	in	Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.24 parcel_ddcg_dia (CPU)

Interface

call parcel_ddcg_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx_hi(n)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n)	double precision	in/out	in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n)	double precision	in	Right hand side vector (upper bit)
vecb_lo(n)	double precision	in	Right hand side vector (lower bit)
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA_hi(n*nnd)	double precision	in	Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd)	double precision	in	Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd)	integer4 / integer8	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.25 parcel_ddbicgstab_dia (CPU)

Interface

call parcel_ddbicgstab_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx_hi(n)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n)	double precision	in/out	in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n)	double precision	in	Right hand side vector (upper bit)
vecb_lo(n)	double precision	in	Right hand side vector (lower bit)
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA_hi(n*nnd)	double precision	in	Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd)	double precision	in	Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd)	integer4 / integer8	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.26 parcel_ddgmres_dia (CPU)

Interface

call parcel_ddgmres_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
vecx_hi(n)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n)	double precision	in/out	in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n)	double precision	in	Right hand side vector (upper bit)
vecb_lo(n)	double precision	in	Right hand side vector (lower bit)
n	integer4 / integer8	in	Size of vector on each process
gn	integer4 / integer8	in	Total size of vector
istart	integer4 / integer8	in	Start line of matrix on each process
diaA_hi(n*nnd)	double precision	in	Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd)	double precision	in	Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd)	integer4 / integer8	in	Offset of each non-zero diagonal elements array in DIA format
nnd	integer4 / integer8	in	Number of diagonal elements arrays in DIA format
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method	integer	in	Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	Convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
gmres_m	integer	in	Number of iterations until restart
gmres_GSflag	integer	in	Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.27 parcel_ddcg_ddm (CPU)

Interface

call parcel_ddcg_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x_hi(m)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m)	double precision	in/out	in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m)	double precision	in	Right hand side vector (upper bit)
b_lo(m)	double precision	in	Right hand side vector (lower bit)
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	integer	in	Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.28 parcel_ddbicgstab_ddm (CPU)

Interface

call parcel_ddbicgstab_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x_hi(m)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m)	double precision	in/out	in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m)	double precision	in	Right hand side vector (upper bit)
b_lo(m)	double precision	in	Right hand side vector (lower bit)
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	integer	in	Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.29 parcel_ddgmres_ddm (CPU)

Interface

call parcel_ddgmres_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m, gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )

Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.

Parameter(dimension)	Type	Input/Output	Description
icomm	integer	in	MPI communicator
x_hi(m)	double precision	in/out	in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m)	double precision	in/out	in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m)	double precision	in	Right hand side vector (upper bit)
b_lo(m)	double precision	in	Right hand side vector (lower bit)
m	integer	in	Size of vector on each process
nnd	integer	in	Number of diagonal elements arrays on each process
ioff_dia(nnd)	integer	in	Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd)	double precision	in	Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks	integer	in	Number of neighboring processes
neighbor_ranks(num_neighbor_ranks)	integer	in	List of neighboring processes
margin	integer	in	Maximum number of absolute value of ioff_grids
ncomp_grids	integer	in	Dimension of structured grids
ndiv_grids(ncomp_grids)	integer	in	Number of decomposed domains in each direction
num_grids(ncomp_grids)	integer	in	Number of grids on each process
ioff_grids(nnd,ncomp_grids)	integer	in	Offset of stencil data location in each direction
div_direc_th	integer	in	Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag	integer	in	Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
addL	integer	in	Overlapping width in the additive Schwartz method
iflagAS	integer	in	Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax	integer	in/out	in: maximum number of iterations, out: number of iterations
rtolmax	double precision	in	convergence criterion (norm of relative residual error)
reshistory(itrmax)	double precision	out	History of relative residual error
iovlflag	integer	in	Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock	integer	in	Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec	integer	in	Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock	integer	in	Number of rows allocated to the thread cyclically.
gmres_m	integer	in	Number of iterations until restart
gmres_GSflag	integer	in	Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precision_A	integer	in	Precision of matrix (1: double precision, 2: quad precision)
precision_b	integer	in	Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x	integer	in	Precision of solution vector (1: double precision, 2: quad precision)
precision_precon	integer	in	Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret	integer	out	Error code (0:normal)

6.30 parcel_dqr (CPU)

Interface

call parcel_dqr(n,s,icomm,V,Q,R,iQRflag)

Function

QR factorization for the m*s matrix, where the sum of the parameter n for the MPI processes belonging to the MPI communicator icomm is m.

Parameter(dimension)	Type	Input/Output	Description
n	integer4 / integer8	in	Size of rows in the matrix on each process
s	integer4 / integer8	in	Size of column in the matrix
icomm	integer	in	MPI communicator
V(s*n)	double precision	in	Input Matrix (Column Major)
Q(s*n)	double precision	out	Orthogonal matrix (Column Major)
R(s*s)	double precision	out	Upper triangular matrix (Column Major)
iQRflag	integer	in	Flag for QR factorization (1:MGS,2:CGS,3:TSQR,4:CholeskyQR,5:CholeskyQR2)

7 CUDA support routines for Fortran

　The interface of each CUDA support routine for Fortran is described. These routines can be replaced with other implementations such as CUDA Fortran and OpenACC.

7.1 cudaMalloc_FP64 (GPU)

呼び出し方法

call cudaMalloc_FP64(a_d,n)

機能

　double precision型のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.

引数名(次元)	型	入力/出力	説明
a_d	type(devicemem)	out	PARCEL GPUメモリ構造体
n	integer4 / integer8	in	確保する要素数

7.2 cudaMalloc_PARCEL_INT (GPU)

呼び出し方法

call cudaMalloc_PARCEL_INT(a_d,n)

機能

　32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.

引数名(次元)	型	入力/出力	説明
a_d	type(devicemem)	out	PARCEL GPUメモリ構造体
n	integer4 / integer8	in	確保する要素数

7.3 cudaMalloc_INT (GPU)

呼び出し方法

call cudaMalloc_INT(a_d,n)

機能

　integer*4型のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.

引数名(次元)	型	入力/出力	説明
a_d	type(devicemem)	out	PARCEL GPUメモリ構造体
n	integer4 / integer8	in	確保する要素数

7.4 cudaFree (GPU)

呼び出し方法

call cudaFree(a_d)

機能

　GPUメモリを解放する.

引数名(次元)	型	入力/出力	説明
a_d	type(devicemem)	out	PARCEL GPUメモリ構造体

7.5 cudaMemcpy_h2d_FP64 (GPU)

呼び出し方法

call cudaMemcpy_h2d_FP64(a_h,a_d,n,offset_h,offset_d)

機能

　FP64データをCPUメモリからGPUメモリにコピー.

引数名(次元)	型	入力/出力	説明
a_h	double precision	in	コピー元CPUメモリ
a_d	type(devicemem)	out	コピー先PARCEL GPUメモリ構造体
n	integer4 / integer8	in	コピーする要素数
offset_h	integer4 / integer8	in	コピー元となるa_hの先頭アドレスからのオフセット
offset_d	integer4 / integer8	in	コピー先となるa_dの先頭アドレスからのオフセット

7.6 cudaMemcpy_h2d_PARCEL_INT (GPU)

呼び出し方法

call cudaMemcpy_h2d_PARCEL_INT(a_h,a_d,n,offset_h,offset_d)

機能

　32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のデータをCPUメモリからGPUメモリにコピー.

引数名(次元)	型	入力/出力	説明
a_h	double precision	in	コピー元CPUメモリ
a_d	type(devicemem)	out	コピー先PARCEL GPUメモリ構造体
n	integer4 / integer8	in	コピーする要素数
offset_h	integer4 / integer8	in	コピー元となるa_hの先頭アドレスからのオフセット
offset_d	integer4 / integer8	in	コピー先となるa_dの先頭アドレスからのオフセット

7.7 cudaMemcpy_h2d_INT (GPU)

呼び出し方法

call cudaMemcpy_h2d_INT(a_h,a_d,n,offset_h,offset_d)

機能

　integer*4データをCPUメモリからGPUメモリにコピー.

引数名(次元)	型	入力/出力	説明
a_h	double precision	in	コピー元CPUメモリ
a_d	type(devicemem)	out	コピー先PARCEL GPUメモリ構造体
n	integer4 / integer8	in	コピーする要素数
offset_h	integer4 / integer8	in	コピー元となるa_hの先頭アドレスからのオフセット
offset_d	integer4 / integer8	in	コピー先となるa_dの先頭アドレスからのオフセット

7.8 cudaMemcpy_d2h_FP64 (GPU)

呼び出し方法

call cudaMemcpy_d2h_FP64(a_d,a_h,n,offset_d,offset_h)

機能

　FP64データをGPUメモリからCPUメモリにコピー.

引数名(次元)	型	入力/出力	説明
a_d	type(devicemem)	in	コピー元PARCEL GPUメモリ構造体
a_h	double precision	out	コピー先CPUメモリ
n	integer4 / integer8	in	コピーする要素数
offset_d	integer4 / integer8	in	コピー先となるa_dの先頭アドレスからのオフセット
offset_h	integer4 / integer8	in	コピー元となるa_hの先頭アドレスからのオフセット

7.9 cudaMemcpy_d2h_PARCEL_INT (GPU)

呼び出し方法

call cudaMemcpy_d2h_PARCEL_INT(a_d,a_h,n,offset_d,offset_h)

機能

　32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のデータをGPUメモリからCPUメモリにコピー.

引数名(次元)	型	入力/出力	説明
a_d	type(devicemem)	in	コピー元PARCEL GPUメモリ構造体
a_h	double precision	out	コピー先CPUメモリ
n	integer4 / integer8	in	コピーする要素数
offset_d	integer4 / integer8	in	コピー先となるa_dの先頭アドレスからのオフセット
offset_h	integer4 / integer8	in	コピー元となるa_hの先頭アドレスからのオフセット

7.10 cudaMemcpy_d2h_INT (GPU)

呼び出し方法

call cudaMemcpy_d2h_INT(a_d,a_h,n,offset_d,offset_h)

機能

　integer*4データをGPUメモリからCPUメモリにコピー.

引数名(次元)	型	入力/出力	説明
a_d	type(devicemem)	in	コピー元PARCEL GPUメモリ構造体
a_h	double precision	out	コピー先CPUメモリ
n	integer4 / integer8	in	コピーする要素数
offset_d	integer4 / integer8	in	コピー先となるa_dの先頭アドレスからのオフセット
offset_h	integer4 / integer8	in	コピー元となるa_hの先頭アドレスからのオフセット

8 How to use (Fortran)

The use of the PARCEL routines is explained by Fortran sample programs of a CG solver. If you enable a preprocessor option CUDA_SOLVER, a CUDA version of each sample program is generated.

8.1 CRS Format (Fortran)

A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.

  program main
    use mpi
    use PARCEL_integer
    use krylov
    use PARCEL_wrapper
    implicit none
    integer n,gn,nnz,istart
    real*8,allocatable  :: crsA(:)
    integer,allocatable :: crsRow_ptr(:),crsCol(:)
    real*8,allocatable  :: vecx(:)
    real*8,allocatable  :: vecb(:)
    integer itrmax
    real*8  rtolmax
    real*8, allocatable :: reshistory(:)
    integer ipreflag
    integer ILU_method
    integer iflagAS
    integer addL
    integer iovlflag
    integer precon_thblock
    integer independ_nvec
    integer nblock
    integer iret
    integer myrank
    integer ierr
 #ifdef CUDA_SOLVER
    type(devicemem) cu_vecx
    type(devicemem) cu_vecb
    type(devicemem) cu_crsA
    type(devicemem) cu_crsRow_ptr
    type(devicemem) cu_crsCol
    integer(kind=intType),parameter :: offset_zero=0
 #endif
    call MPI_Init(ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)
    call PARCEL_Init(myrank)
    call make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol)
    allocate(vecx(n))
    allocate(vecb(n))
    allocate(reshistory(itrmax))
    ipreflag=0
    ILU_method=1
    addL=0
    iflagAS=1
    itrmax=100
    rtolmax=1.0d-8
    iovlflag=0
    precon_thblock=-1
    independ_nvec=-1
    nblock = 2000
    vecb=1.0d0
    vecx=1.0d0
 #ifdef CUDA_SOLVER
    call cudaMalloc_FP64(cu_vecx,n)
    call cudaMalloc_FP64(cu_vecb,n)
    call cudaMalloc_FP64(cu_crsA,nnz)
    call cudaMalloc_PARCEL_INT(cu_crsRow_ptr,n+1)
    call cudaMalloc_PARCEL_INT(cu_crsCol,nnz)
    call cudaMemcpy_h2d_fp64(vecx,cu_vecx,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(vecb,cu_vecb,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(crsMat%crsA,cu_crsA,nnz,offset_zero,offset_zero)
    call cudaMemcpy_h2d_PARCEL_INT(crsMat%crsRow_ptr,cu_crsRow_ptr,n+1,offset_zero,offset_zero)
    call cudaMemcpy_h2d_PARCEL_INT(crsMat%crsCol,cu_crsCol,nnz,offset_zero,offset_zero)
 #endif
    call parcel_dcg( &
         MPI_COMM_WORLD, &
 #ifdef CUDA_SOLVER
         cu_vecx%mem,cu_vecb%mem, &
         n,gn,nnz,istart, &
         cu_crsA%mem,cu_crsRow_ptr%mem,cu_crsCol%mem,&
 #else
         vecx,vecb, &
         n,gn,nnz,istart, &
         crsA,crsRow_ptr,crsCol, &
 #endif
         ipreflag,ILU_method,addL,iflagAS, &
         itrmax,rtolmax, &
         reshistory, &
         iovlflag,iret &
         precon_thblock, independ_nvec, &
         nblock, &
         iret &
         )
    call PARCEL_Free()
    call MPI_Finalize(ierr)
 #ifdef CUDA_SOLVER
    call cudaMemcpy_d2h_fp64(cu_vecx,vecx,n,offset_zero,offset_zero)
    call cudaFree(cu_vecx)
    call cudaFree(cu_vecb)
    call cudaFree(cu_crsA)
    call cudaFree(cu_crsRow_ptr)
    call cudaFree(cu_crsCol)
 #endif
    deallocate(vecx)
    deallocate(vecb)
    deallocate(reshistory)
  end program main

8.2 DIA Format (Fortran)

A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.

  program main
    use PARCEL_integer
    use krylov
    use PARCEL_wrapper
    implicit none
    integer n,gn,nnd,istart
    real*8,allocatable  :: diaA(:)
    integer,allocatable :: offset(:)
    real*8,allocatable  :: vecx(:)
    real*8,allocatable  :: vecb(:)
    integer itrmax
    real*8  rtolmax
    real*8, allocatable :: reshistory(:)
    integer ipreflag
    integer ILU_method
    integer iflagAS
    integer addL
    integer iovlflag
    integer precon_thblock
    integer independ_nvec
    integer nblock
    integer iret
    integer myrank
    integer ierr
 #ifdef CUDA_SOLVER
    type(devicemem) cu_vecx
    type(devicemem) cu_vecb
    type(devicemem) cu_diaA
    type(devicemem) cu_offset
    integer(kind=intType),parameter :: offset_zero=0
 #endif
    call MPI_Init(ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)
    call PARCEL_Init(myrank)
    call make_matrix_DIA(n,gn,nnd,istart,diaA,offset)
    allocate(vecx(n))
    allocate(vecb(n))
    allocate(reshistory(itrmax))
    ipreflag=0
    ILU_method=1
    addL=0
    iflagAS=1
    itrmax=100
    rtolmax=1.0d-8
    iovlflag=0
    precon_thblock=-1
    independ_nvec=-1
    nblock = 2000
    vecb=1.0d0
    vecx=1.0d0
 #ifdef CUDA_SOLVER
    call cudaMalloc_FP64(cu_vecx,n)
    call cudaMalloc_FP64(cu_vecb,n)
    call cudaMalloc_FP64(cu_diaA,n*nnd)
    call cudaMalloc_PARCEL_INT(cu_offset,nnd)
    call cudaMemcpy_h2d_fp64(vecx,cu_vecx,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(vecb,cu_vecb,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(diaMat%diaA,cu_diaA,n*nnd,offset_zero,offset_zero)
    call cudaMemcpy_h2d_PARCEL_INT(diaMat%offset,cu_offset,nnd,offset_zero,offset_zero)
 #endif
    call parcel_dcg_dia(&
         MPI_COMM_WORLD, &
 #ifdef CUDA_SOLVER
         cu_vecx%mem,cu_vecb%mem, &
         n,gn,istart,&
         cu_diaA%mem,cu_offset%mem,nnd, &
 #else
         vecx,vecb,&
         n,gn,istart,&
         diaA,offset,nnd,&
 #endif
         ipreflag,ILU_method,addL,iflagAS,&
         itrmax,rtolmax,&
         reshistory,&
         iovlflag,
         precon_thblock, independ_nvec, &
         nblock, &
         iret&
         )
    call PARCEL_Free()
    call MPI_Finalize(ierr)
 #ifdef CUDA_SOLVER
    call cudaMemcpy_d2h_fp64(cu_vecx,vecx,n,offset_zero,offset_zero)
    call cudaFree(cu_vecx)
    call cudaFree(cu_vecb)
    call cudaFree(cu_diaA)
    call cudaFree(cu_offset)
 #endif
    deallocate(reshistory)
  end program main

8.3 DDM Format (Fortran)

A Fortran sample code in DDM format is shown below. Here, make_matrix_DDM denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL.

    program main
      use mpi
      use PARCEL_integer
      use krylov
      use ddm_wrapper
      implicit none
      integer ityp_eq
      integer gnx,gny,gnz
      integer itrmax
      integer MAX_NITER
      integer nx,ny,nz
      integer nx0,ny0,nz0
      integer n,gn
      integer m
      integer m_
      integer npes,myrank
      integer ierr
      integer i
      integer ipreflag
      integer addL
      integer iflagAS
      real*8  rtolmax
      real*8  abstolmax
      integer solver,ityp_solver
      integer iret
      real*8,allocatable  :: reshistory_DDM(:)
      integer ILU_method
      integer iovlflag
      real*8,allocatable  :: vecx(:)
      real*8,allocatable  :: vecb(:)
      integer niter,restart
      integer npey,npez
      integer precon_thblock
      integer independ_nvec
      integer nblock
      integer npe_x,npe_y,npe_z
      integer div_direc_th
      integer rank_x, rank_y, rank_z
      integer,parameter :: nnd = 7
      integer,parameter :: margin = 1
      integer,parameter :: ndim=3
      integer npe_dim(ndim)
      integer gn_grids(ndim)
      integer n_grids(ndim)
      integer istart_grids(ndim)
      integer,parameter :: num_neighbor_ranks_max = 3**ndim -1
      integer neighbor_ranks( num_neighbor_ranks_max )
      integer num_neighbor_ranks
      real*8, allocatable  :: diaA(:)
      integer, allocatable :: offset(:)
      integer, allocatable :: offset_dim(:,:)
 #ifdef CUDA_SOLVER
      type(devicemem) cu_vecx
      type(devicemem) cu_vecb
      type(devicemem) cu_diaA
      type(devicemem) cu_offset
      integer(kind=intType),parameter :: offset_zero=0
 #endif
      namelist/input/ &
           ityp_eq, &
           gnx, &
           gny, &
           gnz, &
           MAX_NITER, &
           rtolmax, &
           abstolmax, &
           ipreflag, &
           independ_nvec, &
           addL, &
           iflagAS, &
           ILU_method, &
           iovlflag, &
           ityp_solver, &
           precon_thblock, &
           nblock &
      namelist/input_ddm/ &
           npe_x,&
           npe_y,&
           npe_z,&
           div_direc_th
      open(33,file='input_namelist')
      read(33,nml=input)
      read(33,nml=input_ddm)
      close(33)
      stime = 0.0d0
      etime = 0.0d0
      itrmax = MAX_NITER
      solver = ityp_solver
      call MPI_Init(ierr)
      call MPI_Comm_size(MPI_COMM_WORLD,npes,ierr)
      call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)
      call PARCEL_Init(myrank)
      gn_grids(1)  = gnx
      gn_grids(2)  = gny
      gn_grids(3)  = gnz
      call set_rank_xyz( myrank, npe_x, npe_y, npe_z, rank_x, rank_y, rank_z )
      call set_grids_decompostion( &
           npe_x, npe_y, npe_z, &
           rank_x, rank_y, rank_z,  &
           ndim, gn_grids, npe_dim, n_grids,  &
           istart_grids, margin, m )
      num_neighbor_ranks = -1
      call set_neighbor_ranks( &
           npe_x, npe_y, npe_z, rank_x, rank_y, rank_z, &
           num_neighbor_ranks, &
           neighbor_ranks, num_neighbor_ranks_max)
      allocate( diaA(nnd*m) )
      allocate( offset(nnd) )
      allocate( offset_dim(nnd,ndim) )
      allocate( reshistory_DDM(MAX_NITER) )
      allocate( vecx(m) )
      allocate( vecx0(m) )
      allocate( vecb(m) )
      diaA = 0.0d0
      vecx = 0.0d0
      vecx0 = 0.0d0
      vecb = 0.0d0
      reshistory_DDM = 0
      offset = 0
      offset_dim = 0
      call make_matrix_DDM( &
           gnx,gny,gnz, &
           ndim, n_grids,  &
           istart_grids, &
           margin, nnd, m, &
           offset_dim, offset, &
           diaA, vecb, vecx, vecx0,&
           rank_x, rank_y, rank_z, &
           npe_x, npe_y, npe_z )
 #ifdef CUDA_SOLVER
      call cudaMalloc_FP64(cu_vecx,m)
      call cudaMalloc_FP64(cu_vecb,m)
      call cudaMalloc_FP64(cu_diaA,m*nnd)
      call cudaMalloc_INT(cu_offset,nnd)
      call cudaMemcpy_h2d_fp64(vecx,cu_vecx,m,offset_zero,offset_zero)
      call cudaMemcpy_h2d_fp64(vecb,cu_vecb,m,offset_zero,offset_zero)
      call cudaMemcpy_h2d_fp64(diaA,cu_diaA,m*nnd,offset_zero,offset_zero)
      call cudaMemcpy_h2d_INT(offset,cu_offset, nnd,offset_zero,offset_zero)
 #endif
      call parcel_dcg_ddm(&
           MPI_COMM_WORLD, &
 #ifdef CUDA_SOLVER
           cu_vecx%mem,cu_vecb%mem, &
           m, &
           nnd,  cu_offset%mem, cu_diaA%mem, &
 #else
           vecx,vecb,  &
           m, &
           nnd, offset, diaA, &
 #endif
           num_neighbor_ranks, neighbor_ranks, margin, ndim,&
           npe_dim, n_grids, offset_dim,&
           div_direc_th, &
           ipreflag, &
           addL, iflagAS,&
           MAX_NITER, rtolmax, reshistory_DDM, &
           iovlflag, &
           precon_thblock, &
           independ_nvec, &
           nBlock, &
           iret &
           )
 #ifdef CUDA_SOLVER
      call cudaMemcpy_d2h_fp64(cu_vecx,vecx,m,offset_zero,offset_zero)
      call cudaFree(cu_vecx)
      call cudaFree(cu_vecb)
      call cudaFree(cu_diaA)
      call cudaFree(cu_offset)
 #endif
      call MPI_Barrier(MPI_COMM_WORLD,ierr)
      deallocate( diaA )
      deallocate( offset )
      deallocate( offset_dim )
      deallocate( reshistory_DDM )
      deallocate( vecx )
      deallocate( vecx0 )
      deallocate( vecb )
      call PARCEL_Free()
      call MPI_Finalize(ierr)
    end program main

9 How to use (C)

A Fortran routine can be called from C by adding “_” to the end of the routine name and passing the argument as a pointer. The PARCEL routines can also be used in C programs. Regarding the use of MPI, the C MPI communicator should be converted to the Fortran MPI communicator via MPI_Comm_c2f in the MPI library. the use of the PARCEL routines is explained by C sample programs of a CG solver. If you enable a preprocessor option CUDA_SOLVER, a CUDA version of each sample program is generated.

9.1 CRS Format (C)

A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.

 
   #include < stdio.h >
   #include < stdlib.h >
   #include "mpi.h"
   #ifdef CUDA_SOLVER
   #include < cuda.h >
   #include < cuda_runtime.h >
   #endif
   int main( int argc, char *argv[]){
     int npes;
     int myrank;
     MPI_Fint icomm_fort;
     MPI_Init(&argc,&argv);
     MPI_Comm_size(MPI_COMM_WORLD, &npes);
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);
     parcel_init_c_(&myrank);
     int n,gn,nnz,istart;
     double* crsA;
     int* crsRow_ptr;
     int* crsCol;
     make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol)
     double *vecx = (double *)malloc(sizeof(double)*n);
     double *vecb = (double *)malloc(sizeof(double)*n);
     for(int i=0;i < n;i++){
       vecx[i] = 0.0;
       vecb[i] = 1.0;
     }
     int ipreflag = 0;
     int ilu_method = 1;
     int addl = 0;
     int iflagas = 1;
     int max_niter = 100;
     int rtolmax = 1.0e-8;
     double *reshistory = (double *)malloc(sizeof(double)*max_niter);
     int iovlflag = 0;
     int precon_thblock = -1;
     int independ_nvec = -1;
     int nBlock = 2000;
 #ifdef CUDA_SOLVER
     typedef struct Fortran_GPUMem{
       void *a_d;
     } devicemem;
     size_t len = sizeof(double)*n;
     devicemem cu_vecx;cudaMalloc(&(cu_vecx.a_d),len);
     cudaMemcpy( cu_vecx.a_d, vecx,  len, cudaMemcpyHostToDevice);
     devicemem cu_vecb;cudaMalloc(&(cu_vecb.a_d),len);
     cudaMemcpy( cu_vecb.a_d, vecb,  len, cudaMemcpyHostToDevice);
     len = sizeof(double)*nnz;
     devicemem cu_crsa;cudaMalloc(&(cu_crsa.a_d),len);
     cudaMemcpy( cu_crsa.a_d, crsa,  len, cudaMemcpyHostToDevice);
     len = sizeof(parcel_int)*nnz;
     devicemem cu_crscol;cudaMalloc(&(cu_crscol.a_d),len);
     cudaMemcpy( cu_crscol.a_d, crscol,  len, cudaMemcpyHostToDevice);
     len = sizeof(parcel_int)*(n+1);
     devicemem cu_crsrow_ptr;cudaMalloc(&(cu_crsrow_ptr.a_d),len);
     cudaMemcpy( cu_crsrow_ptr.a_d, crsrow_ptr,  len, cudaMemcpyHostToDevice);
 #endif
     parcel_dcg_c_(
          &icomm_fort ,
 #ifdef CUDA_SOLVER
          &cu_vecx,
          &cu_vecb,
          &n,
          &gn,
          &nnz,
          &istart,
          &cu_crsa ,
          &cu_crsrow_ptr,
          &cu_crscol,
 #else
          vecx,
          vecb,
          &n,
          &gn,
          &nnz,
          &istart,
          crsa ,
          crsrow_ptr,
          crscol,
 #endif
          &ipreflag ,
          &ilu_method ,
          &addl ,
          &iflagas ,
          &max_niter ,
          &rtolmax ,
          reshistory,
          &iovlflag ,
          &precon_thblock ,
          &independ_nvec ,
          &nBlock,
          &iret
          );
 #ifdef CUDA_SOLVER
     len = sizeof(double)*n;
     cudaMemcpy(  vecx,cu_vecx.a_d,  len, cudaMemcpyDeviceToHost);
 
     cudaFree(cu_vecx.a_d);
     cudaFree(cu_vecb.a_d);
     cudaFree(cu_crsa.a_d);
     cudaFree(cu_crscol.a_d);
     cudaFree(cu_crsrow_ptr.a_d);
 #endif
     free(vecx);
     free(vecb);
     free(reshistory);
     parcel_free_c_();
     MPI_Finalize();
     }

9.2 DIA Format (C)

A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.

 
   #include < stdio.h >
   #include < stdlib.h >
   #include "mpi.h"
   #ifdef CUDA_SOLVER
   #include < cuda.h >
   #include < cuda_runtime.h >
   #endif
   int main( int argc, char *argv[]){
     int npes;
     int myrank;
     MPI_Fint icomm_fort;
     MPI_Init(&argc,&argv);
     MPI_Comm_size(MPI_COMM_WORLD, &npes);
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);
     parcel_init_c_(&myrank);
     int n,gn,nnd,istart;
     double* diaA;
     int* offset;
     make_matrix_DIA(n,gn,nnd,istart,diaA,offset)
     double *vecx = (double *)malloc(sizeof(double)*n);
     double *vecb = (double *)malloc(sizeof(double)*n);
     for(int i=0;i < n;i++){
       vecx[i] = 0.0;
       vecb[i] = 1.0;
     }
     int ipreflag = 0;
     int ilu_method = 1;
     int addl = 0;
     int iflagas = 1;
     int max_niter = 100;
     int rtolmax = 1.0e-8;
     double *reshistory = (double *)malloc(sizeof(double)*max_niter);
     int iovlflag = 0;
     int precon_thblock = -1;
     int independ_nvec = -1;
     int nBlock = 2000;
 #ifdef CUDA_SOLVER
     size_t len = sizeof(double)*n;
     cudaMemcpy( cu_vecx.a_d, vecx,  len, cudaMemcpyHostToDevice);
     cudaMemcpy( cu_vecb.a_d, vecb,  len, cudaMemcpyHostToDevice);
     len = sizeof(double)*n*nnd;
     devicemem cu_diaA;cudaMalloc(&(cu_diaA.a_d),len);
     cudaMemcpy( cu_diaA.a_d, diaA,  len, cudaMemcpyHostToDevice);
     len = sizeof(parcel_int)*nnd;
     devicemem cu_offset;cudaMalloc(&(cu_offset.a_d),len);
     cudaMemcpy( cu_offset.a_d, offset,  len, cudaMemcpyHostToDevice);
 #endif
     parcel_dcg_dia_c_(
          &icomm_fort ,
 #ifdef CUDA_SOLVER
          &cu_vecx,
          &cu_vecb,
          &n,
          &gn,
          &istart,
          &cu_diaA,
          &cu_offset,
 #else
          vecx,
          vecb,
          &n,
          &gn,
          &istart,
          diaA ,
          offset,
 #endif
          &nnd,
          &ipreflag ,
          &ilu_method ,
          &addl ,
          &iflagas ,
          &max_niter ,
          &rtolmax ,
          reshistory,
          &iovlflag ,
          &precon_thblock ,
          &independ_nvec ,
          &nBlock,
          &iret
          );
 #ifdef CUDA_SOLVER
     len = sizeof(double)*n;
     cudaMemcpy(  vecx,cu_vecx.a_d,  len, cudaMemcpyDeviceToHost);
     cudaFree(cu_vecx.a_d);
     cudaFree(cu_vecb.a_d);
     cudaFree(cu_diaA.a_d);
     cudaFree(cu_offset.a_d);
 #endif
     free(vecx);
     free(vecb);
     free(reshistory);
     parcel_free_c_();
     MPI_Finalize();
     }

9.3 DDM Format (C)

A C sample code in DDM format is shown below. Here, make_matrix_ddm denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL. set_rank_xyz, set_grids_decompostion, and set_neighbor_ranks are arbitrary routines, which compute neighbor processes in PARCEL format.

 
   #include < stdio.h >
   #include < stdlib.h >
   #include < math.h >
   #include "mpi.h"
   #ifdef CUDA_SOLVER
   #include < cuda.h >
   #include < cuda_runtime.h >
   #endif
   int main( int argc, char *argv[]){
     int ityp_eq;
     int itrmax;
     int max_niter;
     int nx,ny,nz;
     int nx0,ny0,nz0;
     int n,gn;
     int istart,iend;
     int npes,myrank;
     int ierr;
     int i;
     int iret;
     int myranky,myrankz;
     int npey,npez;
     MPI_Fint icomm_fort;
     int gnx = 100;
     int gny = 100;
     int gnz = 100;
     int max_niter = 100;
     int rtolmax = 1.0e-8;
     int ipreflag = 0;
     int independ_nvec = -1;
     int addl = 0;
     int iflagas = 1;
     int ilu_method = 1;
     int iovlflag = 0;
     int precon_thblock = -1;
     int nBlock = 2000;
     int npe_x = 1;
     int npe_y = 4;
     int npe_z = 1;
     int div_direc_th = 1;
     MPI_Init(&argc,&argv);
     MPI_Comm_size(MPI_COMM_WORLD, &npes);
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);
     parcel_init_c_(&myrank);
     int nnd = 7;
     int rank_x,rank_y,rank_z;
     set_rank_xyz(
          &myrank,
          &npe_x,
          &npe_y,
          &npe_z,
          &rank_x,
          &rank_y,
          &rank_z);
     int ndim=3 ;
     int margin = 1;
     int *npe_dim = (int *)malloc(sizeof(int)*ndim);
     int *gn_grids = (int *)malloc(sizeof(int)*ndim);
     int *n_grids = (int *)malloc(sizeof(int)*ndim);
     int *istart_grids = (int *)malloc(sizeof(int)*ndim);
     for(i=0;i < ndim;i++){
       npe_dim[i] = 0;
       gn_grids[i] = 0;
       n_grids[i] = 0;
       istart_grids[i] = 0;
     }
     int m=-1;
     gn_grids[0]  = gnx;
     gn_grids[1]  = gny;
     gn_grids[2]  = gnz;
     set_grids_decompostion(
                &npe_x, &npe_y, &npe_z,
                &rank_x, &rank_y, &rank_z,
                &ndim, gn_grids, npe_dim, n_grids,
                istart_grids, &margin, &m );
     int num_neighbor_ranks_max = pow(3,ndim) -1;
     int *neighbor_ranks = (int *)malloc(sizeof(int)*num_neighbor_ranks_max);
     int num_neighbor_ranks=-1;
     for(i=0;i < num_neighbor_ranks_max;i++){
       neighbor_ranks[i] = -1;
     }
     set_neighbor_ranks(
            &npe_x,
            &npe_y,
            &npe_z,
            &rank_x,
            &rank_y,
            &rank_z,
            &num_neighbor_ranks,
            neighbor_ranks,
            &num_neighbor_ranks_max);
     int *offset = (int *)malloc(sizeof(int)*nnd);
     for(i=0;i < nnd;i++){
       offset[i] = 0;
     }
     int *offset_dim = (int *)malloc(sizeof(int)*nnd*ndim);
     for(i=0;i < nnd*ndim;i++){
       offset_dim[i] = 0;
     }
     double *diaA  = (double *)malloc(sizeof(double)*m*nnd);
     for(i=0;i < m*nnd;i++){
       diaA[i] = 0.0;
     }
     double *vecx  = (double *)malloc(sizeof(double)*m);
     double *vecb  = (double *)malloc(sizeof(double)*m);
     for(i=0;i < m;i++){
       vecx[i] = 0.0;
       vecb[i] = 0.0;
     }
     double *reshistory_ddm  = (double *)malloc(sizeof(double)*max_niter);
     for(i=0;i < max_niter;i++){
       reshistory_ddm[i] = 0.0;
     }
     int nnd_ = nnd;
     make_matrix_ddm(
             &gnx,
             &gny,
             &gnz,
             &ndim,
             n_grids,
             istart_grids,
             &margin,
             &nnd_,
             &m,
             offset_dim,
             offset,
             diaA,
             vecb,
             vecx,
             vecx0,
             &rank_x,
             &rank_y,
             &rank_z,
             &npe_x,
             &npe_y,
             &npe_z);
 #ifdef CUDA_SOLVER
     typedef struct Fortran_GPUMem{
       void *a_d;
     } devicemem;
     size_t len = sizeof(double)*m;
     devicemem cu_vecx;cudaMalloc(&(cu_vecx.a_d),len);
     cudaMemcpy( cu_vecx.a_d, vecx,  len, cudaMemcpyHostToDevice);
     devicemem cu_vecb;cudaMalloc(&(cu_vecb.a_d),len);
     cudaMemcpy( cu_vecb.a_d, vecb,  len, cudaMemcpyHostToDevice);
     len = sizeof(double)*m*nnd;
     devicemem cu_diaA;cudaMalloc(&(cu_diaA.a_d),len);
     cudaMemcpy( cu_diaA.a_d, diaA,  len, cudaMemcpyHostToDevice);
     len = sizeof(int)*nnd;
     devicemem cu_offset;cudaMalloc(&(cu_offset.a_d),len);
     cudaMemcpy( cu_offset.a_d, offset,  len, cudaMemcpyHostToDevice);
 #endif
     parcel_dcg_ddm_c_(
              &icomm_fort,
 #ifdef CUDA_SOLVER
              &cu_vecx,
              &cu_vecb,
              &m,
              &nnd,
              &cu_offset,
              &cu_diaA,
 #else
              vecx,
              vecb,
              &m,
              &nnd,
              offset,
              diaA,
 #endif
              &num_neighbor_ranks,
              neighbor_ranks,
              &margin,
              &ndim,
              npe_dim,
              n_grids,
              offset_dim,
              &div_direc_th,
              &ipreflag,
              &addl,
              &iflagas,
              &max_niter,
              &rtolmax,
              reshistory_ddm,
              &iovlflag,
              &precon_thblock,
              &independ_nvec ,
              &nBlock ,
              &iret
              );
 #ifdef CUDA_SOLVER
     len = sizeof(double)*m;
     cudaMemcpy(  vecx,cu_vecx.a_d,  len, cudaMemcpyDeviceToHost);
     cudaFree(cu_diaA.a_d);
     cudaFree(cu_offset.a_d);
     cudaFree(cu_vecx.a_d);
     cudaFree(cu_vecb.a_d);
 #endif
     free(offset);
     free(offset_dim);
     free(diaA);
     free(vecx);
     free(vecb);
     free(reshistory_ddm);
     free(neighbor_ranks);
     free(npe_dim);
     free(gn_grids);
     free(n_grids);
     free(istart_grids);
     MPI_Barrier(MPI_COMM_WORLD);
     parcel_free_c_();
     MPI_Finalize();
     return 0;
   }

10 REFERENCES

[1] R. Barret, M. Berry, T. F. Chan, et al., “Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods”, SIAM（1994）
[2] Y. Saad, “Iterative methods for sparse linear systems”, SIAM (2003)
[3] M. Hoemmen, “Communication-avoiding Krylov subspace methods”. Ph.D.thesis, University of California, Berkeley（2010）
[4] Y. Idomura, T. Ina, A. Mayumi, et al., “Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms”,ScalA 17:8th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp.1-8, (2017).
[5] R. Suda, L. Cong, D. Watanabe, et al., “Communication-Avoiding CG Method: New Direction of Krylov Subspace Methods towards Exa-scale Computing”, RIMS Kokyuroku ,pp. 102-111, (2016).
[6] Y. Idomura, T. Ina, S. Yamashita, et al., “Communication avoiding multigrid preconditioned conjugate gradient method for extreme scale multiphase CFD simulations”. ScalA 18:9th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems,pp. 17-24.(2018)
[7] T. Ina, Y. Idomura, T. Imamura, et al., “Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations”, ScalA21:12th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems .(2021)
[8] A. Stathopoulos, K. Wu, “A block orthogonalization procedure with constant synchronization requirements”. SIAM J. Sci. Comput. 23, 2165–2182 (2002)
[9] T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, et al., “CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System,” ScalA 14:5th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp. 31-38,(2014)