1 Krylov subspace methods

This chapter describes the iterative methods contained in this library. The following notation is used, \(Ax = b\) represents the linear system which will be solved, with \(A\) being an n\(\times\)n sparse matrix and \(x_0\) representing the initial vector. \(K\) represents an appropriate preconditioner.


1.1 Preconditioned Conjugate Gradient Method [1,2]

The Conjugate Gradient (CG) method is an iterative Krylov subspace algorithm for solving symmetric matrices.

  1. \(\mathbf{{Compute\ }}\mathbf{r}^{\mathbf{0}}\mathbf{= b - A}\mathbf{x}^{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}^{\mathbf{- 1}}\mathbf{=}\mathbf{x}^{\mathbf{0}}\mathbf{\ }\)
  2. \(\mathbf{p}^{\mathbf{- 1}}\mathbf{= 0}\)
  3. \(\mathbf{\alpha}_{\mathbf{- 1}}\mathbf{= 0\ }\mathbf{\ }\)
  4. \(\mathbf{\beta}_{\mathbf{- 1}}\mathbf{= 0}\)
  5. \(\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{0}}\)
  6. \(\mathbf{\rho}_{\mathbf{0}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{0}} \right\rangle\)
  7. \(\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}\)
  8.   \(\mathbf{p}^{\mathbf{i}}\mathbf{= s +}\mathbf{\beta}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}\)
  9.   \(\mathbf{q}^{\mathbf{i}}\mathbf{= A}\mathbf{p}^{\mathbf{i}}\)
  10.   \(\mathbf{\gamma =}\left\langle \mathbf{p}^{\mathbf{i}}\mathbf{,}\mathbf{q}^{\mathbf{i}} \right\rangle\)
  11.   \(\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}\)
  12.   \(\mathbf{\alpha}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i}}\mathbf{\ /\ \gamma}\)
  13.   \(\mathbf{r}^{\mathbf{i + 1}}\mathbf{=}\mathbf{r}^{\mathbf{i}}\mathbf{-}\mathbf{\alpha}_{\mathbf{i}}\mathbf{q}^{\mathbf{i}}\)
  14.   \(\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{i + 1}}\)
  15.   \(\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{i + 1}} \right\rangle\)
  16.   \(\mathbf{{if\ }}\left\| \mathbf{r}^{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ the}}\mathbf{n}\)
  17.     \(\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}\)
  18.     \(\mathbf{{qui}}\mathbf{t}\)
  19.   \(\mathbf{{endi}}\mathbf{f}\)
  20.   \(\mathbf{\beta}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{\ /\ }\mathbf{\rho}_{\mathbf{i}}\)
  21. \(\mathbf{{\ en}}\mathbf{d}\)


1.2 Preconditioned Biconjugate Gradient Stabilized Method [1,2]

The Biconjugate Gradient Stabilized (Bi-CGSTAB) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices.

  1. \(\mathbf{{Compute\ }}\mathbf{r}_{\mathbf{0}}\mathbf{= b - A}\mathbf{x}_{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}_{\mathbf{- 1}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\)
  2. \(\mathbf{p}_{\mathbf{0}}\mathbf{=}\mathbf{r}_{\mathbf{0}}\)
  3. \(\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{0}} \right\rangle\)
  4. \(\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}\)
  5.   \(\mathbf{{solve\ }}\widehat{\mathbf{p}}\mathbf{{\ from\ K}}\widehat{\mathbf{p}}\mathbf{= \ }\mathbf{p}_{\mathbf{i}}\)
  6.   \(\mathbf{q = A}\widehat{\mathbf{p}}\)
  7.   \(\mathbf{c}_{\mathbf{2}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,q} \right\rangle\)
  8.   \(\mathbf{\alpha =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\mathbf{c}_{\mathbf{2}}\)
  9.   \(\mathbf{e =}\mathbf{r}_{\mathbf{i}}\mathbf{- \alpha q}\)
  10.   \(\mathbf{{solve\ }}\widehat{\mathbf{e}}\mathbf{{\ from\ K}}\widehat{\mathbf{e}}\mathbf{= e}\)
  11.   \(\mathbf{v = A}\widehat{\mathbf{e}}\)
  12.   \(\mathbf{c}_{\mathbf{3}}\mathbf{=}\left\langle \mathbf{e,v} \right\rangle\mathbf{\ /\ }\left\langle \mathbf{v,v} \right\rangle\)
  13.   \(\mathbf{x}_{\mathbf{i + 1}}\mathbf{=}\mathbf{x}_{\mathbf{i}}\mathbf{+ \alpha}\widehat{\mathbf{p}}\mathbf{+}\mathbf{c}_{\mathbf{3}}\widehat{\mathbf{e}}\)
  14.   \(\mathbf{r}_{\mathbf{i + 1}}\mathbf{= e -}\mathbf{c}_{\mathbf{3}}\mathbf{v}\)
  15.   \(\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{i + 1}} \right\rangle\)
  16.   \(\mathbf{{if\ }}\left\| \mathbf{r}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{the}}\mathbf{n}\)
  17.     \(\mathbf{{quit}}\)
  18.   \(\mathbf{{endif}}\)
  19.   \(\mathbf{\beta =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\left( \mathbf{c}_{\mathbf{2}}\mathbf{c}_{\mathbf{3}} \right)\)
  20.   \(\mathbf{p}_{\mathbf{i + 1}}\mathbf{=}\mathbf{r}_{\mathbf{i + 1}}\mathbf{+ \beta}\left( \mathbf{p}_{\mathbf{i}}\mathbf{-}\mathbf{c}_{\mathbf{3}}\mathbf{q} \right)\)
  21. \(\mathbf{{end}}\)


1.3 Preconditioned Generalized Minimum Residual Method [1,2]

The Generalized Minimum Residual (GMRES(m)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. GMRES(m) is restarted periodically after \(m\) iterations if it did not converge within \(m\) iterations, the solution obtained until \(m\) iterations will be used as an input for the new restart cycle. PARCEL provides variants of GMRES with the Classical Gram-Schmidt (sometimes referred to as standard Gram-Schmidt) and Modified Gram-Schmidt (MGS) orthogonalization methods.

\(\mathbf{H}_{\mathbf{n}}\) is an upper triangular matrix with \(\mathbf{h}_{\mathbf{j,k}}\) representing its elements, \(\mathbf{\ }\mathbf{e}_{\mathbf{i}}\) represents a vector consisting of the first i elements of the vector \(\mathbf{e}\).

  1. \(\mathbf{for\ j = 0,1,2,3}\mathbf{\ldots}\)
  2.   \(\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}\)
  3.   \(\mathbf{v}_{\mathbf{1}}\mathbf{= - r\ /\ }\left\| \mathbf{r} \right\|\)
  4.   \(\mathbf{e =}\left( \mathbf{-}\left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}\)
  5.   \(\mathbf{n = m}\)
  6.   \(\mathbf{for\ i = 1,2,3}\mathbf{\ldots}\mathbf{m}\)
  7.      \(\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{=}\mathbf{v}_{\mathbf{i}}\)
  8.      \(\mathbf{\omega = A}{\widehat{\mathbf{v}}}_{\mathbf{i}}\)
  9.      \(\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i}\)
  10.       \(\mathbf{h}_{\mathbf{k,i}}\mathbf{=}\left\langle \mathbf{\omega,}\mathbf{v}_{\mathbf{k}} \right\rangle\)
  11.       \(\mathbf{\omega = \omega -}\mathbf{h}_{\mathbf{k,i}}\mathbf{v}_{\mathbf{k}}\)
  12.     \(\mathbf{{end}}\)
  13.     \(\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{=}\left\| \mathbf{\omega} \right\|\)
  14.     \(\mathbf{v}_{\mathbf{i + 1}}\mathbf{= \omega/}\left\| \mathbf{\omega} \right\|\)
  15.     \(\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i - 1}\)
  16.       \(\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\)
  17.     \(\mathbf{{end}}\)
  18.     \(\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
  19.     \(\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
  20.     \(\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}\)
  21.     \(\mathbf{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathbf{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathbf{h}_{\mathbf{i + 1,i}}\)
  22.     \(\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{= 0}\)
  23.     \(\mathbf{{if\ }}\left\| \mathbf{e}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{then}}\)
  24.       \(\mathbf{n = i}\)
  25.       \(\mathbf{{exit}}\)
  26.     \(\mathbf{{endif}}\)
  27.   \(\mathbf{{end}}\)
  28.   \(\mathbf{y}_{\mathbf{n}}\mathbf{=}\mathbf{H}_{\mathbf{n}}^{\mathbf{- 1}}\mathbf{e}_{\mathbf{n}}\)
  29.   \(\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 1}}^{\mathbf{n}}{\mathbf{y}_{\mathbf{k}}\mathbf{v}_{\mathbf{k}}}\)
  30.   \(\mathbf{x}_{\mathbf{n}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}\)
  31.   \(\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{n}}\)
  32. \(\mathbf{{end}}\)


1.4 Preconditioned Communication Avoiding Generalized Minimum Residual Method [3,4]

The Communication Avoiding Generalized Minimum Residual (CA-GMRES(s,t)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. The calculation which is done in \(s\) iterations in the GMRES(m) method is equivalent to one iteration of the CA-GMRES method. CA-GMRES(s,t) is restarted periodically after \(t\) iterations if it did not converge within \(t\) iterations, the solution obtained until \(m\) iterations will be used as an input for the new restart cycle. CA-GMRES(s,t) is equivalent to GMRES(m) if \(s\) and \(t\) are chosen so that \(m = t \times s\). The convergence property of CA-GMRES(s,t) is the same as GMRES(m) in exact arithmetic. In addition, the number of global collective communication calls is reduced by communication avoiding QR factorization. As CA-GMRES(s,t) produces \(s\) basis vectors at once, when \(s\) is too large, the linear independency of the basic vectors may become worse because of round off errors, leading to worse convergence property than GMRES(m). In order to improve the orthogonality of the basic vectors, the PARCEL implementation of CA-GMRES(s,t) provides an option to use the Newton basis in addition to the monomial basis.

\(\mathbf{e}\) represents a unit vector, \({\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\) represents the \(s\)-th row and \(s\)-th column element of the matrix \(\mathbf{R}_{\mathbf{k}}\), and \(\mathbf{E}\) represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.

  1. \(\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{= \lbrack}\mathbf{e}_{\mathbf{2}}\mathbf{,}\mathbf{e}_{\mathbf{3}}\mathbf{,\ldots,}\mathbf{e}_{\mathbf{s + 1}}\mathbf{\rbrack}\)
  2. \(\mathbf{for\ j = 0,1,2,3\ldots}\)
  3.   \(\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}\)
  4.   \(\mathbf{v}_{\mathbf{1}}\mathbf{= r\ /\ }\left\| \mathbf{r} \right\|\)
  5.   \(\mathbf{\zeta =}\left( \left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}\)
  6.   \(\mathbf{for\ k = 0,1\ldots,t - 1}\)
  7.     \(\mathbf{Fix\ basis\ conversion\ matrix\ \lbrack}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,E\rbrack}\)
  8.     \(\mathbf{Comupute\ \ basic\ vector\ \ \lbrack\ s,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{\ \rbrack}\)
  9.     \(\mathbf{if(k.eq.0)}\)
  10.       \(\mathbf{{QR\ decomposition\ }}\mathbf{V}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\)
  11.       \(\mathfrak{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\)
  12.       \(\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{- 1}}\)
  13.       \(\mathcal{H}_{\mathbf{0}}\mathbf{=}\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\)
  14.       \(\mathbf{for\ o = 1,2\ldots,s}\)
  15.       \(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta\rbrack}\)
  16.       \(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
  17.         \(\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{Q}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}\mathbf{R}_{\mathbf{0}}\mathbf{,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}\)
  18.         \(\mathbf{{quit}}\)
  19.       \(\mathbf{{endif}}\)
  20.     \(\mathbf{{else}}\)
  21.       \({\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{=}\left( \mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}} \right)^{\mathbf{H}}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
  22.       \({\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{-}\mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}}{\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\)
  23.       \(\mathbf{{QR\ decomposition\ }}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{Q}}}_{\mathbf{k}}^{\mathbf{\sim}}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
  24.       \(\mathbf{R}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathbf{R}_{\mathbf{k}} & \mathbf{z}_{\mathbf{k}} \\ \mathbf{0}_{\mathbf{1,k}} & \mathbf{\rho} \\ \end{pmatrix}\)
  25.       \(\mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{= -}\mathfrak{H}_{\mathbf{k - 1}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}\mathfrak{R}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
  26.       \(\mathbf{H}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\mathbf{B}_{\mathbf{k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{b}_{\mathbf{k}}\mathbf{z}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}}\mathbf{-}\mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
  27.       \(\mathbf{h}_{\mathbf{k}}\mathbf{=}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{\rho}_{\mathbf{k}}\mathbf{b}_{\mathbf{k}}\)
  28.       \(\mathfrak{H}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1}} & \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \begin{matrix} \mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}} \\ \mathbf{0}_{\mathbf{1,s(k - 1)}} \\ \end{matrix} & \begin{matrix} \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{matrix} \\ \end{pmatrix}\)
  29.       \(\mathcal{H}_{\mathbf{k}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{pmatrix}\)
  30.       \(\mathbf{for\ o = 1 + sk,\ldots,s(k + 1)}\)
  31.         \(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta\rbrack}\)
  32.         \(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
  33.           \(\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}\)
  34.           \(\mathbf{{quit}}\)
  35.         \(\mathbf{{endif}}\)
  36.       \(\mathbf{{end}}\)
  37.     \(\mathbf{{endif}}\)
  38.   \(\mathbf{{end}}\)
  39.   \(\mathbf{update\ \ solution\ \ vector\lbrack s,t - 1,st,}{\acute{\mathbf{V}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{t - 1}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}\)
  40. \(\mathbf{{end}}\)

  1. \(\mathbf{{if\ compute\ Newton\ Basis}}\)
  2.   \(\mathbf{i = 0}\)
  3.   \(\mathbf{{while}}\left( \mathbf{i \leq s - 1} \right)\)
  4.     \(\mathbf{{if}}\left( \mathbf{i.eq.s - 1} \right)\mathbf{{then}}\)
  5.       \(\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
  6.     \(\mathbf{{else}}\)
  7.       \(\mathbf{{if}}\left( \mathbf{{Im}}\left\lbrack \mathbf{E}_{\mathbf{i}} \right\rbrack\mathbf{.eq.0} \right)\mathbf{{then}}\)
  8.         \(\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
  9.       \(\mathbf{{else}}\)
  10.         \(\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
  11.         \(\mathbf{B}_{\mathbf{i + 1,i + 1}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}\)
  12.         \(\mathbf{B}_{\mathbf{i,i + 1}}\mathbf{= -}\left( \mathbf{Im\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack} \right)^{\mathbf{2}}\)
  13.         \(\mathbf{i = i + 1}\)
  14.       \(\mathbf{{endif}}\)
  15.     \(\mathbf{{endif}}\)
  16.     \(\mathbf{i = i + 1}\)
  17.   \(\mathbf{{end\ \ }}\)
  18. \(\mathbf{{end\ \ }}\)

The elements of the matrix \(\mathbf{B}\) are represented as \(\mathbf{b}_{\mathbf{k,i}}\)

  1. \(\mathbf{for\ k = 1,2,3\ldots s}\)
  2.   \(\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{=}\mathbf{v}_{\mathbf{k - 1}}\)
  3.   \(\mathbf{if(k.ne.1)then}\)
  4.     \(\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}\)
  5.     \(\mathbf{\beta =}\mathbf{b}_{\mathbf{k - 2,k - 1}}\)
  6.     \(\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}\mathbf{+ \beta}\mathbf{v}_{\mathbf{k - 2}}\)
  7.   \(\mathbf{{else}}\)
  8.     \(\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}\)
  9.     \(\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}\)
  10.   \(\mathbf{{endif}}\)
  11. \(\mathbf{{en}}\mathbf{d}\)

The elements of the matrix \(\mathcal{H}\) are represented as \(\mathcal{h}_{\mathbf{k,i}}\).

  1. \(\mathbf{for\ k = 1,2,3\ldots i - 1}\)
  2.   \(\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\)
  3. \(\mathbf{\text{end}}\)
  4. \(\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
  5. \(\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}\)
  6. \(\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}\)
  7. \(\mathcal{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathcal{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathcal{h}_{\mathbf{i + 1,i}}\)
  8. \(\mathcal{h}_{\mathbf{i + 1,i}}\mathbf{= 0}\)

  1. \(\mathbf{y =}\mathcal{H}^{\mathbf{- 1}}\mathbf{\zeta}\)
  2. \(\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 0}}^{\mathbf{s}\mathbf{m}\mathbf{- 1}}\mathbf{Q}_{\mathbf{k}}\mathbf{y}_{\mathbf{k}}\mathbf{+}\sum_{\mathbf{l =}\mathbf{{sm}}}^{\mathbf{n - 1}}\mathbf{V}_{\mathbf{l}\mathbf{- sm}}\sum_{\mathbf{k =}\mathbf{{sm}}}^{\mathbf{n - 1}}{\mathbf{R}_{\mathbf{l - sm}\mathbf{,}\mathbf{k}\mathbf{- sm}}^{\mathbf{- 1}}\mathbf{y}_{\mathbf{k}}}\)
  3. \(\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}\)


1.5 Preconditioned Chebyshev basis Conjugate Gradient Method [5,6]

The Chebyshev basis Conjugate Gradient (CBCG) method is an iterative Krylov subspace algorithm for solving symmetric matrices. The CBCG method calculates \(k\) iterations of the CG method in one iteration. By this approach, the CBCG method reduces the global collective communication calls. In order to use the Chebyshev basis the largest and smallest eigenvalues are needed. PARCEL provides two methods to obtain these eigenvalues, the so called power method and the communication avoiding Arnoldi method.

  1. \(\mathbf{{Compute\ }}\mathbf{r}_{\mathbf{0}}\mathbf{= b - A}\mathbf{x}_{\mathbf{0}}\mathbf{{\ for\ some\ initial\ gue}}\mathbf{{ss\ }}\mathbf{x}_{\mathbf{- 1}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\)
  2. \(\mathbf{Compute\ Chebyshev\ basis\ \lbrack}\mathbf{S}_{\mathbf{0}}\mathbf{,}{\mathbf{A}\mathbf{S}}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}\)
  3. \(\mathbf{Q}_{\mathbf{0}}\mathbf{=}\mathbf{S}_{\mathbf{0}}\)
  4. \(\mathbf{for\ i = 0,1,2,3\ldots}\)
  5. \(\mathbf{{Compute\ }}\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}}\mathbf{\ ,\ }\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{r}_{\mathbf{{ik}}}\)
  6.   \(\mathbf{\alpha}_{\mathbf{i}}\mathbf{= \ }\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}} \right)^{\mathbf{- 1}}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{r}_{\mathbf{{ik}}} \right)\)
  7.   \(\mathbf{x}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{=}\mathbf{x}_{\mathbf{{ik}}}\mathbf{+}\mathbf{Q}_{\mathbf{i}}\mathbf{\alpha}_{\mathbf{i}}\)
  8.   \(\mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{=}\mathbf{r}_{\mathbf{{ik}}}\mathbf{- A}\mathbf{Q}_{\mathbf{i}}\mathbf{\alpha}_{\mathbf{i}}\)
  9.   \(\mathbf{{if\ }}\left\| \mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{then}}\)
  10.   \(\mathbf{Compute\ Chebyshev\ basis\ \lbrack}\mathbf{S}_{\mathbf{i + 1}}\mathbf{,}{\mathbf{A}\mathbf{S}}_{\mathbf{i + 1}}\mathbf{,}\mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}\)
  11.   \(\mathbf{{Compute\ }}\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{S}_{\mathbf{i + 1}}\)
  12.   \(\mathbf{B}_{\mathbf{i}}\mathbf{=}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}} \right)^{\mathbf{- 1}}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{S}_{\mathbf{i + 1}} \right)\mathbf{\ }\)
  13.   \(\mathbf{Q}_{\mathbf{i + 1}}\mathbf{=}\mathbf{S}_{\mathbf{i + 1}}\mathbf{-}\mathbf{Q}_{\mathbf{i}}\mathbf{B}_{\mathbf{i}}\)
  14.   \(\mathbf{A}\mathbf{Q}_{\mathbf{i + 1}}\mathbf{=}\mathbf{{AS}}_{\mathbf{i + 1}}\mathbf{-}\mathbf{{AQ}}_{\mathbf{i}}\mathbf{B}_{\mathbf{i}}\)
  15. \(\mathbf{{end}}\)

\(\mathbf{\lambda}_{\mathbf{\max}}\), \(\mathbf{\lambda}_{\mathbf{\min}}\)represent the largest and smallest eigenvalues of \(\mathbf{AK}^{-1}\).

  1. \(\mathbf{\eta = 2\ /\ (}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{-}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{)}\)
  2. \(\mathbf{\zeta =}\left( \mathbf{\lambda}_{\mathbf{\max}}\mathbf{+}\mathbf{\lambda}_{\mathbf{\min}} \right)\mathbf{\ /\ (}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{-}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{)}\)
  3. \(\mathbf{s}_{\mathbf{0}}\mathbf{=}\mathbf{r}_{\mathbf{0}}\mathbf{\ }\)
  4. \(\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{= \ }\mathbf{s}_{\mathbf{0}}\)
  5. \(\mathbf{s}_{\mathbf{1}}\mathbf{= \eta A}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{- \zeta}\mathbf{s}_{\mathbf{0}}\)
  6. \(\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{= \ }\mathbf{s}_{\mathbf{1}}\)
  7. \(\mathbf{for\ j = 2,3,\ldots,k}\)
  8.   \(\mathbf{s}_{\mathbf{j}}\mathbf{= 2\eta A}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{- 2\zeta}\mathbf{s}_{\mathbf{j - 1}}\mathbf{-}\mathbf{s}_{\mathbf{j - 2}}\)
  9.   \(\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{= \ }\mathbf{s}_{\mathbf{j}}\)
  10. \(\mathbf{{end}}\)
  11. \(\mathbf{S = (}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{,}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{,\ldots,}{\widetilde{\mathbf{s}}}_{\mathbf{k - 1}}\mathbf{)}\)
  12. \(\mathbf{AS = (A}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{,}{\mathbf{A}\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{,\ldots,}{\mathbf{A}\widetilde{\mathbf{s}}}_{\mathbf{k - 1}}\mathbf{)}\)


1.6 Preconditioned Communication Avoiding Arnoldi Method [3]

The Preconditioned Communication Avoiding Arnoldi Method (CA-Arnoldi(s,t)) is an eigenvalue calculation algorithm for asymmetric matrices. CA-Arnoldi(s,t) computes s iterations of the Arnordi method in a single iteration, and by repeating this for t times, eigenvalues and eigenvectors of t\(\times\)s Hessenberg matrices are computed. By applying communication-reducing QR factorization algorithms, the number of collective communications is reduced compared to the Arnoldi method.

\(\mathbf{e}\) represents a unit vector, \({\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\) represents the \(s\)-th row and \(s\)-th column element of the matrix \(\mathbf{R}_{\mathbf{k}}\), and \(\mathbf{E}\) represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.

  1. \(\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{= \lbrack}\mathbf{e}_{\mathbf{2}}\mathbf{,}\mathbf{e}_{\mathbf{3}}\mathbf{,\ldots,}\mathbf{e}_{\mathbf{s + 1}}\mathbf{\rbrack}\)
  2.   \(\mathbf{r = }\mathbf{x}_{\mathbf{0}}\)
  3.   \(\mathbf{\zeta =}\left( \left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}\)
  4.   \(\mathbf{for\ k = 0,1\ldots,t - 1}\)
  5.     \(\mathbf{Fix\ basis\ conversion\ matrix\ \lbrack}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,E\rbrack}\)
  6.     \(\mathbf{Comupute\ \ basic\ vector\ \ \lbrack\ s,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{\ \rbrack}\)
  7.     \(\mathbf{if(k.eq.0)}\)
  8.       \(\mathbf{{QR\ decomposition\ }}\mathbf{V}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\)
  9.       \(\mathfrak{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\)
  10.       \(\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{- 1}}\)
  11.       \(\mathcal{H}_{\mathbf{0}}\mathbf{=}\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\)
  12.       \(\mathbf{for\ o = 1,2\ldots,s}\)
  13.       \(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta\rbrack}\)
  14.       \(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
  15.         \(\mathbf{{quit}}\)
  16.       \(\mathbf{{endif}}\)
  17.     \(\mathbf{{else}}\)
  18.       \({\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{=}\left( \mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}} \right)^{\mathbf{H}}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
  19.       \({\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{-}\mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}}{\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\)
  20.       \(\mathbf{{QR\ decomposition\ }}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{Q}}}_{\mathbf{k}}^{\mathbf{\sim}}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\)
  21.       \(\mathbf{R}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathbf{R}_{\mathbf{k}} & \mathbf{z}_{\mathbf{k}} \\ \mathbf{0}_{\mathbf{1,k}} & \mathbf{\rho} \\ \end{pmatrix}\)
  22.       \(\mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{= -}\mathfrak{H}_{\mathbf{k - 1}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}\mathfrak{R}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
  23.       \(\mathbf{H}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\mathbf{B}_{\mathbf{k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{b}_{\mathbf{k}}\mathbf{z}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}}\mathbf{-}\mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\)
  24.       \(\mathbf{h}_{\mathbf{k}}\mathbf{=}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{\rho}_{\mathbf{k}}\mathbf{b}_{\mathbf{k}}\)
  25.       \(\mathfrak{H}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1}} & \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \begin{matrix} \mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}} \\ \mathbf{0}_{\mathbf{1,s(k - 1)}} \\ \end{matrix} & \begin{matrix} \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{matrix} \\ \end{pmatrix}\)
  26.       \(\mathcal{H}_{\mathbf{k}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{pmatrix}\)
  27.       \(\mathbf{for\ o = 1 + sk,\ldots,s(k + 1)}\)
  28.         \(\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta\rbrack}\)
  29.         \(\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}\)
  30.           \(\mathbf{{quit}}\)
  31.         \(\mathbf{{endif}}\)
  32.       \(\mathbf{{end}}\)
  33.     \(\mathbf{{endif}}\)
  34.   \(\mathbf{{end}}\)
  35.   \(\mathbf{Eigen\ Value\ problem\ \lbrack\mathfrak{H}_{t - 1}^{\sim},E,z\rbrack}\)
  36.   \(\mathbf{Eigen Vectors \ X = \mathfrak{Q}_{t - 1}^{\sim}z}\)


2 Preconditioning

In iterative algorithms, the application of a preconditioner of the form \(K \approx A\) can help to reduce the number of iterations until convergence. The notation [\({{solve\ }}\widehat{{p}}{{\ from\ K}}\widehat{{p}}{= \ }{p}\)] means [\(approximately solve {A}\widehat{p}{= \ }{p}\) with respect to the vector \(\widehat{p}\)]. Although the more accurate approximation leads to the less number of iterations until convergence, the cost of preconditioner is normally increased. In non-parallel computing, one of the most common preconditioners is the incomplete LU factorization (ILU). However, in parallel computing, parallel preconditioners are needed. PARCEL provides the following parallel preconditioners: Point Jacobi, Block Jacobi, Additive Schwarz, and Neumann expansion.


2.1 Point Jacobi Preconditioner [1,2]

The point Jacobi preconditioner is one of the most simplest preconditioners. \(K\) only consists of the diagonal elements of the matrix \(A\). Compared to other preconditioners the efficiency of the point Jacobi preconditioner is very low. However, it can be easily applied in parallel and it does not require any additional communication between processors.


2.2 Zero Fill-in Incomplete LU Factorization Preconditioners (ILU(0)) [1,2]

LU factorization decompose a square matrix \(A\) into a lower triangular matrix \(L\) and an upper triangular matrix \(U\). For a typical sparse matrix, the LU factors can be much less sparse than the original matrix, which is called as fill-in. This increases the memory and computational requirements. In order to avoid this issue, PARCEL provides the zero fill-in incomplete LU factorization (ILU(0)).

2.3 Diagonal Incomplete LU Factorization Preconditioners (D-ILU) [1,2]

The diagonal incomplete LU factorization (D-ILU) computes a diagonal matrix \(D\), a lower triangular matrix \(L\) and an upper triangular matrix \(U\) for the input matrix \(A\). \[A = L_{A} + D_{A} + U_{A}\] The preconditioner \(K\) is constructed with \(L_{A}\), \(U_{A}\) and \(D\) (\(D \neq D_{A}\)) with: \[K = \left( D + L_{A} \right)D^{- 1}\left( D + U_{A} \right)\] Two different conditions exist in order to construct the diagonal matrix \(D\):

Given the conditions above \(D\) can be computed as follows:


2.4 Block Jacobi Preconditioner [1,2]

The Block Jacobi Preconditioner constructs \(K\) out of the diagonal blocks of the matrix \(A\). For each block, the incomplete ILU factorization is computed. When the preconditioner is applied, each block can be processed independently, resulting in a high level of parallelism. In addition, no communication is needed when each block is defined within a sub-matrix on each processor.


2.5 Additive Schwarz Preconditioner [1,2]

The Additive Schwarz Preconditioner constructs \(K\) out of the overlapping diagonal blocks of \(A\). For each block the incomplete LU factorization is computed. Compared to the Block Jacobi Preconditioner, the Additive Schwarz Preconditioner may require additional communication. PARCEL provides four different overlapping methods, BASIC, RESTRICT, INTERPOLATE and NONE.

Fig.1 Additive Schwarz Preconditioner with overlapping diagonal blocks

BASIC

Solve \(Ks=r\) with an extended \(r\) including an overlapping region with the neighboring process, \(s\) in the overlapping region is determined by summing up the results between the neighboring region.

Fig.2 BASIC

RESTRICT

  Solve \(Ks=r\) with an extended \(r\) including an overlapping region with the neighboring process, \(s\) in the overlapping region is determined without taking account of the result in the neighboring region.

Fig.3 RESTRICT

INTERPOLATE

Solve \(Ks=r\) with an extended \(r\) without an overlapping region with the neighboring process, \(s\) in the overlapping region is determined by summing up the results between the neighboring regions. 

Fig.4 INTERPOLATE

NONE

Solve \(Ks=r\) with an extended \(r\) without an overlapping region with the neighboring process, \(s\) in the overlapping region is determined without taking account of the result in the neighboring regions. 

Fig.5 NONE


2.6 Fine-block Preconditioner [7]

The fine-block preconditioner generates a preconditionig matrix so that SIMD operations can be applied to the incomplete LU decomposition. In each block of the block Jacobi preconditioner or the additive Schwarz preconditioner, fine diagonal blocks are defined. By ignoring off diagonal elements only in the fine diagonal blocks, data dependency is eliminated, and SIMD operations are enabled. In Fig.6, 3 \(\times\) 3 fine blocks are defined within 9 \(\times\) 9 blocks of the block Jacobi preconditioner, where data dependency is eliminated by ignoring off diagonal elements only within fine blocks (shown in white) and three vector elements are processed by SIMD operations.

 Block Jacobi Preconditioner

BlockJacobi

 Subdividing  Block Jacobi Preconditioner

Subdividing BlockJacobi
Fig.6 Subdividing preconditioning


2.7 Neumann Series Preconditioner [2]

The diagonal scaling of the matrix \(A\) yields \(\bar{A}\). In the Neumann series preconditioner, the preconditioning matrix \(K\) is defined by an approximate inverse of \(\bar{A}\) using the Neumann series.

\[ \bar{A}^{-1}\simeq K^{-1}=[I+F+F^2+F^3+...] \]

Here, \(\bar{A}=I-F\). This algorithm can be implemented using sparse matrix-vector products (SpMV), and has high parallel efficiency. In addition, by combining the block Jacobi preconditioner, the block Neumann series preconditioner is constructed to avoid inter-proccess communication in SpMV. These preconditioners are available only for GPU version.

3 QR factorization

This section describes the QR decomposition algorithms that can be used in this library.


3.1 Classical Gram-Schmidt (CGS) Method [1,2]

The QR factorization by the Gram-Schmidt orthogonalization. This algorithm has High parallelism but poor orthogonality.

  1. \(\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}\)
  2.   \(\mathbf{for\ k = 1,i - 1\ do}\)
  3.     \(\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
  4.   \(\mathbf{{enddo\ }}\)
  5.   \(\mathbf{for\ k = 1,i - 1\ do}\)
  6.     \(\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}\)
  7.     \(\mathbf{{enddo\ }}\)
  8.   \(\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
  9.   \(\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}\)
  10. \(\mathbf{{enddo\ }}\)


3.2 Modified Gram-Schmidt (MGS) Method [1,2]

The QR factorization with a modified algorithm of the classical Gram-Schmidt method to reduce the error. This algorithm improves the orthogonality from the classical Gram-Schmidt method, but requires more collective communication.

  1. \(\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}\)
  2.   \(\mathbf{for\ k = 1,i - 1\ do}\)
  3.     \(\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
  4.     \(\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}\)
  5.   \(\mathbf{{enddo\ }}\)
  6.   \(\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}\)
  7.   \(\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}\)
  8. \(\mathbf{{enddo\ }}\)


3.3 Tall Skinny QR (TSQR) Method [3]

TSQR is based on the Householder QR factorization shown below, and has good orthogonality. The sequential TSQR is used between threads, while the parallel TSQR is used between processes.

  1. \(\mathbf{for\ i = 1,\ldots,n\ do}\)
  2.   \(\mathbf{y}_{\mathbf{i}}\mathbf{(i:m) = A(i:m,i) -}\left\| \mathbf{A(i:m,i)} \right\|\mathbf{e}_{\mathbf{i}}\mathbf{(i:m)}\)
  3.   \(\mathbf{t}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{2}}{\mathbf{<}\mathbf{y}_{\mathbf{i}}\left( \mathbf{i:m} \right)\mathbf{,\ }\mathbf{y}_{\mathbf{i}}\mathbf{(i:m) >}}\)
  4.   \(\mathbf{Q}_{\mathbf{i}}\mathbf{=}\left( \mathbf{I -}\mathbf{t}_{\mathbf{i}}\mathbf{y}_{\mathbf{i}}\mathbf{(i:m)}{\mathbf{y}_{\mathbf{i}}\mathbf{(i:m)}}^{\mathbf{T}} \right)\)
  5.   \(\mathbf{A(i:m,i) =}\mathbf{Q}_{\mathbf{i}}\mathbf{A(i:m,i)}\)
  6. \(\mathbf{{enddo}}\)
Sequential TSQR Parallel TSQR
SequentialTSQR       ParallelTSQR


3.4 Cholesky QR Method [8]

The Cholesky QR factoriztion consists of a matrix product and the Cholesky factorization. This algorithm has high computational intensity and can be computed with one collective communication. However, its orthogonality is poor.

  1.  \(\mathbf{B =}\mathbf{V}^{\mathbf{T}}\mathbf{V}\)
  2.  \(\mathbf{Cholesky\ decomposition(B)}\)
  3.  \(\mathbf{Q = V}\mathbf{R}^{\mathbf{- 1}}\)


3.5 Cholesky QR2 Method [9]

The Cholesky QR2 factorization performs the second Cholesky QR factorization for the orthogonal matrix obtained by the first Cholesky QR factorization. This algorithm improves the orthogonality by executing the Cholesky QR factorization twice.

  1.  \(\mathbf{B}_{\mathbf{1}}\mathbf{=}\mathbf{V}^{\mathbf{T}}\mathbf{V}\)
  2.  \(\mathbf{R}_{\mathbf{1}}^{\mathbf{T}}\mathbf{R}_{\mathbf{1}}\mathbf{= Cholesky\ decomposition(}\mathbf{B}_{\mathbf{1}}\mathbf{)}\)
  3.  \(\mathbf{Q}_{\mathbf{1}}\mathbf{= V}\mathbf{R}_{\mathbf{1}}^{\mathbf{- 1}}\)
  4.  \(\mathbf{B}_{\mathbf{2}}\mathbf{=}\mathbf{Q}_{\mathbf{1}}^{\mathbf{T}}\mathbf{Q}_{\mathbf{1}}\)
  5.  \(\mathbf{R}_{\mathbf{2}}^{\mathbf{T}}\mathbf{R}_{\mathbf{2}}\mathbf{= Cholesky\ decomposition(}\mathbf{B}_{\mathbf{2}}\mathbf{)}\)
  6.  \(\mathbf{Q}_{\mathbf{2}}\mathbf{=}\mathbf{Q}_{\mathbf{1}}\mathbf{R}_{\mathbf{2}}^{\mathbf{- 1}}\)
  7.  \(\mathbf{V =}\mathbf{Q}_{\mathbf{1}}\mathbf{R}_{\mathbf{1}}\mathbf{=}\mathbf{Q}_{\mathbf{2}}\mathbf{R}_{\mathbf{2}}\mathbf{R}_{\mathbf{1}}\mathbf{= QR}\)
  8.  \(\mathbf{Q =}\mathbf{Q}_{\mathbf{2}}\)
  9.  \(\mathbf{R =}\mathbf{R}_{\mathbf{2}}\mathbf{R}_{\mathbf{1}}\)


4 Sparse Matrix Data Formats

In order to save memory in sparse matrix computation, only the non-zero elements and their locations are stored. PARCEL supports different sparse matrix formats.


4.1 Compressed Row Storage (CRS) Format

The CRS format compresses the row indices of the non-zero elements of the sparse matrix. Together with the compressed rows, the column index and the value for each non-zero element are stored in one dimensional arrays. Compresed row indeces are stored in an incremental order.

\[ A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right) \]


One process case

Values of non-zero elements :: \(crsA=\{a,b,c,d,e,f,g,h,i,j\}\)

Compressed row indices :: \(crsRow\_ptr=\{1,4,5,8,11\}\)

Column indices of non-zero elements :: \(crsCol=\{1,2,3,2,1,3,4,2,3,4\}\)


Two process case

Values of non-zero elements :: \(crsA=\{a,b,c,d\}\)

Compressed row indices :: \(crsRow\_ptr=\{1,4,5\}\)

Column indices of non-zero elements :: \(crsCol=\{1,2,3,2\}\)

Values of non-zero elements :: \(crsA=\{e,f,g,h,i,j\}\)

Compressed row indices :: \(crsRow\_ptr=\{1,4,7\}\)

Column indices of non-zero elements :: \(crsCol=\{1,3,4,2,3,4\}\)



4.2 Diagonal (DIA) Format

The DIA format stores the non-zero elements of a matrix as diagonals. This format provides high performance for band block diagonal matrices. In addition to the element values, an offset value for each diagonal of the matrix needs to be stored inside an one dimensional array. A negative offset indicates that the diagonal is below the main diagonal of the matrix, a positive offset indicates that the diagonal is above the main diagonal. The main diagonal has the offset zero. Offset values are stored in an incremental order.

\[ A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right) \]


Single process case

Elements :: \(diaA=\{0,0,e,h,0,0,0,i,a,d,f,j,b,0,g,0,c,0,0,0\}\)

Offsets :: \(offset=\{-2,-1,0,1,2\}\)

Number of diagonals :: \(nnd=5\)


Two process case

Elements :: \(diaA=\{a,d,b,0,c,0\}\)

Offsets :: \(offset=\{0,1,2\}\)

Number of diagonals :: \(nnd=3\)

Elements :: \(diaA=\{e,h,0,i,f,j,g,0\}\)

Offsets :: \(offset=\{-2,-1,0,1\}\)

Number of diagonals :: \(nnd=4\)



4.3 Domain Decomposition Method (DDM) Format

4.3.1 Domain Decomposition Parameters

In the DDM format, the DIA format is extended by assuming simulations based on stencil computation such as finite difference with domain decomposition. By specifying a list of processes which exchange data with each other so that it matches domain decomposition in the simulation, one can minimize communication needed for parallel processing of matrices given by one-, two-, and three-dimensional structured grids. Related parameter settings are shown below.

The number of neighboring processes for \(m\)-dimensional domain decomposition is given by \(3^{m-1}\). For example, in the two dimensional case, the number of neighboring processes is eight, while for the three dimensional case, it becomes \(26\).

A list of the neoghboring processes is specified by their ranks on (one dimensional) MPI communicator. At the boundaries of computational domain, negative indexes are substituted to neighbor_ranks for the directions, where the neighboring processes do not exist. A sample code below can be used as the standard method for setting neighbor_ranks. The parameters npe_x, npe_y and npe_z represent the number of processes in each direction (x, y, z). rank_x, rank_y and rank_z represent the ranks in each direction. Here, the definition of one-dimensional index iirank should be consistent with that in the simulation.


ixmax = 0; iymax = 0; izmax = 0; num = 0;

neighbor_ranks = -1

if ( npe_x > 1 ) ixmax=1

if ( npe_y > 1 ) iymax=1

if ( npe_z > 1 ) izmax=1

do iz=-izmax, izmax

  do iy=-iymax, iymax

    do ix=-ixmax, ixmax

      n1x = rank_x +ix; n1y = rank_y +iy; n1z = rank_z +iz;

      if ( ix*ix +iy*iy + iz*iz == 0 ) cycle

      num = num +1

      if ( n1z >=0 .and. n1z < npe_z ) then

        if ( n1y >=0 .and. n1y < npe_y ) then

          if ( n1x >=0 .and. n1x < npe_x ) then

            iirank = npe_x *npe_y *n1z +npe_x *n1y +n1x;

            neighbor_ranks(num) = iirank;

          endif

        endif

      endif

    end do

  end do

end do


Grid positions referred in stencil computation such as finite difference are specified in each direction. For example, in the case of seven stencil computation consisting of three point finite difference for three-dimensional Poisson equation, the grid offsets for East, West, South, North, Up and Down from the base grid can be set as shown below.

i  ioff_grids(*,1)  ioff_grids(*,2)  ioff_grids(*,3)
 base 1 0 0 0
 east 2 1 0 0
 west 3 -1 0 0
 south 4 0 1 0
 north 5 0 -1 0
 top 6 0 0 1
 bottom 7 0 0 -1

\[ A = \left( \begin{array}{ccc} a & b & 0 & 0\\ c & d & e & 0\\ 0 & f & g & h\\ 0 & 0 & i & j \end{array} \right) \]


Single process case

Elements     :: \(val\_dia=\{a,d,g,j,b,e,h,0,0,c,f,i\}\)

Offsets     :: \(ioff\_dia=\{0,0,-1\}\)

Number of diagonals :: \(nnd=3\)

Stencils  ::

i  ioff_grids(*,1)
 reference 1 0
 east  2 1
 west  3 -1

Two process case

Elements     :: \(val\_dia=\{a,d,b,e,0,c\}\)

Offsets     :: \(ioff\_dia=\{0,0,-1\}\)

Number of diagonals :: \(nnd=3\)

Stencils  ::

i  ioff_grids(*,1)
 reference 1 0
 east  2 1
 west  3 -1

Elements     :: \(val\_dia=\{g,j,h,0,f,i\}\)

Offsets     :: \(ioff\_dia=\{0,0,0\}\)

Number of diagonals :: \(nnd=3\)

Stencils  ::

i  ioff_grids(*,1)
 reference 1 0
 east  2 1
 west  3 -1


5 Install

This chapter explains how to compile and install the PARCEL library.

5.1 Compiling the PARCEL library

PARCEL requires a C compiler, a Fortran compiler, which supports preprocessor, MPI, LAPACK, and OpenMP. When compiling with \(gfortran\), the option \(-cpp\) has to be specified to to activate preprocessor. Compiler options can be changed inside the file arch/make_config. An example of a make_config file is shown below.

To compile, execute the make command in the top directory of the extracted archive. Then, execute “make install” to create the example, include, lib directory in the directory specified in INSTALLDIR. If parcel_sparse.a is created in the lib directory, the compilation is successful. For large-scale problems of 2 billion dimensions or more, 64-bit integers are required, so it is necessary to compile with 64-bit integer types. To compile the GPU version of PARCEL, turn CUDAFLAG “ON”.By turning CUDAAWARE “ON”, a library that supports CUDA AWARE MPI is created.You need to link cuBLAS,cuSPARSE libraries.

  CUDAFLAG  = ON
  CUDAAWARE = ON
  CC     = mpiicc
  FC     = mpiifort
 
  CFLAGS  = -O3
  FFLAGS  = -O3 -fpp -qopenmp -xHost  -mcmodel=large
 
  FP_MODE = -fp-model precise
 
  INCLUDEDIR =
  LD_MPI     = -lmpifort
  LD_FORT    = -lifcore
  LD_LAPACK  = -mkl
  LD_CUDA    = -lcudart -lcublas -lcusparse
  NVCFLAGS  = \
            -arch=sm_80 -ccbin=${CC} -O3 -restrict  \
            --expt-extended-lambda -use_fast_math   \
            --default-stream per-thread  \
            --nvlink-options -Werror     \
            --maxrregcount=128  \
            --extra-device-vectorization  \
            --restrict \
             "-std=c++11"
 
  #FDEFINE = -DPARCEL_INT8
  INSTALLDIR = ../PARCEL_1.2

    Option     Explanation
CC C compiler command name
FC Fortran compiler command name
CFLAGS C compiler options
FFLAGS Fortran compiler options
FP_MODE Option to suppress optimizations with arithmetic order change (Only for quad precision)
CUDAFLAG CUDA compile flag (ON:enable.OFF:disable)
CUDAAWARE CUDA-Aware MPI compile flag (ON:enable.OFF:disable)
NVCFLAGS CUDA compiler options
INCLUDEDIR Include directories
LD_MPI MPI library
LD_LAPACK LAPACK library
LD_CUDA CUDA library(cuBLAS,cuSPARSE)
FDEFINE 64bit integer becomes available with -DPARCEL_INT8.
INSTALLDIR PARCEL installation directory


5.2 Usage of the PARCEL library

All the routines in the PARCEL library can be called by linking parcel_sparse.a which is generated in the directory lib.


6 PARCEL routines

The interface of each rountine in the PARCEL library is described. The following implementations are commonry used.

The names of routines starting with ‘parcel_d’ and ‘parcel_dd’ donote the PARCEL routines implemented in Double precision and Quadruple (Double-Double) precision, respectively.

The GPU version allocates memory on the GPU with Fortran type(c_ptr). The subroutine PARCEL_Init must be called before executing the subroutine of the Krylov subspace method to specify the GPU used by the process. The input matrix \(A\), right hand side vector \(b\), and the initial vector \(x\) are also allocated on the GPU memory with Fortran type(c_ptr). Sparse matrix-vector products in the CRS format are implemented using cuSPARSE(cusparseSpMV). The GPU version does not support the Fine-block Preconditioner, but the Neumann series Preconditioner is available.

The communication-computation overlap is implemneted in the following three modes. iovlflag=0 does not use any communication-computation overlap. iovlflag=1 uses the communication-computation overlap, in which computation processes are started after all non-blocking communications are launched at once. iovlflag=0 uses the communication-computation overlap, in which non-blocking communication and computation in each direction are processed sequencially with taking barrior synchronization, which reuses and saves communication buffers. The communication-computation overlap is not supported in the GPU version.

When precon_thblock is zero or negative, or is greater than the number of threads per process, precon_nblock is set to be the number of threads per process, which is the default value. One can reduce the number of threads to enlarge the block size and improve the accuracy of preconditioning, but this also leads to the degradation of thread parallel performance. In the GPU version, one block per process is computed by ILU decomposition of cuSPARSE (cusparseDcsrsv2_solve).

When independ_nvec is zero or negative, the fine-block preconditioner is not applied. On A64FX, independ_nvec=300 is recommended to facilitate optimizations for SIMD operations and software pipelining on the Fujitsu compiler. Larger independ_nvec improves the computational performance, while the accuracy of preconditioning becomes worse. When independ_nvec is greater than the size of vector on each process n, the fine-block preconditioner becomes equivalent to the point Jacobi preconditioner.

Thread optimization of SpMV employs a cyclic loop division to facilitate the reuse of on-cache data between different threads. nBlock is the chank size for this loop division. When nBlock is zero or negative, the default value of nBlock=2000 is set.

When the i-th eigenvalue is a real value, the corresponding eigenvector is stored in Evec(n*(i-1)+1:n*i). When the i-th eigenvalue is a complex value and the (i+1)-th eigenvalue is its complex conjugate value, the i-th and (i+1)-th eigenvectors are stored as follows. The CA-Arnoldi method is not supported in the GPU version.

Eigenvector Real Part Imaginary Part
i-th   Evec(n*(i-1)+1:n*i)     Evec(n*i+1:n*(i+1))  
(i+1)-th   Evec(n*(i-1)+1:n*i)     -Evec(n*i+1:n*(i+1))  


6.1 PARCEL_Init


call PARCEL_Init(myrank)


 Initialize the PARCEL execution environment.

Parameter(dimension) Type Input/Output Description
myrank integer*4 in MPI process rank


6.2 PARCEL_Free


call PARCEL_Free()


 Terminates PARCEL execution environment.


6.3 parcel_dcg (CPU/GPU)


call parcel_dcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )


A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Pointer table in CRS format
crsCol(nnz) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)


6.4 parcel_dbicgstab (CPU/GPU)


call parcel_dbicgstab( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )


A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored​in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Pointer table in CRS format
crsCol(nnz) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)


6.5 parcel_dgmres (CPU/GPU)


call parcel_dgmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )


A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Pointer table in CRS format
crsCol(nnz) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret integer out Error code (0:normal)


6.6 parcel_dcagmres (CPU/GPU)


call parcel_dcagmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )


A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Pointer table in CRS format
crsCol(nnz) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cagmres_sstep integer in Number of communication avoiding steps s in the CA-GMRES method
cagmres_tstep integer in Number of outer iterations t in the CA-GMRES method (restart length = st)
cagmres_basis integer in Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis)
cagmres_QRflag integer in Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret integer out Error code (0:normal)


6.7 parcel_dcbcg (CPU/GPU)


call parcel_dcbcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )


A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cbcg_kstep integer in Number of communication avoiding steps
cbcg_Eigenflag integer in Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax integer in Maximum number of iterations in the power method
caarnoldi_sstep integer in Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret integer out Error code (0:normal)


6.8 parcel_dcaarnoldi (CPU)


call parcel_dcaarnoldi( icomm, vecx, vecb, n, gn, nnz, istart, crsA, crsRow_ptr, crsCol, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )


An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
caarnoldi_sstep integer in Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep) double precision out Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep) double precision out Imaginary part of eigenvalue
Evec(n*caarnoldi_sstep*caarnoldi_tstep) double precision out Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep) double precision out Error norm \(\frac{\left\| Ax - \lambda x \right\|}{\left\| \text{λx} \right\|}\)
EmaxID integer out ID number of the maximum eigenvalue
EminID integer out ID number of the minimam eigenvalue
iret integer out Error code (0:normal)


6.9 parcel_dcg_dia (CPU/GPU)


call parcel_dcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )


A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DIA format
offset(nnd) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)


6.10 parcel_dbicgstab_dia (CPU/GPU)


call parcel_dbicgstab_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )


A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DIA format
offset(nnd) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)


6.11 parcel_dgmres_dia (CPU/GPU)


call parcel_dgmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )


A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DIA format
offset(nnd) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret integer out Error code (0:normal)


6.12 parcel_dcagmres_dia (CPU/GPU)


call parcel_dcagmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )


A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DIA format
offset(nnd) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cagmres_sstep integer in Small communication step of CA-GMRES method
cagmres_tstep integer in List step of the CA-GMRES method
cagmres_basis integer in Base generation flag of CA-GMRES method (0: single basis, 1: newton basis)
cagmres_QRflag integer in QR Decomposition flag of CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret integer out Error code (0:normal)


6.13 parcel_dcbcg_dia (CPU/GPU)


call parcel_dcbcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )


A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
vecb(n) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DIA format
offset(nnd) CPU : integer*4 / integer*8 / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cbcg_kstep integer in Number of communication avoiding steps
cbcg_Eigenflag integer in Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax integer in Maximum number of iterations in the power method
caarnoldi_sstep integer in Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret integer out Error code (0:normal)


6.14 parcel_dcaarnoldi_dia (CPU)


call parcel_dcaarnoldi_dia( icomm, vecx, vecb, n, gn, istart, diaA,offset,nnd, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )


An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) double precision in Non-zero elements of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
caarnoldi_sstep integer in Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep) double precision out Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep) double precision out Imaginary part of eigenvalue
Evec(n*caarnoldi_sstep*caarnoldi_tstep) double precision out Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep) double precision out Error norm \({\left\| Ax - \lambda x \right\|}/{\left\| \text{λx} \right\|}\)
EmaxID integer out ID number of the maximum eigenvalue
EminID integer out ID number of the minimam eigenvalue
iret integer out Error code (0:normal)


6.15 parcel_dcg_ddm (CPU/GPU)


call parcel_dcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )


A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
b(m) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)


6.16 parcel_dbicgstab_ddm (CPU/GPU)


call parcel_dbicgstab_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )


A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
b(m) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) CPU : integer / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)


6.17 parcel_dgmres_ddm (CPU/GPU)


call parcel_dgmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m, gmres_GSflag, iret )


A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
b(m) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) CPU : integer / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret integer out Error code (0:normal)


6.18 parcel_dcagmres_ddm (CPU/GPU)


call parcel_dcagmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )


A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
b(m) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) CPU : integer / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cagmres_sstep integer in Number of communication avoiding steps s in the CA-GMRES method
cagmres_tstep integer in Number of outer iterations t in the CA-GMRES method (restart length = st)
cagmres_basis integer in Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis)
cagmres_QRflag integer in Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret integer out Error code (0:normal)


6.19 parcel_dcbcg_ddm (CPU/GPU)


call parcel_dcbcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )


A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) CPU : double precision / GPU : type(c_ptr) in/out in: initial vector, out: solution vector
b(m) CPU : double precision / GPU : type(c_ptr) in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) CPU : integer / GPU : type(c_ptr) in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) CPU : double precision / GPU : type(c_ptr) in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) )
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
Neuman_itrmax integer in Order of Neumann series preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cbcg_kstep integer in Number of communication avoiding steps
cbcg_Eigenflag integer in Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax integer in Maximum number of iterations in the power method
caarnoldi_sstep integer in Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret integer out Error code (0:normal)


6.20 parcel_dcaarnoldi_ddm (CPU)


call parcel_dcaarnoldi_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )


An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) double precision in/out in: initial vector, out: solution vector
b(m) double precision in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
caarnoldi_sstep integer in Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep) double precision out Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep) double precision out Imaginary part of eigenvalue
Evec(m*caarnoldi_sstep*caarnoldi_tstep) double precision out Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep) double precision out Error norm \({\left\| Ax - \lambda x \right\|}/{\left\| \text{λx} \right\|}\)
EmaxID integer out ID number of the maximum eigenvalue
EminID integer out ID number of the minimam eigenvalue
iret integer out Error code (0:normal)


6.21 parcel_ddcg (CPU)


call parcel_ddcg( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA_hi(nnz) double precision in Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz) double precision in Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.22 parcel_ddbicgstab (CPU)


call parcel_ddbicgstab( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA_hi(nnz) double precision in Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz) double precision in Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.23 parcel_ddgmres (CPU)


call parcel_ddgmres( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m,gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA_hi(nnz) double precision in Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz) double precision in Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.24 parcel_ddcg_dia (CPU)


call parcel_ddcg_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA_hi(n*nnd) double precision in Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd) double precision in Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.25 parcel_ddbicgstab_dia (CPU)


call parcel_ddbicgstab_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA_hi(n*nnd) double precision in Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd) double precision in Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.26 parcel_ddgmres_dia (CPU)


call parcel_ddgmres_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA_hi(n*nnd) double precision in Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd) double precision in Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.27 parcel_ddcg_ddm (CPU)


call parcel_ddcg_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x_hi(m) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m) double precision in/out in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m) double precision in Right hand side vector (upper bit)
b_lo(m) double precision in Right hand side vector (lower bit)
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.28 parcel_ddbicgstab_ddm (CPU)


call parcel_ddbicgstab_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x_hi(m) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m) double precision in/out in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m) double precision in Right hand side vector (upper bit)
b_lo(m) double precision in Right hand side vector (lower bit)
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.29 parcel_ddgmres_ddm (CPU)


call parcel_ddgmres_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m, gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )


A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x_hi(m) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m) double precision in/out in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m) double precision in Right hand side vector (upper bit)
b_lo(m) double precision in Right hand side vector (lower bit)
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
addL integer in Overlapping width in the additive Schwartz method
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)


6.30 parcel_dqr (CPU)


call parcel_dqr(n,s,icomm,V,Q,R,iQRflag)


QR factorization for the m*s matrix, where the sum of the parameter n for the MPI processes belonging to the MPI communicator icomm is m.

Parameter(dimension) Type Input/Output Description
n integer*4 / integer*8 in Size of rows in the matrix on each process
s integer*4 / integer*8 in Size of column in the matrix
icomm integer in MPI communicator
V(s*n) double precision in Input Matrix (Column Major)
Q(s*n) double precision out Orthogonal matrix (Column Major)
R(s*s) double precision out Upper triangular matrix (Column Major)
iQRflag integer in Flag for QR factorization (1:MGS,2:CGS,3:TSQR,4:CholeskyQR,5:CholeskyQR2)


7 CUDA support routines for Fortran

 The interface of each CUDA support routine for Fortran is described. These routines can be replaced with other implementations such as CUDA Fortran and OpenACC.

7.1 cudaMalloc_FP64 (GPU)


call cudaMalloc_FP64(a_d,n)


 double precision型のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.

引数名(次元) 入力/出力 説明
a_d type(devicemem) out PARCEL GPUメモリ構造体
n integer*4 / integer*8 in 確保する要素数


7.2 cudaMalloc_PARCEL_INT (GPU)


call cudaMalloc_PARCEL_INT(a_d,n)


 32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.

引数名(次元) 入力/出力 説明
a_d type(devicemem) out PARCEL GPUメモリ構造体
n integer*4 / integer*8 in 確保する要素数


7.3 cudaMalloc_INT (GPU)


call cudaMalloc_INT(a_d,n)


 integer*4型のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.

引数名(次元) 入力/出力 説明
a_d type(devicemem) out PARCEL GPUメモリ構造体
n integer*4 / integer*8 in 確保する要素数


7.4 cudaFree (GPU)


call cudaFree(a_d)


 GPUメモリを解放する.

引数名(次元) 入力/出力 説明
a_d type(devicemem) out PARCEL GPUメモリ構造体


7.5 cudaMemcpy_h2d_FP64 (GPU)


call cudaMemcpy_h2d_FP64(a_h,a_d,n,offset_h,offset_d)


 FP64データをCPUメモリからGPUメモリにコピー.

引数名(次元) 入力/出力 説明
a_h double precision in コピー元CPUメモリ
a_d type(devicemem) out コピー先PARCEL GPUメモリ構造体
n integer*4 / integer*8 in コピーする要素数
offset_h integer*4 / integer*8 in コピー元となるa_hの先頭アドレスからのオフセット
offset_d integer*4 / integer*8 in コピー先となるa_dの先頭アドレスからのオフセット


7.6 cudaMemcpy_h2d_PARCEL_INT (GPU)


call cudaMemcpy_h2d_PARCEL_INT(a_h,a_d,n,offset_h,offset_d)


 32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のデータをCPUメモリからGPUメモリにコピー.

引数名(次元) 入力/出力 説明
a_h double precision in コピー元CPUメモリ
a_d type(devicemem) out コピー先PARCEL GPUメモリ構造体
n integer*4 / integer*8 in コピーする要素数
offset_h integer*4 / integer*8 in コピー元となるa_hの先頭アドレスからのオフセット
offset_d integer*4 / integer*8 in コピー先となるa_dの先頭アドレスからのオフセット


7.7 cudaMemcpy_h2d_INT (GPU)


call cudaMemcpy_h2d_INT(a_h,a_d,n,offset_h,offset_d)


 integer*4データをCPUメモリからGPUメモリにコピー.

引数名(次元) 入力/出力 説明
a_h double precision in コピー元CPUメモリ
a_d type(devicemem) out コピー先PARCEL GPUメモリ構造体
n integer*4 / integer*8 in コピーする要素数
offset_h integer*4 / integer*8 in コピー元となるa_hの先頭アドレスからのオフセット
offset_d integer*4 / integer*8 in コピー先となるa_dの先頭アドレスからのオフセット


7.8 cudaMemcpy_d2h_FP64 (GPU)


call cudaMemcpy_d2h_FP64(a_d,a_h,n,offset_d,offset_h)


 FP64データをGPUメモリからCPUメモリにコピー.

引数名(次元) 入力/出力 説明
a_d type(devicemem) in コピー元PARCEL GPUメモリ構造体
a_h double precision out コピー先CPUメモリ
n integer*4 / integer*8 in コピーする要素数
offset_d integer*4 / integer*8 in コピー先となるa_dの先頭アドレスからのオフセット
offset_h integer*4 / integer*8 in コピー元となるa_hの先頭アドレスからのオフセット


7.9 cudaMemcpy_d2h_PARCEL_INT (GPU)


call cudaMemcpy_d2h_PARCEL_INT(a_d,a_h,n,offset_d,offset_h)


 32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のデータをGPUメモリからCPUメモリにコピー.

引数名(次元) 入力/出力 説明
a_d type(devicemem) in コピー元PARCEL GPUメモリ構造体
a_h double precision out コピー先CPUメモリ
n integer*4 / integer*8 in コピーする要素数
offset_d integer*4 / integer*8 in コピー先となるa_dの先頭アドレスからのオフセット
offset_h integer*4 / integer*8 in コピー元となるa_hの先頭アドレスからのオフセット


7.10 cudaMemcpy_d2h_INT (GPU)


call cudaMemcpy_d2h_INT(a_d,a_h,n,offset_d,offset_h)


 integer*4データをGPUメモリからCPUメモリにコピー.

引数名(次元) 入力/出力 説明
a_d type(devicemem) in コピー元PARCEL GPUメモリ構造体
a_h double precision out コピー先CPUメモリ
n integer*4 / integer*8 in コピーする要素数
offset_d integer*4 / integer*8 in コピー先となるa_dの先頭アドレスからのオフセット
offset_h integer*4 / integer*8 in コピー元となるa_hの先頭アドレスからのオフセット


8 How to use (Fortran)

The use of the PARCEL routines is explained by Fortran sample programs of a CG solver. If you enable a preprocessor option CUDA_SOLVER, a CUDA version of each sample program is generated.

8.1 CRS Format (Fortran)

A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.


  program main
    use mpi
    use PARCEL_integer
    use krylov
    use PARCEL_wrapper
    implicit none
    integer n,gn,nnz,istart
    real*8,allocatable  :: crsA(:)
    integer,allocatable :: crsRow_ptr(:),crsCol(:)
    real*8,allocatable  :: vecx(:)
    real*8,allocatable  :: vecb(:)
    integer itrmax
    real*8  rtolmax
    real*8, allocatable :: reshistory(:)
    integer ipreflag
    integer ILU_method
    integer iflagAS
    integer addL
    integer iovlflag
    integer precon_thblock
    integer independ_nvec
    integer nblock
    integer iret
    integer myrank
    integer ierr
 #ifdef CUDA_SOLVER
    type(devicemem) cu_vecx
    type(devicemem) cu_vecb
    type(devicemem) cu_crsA
    type(devicemem) cu_crsRow_ptr
    type(devicemem) cu_crsCol
    integer(kind=intType),parameter :: offset_zero=0
 #endif
    call MPI_Init(ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)
    call PARCEL_Init(myrank)
    call make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol)
    allocate(vecx(n))
    allocate(vecb(n))
    allocate(reshistory(itrmax))
    ipreflag=0
    ILU_method=1
    addL=0
    iflagAS=1
    itrmax=100
    rtolmax=1.0d-8
    iovlflag=0
    precon_thblock=-1
    independ_nvec=-1
    nblock = 2000
    vecb=1.0d0
    vecx=1.0d0
 #ifdef CUDA_SOLVER
    call cudaMalloc_FP64(cu_vecx,n)
    call cudaMalloc_FP64(cu_vecb,n)
    call cudaMalloc_FP64(cu_crsA,nnz)
    call cudaMalloc_PARCEL_INT(cu_crsRow_ptr,n+1)
    call cudaMalloc_PARCEL_INT(cu_crsCol,nnz)
    call cudaMemcpy_h2d_fp64(vecx,cu_vecx,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(vecb,cu_vecb,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(crsMat%crsA,cu_crsA,nnz,offset_zero,offset_zero)
    call cudaMemcpy_h2d_PARCEL_INT(crsMat%crsRow_ptr,cu_crsRow_ptr,n+1,offset_zero,offset_zero)
    call cudaMemcpy_h2d_PARCEL_INT(crsMat%crsCol,cu_crsCol,nnz,offset_zero,offset_zero)
 #endif
    call parcel_dcg( &
         MPI_COMM_WORLD, &
 #ifdef CUDA_SOLVER
         cu_vecx%mem,cu_vecb%mem, &
         n,gn,nnz,istart, &
         cu_crsA%mem,cu_crsRow_ptr%mem,cu_crsCol%mem,&
 #else
         vecx,vecb, &
         n,gn,nnz,istart, &
         crsA,crsRow_ptr,crsCol, &
 #endif
         ipreflag,ILU_method,addL,iflagAS, &
         itrmax,rtolmax, &
         reshistory, &
         iovlflag,iret &
         precon_thblock, independ_nvec, &
         nblock, &
         iret &
         )
    call PARCEL_Free()
    call MPI_Finalize(ierr)
 #ifdef CUDA_SOLVER
    call cudaMemcpy_d2h_fp64(cu_vecx,vecx,n,offset_zero,offset_zero)
    call cudaFree(cu_vecx)
    call cudaFree(cu_vecb)
    call cudaFree(cu_crsA)
    call cudaFree(cu_crsRow_ptr)
    call cudaFree(cu_crsCol)
 #endif
    deallocate(vecx)
    deallocate(vecb)
    deallocate(reshistory)
  end program main


8.2 DIA Format (Fortran)

A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.


  program main
    use PARCEL_integer
    use krylov
    use PARCEL_wrapper
    implicit none
    integer n,gn,nnd,istart
    real*8,allocatable  :: diaA(:)
    integer,allocatable :: offset(:)
    real*8,allocatable  :: vecx(:)
    real*8,allocatable  :: vecb(:)
    integer itrmax
    real*8  rtolmax
    real*8, allocatable :: reshistory(:)
    integer ipreflag
    integer ILU_method
    integer iflagAS
    integer addL
    integer iovlflag
    integer precon_thblock
    integer independ_nvec
    integer nblock
    integer iret
    integer myrank
    integer ierr
 #ifdef CUDA_SOLVER
    type(devicemem) cu_vecx
    type(devicemem) cu_vecb
    type(devicemem) cu_diaA
    type(devicemem) cu_offset
    integer(kind=intType),parameter :: offset_zero=0
 #endif
    call MPI_Init(ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)
    call PARCEL_Init(myrank)
    call make_matrix_DIA(n,gn,nnd,istart,diaA,offset)
    allocate(vecx(n))
    allocate(vecb(n))
    allocate(reshistory(itrmax))
    ipreflag=0
    ILU_method=1
    addL=0
    iflagAS=1
    itrmax=100
    rtolmax=1.0d-8
    iovlflag=0
    precon_thblock=-1
    independ_nvec=-1
    nblock = 2000
    vecb=1.0d0
    vecx=1.0d0
 #ifdef CUDA_SOLVER
    call cudaMalloc_FP64(cu_vecx,n)
    call cudaMalloc_FP64(cu_vecb,n)
    call cudaMalloc_FP64(cu_diaA,n*nnd)
    call cudaMalloc_PARCEL_INT(cu_offset,nnd)
    call cudaMemcpy_h2d_fp64(vecx,cu_vecx,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(vecb,cu_vecb,n,offset_zero,offset_zero)
    call cudaMemcpy_h2d_fp64(diaMat%diaA,cu_diaA,n*nnd,offset_zero,offset_zero)
    call cudaMemcpy_h2d_PARCEL_INT(diaMat%offset,cu_offset,nnd,offset_zero,offset_zero)
 #endif
    call parcel_dcg_dia(&
         MPI_COMM_WORLD, &
 #ifdef CUDA_SOLVER
         cu_vecx%mem,cu_vecb%mem, &
         n,gn,istart,&
         cu_diaA%mem,cu_offset%mem,nnd, &
 #else
         vecx,vecb,&
         n,gn,istart,&
         diaA,offset,nnd,&
 #endif
         ipreflag,ILU_method,addL,iflagAS,&
         itrmax,rtolmax,&
         reshistory,&
         iovlflag,
         precon_thblock, independ_nvec, &
         nblock, &
         iret&
         )
    call PARCEL_Free()
    call MPI_Finalize(ierr)
 #ifdef CUDA_SOLVER
    call cudaMemcpy_d2h_fp64(cu_vecx,vecx,n,offset_zero,offset_zero)
    call cudaFree(cu_vecx)
    call cudaFree(cu_vecb)
    call cudaFree(cu_diaA)
    call cudaFree(cu_offset)
 #endif
    deallocate(reshistory)
  end program main


8.3 DDM Format (Fortran)

A Fortran sample code in DDM format is shown below. Here, make_matrix_DDM denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL.


    program main
      use mpi
      use PARCEL_integer
      use krylov
      use ddm_wrapper
      implicit none
      integer ityp_eq
      integer gnx,gny,gnz
      integer itrmax
      integer MAX_NITER
      integer nx,ny,nz
      integer nx0,ny0,nz0
      integer n,gn
      integer m
      integer m_
      integer npes,myrank
      integer ierr
      integer i
      integer ipreflag
      integer addL
      integer iflagAS
      real*8  rtolmax
      real*8  abstolmax
      integer solver,ityp_solver
      integer iret
      real*8,allocatable  :: reshistory_DDM(:)
      integer ILU_method
      integer iovlflag
      real*8,allocatable  :: vecx(:)
      real*8,allocatable  :: vecb(:)
      integer niter,restart
      integer npey,npez
      integer precon_thblock
      integer independ_nvec
      integer nblock
      integer npe_x,npe_y,npe_z
      integer div_direc_th
      integer rank_x, rank_y, rank_z
      integer,parameter :: nnd = 7
      integer,parameter :: margin = 1
      integer,parameter :: ndim=3
      integer npe_dim(ndim)
      integer gn_grids(ndim)
      integer n_grids(ndim)
      integer istart_grids(ndim)
      integer,parameter :: num_neighbor_ranks_max = 3**ndim -1
      integer neighbor_ranks( num_neighbor_ranks_max )
      integer num_neighbor_ranks
      real*8, allocatable  :: diaA(:)
      integer, allocatable :: offset(:)
      integer, allocatable :: offset_dim(:,:)
 #ifdef CUDA_SOLVER
      type(devicemem) cu_vecx
      type(devicemem) cu_vecb
      type(devicemem) cu_diaA
      type(devicemem) cu_offset
      integer(kind=intType),parameter :: offset_zero=0
 #endif
      namelist/input/ &
           ityp_eq, &
           gnx, &
           gny, &
           gnz, &
           MAX_NITER, &
           rtolmax, &
           abstolmax, &
           ipreflag, &
           independ_nvec, &
           addL, &
           iflagAS, &
           ILU_method, &
           iovlflag, &
           ityp_solver, &
           precon_thblock, &
           nblock &
      namelist/input_ddm/ &
           npe_x,&
           npe_y,&
           npe_z,&
           div_direc_th
      open(33,file='input_namelist')
      read(33,nml=input)
      read(33,nml=input_ddm)
      close(33)
      stime = 0.0d0
      etime = 0.0d0
      itrmax = MAX_NITER
      solver = ityp_solver
      call MPI_Init(ierr)
      call MPI_Comm_size(MPI_COMM_WORLD,npes,ierr)
      call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)
      call PARCEL_Init(myrank)
      gn_grids(1)  = gnx
      gn_grids(2)  = gny
      gn_grids(3)  = gnz
      call set_rank_xyz( myrank, npe_x, npe_y, npe_z, rank_x, rank_y, rank_z )
      call set_grids_decompostion( &
           npe_x, npe_y, npe_z, &
           rank_x, rank_y, rank_z,  &
           ndim, gn_grids, npe_dim, n_grids,  &
           istart_grids, margin, m )
      num_neighbor_ranks = -1
      call set_neighbor_ranks( &
           npe_x, npe_y, npe_z, rank_x, rank_y, rank_z, &
           num_neighbor_ranks, &
           neighbor_ranks, num_neighbor_ranks_max)
      allocate( diaA(nnd*m) )
      allocate( offset(nnd) )
      allocate( offset_dim(nnd,ndim) )
      allocate( reshistory_DDM(MAX_NITER) )
      allocate( vecx(m) )
      allocate( vecx0(m) )
      allocate( vecb(m) )
      diaA = 0.0d0
      vecx = 0.0d0
      vecx0 = 0.0d0
      vecb = 0.0d0
      reshistory_DDM = 0
      offset = 0
      offset_dim = 0
      call make_matrix_DDM( &
           gnx,gny,gnz, &
           ndim, n_grids,  &
           istart_grids, &
           margin, nnd, m, &
           offset_dim, offset, &
           diaA, vecb, vecx, vecx0,&
           rank_x, rank_y, rank_z, &
           npe_x, npe_y, npe_z )
 #ifdef CUDA_SOLVER
      call cudaMalloc_FP64(cu_vecx,m)
      call cudaMalloc_FP64(cu_vecb,m)
      call cudaMalloc_FP64(cu_diaA,m*nnd)
      call cudaMalloc_INT(cu_offset,nnd)
      call cudaMemcpy_h2d_fp64(vecx,cu_vecx,m,offset_zero,offset_zero)
      call cudaMemcpy_h2d_fp64(vecb,cu_vecb,m,offset_zero,offset_zero)
      call cudaMemcpy_h2d_fp64(diaA,cu_diaA,m*nnd,offset_zero,offset_zero)
      call cudaMemcpy_h2d_INT(offset,cu_offset, nnd,offset_zero,offset_zero)
 #endif
      call parcel_dcg_ddm(&
           MPI_COMM_WORLD, &
 #ifdef CUDA_SOLVER
           cu_vecx%mem,cu_vecb%mem, &
           m, &
           nnd,  cu_offset%mem, cu_diaA%mem, &
 #else
           vecx,vecb,  &
           m, &
           nnd, offset, diaA, &
 #endif
           num_neighbor_ranks, neighbor_ranks, margin, ndim,&
           npe_dim, n_grids, offset_dim,&
           div_direc_th, &
           ipreflag, &
           addL, iflagAS,&
           MAX_NITER, rtolmax, reshistory_DDM, &
           iovlflag, &
           precon_thblock, &
           independ_nvec, &
           nBlock, &
           iret &
           )
 #ifdef CUDA_SOLVER
      call cudaMemcpy_d2h_fp64(cu_vecx,vecx,m,offset_zero,offset_zero)
      call cudaFree(cu_vecx)
      call cudaFree(cu_vecb)
      call cudaFree(cu_diaA)
      call cudaFree(cu_offset)
 #endif
      call MPI_Barrier(MPI_COMM_WORLD,ierr)
      deallocate( diaA )
      deallocate( offset )
      deallocate( offset_dim )
      deallocate( reshistory_DDM )
      deallocate( vecx )
      deallocate( vecx0 )
      deallocate( vecb )
      call PARCEL_Free()
      call MPI_Finalize(ierr)
    end program main


9 How to use (C)

A Fortran routine can be called from C by adding “_” to the end of the routine name and passing the argument as a pointer. The PARCEL routines can also be used in C programs. Regarding the use of MPI, the C MPI communicator should be converted to the Fortran MPI communicator via MPI_Comm_c2f in the MPI library. the use of the PARCEL routines is explained by C sample programs of a CG solver. If you enable a preprocessor option CUDA_SOLVER, a CUDA version of each sample program is generated.

9.1 CRS Format (C)

A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.


 
   #include < stdio.h >
   #include < stdlib.h >
   #include "mpi.h"
   #ifdef CUDA_SOLVER
   #include < cuda.h >
   #include < cuda_runtime.h >
   #endif
   int main( int argc, char *argv[]){
     int npes;
     int myrank;
     MPI_Fint icomm_fort;
     MPI_Init(&argc,&argv);
     MPI_Comm_size(MPI_COMM_WORLD, &npes);
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);
     parcel_init_c_(&myrank);
     int n,gn,nnz,istart;
     double* crsA;
     int* crsRow_ptr;
     int* crsCol;
     make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol)
     double *vecx = (double *)malloc(sizeof(double)*n);
     double *vecb = (double *)malloc(sizeof(double)*n);
     for(int i=0;i < n;i++){
       vecx[i] = 0.0;
       vecb[i] = 1.0;
     }
     int ipreflag = 0;
     int ilu_method = 1;
     int addl = 0;
     int iflagas = 1;
     int max_niter = 100;
     int rtolmax = 1.0e-8;
     double *reshistory = (double *)malloc(sizeof(double)*max_niter);
     int iovlflag = 0;
     int precon_thblock = -1;
     int independ_nvec = -1;
     int nBlock = 2000;
 #ifdef CUDA_SOLVER
     typedef struct Fortran_GPUMem{
       void *a_d;
     } devicemem;
     size_t len = sizeof(double)*n;
     devicemem cu_vecx;cudaMalloc(&(cu_vecx.a_d),len);
     cudaMemcpy( cu_vecx.a_d, vecx,  len, cudaMemcpyHostToDevice);
     devicemem cu_vecb;cudaMalloc(&(cu_vecb.a_d),len);
     cudaMemcpy( cu_vecb.a_d, vecb,  len, cudaMemcpyHostToDevice);
     len = sizeof(double)*nnz;
     devicemem cu_crsa;cudaMalloc(&(cu_crsa.a_d),len);
     cudaMemcpy( cu_crsa.a_d, crsa,  len, cudaMemcpyHostToDevice);
     len = sizeof(parcel_int)*nnz;
     devicemem cu_crscol;cudaMalloc(&(cu_crscol.a_d),len);
     cudaMemcpy( cu_crscol.a_d, crscol,  len, cudaMemcpyHostToDevice);
     len = sizeof(parcel_int)*(n+1);
     devicemem cu_crsrow_ptr;cudaMalloc(&(cu_crsrow_ptr.a_d),len);
     cudaMemcpy( cu_crsrow_ptr.a_d, crsrow_ptr,  len, cudaMemcpyHostToDevice);
 #endif
     parcel_dcg_c_(
          &icomm_fort ,
 #ifdef CUDA_SOLVER
          &cu_vecx,
          &cu_vecb,
          &n,
          &gn,
          &nnz,
          &istart,
          &cu_crsa ,
          &cu_crsrow_ptr,
          &cu_crscol,
 #else
          vecx,
          vecb,
          &n,
          &gn,
          &nnz,
          &istart,
          crsa ,
          crsrow_ptr,
          crscol,
 #endif
          &ipreflag ,
          &ilu_method ,
          &addl ,
          &iflagas ,
          &max_niter ,
          &rtolmax ,
          reshistory,
          &iovlflag ,
          &precon_thblock ,
          &independ_nvec ,
          &nBlock,
          &iret
          );
 #ifdef CUDA_SOLVER
     len = sizeof(double)*n;
     cudaMemcpy(  vecx,cu_vecx.a_d,  len, cudaMemcpyDeviceToHost);
 
     cudaFree(cu_vecx.a_d);
     cudaFree(cu_vecb.a_d);
     cudaFree(cu_crsa.a_d);
     cudaFree(cu_crscol.a_d);
     cudaFree(cu_crsrow_ptr.a_d);
 #endif
     free(vecx);
     free(vecb);
     free(reshistory);
     parcel_free_c_();
     MPI_Finalize();
     }


9.2 DIA Format (C)

A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.


 
   #include < stdio.h >
   #include < stdlib.h >
   #include "mpi.h"
   #ifdef CUDA_SOLVER
   #include < cuda.h >
   #include < cuda_runtime.h >
   #endif
   int main( int argc, char *argv[]){
     int npes;
     int myrank;
     MPI_Fint icomm_fort;
     MPI_Init(&argc,&argv);
     MPI_Comm_size(MPI_COMM_WORLD, &npes);
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);
     parcel_init_c_(&myrank);
     int n,gn,nnd,istart;
     double* diaA;
     int* offset;
     make_matrix_DIA(n,gn,nnd,istart,diaA,offset)
     double *vecx = (double *)malloc(sizeof(double)*n);
     double *vecb = (double *)malloc(sizeof(double)*n);
     for(int i=0;i < n;i++){
       vecx[i] = 0.0;
       vecb[i] = 1.0;
     }
     int ipreflag = 0;
     int ilu_method = 1;
     int addl = 0;
     int iflagas = 1;
     int max_niter = 100;
     int rtolmax = 1.0e-8;
     double *reshistory = (double *)malloc(sizeof(double)*max_niter);
     int iovlflag = 0;
     int precon_thblock = -1;
     int independ_nvec = -1;
     int nBlock = 2000;
 #ifdef CUDA_SOLVER
     size_t len = sizeof(double)*n;
     cudaMemcpy( cu_vecx.a_d, vecx,  len, cudaMemcpyHostToDevice);
     cudaMemcpy( cu_vecb.a_d, vecb,  len, cudaMemcpyHostToDevice);
     len = sizeof(double)*n*nnd;
     devicemem cu_diaA;cudaMalloc(&(cu_diaA.a_d),len);
     cudaMemcpy( cu_diaA.a_d, diaA,  len, cudaMemcpyHostToDevice);
     len = sizeof(parcel_int)*nnd;
     devicemem cu_offset;cudaMalloc(&(cu_offset.a_d),len);
     cudaMemcpy( cu_offset.a_d, offset,  len, cudaMemcpyHostToDevice);
 #endif
     parcel_dcg_dia_c_(
          &icomm_fort ,
 #ifdef CUDA_SOLVER
          &cu_vecx,
          &cu_vecb,
          &n,
          &gn,
          &istart,
          &cu_diaA,
          &cu_offset,
 #else
          vecx,
          vecb,
          &n,
          &gn,
          &istart,
          diaA ,
          offset,
 #endif
          &nnd,
          &ipreflag ,
          &ilu_method ,
          &addl ,
          &iflagas ,
          &max_niter ,
          &rtolmax ,
          reshistory,
          &iovlflag ,
          &precon_thblock ,
          &independ_nvec ,
          &nBlock,
          &iret
          );
 #ifdef CUDA_SOLVER
     len = sizeof(double)*n;
     cudaMemcpy(  vecx,cu_vecx.a_d,  len, cudaMemcpyDeviceToHost);
     cudaFree(cu_vecx.a_d);
     cudaFree(cu_vecb.a_d);
     cudaFree(cu_diaA.a_d);
     cudaFree(cu_offset.a_d);
 #endif
     free(vecx);
     free(vecb);
     free(reshistory);
     parcel_free_c_();
     MPI_Finalize();
     }


9.3 DDM Format (C)

A C sample code in DDM format is shown below. Here, make_matrix_ddm denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL. set_rank_xyz, set_grids_decompostion, and set_neighbor_ranks are arbitrary routines, which compute neighbor processes in PARCEL format.


 
   #include < stdio.h >
   #include < stdlib.h >
   #include < math.h >
   #include "mpi.h"
   #ifdef CUDA_SOLVER
   #include < cuda.h >
   #include < cuda_runtime.h >
   #endif
   int main( int argc, char *argv[]){
     int ityp_eq;
     int itrmax;
     int max_niter;
     int nx,ny,nz;
     int nx0,ny0,nz0;
     int n,gn;
     int istart,iend;
     int npes,myrank;
     int ierr;
     int i;
     int iret;
     int myranky,myrankz;
     int npey,npez;
     MPI_Fint icomm_fort;
     int gnx = 100;
     int gny = 100;
     int gnz = 100;
     int max_niter = 100;
     int rtolmax = 1.0e-8;
     int ipreflag = 0;
     int independ_nvec = -1;
     int addl = 0;
     int iflagas = 1;
     int ilu_method = 1;
     int iovlflag = 0;
     int precon_thblock = -1;
     int nBlock = 2000;
     int npe_x = 1;
     int npe_y = 4;
     int npe_z = 1;
     int div_direc_th = 1;
     MPI_Init(&argc,&argv);
     MPI_Comm_size(MPI_COMM_WORLD, &npes);
     MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
     icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);
     parcel_init_c_(&myrank);
     int nnd = 7;
     int rank_x,rank_y,rank_z;
     set_rank_xyz(
          &myrank,
          &npe_x,
          &npe_y,
          &npe_z,
          &rank_x,
          &rank_y,
          &rank_z);
     int ndim=3 ;
     int margin = 1;
     int *npe_dim = (int *)malloc(sizeof(int)*ndim);
     int *gn_grids = (int *)malloc(sizeof(int)*ndim);
     int *n_grids = (int *)malloc(sizeof(int)*ndim);
     int *istart_grids = (int *)malloc(sizeof(int)*ndim);
     for(i=0;i < ndim;i++){
       npe_dim[i] = 0;
       gn_grids[i] = 0;
       n_grids[i] = 0;
       istart_grids[i] = 0;
     }
     int m=-1;
     gn_grids[0]  = gnx;
     gn_grids[1]  = gny;
     gn_grids[2]  = gnz;
     set_grids_decompostion(
                &npe_x, &npe_y, &npe_z,
                &rank_x, &rank_y, &rank_z,
                &ndim, gn_grids, npe_dim, n_grids,
                istart_grids, &margin, &m );
     int num_neighbor_ranks_max = pow(3,ndim) -1;
     int *neighbor_ranks = (int *)malloc(sizeof(int)*num_neighbor_ranks_max);
     int num_neighbor_ranks=-1;
     for(i=0;i < num_neighbor_ranks_max;i++){
       neighbor_ranks[i] = -1;
     }
     set_neighbor_ranks(
            &npe_x,
            &npe_y,
            &npe_z,
            &rank_x,
            &rank_y,
            &rank_z,
            &num_neighbor_ranks,
            neighbor_ranks,
            &num_neighbor_ranks_max);
     int *offset = (int *)malloc(sizeof(int)*nnd);
     for(i=0;i < nnd;i++){
       offset[i] = 0;
     }
     int *offset_dim = (int *)malloc(sizeof(int)*nnd*ndim);
     for(i=0;i < nnd*ndim;i++){
       offset_dim[i] = 0;
     }
     double *diaA  = (double *)malloc(sizeof(double)*m*nnd);
     for(i=0;i < m*nnd;i++){
       diaA[i] = 0.0;
     }
     double *vecx  = (double *)malloc(sizeof(double)*m);
     double *vecb  = (double *)malloc(sizeof(double)*m);
     for(i=0;i < m;i++){
       vecx[i] = 0.0;
       vecb[i] = 0.0;
     }
     double *reshistory_ddm  = (double *)malloc(sizeof(double)*max_niter);
     for(i=0;i < max_niter;i++){
       reshistory_ddm[i] = 0.0;
     }
     int nnd_ = nnd;
     make_matrix_ddm(
             &gnx,
             &gny,
             &gnz,
             &ndim,
             n_grids,
             istart_grids,
             &margin,
             &nnd_,
             &m,
             offset_dim,
             offset,
             diaA,
             vecb,
             vecx,
             vecx0,
             &rank_x,
             &rank_y,
             &rank_z,
             &npe_x,
             &npe_y,
             &npe_z);
 #ifdef CUDA_SOLVER
     typedef struct Fortran_GPUMem{
       void *a_d;
     } devicemem;
     size_t len = sizeof(double)*m;
     devicemem cu_vecx;cudaMalloc(&(cu_vecx.a_d),len);
     cudaMemcpy( cu_vecx.a_d, vecx,  len, cudaMemcpyHostToDevice);
     devicemem cu_vecb;cudaMalloc(&(cu_vecb.a_d),len);
     cudaMemcpy( cu_vecb.a_d, vecb,  len, cudaMemcpyHostToDevice);
     len = sizeof(double)*m*nnd;
     devicemem cu_diaA;cudaMalloc(&(cu_diaA.a_d),len);
     cudaMemcpy( cu_diaA.a_d, diaA,  len, cudaMemcpyHostToDevice);
     len = sizeof(int)*nnd;
     devicemem cu_offset;cudaMalloc(&(cu_offset.a_d),len);
     cudaMemcpy( cu_offset.a_d, offset,  len, cudaMemcpyHostToDevice);
 #endif
     parcel_dcg_ddm_c_(
              &icomm_fort,
 #ifdef CUDA_SOLVER
              &cu_vecx,
              &cu_vecb,
              &m,
              &nnd,
              &cu_offset,
              &cu_diaA,
 #else
              vecx,
              vecb,
              &m,
              &nnd,
              offset,
              diaA,
 #endif
              &num_neighbor_ranks,
              neighbor_ranks,
              &margin,
              &ndim,
              npe_dim,
              n_grids,
              offset_dim,
              &div_direc_th,
              &ipreflag,
              &addl,
              &iflagas,
              &max_niter,
              &rtolmax,
              reshistory_ddm,
              &iovlflag,
              &precon_thblock,
              &independ_nvec ,
              &nBlock ,
              &iret
              );
 #ifdef CUDA_SOLVER
     len = sizeof(double)*m;
     cudaMemcpy(  vecx,cu_vecx.a_d,  len, cudaMemcpyDeviceToHost);
     cudaFree(cu_diaA.a_d);
     cudaFree(cu_offset.a_d);
     cudaFree(cu_vecx.a_d);
     cudaFree(cu_vecb.a_d);
 #endif
     free(offset);
     free(offset_dim);
     free(diaA);
     free(vecx);
     free(vecb);
     free(reshistory_ddm);
     free(neighbor_ranks);
     free(npe_dim);
     free(gn_grids);
     free(n_grids);
     free(istart_grids);
     MPI_Barrier(MPI_COMM_WORLD);
     parcel_free_c_();
     MPI_Finalize();
     return 0;
   }



10 REFERENCES

[1] R. Barret, M. Berry, T. F. Chan, et al., “Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods”, SIAM(1994)
[2] Y. Saad, “Iterative methods for sparse linear systems”, SIAM (2003)
[3] M. Hoemmen, “Communication-avoiding Krylov subspace methods”. Ph.D.thesis, University of California, Berkeley(2010)
[4] Y. Idomura, T. Ina, A. Mayumi, et al., “Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms”,ScalA 17:8th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp.1-8, (2017).
[5] R. Suda, L. Cong, D. Watanabe, et al., “Communication-Avoiding CG Method: New Direction of Krylov Subspace Methods towards Exa-scale Computing”, RIMS Kokyuroku ,pp. 102-111, (2016).
[6] Y. Idomura, T. Ina, S. Yamashita, et al., “Communication avoiding multigrid preconditioned conjugate gradient method for extreme scale multiphase CFD simulations”. ScalA 18:9th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems,pp. 17-24.(2018)
[7] T. Ina, Y. Idomura, T. Imamura, et al., “Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations”, ScalA21:12th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems .(2021)
[8] A. Stathopoulos, K. Wu, “A block orthogonalization procedure with constant synchronization requirements”. SIAM J. Sci. Comput. 23, 2165–2182 (2002)
[9] T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, et al., “CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System,” ScalA 14:5th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp. 31-38,(2014)