1 Krylov subspace methods

This chapter describes the iterative methods contained in this library. The following notation is used, $$Ax = b$$ represents the linear system which will be solved, with $$A$$ being an n$$\times$$n sparse matrix and $$x_0$$ representing the initial vector. $$K$$ represents an appropriate preconditioner.

1.1 Preconditioned Conjugate Gradient Method 

The Conjugate Gradient (CG) method is an iterative Krylov subspace algorithm for solving symmetric matrices.

• Algorithm description
1. $$\mathbf{{Compute\ }}\mathbf{r}^{\mathbf{0}}\mathbf{= b - A}\mathbf{x}^{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}^{\mathbf{- 1}}\mathbf{=}\mathbf{x}^{\mathbf{0}}\mathbf{\ }$$
2. $$\mathbf{p}^{\mathbf{- 1}}\mathbf{= 0}$$
3. $$\mathbf{\alpha}_{\mathbf{- 1}}\mathbf{= 0\ }\mathbf{\ }$$
4. $$\mathbf{\beta}_{\mathbf{- 1}}\mathbf{= 0}$$
5. $$\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{0}}$$
6. $$\mathbf{\rho}_{\mathbf{0}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{0}} \right\rangle$$
7. $$\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}$$
8. 　　$$\mathbf{p}^{\mathbf{i}}\mathbf{= s +}\mathbf{\beta}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}$$
9. 　　$$\mathbf{q}^{\mathbf{i}}\mathbf{= A}\mathbf{p}^{\mathbf{i}}$$
10. 　　$$\mathbf{\gamma =}\left\langle \mathbf{p}^{\mathbf{i}}\mathbf{,}\mathbf{q}^{\mathbf{i}} \right\rangle$$
11. 　　$$\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}$$
12. 　　$$\mathbf{\alpha}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i}}\mathbf{\ /\ \gamma}$$
13. 　　$$\mathbf{r}^{\mathbf{i + 1}}\mathbf{=}\mathbf{r}^{\mathbf{i}}\mathbf{-}\mathbf{\alpha}_{\mathbf{i}}\mathbf{q}^{\mathbf{i}}$$
14. 　　$$\mathbf{{solve\ s\ from\ }}\mathbf{K}\mathbf{s = \ }\mathbf{r}^{\mathbf{i + 1}}$$
15. 　　$$\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{=}\left\langle \mathbf{s,}\mathbf{r}^{\mathbf{i + 1}} \right\rangle$$
16. 　　$$\mathbf{{if\ }}\left\| \mathbf{r}^{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ the}}\mathbf{n}$$
17. 　　　　$$\mathbf{x}^{\mathbf{i}}\mathbf{=}\mathbf{x}^{\mathbf{i - 1}}\mathbf{+}\mathbf{\alpha}_{\mathbf{i - 1}}\mathbf{p}^{\mathbf{i - 1}}$$
18. 　　　　$$\mathbf{{qui}}\mathbf{t}$$
19. 　　$$\mathbf{{endi}}\mathbf{f}$$
20. 　　$$\mathbf{\beta}_{\mathbf{i}}\mathbf{=}\mathbf{\rho}_{\mathbf{i + 1}}\mathbf{\ /\ }\mathbf{\rho}_{\mathbf{i}}$$
21. $$\mathbf{{\ en}}\mathbf{d}$$

1.2 Preconditioned Biconjugate Gradient Stabilized Method

The Biconjugate Gradient Stabilized (Bi-CGSTAB) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices.

• Algorithm description
1. $$\mathbf{{Compute\ }}\mathbf{r}_{\mathbf{0}}\mathbf{= b - A}\mathbf{x}_{\mathbf{0}}\mathbf{{\ for\ some\ initial\ guess\ }}\mathbf{x}_{\mathbf{- 1}}\mathbf{=}\mathbf{x}_{\mathbf{0}}$$
2. $$\mathbf{p}_{\mathbf{0}}\mathbf{=}\mathbf{r}_{\mathbf{0}}$$
3. $$\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{0}} \right\rangle$$
4. $$\mathbf{for\ i = 0,1,2,3}\mathbf{\ldots}$$
5. 　　$$\mathbf{{solve\ }}\widehat{\mathbf{p}}\mathbf{{\ from\ K}}\widehat{\mathbf{p}}\mathbf{= \ }\mathbf{p}_{\mathbf{i}}$$
6. 　　$$\mathbf{q = A}\widehat{\mathbf{p}}$$
7. 　　$$\mathbf{c}_{\mathbf{2}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,q} \right\rangle$$
8. 　　$$\mathbf{\alpha =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\mathbf{c}_{\mathbf{2}}$$
9. 　　$$\mathbf{e =}\mathbf{r}_{\mathbf{i}}\mathbf{- \alpha q}$$
10. 　　$$\mathbf{{solve\ }}\widehat{\mathbf{e}}\mathbf{{\ from\ K}}\widehat{\mathbf{e}}\mathbf{= e}$$
11. 　　$$\mathbf{v = A}\widehat{\mathbf{e}}$$
12. 　　$$\mathbf{c}_{\mathbf{3}}\mathbf{=}\left\langle \mathbf{e,v} \right\rangle\mathbf{\ /\ }\left\langle \mathbf{v,v} \right\rangle$$
13. 　　$$\mathbf{x}_{\mathbf{i + 1}}\mathbf{=}\mathbf{x}_{\mathbf{i}}\mathbf{+ \alpha}\widehat{\mathbf{p}}\mathbf{+}\mathbf{c}_{\mathbf{3}}\widehat{\mathbf{e}}$$
14. 　　$$\mathbf{r}_{\mathbf{i + 1}}\mathbf{= e -}\mathbf{c}_{\mathbf{3}}\mathbf{v}$$
15. 　　$$\mathbf{c}_{\mathbf{1}}\mathbf{=}\left\langle \mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{i + 1}} \right\rangle$$
16. 　　$$\mathbf{{if\ }}\left\| \mathbf{r}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{the}}\mathbf{n}$$
17. 　　　　$$\mathbf{{quit}}$$
18. 　　$$\mathbf{{endif}}$$
19. 　　$$\mathbf{\beta =}\mathbf{c}_{\mathbf{1}}\mathbf{\ /\ }\left( \mathbf{c}_{\mathbf{2}}\mathbf{c}_{\mathbf{3}} \right)$$
20. 　　$$\mathbf{p}_{\mathbf{i + 1}}\mathbf{=}\mathbf{r}_{\mathbf{i + 1}}\mathbf{+ \beta}\left( \mathbf{p}_{\mathbf{i}}\mathbf{-}\mathbf{c}_{\mathbf{3}}\mathbf{q} \right)$$
21. $$\mathbf{{end}}$$

1.3 Preconditioned Generalized Minimum Residual Method 

The Generalized Minimum Residual (GMRES(m)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. GMRES(m) is restarted periodically after $$m$$ iterations if it did not converge within $$m$$ iterations, the solution obtained until $$m$$ iterations will be used as an input for the new restart cycle. PARCEL provides variants of GMRES with the Classical Gram-Schmidt (sometimes referred to as standard Gram-Schmidt) and Modified Gram-Schmidt (MGS) orthogonalization methods.

• Algorithm description

$$\mathbf{H}_{\mathbf{n}}$$ is an upper triangular matrix with $$\mathbf{h}_{\mathbf{j,k}}$$ representing its elements, $$\mathbf{\ }\mathbf{e}_{\mathbf{i}}$$ represents a vector consisting of the first i elements of the vector $$\mathbf{e}$$.

1. $$\mathbf{for\ j = 0,1,2,3}\mathbf{\ldots}$$
2. 　　$$\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}$$
3. 　　$$\mathbf{v}_{\mathbf{1}}\mathbf{= - r\ /\ }\left\| \mathbf{r} \right\|$$
4. 　　$$\mathbf{e =}\left( \mathbf{-}\left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}$$
5. 　　$$\mathbf{n = m}$$
6. 　　$$\mathbf{for\ i = 1,2,3}\mathbf{\ldots}\mathbf{m}$$
7. 　 　　　$$\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{i}}\mathbf{=}\mathbf{v}_{\mathbf{i}}$$
8. 　　 　　$$\mathbf{\omega = A}{\widehat{\mathbf{v}}}_{\mathbf{i}}$$
9. 　　　 　$$\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i}$$
10. 　　　　　　$$\mathbf{h}_{\mathbf{k,i}}\mathbf{=}\left\langle \mathbf{\omega,}\mathbf{v}_{\mathbf{k}} \right\rangle$$
11. 　　　　　　$$\mathbf{\omega = \omega -}\mathbf{h}_{\mathbf{k,i}}\mathbf{v}_{\mathbf{k}}$$
12. 　　　　$$\mathbf{{end}}$$
13. 　　　　$$\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{=}\left\| \mathbf{\omega} \right\|$$
14. 　　　　$$\mathbf{v}_{\mathbf{i + 1}}\mathbf{= \omega/}\left\| \mathbf{\omega} \right\|$$
15. 　　　　$$\mathbf{for\ k = 1,2,3}\mathbf{\ldots}\mathbf{i - 1}$$
16. 　　　　　　$$\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{h}_{\mathbf{k,i}} \\ \mathbf{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}$$
17. 　　　　$$\mathbf{{end}}$$
18. 　　　　$$\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$$
19. 　　　　$$\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathbf{h}_{\mathbf{i + 1,i}}}{\mathbf{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$$
20. 　　　　$$\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{e}_{\mathbf{i}} \\ \mathbf{e}_{\mathbf{i + 1}} \\ \end{pmatrix}$$
21. 　　　　$$\mathbf{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathbf{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathbf{h}_{\mathbf{i + 1,i}}$$
22. 　　　　$$\mathbf{h}_{\mathbf{i + 1,i}}\mathbf{= 0}$$
23. 　　　　$$\mathbf{{if\ }}\left\| \mathbf{e}_{\mathbf{i + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{then}}$$
24. 　　　　　　$$\mathbf{n = i}$$
25. 　　　　　　$$\mathbf{{exit}}$$
26. 　　　　$$\mathbf{{endif}}$$
27. 　　$$\mathbf{{end}}$$
28. 　　$$\mathbf{y}_{\mathbf{n}}\mathbf{=}\mathbf{H}_{\mathbf{n}}^{\mathbf{- 1}}\mathbf{e}_{\mathbf{n}}$$
29. 　　$$\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 1}}^{\mathbf{n}}{\mathbf{y}_{\mathbf{k}}\mathbf{v}_{\mathbf{k}}}$$
30. 　　$$\mathbf{x}_{\mathbf{n}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}$$
31. 　　$$\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{n}}$$
32. $$\mathbf{{end}}$$

1.4 Preconditioned Communication Avoiding Generalized Minimum Residual Method [2,3]

The Communication Avoiding Generalized Minimum Residual (CA-GMRES(s,t)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. The calculation which is done in $$s$$ iterations in the GMRES(m) method is equivalent to one iteration of the CA-GMRES method. CA-GMRES(s,t) is restarted periodically after $$t$$ iterations if it did not converge within $$t$$ iterations, the solution obtained until $$m$$ iterations will be used as an input for the new restart cycle. CA-GMRES(s,t) is equivalent to GMRES(m) if $$s$$ and $$t$$ are chosen so that $$m = t \times s$$. The convergence property of CA-GMRES(s,t) is the same as GMRES(m) in exact arithmetic. In addition, the number of global collective communication calls is reduced by communication avoiding QR factorization. As CA-GMRES(s,t) produces $$s$$ basis vectors at once, when $$s$$ is too large, the linear independency of the basic vectors may become worse because of round off errors, leading to worse convergence property than GMRES(m). In order to improve the orthogonality of the basic vectors, the PARCEL implementation of CA-GMRES(s,t) provides an option to use the Newton basis in addition to the monomial basis.

• Algorithm description

$$\mathbf{e}$$ represents a unit vector, $${\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}$$ represents the $$s$$-th row and $$s$$-th column element of the matrix $$\mathbf{R}_{\mathbf{k}}$$, and $$\mathbf{E}$$ represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.

1. $$\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{= \lbrack}\mathbf{e}_{\mathbf{2}}\mathbf{,}\mathbf{e}_{\mathbf{3}}\mathbf{,\ldots,}\mathbf{e}_{\mathbf{s + 1}}\mathbf{\rbrack}$$
2. $$\mathbf{for\ j = 0,1,2,3\ldots}$$
3. 　　$$\mathbf{r = b - A}\mathbf{x}_{\mathbf{0}}$$
4. 　　$$\mathbf{v}_{\mathbf{1}}\mathbf{= r\ /\ }\left\| \mathbf{r} \right\|$$
5. 　　$$\mathbf{\zeta =}\left( \left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}$$
6. 　　$$\mathbf{for\ k = 0,1\ldots,t - 1}$$
7. 　　　　$$\mathbf{Fix\ basis\ conversion\ matrix\ \lbrack}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,E\rbrack}$$
8. 　　　　$$\mathbf{Comupute\ \ basic\ vector\ \ \lbrack\ s,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{\ \rbrack}$$
9. 　　　　$$\mathbf{if(k.eq.0)}$$
10. 　　　　　　$$\mathbf{{QR\ decomposition\ }}\mathbf{V}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}$$
11. 　　　　　　$$\mathfrak{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}$$
12. 　　　　　　$$\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{- 1}}$$
13. 　　　　　　$$\mathcal{H}_{\mathbf{0}}\mathbf{=}\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}$$
14. 　　　　　　$$\mathbf{for\ o = 1,2\ldots,s}$$
15. 　　　　　　$$\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta\rbrack}$$
16. 　　　　　　$$\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}$$
17. 　　　　　　　　$$\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{Q}}}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{,}\mathbf{R}_{\mathbf{0}}\mathbf{,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$$
18. 　　　　　　　　$$\mathbf{{quit}}$$
19. 　　　　　　$$\mathbf{{endif}}$$
20. 　　　　$$\mathbf{{else}}$$
21. 　　　　　　$${\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{=}\left( \mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}} \right)^{\mathbf{H}}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}$$
22. 　　　　　　$${\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{-}\mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}}{\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}$$
23. 　　　　　　$$\mathbf{{QR\ decomposition\ }}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{Q}}}_{\mathbf{k}}^{\mathbf{\sim}}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}$$
24. 　　　　　　$$\mathbf{R}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathbf{R}_{\mathbf{k}} & \mathbf{z}_{\mathbf{k}} \\ \mathbf{0}_{\mathbf{1,k}} & \mathbf{\rho} \\ \end{pmatrix}$$
25. 　　　　　　$$\mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{= -}\mathfrak{H}_{\mathbf{k - 1}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}\mathfrak{R}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}$$
26. 　　　　　　$$\mathbf{H}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\mathbf{B}_{\mathbf{k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{b}_{\mathbf{k}}\mathbf{z}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}}\mathbf{-}\mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}$$
27. 　　　　　　$$\mathbf{h}_{\mathbf{k}}\mathbf{=}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{\rho}_{\mathbf{k}}\mathbf{b}_{\mathbf{k}}$$
28. 　　　　　　$$\mathfrak{H}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1}} & \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \begin{matrix} \mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}} \\ \mathbf{0}_{\mathbf{1,s(k - 1)}} \\ \end{matrix} & \begin{matrix} \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{matrix} \\ \end{pmatrix}$$
29. 　　　　　　$$\mathcal{H}_{\mathbf{k}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{pmatrix}$$
30. 　　　　　　$$\mathbf{for\ o = 1 + sk,\ldots,s(k + 1)}$$
31. 　　　　　　　　$$\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta\rbrack}$$
32. 　　　　　　　　$$\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}$$
33. 　　　　　　　　　　$$\mathbf{update\ \ solution\ \ vector\lbrack s,k,o,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$$
34. 　　　　　　　　　　$$\mathbf{{quit}}$$
35. 　　　　　　　　$$\mathbf{{endif}}$$
36. 　　　　　　$$\mathbf{{end}}$$
37. 　　　　$$\mathbf{{endif}}$$
38. 　　$$\mathbf{{end}}$$
39. 　　$$\mathbf{update\ \ solution\ \ vector\lbrack s,t - 1,st,}{\acute{\mathbf{V}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathfrak{Q}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}{\acute{\mathbf{R}}}_{\mathbf{t - 1}}^{\mathbf{\sim}}\mathbf{,}\mathcal{H}_{\mathbf{t - 1}}\mathbf{,\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$$
40. $$\mathbf{{end}}$$

• Fix basis conversion matrix $$\mathbf{\lbrack s,B,E\rbrack}$$
1. $$\mathbf{{if\ compute\ Newton\ Basis}}$$
2. 　　$$\mathbf{i = 0}$$
3. 　　$$\mathbf{{while}}\left( \mathbf{i \leq s - 1} \right)$$
4. 　　　　$$\mathbf{{if}}\left( \mathbf{i.eq.s - 1} \right)\mathbf{{then}}$$
5. 　　　　　　$$\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$$
6. 　　　　$$\mathbf{{else}}$$
7. 　　　　　　$$\mathbf{{if}}\left( \mathbf{{Im}}\left\lbrack \mathbf{E}_{\mathbf{i}} \right\rbrack\mathbf{.eq.0} \right)\mathbf{{then}}$$
8. 　　　　　　　　$$\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$$
9. 　　　　　　$$\mathbf{{else}}$$
10. 　　　　　　　　$$\mathbf{B}_{\mathbf{i,i}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$$
11. 　　　　　　　　$$\mathbf{B}_{\mathbf{i + 1,i + 1}}\mathbf{= Re\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack}$$
12. 　　　　　　　　$$\mathbf{B}_{\mathbf{i,i + 1}}\mathbf{= -}\left( \mathbf{Im\lbrack}\mathbf{E}_{\mathbf{i}}\mathbf{\rbrack} \right)^{\mathbf{2}}$$
13. 　　　　　　　　$$\mathbf{i = i + 1}$$
14. 　　　　　　$$\mathbf{{endif}}$$
15. 　　　　$$\mathbf{{endif}}$$
16. 　　　　$$\mathbf{i = i + 1}$$
17. 　　$$\mathbf{{end\ \ }}$$
18. $$\mathbf{{end\ \ }}$$

• Comupute basis vector $$\mathbf{\lbrack\ s,v\ ,B\rbrack}$$

The elements of the matrix $$\mathbf{B}$$ are represented as $$\mathbf{b}_{\mathbf{k,i}}$$

1. $$\mathbf{for\ k = 1,2,3\ldots s}$$
2. 　　$$\mathbf{{solve\ }}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{{\ from\ K}}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{=}\mathbf{v}_{\mathbf{k - 1}}$$
3. 　　$$\mathbf{if(k.ne.1)then}$$
4. 　　　　$$\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}$$
5. 　　　　$$\mathbf{\beta =}\mathbf{b}_{\mathbf{k - 2,k - 1}}$$
6. 　　　　$$\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}\mathbf{+ \beta}\mathbf{v}_{\mathbf{k - 2}}$$
7. 　　$$\mathbf{{else}}$$
8. 　　　　$$\mathbf{\alpha =}\mathbf{b}_{\mathbf{k - 1,k - 1}}$$
9. 　　　　$$\mathbf{v}_{\mathbf{k}}\mathbf{= A}{\widehat{\mathbf{v}}}_{\mathbf{k - 1}}\mathbf{- \alpha}\mathbf{v}_{\mathbf{k - 1}}$$
10. 　　$$\mathbf{{endif}}$$
11. $$\mathbf{{en}}\mathbf{d}$$

• Givens rotation $$\mathbf{\lbrack i,}\mathcal{H,}\mathbf{\zeta\rbrack}$$

The elements of the matrix $$\mathcal{H}$$ are represented as $$\mathcal{h}_{\mathbf{k,i}}$$.

1. $$\mathbf{for\ k = 1,2,3\ldots i - 1}$$
2. 　　$$\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathcal{h}_{\mathbf{k,i}} \\ \mathcal{h}_{\mathbf{k + 1,i}} \\ \end{pmatrix}$$
3. $$\mathbf{\text{end}}$$
4. $$\mathbf{c}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$$
5. $$\mathbf{s}_{\mathbf{i}}\mathbf{= -}\frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}}\frac{\mathbf{1}}{\sqrt{\mathbf{1 +}\left( \frac{\mathcal{h}_{\mathbf{i + 1,i}}}{\mathcal{h}_{\mathbf{i,i}}} \right)^{\mathbf{2}}}}$$
6. $$\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}\mathbf{=}\begin{pmatrix} \mathbf{c}_{\mathbf{i}} & \mathbf{- s}_{\mathbf{i}} \\ \mathbf{s}_{\mathbf{i}} & \mathbf{c}_{\mathbf{i}} \\ \end{pmatrix}\begin{pmatrix} \mathbf{\zeta}_{\mathbf{i}} \\ \mathbf{\zeta}_{\mathbf{i + 1}} \\ \end{pmatrix}$$
7. $$\mathcal{h}_{\mathbf{i,i}}\mathbf{=}\mathbf{c}_{\mathbf{i}}\mathcal{h}_{\mathbf{i,i}}\mathbf{- \ }\mathbf{s}_{\mathbf{i}}\mathcal{h}_{\mathbf{i + 1,i}}$$
8. $$\mathcal{h}_{\mathbf{i + 1,i}}\mathbf{= 0}$$

• Update solution vector $$\mathbf{\lbrack s,k,n,V,Q,R,}\mathcal{H,}\mathbf{\zeta,}\mathbf{x}_{\mathbf{0}}\mathbf{\ \rbrack}$$
1. $$\mathbf{y =}\mathcal{H}^{\mathbf{- 1}}\mathbf{\zeta}$$
2. $$\mathbf{{solve\ }}\widehat{\mathbf{x}}\mathbf{{\ from\ K}}\widehat{\mathbf{x}}\mathbf{=}\sum_{\mathbf{k = 0}}^{\mathbf{s}\mathbf{m}\mathbf{- 1}}\mathbf{Q}_{\mathbf{k}}\mathbf{y}_{\mathbf{k}}\mathbf{+}\sum_{\mathbf{l =}\mathbf{{sm}}}^{\mathbf{n - 1}}\mathbf{V}_{\mathbf{l}\mathbf{- sm}}\sum_{\mathbf{k =}\mathbf{{sm}}}^{\mathbf{n - 1}}{\mathbf{R}_{\mathbf{l - sm}\mathbf{,}\mathbf{k}\mathbf{- sm}}^{\mathbf{- 1}}\mathbf{y}_{\mathbf{k}}}$$
3. $$\mathbf{x}_{\mathbf{0}}\mathbf{=}\mathbf{x}_{\mathbf{0}}\mathbf{+}\widehat{\mathbf{x}}$$

1.5 Preconditioned Chebyshev basis Conjugate Gradient Method [4,5]

The Chebyshev basis Conjugate Gradient (CBCG) method is an iterative Krylov subspace algorithm for solving symmetric matrices. The CBCG method calculates $$k$$ iterations of the CG method in one iteration. By this approach, the CBCG method reduces the global collective communication calls. In order to use the Chebyshev basis the largest and smallest eigenvalues are needed. PARCEL provides two methods to obtain these eigenvalues, the so called power method and the communication avoiding Arnoldi method.

• Algorithm
1. $$\mathbf{{Compute\ }}\mathbf{r}_{\mathbf{0}}\mathbf{= b - A}\mathbf{x}_{\mathbf{0}}\mathbf{{\ for\ some\ initial\ gue}}\mathbf{{ss\ }}\mathbf{x}_{\mathbf{- 1}}\mathbf{=}\mathbf{x}_{\mathbf{0}}$$
2. $$\mathbf{Compute\ Chebyshev\ basis\ \lbrack}\mathbf{S}_{\mathbf{0}}\mathbf{,}{\mathbf{A}\mathbf{S}}_{\mathbf{0}}\mathbf{,}\mathbf{r}_{\mathbf{0}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}$$
3. $$\mathbf{Q}_{\mathbf{0}}\mathbf{=}\mathbf{S}_{\mathbf{0}}$$
4. $$\mathbf{for\ i = 0,1,2,3\ldots}$$
5. $$\mathbf{{Compute\ }}\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}}\mathbf{\ ,\ }\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{r}_{\mathbf{{ik}}}$$
6. 　　$$\mathbf{\alpha}_{\mathbf{i}}\mathbf{= \ }\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}} \right)^{\mathbf{- 1}}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{r}_{\mathbf{{ik}}} \right)$$
7. 　　$$\mathbf{x}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{=}\mathbf{x}_{\mathbf{{ik}}}\mathbf{+}\mathbf{Q}_{\mathbf{i}}\mathbf{\alpha}_{\mathbf{i}}$$
8. 　　$$\mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{=}\mathbf{r}_{\mathbf{{ik}}}\mathbf{- A}\mathbf{Q}_{\mathbf{i}}\mathbf{\alpha}_{\mathbf{i}}$$
9. 　　$$\mathbf{{if\ }}\left\| \mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ }}\mathbf{{then}}$$
10. 　　$$\mathbf{Compute\ Chebyshev\ basis\ \lbrack}\mathbf{S}_{\mathbf{i + 1}}\mathbf{,}{\mathbf{A}\mathbf{S}}_{\mathbf{i + 1}}\mathbf{,}\mathbf{r}_{\left( \mathbf{i + 1} \right)\mathbf{k}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}$$
11. 　　$$\mathbf{{Compute\ }}\mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{S}_{\mathbf{i + 1}}$$
12. 　　$$\mathbf{B}_{\mathbf{i}}\mathbf{=}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{Q}_{\mathbf{i}} \right)^{\mathbf{- 1}}\left( \mathbf{Q}_{\mathbf{i}}^{\mathbf{T}}\mathbf{A}\mathbf{S}_{\mathbf{i + 1}} \right)\mathbf{\ }$$
13. 　　$$\mathbf{Q}_{\mathbf{i + 1}}\mathbf{=}\mathbf{S}_{\mathbf{i + 1}}\mathbf{-}\mathbf{Q}_{\mathbf{i}}\mathbf{B}_{\mathbf{i}}$$
14. 　　$$\mathbf{A}\mathbf{Q}_{\mathbf{i + 1}}\mathbf{=}\mathbf{{AS}}_{\mathbf{i + 1}}\mathbf{-}\mathbf{{AQ}}_{\mathbf{i}}\mathbf{B}_{\mathbf{i}}$$
15. $$\mathbf{{end}}$$

• Compute Chebyshev basis $$\mathbf{\lbrack S,AS,r,}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{,}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{\rbrack}$$

$$\mathbf{\lambda}_{\mathbf{\max}}$$, $$\mathbf{\lambda}_{\mathbf{\min}}$$represent the largest and smallest eigenvalues of $$\mathbf{AK}^{-1}$$.

1. $$\mathbf{\eta = 2\ /\ (}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{-}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{)}$$
2. $$\mathbf{\zeta =}\left( \mathbf{\lambda}_{\mathbf{\max}}\mathbf{+}\mathbf{\lambda}_{\mathbf{\min}} \right)\mathbf{\ /\ (}\mathbf{\lambda}_{\mathbf{\max}}\mathbf{-}\mathbf{\lambda}_{\mathbf{\min}}\mathbf{)}$$
3. $$\mathbf{s}_{\mathbf{0}}\mathbf{=}\mathbf{r}_{\mathbf{0}}\mathbf{\ }$$
4. $$\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{= \ }\mathbf{s}_{\mathbf{0}}$$
5. $$\mathbf{s}_{\mathbf{1}}\mathbf{= \eta A}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{- \zeta}\mathbf{s}_{\mathbf{0}}$$
6. $$\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{= \ }\mathbf{s}_{\mathbf{1}}$$
7. $$\mathbf{for\ j = 2,3,\ldots,k}$$
8. 　　$$\mathbf{s}_{\mathbf{j}}\mathbf{= 2\eta A}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{- 2\zeta}\mathbf{s}_{\mathbf{j - 1}}\mathbf{-}\mathbf{s}_{\mathbf{j - 2}}$$
9. 　　$$\mathbf{{solve\ }}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{{\ from\ K}}{\widetilde{\mathbf{s}}}_{\mathbf{j}}\mathbf{= \ }\mathbf{s}_{\mathbf{j}}$$
10. $$\mathbf{{end}}$$
11. $$\mathbf{S = (}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{,}{\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{,\ldots,}{\widetilde{\mathbf{s}}}_{\mathbf{k - 1}}\mathbf{)}$$
12. $$\mathbf{AS = (A}{\widetilde{\mathbf{s}}}_{\mathbf{0}}\mathbf{,}{\mathbf{A}\widetilde{\mathbf{s}}}_{\mathbf{1}}\mathbf{,\ldots,}{\mathbf{A}\widetilde{\mathbf{s}}}_{\mathbf{k - 1}}\mathbf{)}$$

1.6 Preconditioned Communication Avoiding Arnoldi Method 

The Preconditioned Communication Avoiding Arnoldi Method (CA-Arnoldi(s,t)) is an eigenvalue calculation algorithm for asymmetric matrices. CA-Arnoldi(s,t) computes s iterations of the Arnordi method in a single iteration, and by repeating this for t times, eigenvalues and eigenvectors of t$$\times$$s Hessenberg matrices are computed. By applying communication-reducing QR factorization algorithms, the number of collective communications is reduced compared to the Arnoldi method.

• Algorithm description

$$\mathbf{e}$$ represents a unit vector, $${\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}$$ represents the $$s$$-th row and $$s$$-th column element of the matrix $$\mathbf{R}_{\mathbf{k}}$$, and $$\mathbf{E}$$ represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.

1. $$\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{= \lbrack}\mathbf{e}_{\mathbf{2}}\mathbf{,}\mathbf{e}_{\mathbf{3}}\mathbf{,\ldots,}\mathbf{e}_{\mathbf{s + 1}}\mathbf{\rbrack}$$
2. 　　$$\mathbf{r = }\mathbf{x}_{\mathbf{0}}$$
3. 　　$$\mathbf{\zeta =}\left( \left\| \mathbf{r} \right\|\mathbf{,0,\ldots,0} \right)^{\mathbf{T}}$$
4. 　　$$\mathbf{for\ k = 0,1\ldots,t - 1}$$
5. 　　　　$$\mathbf{Fix\ basis\ conversion\ matrix\ \lbrack}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,E\rbrack}$$
6. 　　　　$$\mathbf{Comupute\ \ basic\ vector\ \ \lbrack\ s,}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{,}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{\ \rbrack}$$
7. 　　　　$$\mathbf{if(k.eq.0)}$$
8. 　　　　　　$$\mathbf{{QR\ decomposition\ }}\mathbf{V}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}$$
9. 　　　　　　$$\mathfrak{Q}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{Q}_{\mathbf{0}}^{\mathbf{\sim}}$$
10. 　　　　　　$$\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{=}\mathbf{R}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{0}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{0}}^{\mathbf{- 1}}$$
11. 　　　　　　$$\mathcal{H}_{\mathbf{0}}\mathbf{=}\mathfrak{H}_{\mathbf{0}}^{\mathbf{\sim}}$$
12. 　　　　　　$$\mathbf{for\ o = 1,2\ldots,s}$$
13. 　　　　　　$$\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{0}}\mathbf{,\zeta\rbrack}$$
14. 　　　　　　$$\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}$$
15. 　　　　　　　　$$\mathbf{{quit}}$$
16. 　　　　　　$$\mathbf{{endif}}$$
17. 　　　　$$\mathbf{{else}}$$
18. 　　　　　　$${\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{=}\left( \mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}} \right)^{\mathbf{H}}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}$$
19. 　　　　　　$${\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{-}\mathfrak{Q}_{\mathbf{k - 1}}^{\mathbf{\sim}}{\acute{\mathfrak{R}}}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}$$
20. 　　　　　　$$\mathbf{{QR\ decomposition\ }}{\acute{\mathbf{V}}}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}{\acute{\mathbf{Q}}}_{\mathbf{k}}^{\mathbf{\sim}}{\acute{\mathbf{R}}}_{\mathbf{k}}^{\mathbf{\sim}}$$
21. 　　　　　　$$\mathbf{R}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathbf{R}_{\mathbf{k}} & \mathbf{z}_{\mathbf{k}} \\ \mathbf{0}_{\mathbf{1,k}} & \mathbf{\rho} \\ \end{pmatrix}$$
22. 　　　　　　$$\mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{= -}\mathfrak{H}_{\mathbf{k - 1}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}\mathfrak{R}_{\mathbf{k - 1,k}}^{\mathbf{\sim}}\mathbf{B}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}$$
23. 　　　　　　$$\mathbf{H}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\mathbf{B}_{\mathbf{k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{+}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{b}_{\mathbf{k}}\mathbf{z}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}}\mathbf{-}\mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}}\mathfrak{R}_{\mathbf{k - 1,k}}\mathbf{R}_{\mathbf{k}}^{\mathbf{- 1}}$$
24. 　　　　　　$$\mathbf{h}_{\mathbf{k}}\mathbf{=}{\widetilde{\mathbf{\rho}}}_{\mathbf{k}}^{\mathbf{- 1}}\mathbf{\rho}_{\mathbf{k}}\mathbf{b}_{\mathbf{k}}$$
25. 　　　　　　$$\mathfrak{H}_{\mathbf{k}}^{\mathbf{\sim}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1}} & \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \begin{matrix} \mathbf{h}_{\mathbf{k - 1}}\mathbf{e}_{\mathbf{1}}\mathbf{e}_{\mathbf{s(k - 1)}}^{\mathbf{T}} \\ \mathbf{0}_{\mathbf{1,s(k - 1)}} \\ \end{matrix} & \begin{matrix} \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{matrix} \\ \end{pmatrix}$$
26. 　　　　　　$$\mathcal{H}_{\mathbf{k}}\mathbf{=}\begin{pmatrix} \mathfrak{H}_{\mathbf{k - 1,k}}^{\mathbf{\sim}} \\ \mathbf{H}_{\mathbf{k}} \\ \mathbf{h}_{\mathbf{k}}\mathbf{e}_{\mathbf{s}}^{\mathbf{T}} \\ \end{pmatrix}$$
27. 　　　　　　$$\mathbf{for\ o = 1 + sk,\ldots,s(k + 1)}$$
28. 　　　　　　　　$$\mathbf{Givens\ rotation\ \lbrack o,}\mathcal{H}_{\mathbf{k}}\mathbf{,\zeta\rbrack}$$
29. 　　　　　　　　$$\mathbf{{if\ }}\left\| \mathbf{\zeta}_{\mathbf{o + 1}} \right\|\mathbf{\ /\ }\left\| \mathbf{b} \right\|\mathbf{{\ small\ enough\ then}}$$
30. 　　　　　　　　　　$$\mathbf{{quit}}$$
31. 　　　　　　　　$$\mathbf{{endif}}$$
32. 　　　　　　$$\mathbf{{end}}$$
33. 　　　　$$\mathbf{{endif}}$$
34. 　　$$\mathbf{{end}}$$
35. 　　$$\mathbf{Eigen\ Value\ problem\ \lbrack\mathfrak{H}_{t - 1}^{\sim},E,z\rbrack}$$
36. 　　$$\mathbf{Eigen Vectors \ X = \mathfrak{Q}_{t - 1}^{\sim}z}$$

2 Preconditioning

In iterative algorithms, the application of a preconditioner of the form $$K \approx A$$ can help to reduce the number of iterations until convergence. The notation [$${{solve\ }}\widehat{{p}}{{\ from\ K}}\widehat{{p}}{= \ }{p}$$] means [$$approximately solve {A}\widehat{p}{= \ }{p}$$ with respect to the vector $$\widehat{p}$$]. Although the more accurate approximation leads to the less number of iterations until convergence, the cost of preconditioner is normally increased. In non-parallel computing, one of the most common preconditioners is the incomplete LU factorization (ILU). However, in parallel computing, parallel preconditioners are needed. PARCEL provides the following parallel preconditioners: Point Jacobi, Block Jacobi and Additive Schwarz.

2.1 Point Jacobi Preconditioner 

The point Jacobi preconditioner is one of the most simplest preconditioners. $$K$$ only consists of the diagonal elements of the matrix $$A$$. Compared to other preconditioners the efficiency of the point Jacobi preconditioner is very low. However, it can be easily applied in parallel and it does not require any additional communication between processors.

2.2 Zero Fill-in Incomplete LU Factorization Preconditioners (ILU(0)) 

LU factorization decompose a square matrix $$A$$ into a lower triangular matrix $$L$$ and an upper triangular matrix $$U$$. For a typical sparse matrix, the LU factors can be much less sparse than the original matrix, which is called as fill-in. This increases the memory and computational requirements. In order to avoid this issue, PARCEL provides the zero fill-in incomplete LU factorization (ILU(0)).

2.3 Diagonal Incomplete LU Factorization Preconditioners (D-ILU) 

The diagonal incomplete LU factorization (D-ILU) computes a diagonal matrix $$D$$, a lower triangular matrix $$L$$ and an upper triangular matrix $$U$$ for the input matrix $$A$$. $A = L_{A} + D_{A} + U_{A}$ The preconditioner $$K$$ is constructed with $$L_{A}$$, $$U_{A}$$ and $$D$$ ($$D \neq D_{A}$$) with: $K = \left( D + L_{A} \right)D^{- 1}\left( D + U_{A} \right)$ Two different conditions exist in order to construct the diagonal matrix $$D$$:

• Condition 1: The diagonal elements of $$K$$ equal those of $$A$$: $A_{{ii}} = K_{{ii}}\ \ (i = 1,\ldots,n)$

• Condition 2: The row sum of $$K$$ equals the row sum of $$A$$: $\sum_{j}^{}A_{{ij}} = \sum_{j}^{}K_{{ij}}\ \ (i = 1,\ldots,n)$

Given the conditions above $$D$$ can be computed as follows:

• If condition 1 should be fullfilled the following computation is used: $D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{A_{{ij}}D_{jj}^{-1}A_{{ji}}}\ \ (i = 1,\ldots,n)$

• If condition 2 should be fullfilled the following computation is used: $D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{\sum_{k > j}^{}{A_{{ij}}D_{{jj}}^{-1}A_{jk}}}\ \ (i = 1,\ldots,n)$

2.4 Block Jacobi Preconditioner 

The Block Jacobi Preconditioner constructs $$K$$ out of the diagonal blocks of the matrix $$A$$. For each block, the incomplete ILU factorization is computed. When the preconditioner is applied, each block can be processed independently, resulting in a high level of parallelism. In addition, no communication is needed when each block is defined within a sub-matrix on each processor.

The Additive Schwarz Preconditioner constructs $$K$$ out of the overlapping diagonal blocks of $$A$$. For each block the incomplete LU factorization is computed. Compared to the Block Jacobi Preconditioner, the Additive Schwarz Preconditioner may require additional communication. PARCEL provides four different overlapping methods, BASIC, RESTRICT, INTERPOLATE and NONE. Fig.1 Additive Schwarz Preconditioner with overlapping diagonal blocks

BASIC

Solve $$Ks=r$$ with an extended $$r$$ including an overlapping region with the neighboring process, $$s$$ in the overlapping region is determined by summing up the results between the neighboring region. Fig.2 BASIC

RESTRICT

Solve $$Ks=r$$ with an extended $$r$$ including an overlapping region with the neighboring process, $$s$$ in the overlapping region is determined without taking account of the result in the neighboring region. Fig.3 RESTRICT

INTERPOLATE

Solve $$Ks=r$$ with an extended $$r$$ without an overlapping region with the neighboring process, $$s$$ in the overlapping region is determined by summing up the results between the neighboring regions. Fig.4 INTERPOLATE

NONE

Solve $$Ks=r$$ with an extended $$r$$ without an overlapping region with the neighboring process, $$s$$ in the overlapping region is determined without taking account of the result in the neighboring regions. Fig.5 NONE

2.6 Fine-block Preconditioner 

The fine-block preconditioner generates a preconditionig matrix so that SIMD operations can be applied to the incomplete LU decomposition. In each block of the block Jacobi preconditioner or the additive Schwarz preconditioner, fine diagonal blocks are defined. By ignoring off diagonal elements only in the fine diagonal blocks, data dependency is eliminated, and SIMD operations are enabled. In Fig.6, 3 $$\times$$ 3 fine blocks are defined within 9 $$\times$$ 9 blocks of the block Jacobi preconditioner, where data dependency is eliminated by ignoring off diagonal elements only within fine blocks (shown in white) and three vector elements are processed by SIMD operations.

 Fig.6 Subdividing preconditioning

3 QR factorization

This section describes the QR decomposition algorithms that can be used in this library.

3.1 Classical Gram-Schmidt (CGS) Method

The QR factorization by the Gram-Schmidt orthogonalization. This algorithm has High parallelism but poor orthogonality.

• Algorithm description
1. $$\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}$$
2. 　　$$\mathbf{for\ k = 1,i - 1\ do}$$
3. 　　　　$$\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$$
4. 　　$$\mathbf{{enddo\ }}$$
5. 　　$$\mathbf{for\ k = 1,i - 1\ do}$$
6. 　　　　$$\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}$$
7. 　　　　$$\mathbf{{enddo\ }}$$
8. 　　$$\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$$
9. 　　$$\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}$$
10. $$\mathbf{{enddo\ }}$$

3.2 Modified Gram-Schmidt (MGS) Method

The QR factorization with a modified algorithm of the classical Gram-Schmidt method to reduce the error. This algorithm improves the orthogonality from the classical Gram-Schmidt method, but requires more collective communication.

• Algorithm description
1. $$\mathbf{f}\mathbf{or\ i = 1,\ldots,n\ do}$$
2. 　　$$\mathbf{for\ k = 1,i - 1\ do}$$
3. 　　　　$$\mathbf{R}_{\mathbf{k,i}}\mathbf{= <}\mathbf{Q}_{\mathbf{k}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$$
4. 　　　　$$\mathbf{V}_{\mathbf{i}}\mathbf{=}\mathbf{V}_{\mathbf{i}}\mathbf{-}\mathbf{R}_{\mathbf{k,i}}\mathbf{Q}_{\mathbf{k}}$$
5. 　　$$\mathbf{{enddo\ }}$$
6. 　　$$\mathbf{R}_{\mathbf{i,i}}\mathbf{= <}\mathbf{V}_{\mathbf{i}}\mathbf{,}\mathbf{V}_{\mathbf{i}}\mathbf{>}$$
7. 　　$$\mathbf{Q}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{1}}{\left\| \mathbf{V}_{\mathbf{i}} \right\|}\mathbf{V}_{\mathbf{i}}$$
8. $$\mathbf{{enddo\ }}$$

3.3 Tall Skinny QR (TSQR) Method

TSQR is based on the Householder QR factorization shown below, and has good orthogonality. The sequential TSQR is used between threads, while the parallel TSQR is used between processes.

• Algorithm description
1. $$\mathbf{for\ i = 1,\ldots,n\ do}$$
2. 　　$$\mathbf{y}_{\mathbf{i}}\mathbf{(i:m) = A(i:m,i) -}\left\| \mathbf{A(i:m,i)} \right\|\mathbf{e}_{\mathbf{i}}\mathbf{(i:m)}$$
3. 　　$$\mathbf{t}_{\mathbf{i}}\mathbf{=}\frac{\mathbf{2}}{\mathbf{<}\mathbf{y}_{\mathbf{i}}\left( \mathbf{i:m} \right)\mathbf{,\ }\mathbf{y}_{\mathbf{i}}\mathbf{(i:m) >}}$$
4. 　　$$\mathbf{Q}_{\mathbf{i}}\mathbf{=}\left( \mathbf{I -}\mathbf{t}_{\mathbf{i}}\mathbf{y}_{\mathbf{i}}\mathbf{(i:m)}{\mathbf{y}_{\mathbf{i}}\mathbf{(i:m)}}^{\mathbf{T}} \right)$$
5. 　　$$\mathbf{A(i:m,i) =}\mathbf{Q}_{\mathbf{i}}\mathbf{A(i:m,i)}$$
6. $$\mathbf{{enddo}}$$
Sequential TSQR Parallel TSQR  3.4 Cholesky QR Method

The Cholesky QR factoriztion consists of a matrix product and the Cholesky factorization. This algorithm has high computational intensity and can be computed with one collective communication. However, its orthogonality is poor.

• Algorithm description
1. 　$$\mathbf{B =}\mathbf{V}^{\mathbf{T}}\mathbf{V}$$
2. 　$$\mathbf{Cholesky\ decomposition(B)}$$
3. 　$$\mathbf{Q = V}\mathbf{R}^{\mathbf{- 1}}$$

3.5 Cholesky QR2 Method

The Cholesky QR2 factorization performs the second Cholesky QR factorization for the orthogonal matrix obtained by the first Cholesky QR factorization. This algorithm improves the orthogonality by executing the Cholesky QR factorization twice.

• Algorithm description
1. 　$$\mathbf{B}_{\mathbf{1}}\mathbf{=}\mathbf{V}^{\mathbf{T}}\mathbf{V}$$
2. 　$$\mathbf{R}_{\mathbf{1}}^{\mathbf{T}}\mathbf{R}_{\mathbf{1}}\mathbf{= Cholesky\ decomposition(}\mathbf{B}_{\mathbf{1}}\mathbf{)}$$
3. 　$$\mathbf{Q}_{\mathbf{1}}\mathbf{= V}\mathbf{R}_{\mathbf{1}}^{\mathbf{- 1}}$$
4. 　$$\mathbf{B}_{\mathbf{2}}\mathbf{=}\mathbf{Q}_{\mathbf{1}}^{\mathbf{T}}\mathbf{Q}_{\mathbf{1}}$$
5. 　$$\mathbf{R}_{\mathbf{2}}^{\mathbf{T}}\mathbf{R}_{\mathbf{2}}\mathbf{= Cholesky\ decomposition(}\mathbf{B}_{\mathbf{2}}\mathbf{)}$$
6. 　$$\mathbf{Q}_{\mathbf{2}}\mathbf{=}\mathbf{Q}_{\mathbf{1}}\mathbf{R}_{\mathbf{2}}^{\mathbf{- 1}}$$
7. 　$$\mathbf{V =}\mathbf{Q}_{\mathbf{1}}\mathbf{R}_{\mathbf{1}}\mathbf{=}\mathbf{Q}_{\mathbf{2}}\mathbf{R}_{\mathbf{2}}\mathbf{R}_{\mathbf{1}}\mathbf{= QR}$$
8. 　$$\mathbf{Q =}\mathbf{Q}_{\mathbf{2}}$$
9. 　$$\mathbf{R =}\mathbf{R}_{\mathbf{2}}\mathbf{R}_{\mathbf{1}}$$

4 Sparse Matrix Data Formats

In order to save memory in sparse matrix computation, only the non-zero elements and their locations are stored. PARCEL supports different sparse matrix formats.

4.1 Compressed Row Storage (CRS) Format

The CRS format compresses the row indices of the non-zero elements of the sparse matrix. Together with the compressed rows, the column index and the value for each non-zero element are stored in one dimensional arrays. Compresed row indeces are stored in an incremental order.

$A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right)$

One process case

• In case a single process processes the whole matrix from above

Values of non-zero elements :: $$crsA=\{a,b,c,d,e,f,g,h,i,j\}$$

Compressed row indices :: $$crsRow\_ptr=\{1,4,5,8,11\}$$

Column indices of non-zero elements :: $$crsCol=\{1,2,3,2,1,3,4,2,3,4\}$$

Two process case

• Rank 0 process

Values of non-zero elements :: $$crsA=\{a,b,c,d\}$$

Compressed row indices :: $$crsRow\_ptr=\{1,4,5\}$$

Column indices of non-zero elements :: $$crsCol=\{1,2,3,2\}$$

• Rank 1 process

Values of non-zero elements :: $$crsA=\{e,f,g,h,i,j\}$$

Compressed row indices :: $$crsRow\_ptr=\{1,4,7\}$$

Column indices of non-zero elements :: $$crsCol=\{1,3,4,2,3,4\}$$

4.2 Diagonal (DIA) Format

The DIA format stores the non-zero elements of a matrix as diagonals. This format provides high performance for band block diagonal matrices. In addition to the element values, an offset value for each diagonal of the matrix needs to be stored inside an one dimensional array. A negative offset indicates that the diagonal is below the main diagonal of the matrix, a positive offset indicates that the diagonal is above the main diagonal. The main diagonal has the offset zero. Offset values are stored in an incremental order.

$A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right)$

Single process case

• In case the matrix above is processed by a single process

Elements :: $$diaA=\{0,0,e,h,0,0,0,i,a,d,f,j,b,0,g,0,c,0,0,0\}$$

Offsets :: $$offset=\{-2,-1,0,1,2\}$$

Number of diagonals :: $$nnd=5$$

Two process case

• Rank 0 process

Elements :: $$diaA=\{a,d,b,0,c,0\}$$

Offsets :: $$offset=\{0,1,2\}$$

Number of diagonals :: $$nnd=3$$

• Rank 1 process

Elements :: $$diaA=\{e,h,0,i,f,j,g,0\}$$

Offsets :: $$offset=\{-2,-1,0,1\}$$

Number of diagonals :: $$nnd=4$$

4.3 Domain Decomposition Method (DDM) Format

4.3.1 Domain Decomposition Parameters

In the DDM format, the DIA format is extended by assuming simulations based on stencil computation such as finite difference with domain decomposition. By specifying a list of processes which exchange data with each other so that it matches domain decomposition in the simulation, one can minimize communication needed for parallel processing of matrices given by one-, two-, and three-dimensional structured grids. Related parameter settings are shown below.

• Number of neighbor processes: num_neighbor_ranks

The number of neighboring processes for $$m$$-dimensional domain decomposition is given by $$3^{m-1}$$. For example, in the two dimensional case, the number of neighboring processes is eight, while for the three dimensional case, it becomes $$26$$.

• 1D array: neighbor_ranks ( num_neighbor_ranks )

A list of the neoghboring processes is specified by their ranks on (one dimensional) MPI communicator. At the boundaries of computational domain, negative indexes are substituted to neighbor_ranks for the directions, where the neighboring processes do not exist. A sample code below can be used as the standard method for setting neighbor_ranks. The parameters npe_x, npe_y and npe_z represent the number of processes in each direction (x, y, z). rank_x, rank_y and rank_z represent the ranks in each direction. Here, the definition of one-dimensional index iirank should be consistent with that in the simulation.

 ixmax = 0; iymax = 0; izmax = 0; num = 0; neighbor_ranks = -1 if ( npe_x > 1 ) ixmax=1 if ( npe_y > 1 ) iymax=1 if ( npe_z > 1 ) izmax=1 do iz=-izmax, izmax 　　do iy=-iymax, iymax 　　　　do ix=-ixmax, ixmax 　　　　　　n1x = rank_x +ix; n1y = rank_y +iy; n1z = rank_z +iz; 　　　　　　if ( ix*ix +iy*iy + iz*iz == 0 ) cycle 　　　　　　num = num +1 　　　　　　if ( n1z >=0 .and. n1z < npe_z ) then 　　　　　　　　if ( n1y >=0 .and. n1y < npe_y ) then 　　　　　　　　　　if ( n1x >=0 .and. n1x < npe_x ) then 　　　　　　　　　　　　iirank = npe_x *npe_y *n1z +npe_x *n1y +n1x; 　　　　　　　　　　　　neighbor_ranks(num) = iirank; 　　　　　　　　　　endif 　　　　　　　　endif 　　　　　　endif 　　　　end do 　　end do end do

• 2D array: ioff_grids (nnd, ncomp_grids )

Grid positions referred in stencil computation such as finite difference are specified in each direction. For example, in the case of seven stencil computation consisting of three point finite difference for three-dimensional Poisson equation, the grid offsets for East, West, South, North, Up and Down from the base grid can be set as shown below.

i  ioff_grids(*,1)  ioff_grids(*,2)  ioff_grids(*,3)
base 1 0 0 0
east 2 1 0 0
west 3 -1 0 0
south 4 0 1 0
north 5 0 -1 0
top 6 0 0 1
bottom 7 0 0 -1

• Matrix Structure One-dimensional Poisson equation with three point finite difference.

$A = \left( \begin{array}{ccc} a & b & 0 & 0\\ c & d & e & 0\\ 0 & f & g & h\\ 0 & 0 & i & j \end{array} \right)$

Single process case

• Rank 0 process

Elements　　　　 ::　$$val\_dia=\{a,d,g,j,b,e,h,0,0,c,f,i\}$$

Offsets 　　　 ::　$$ioff\_dia=\{0,0,-1\}$$

Number of diagonals　::　$$nnd=3$$

Stencils 　::

i  ioff_grids(*,1)
reference 1 0
east　 2 1
west　 3 -1

Two process case

• Rank 0 process

Elements　　　　 ::　$$val\_dia=\{a,d,b,e,0,c\}$$

Offsets 　　　 ::　$$ioff\_dia=\{0,0,-1\}$$

Number of diagonals　::　$$nnd=3$$

Stencils 　::

i  ioff_grids(*,1)
reference 1 0
east　 2 1
west　 3 -1
• Rank 1 process

Elements　　　　 ::　$$val\_dia=\{g,j,h,0,f,i\}$$

Offsets 　　　 ::　$$ioff\_dia=\{0,0,0\}$$

Number of diagonals　::　$$nnd=3$$

Stencils 　::

i  ioff_grids(*,1)
reference 1 0
east　 2 1
west　 3 -1

5 Install

This chapter explains how to compile and install the PARCEL library.

5.1 Compiling the PARCEL library

PARCEL requires a C compiler, a FORTRAN compiler, which supports preprocessor, MPI, LAPACK, and OpenMP. When compiling with $$gfortran$$, the option $$-cpp$$ has to be specified to to activate preprocessor. Compiler options can be changed inside the file arch/make_config. An example of a make_config file is shown below.

To compile, execute the make command in the top directory of the extracted archive. Then, execute “make install” to create the example, include, lib directory in the directory specified in INSTALLDIR. If parcel_sparse.a is created in the lib directory, the compilation is successful. For large-scale problems of 2 billion dimensions or more, 64-bit integers are required, so it is necessary to compile with 64-bit integer types.

• An example of make_config

CC     = mpiicc
FC     = mpiifort

CFLAGS  = -O3
FFLAGS  = -O3 -fpp -qopenmp -xHost  -mcmodel=large

FP_MODE = -fp-model precise

INCLUDEDIR =
LD_MPI     = -lmpifort -lifcore
LD_LAPACK  = -mkl

#FDEFINE = -DPARCEL_INT8
INSTALLDIR = ../PARCEL_1.1

Option     Explanation
CC C compiler command name
FC FORTRAN compiler command name
CFLAGS C compiler options
FFLAGS FORTRAN compiler options
FP_MODE Option to suppress optimizations with arithmetic order change (Only for quad precision)
INCLUDEDIR Include directories
LD_MPI MPI library
LD_LAPACK LAPACK library
FDEFINE 64bit integer becomes available with -DPARCEL_INT8.
INSTALLDIR PARCEL installation directory

5.2 Usage of the PARCEL library

All the routines in the PARCEL library can be called by linking parcel_sparse.a which is generated in the directory lib.

6 PARCEL routines

The interface of each rountine in the PARCEL library is described. The following implementations are commonry used.

• Precision

The names of routines starting with ‘parcel_d’ and ‘parcel_dd’ donote the PARCEL routines implemented in Double precision and Quadruple (Double-Double) precision, respectively.

• Communication-computation overlap

The communication-computation overlap is implemneted in the following three modes. iovlflag=0 does not use any communication-computation overlap. iovlflag=1 uses the communication-computation overlap, in which computation processes are started after all non-blocking communications are launched at once. iovlflag=0 uses the communication-computation overlap, in which non-blocking communication and computation in each direction are processed sequencially with taking barrior synchronization, which reuses and saves communication buffers.

• Number of blocks in the block Jacobi preconditioner and the additive Schwarz preconditioner

When precon_thblock is zero or negative, or is greater than the number of threads per process, precon_nblock is set to be the number of threads per process, which is the default value. One can reduce the number of threads to enlarge the block size and improve the accuracy of preconditioning, but this also leads to the degradation of thread parallel performance.

• Size of diagonal blocks in the fine-block preconditioner

When independ_nvec is zero or negative, the fine-block preconditioner is not applied. On A64FX, independ_nvec=300 is recommended to facilitate optimizations for SIMD operations and software pipelining on the Fujitsu compiler. Larger independ_nvec improves the computational performance, while the accuracy of preconditioning becomes worse. When independ_nvec is greater than the size of vector on each process n, the fine-block preconditioner becomes equivalent to the point Jacobi preconditioner.

Thread optimization of SpMV employs a cyclic loop division to facilitate the reuse of on-cache data between different threads. nBlock is the chank size for this loop division. When nBlock is zero or negative, the default value of nBlock=2000 is set.

• Data structure of eigenvector in CA-Arnoldi method

When the i-th eigenvalue is a real value, the corresponding eigenvector is stored in Evec(n*(i-1)+1:n*i).

When the i-th eigenvalue is a complex value and the (i+1)-th eigenvalue is its complex conjugate value, the i-th and (i+1)-th eigenvectors are stored as follows.

Eigenvector Real Part Imaginary Part
i-th     Evec(n*(i-1)+1:n*i)         Evec(n*i+1:n*(i+1))
(i+1)-th     Evec(n*(i-1)+1:n*i)        -Evec(n*i+1:n*(i+1))

6.1 parcel_dcg

• Interface

call parcel_dcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)

6.2 parcel_dbicgstab

• Interface

call parcel_dbicgstab( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored​in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)

6.3 parcel_dgmres

• Interface

call parcel_dgmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret integer out Error code (0:normal)

6.4 parcel_dcagmres

• Interface

call parcel_dcagmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cagmres_sstep integer in Number of communication avoiding steps s in the CA-GMRES method
cagmres_tstep integer in Number of outer iterations t in the CA-GMRES method (restart length = st)
cagmres_basis integer in Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis)
cagmres_QRflag integer in Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret integer out Error code (0:normal)

6.5 parcel_dcbcg

• Interface

call parcel_dcbcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cbcg_kstep integer in Number of communication avoiding steps
cbcg_Eigenflag integer in Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax integer in Maximum number of iterations in the power method
caarnoldi_sstep integer in Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret integer out Error code (0:normal)

6.6 parcel_dcaarnoldi

• Interface

call parcel_dcaarnoldi( icomm, vecx, vecb, n, gn, nnz, istart, crsA, crsRow_ptr, crsCol, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )

• Function

An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA(nnz) double precision in Non-zero elements of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
caarnoldi_sstep integer in Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep) double precision out Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep) double precision out Imaginary part of eigenvalue
Evec(n*caarnoldi_sstep*caarnoldi_tstep) double precision out Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep) double precision out Error norm $$\frac{\left\| Ax - \lambda x \right\|}{\left\| \text{λx} \right\|}$$
EmaxID integer out ID number of the maximum eigenvalue
EminID integer out ID number of the minimam eigenvalue
iret integer out Error code (0:normal)

6.7 parcel_dcg_dia

• Interface

call parcel_dcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) double precision in Non-zero elements of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)

6.8 parcel_dbicgstab_dia

• Interface

call parcel_dbicgstab_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) double precision in Non-zero elements of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)

6.9 parcel_dgmres_dia

• Interface

call parcel_dgmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) double precision in Non-zero elements of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret integer out Error code (0:normal)

6.10 parcel_dcagmres_dia

• Interface

call parcel_dcagmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) double precision in Non-zero elements of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cagmres_sstep integer in Small communication step of CA-GMRES method
cagmres_tstep integer in List step of the CA-GMRES method
cagmres_basis integer in Base generation flag of CA-GMRES method (0: single basis, 1: newton basis)
cagmres_QRflag integer in QR Decomposition flag of CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret integer out Error code (0:normal)

6.11 parcel_dcbcg_dia

• Interface

call parcel_dcbcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) double precision in Non-zero elements of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cbcg_kstep integer in Number of communication avoiding steps
cbcg_Eigenflag integer in Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax integer in Maximum number of iterations in the power method
caarnoldi_sstep integer in Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret integer out Error code (0:normal)

6.12 parcel_dcaarnoldi_dia

• Interface

call parcel_dcaarnoldi_dia( icomm, vecx, vecb, n, gn, istart, diaA,offset,nnd, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )

• Function

An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx(n) double precision in/out in: initial vector, out: solution vector
vecb(n) double precision in Right hand side vector
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA(n*nnd) double precision in Non-zero elements of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
caarnoldi_sstep integer in Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep) double precision out Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep) double precision out Imaginary part of eigenvalue
Evec(n*caarnoldi_sstep*caarnoldi_tstep) double precision out Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep) double precision out Error norm $${\left\| Ax - \lambda x \right\|}/{\left\| \text{λx} \right\|}$$
EmaxID integer out ID number of the maximum eigenvalue
EminID integer out ID number of the minimam eigenvalue
iret integer out Error code (0:normal)

6.13 parcel_dcg_ddm

• Interface

call parcel_dcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) double precision in/out in: initial vector, out: solution vector
b(m) double precision in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)

6.14 parcel_dbicgstab_ddm

• Interface

call parcel_dbicgstab_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) double precision in/out in: initial vector, out: solution vector
b(m) double precision in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
iret integer out Error code (0:normal)

6.15 parcel_dgmres_ddm

• Interface

call parcel_dgmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m, gmres_GSflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) double precision in/out in: initial vector, out: solution vector
b(m) double precision in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
iret integer out Error code (0:normal)

6.16 parcel_dcagmres_ddm

• Interface

call parcel_dcagmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) double precision in/out in: initial vector, out: solution vector
b(m) double precision in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cagmres_sstep integer in Number of communication avoiding steps s in the CA-GMRES method
cagmres_tstep integer in Number of outer iterations t in the CA-GMRES method (restart length = st)
cagmres_basis integer in Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis)
cagmres_QRflag integer in Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
iret integer out Error code (0:normal)

6.17 parcel_dcbcg_ddm

• Interface

call parcel_dcbcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) double precision in/out in: initial vector, out: solution vector
b(m) double precision in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
cbcg_kstep integer in Number of communication avoiding steps
cbcg_Eigenflag integer in Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI)
power_method_itrmax integer in Maximum number of iterations in the power method
caarnoldi_sstep integer in Number of communication avoiding steps s of the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2)
iret integer out Error code (0:normal)

6.18 parcel_dcaarnoldi_ddm

• Interface

call parcel_dcaarnoldi_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )

• Function

An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x(m) double precision in/out in: initial vector, out: solution vector
b(m) double precision in Right hand side vector
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia(m*nnd) double precision in Non-zero elements of matrix stored in DDM format
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
caarnoldi_sstep integer in Number of communication avoiding steps s in the CA-ARNOLDI method
caarnoldi_tstep integer in Number of outer iterations t in the CA-ARNOLDI method (restart length = st)
caarnoldi_basis integer in Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis)
caarnoldi_QRflag integer in Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2)
Evalr(caarnoldi_sstep*caarnoldi_tstep) double precision out Real part of the eigenvalue
Evali(caarnoldi_sstep*caarnoldi_tstep) double precision out Imaginary part of eigenvalue
Evec(m*caarnoldi_sstep*caarnoldi_tstep) double precision out Eigenvectors
Eerr(caarnoldi_sstep*caarnoldi_tstep) double precision out Error norm $${\left\| Ax - \lambda x \right\|}/{\left\| \text{λx} \right\|}$$
EmaxID integer out ID number of the maximum eigenvalue
EminID integer out ID number of the minimam eigenvalue
iret integer out Error code (0:normal)

6.19 parcel_ddcg

• Interface

call parcel_ddcg( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA_hi(nnz) double precision in Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz) double precision in Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.20 parcel_ddbicgstab

• Interface

call parcel_ddbicgstab( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA_hi(nnz) double precision in Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz) double precision in Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.21 parcel_ddgmres

• Interface

call parcel_ddgmres( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m,gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
nnz integer*4 / integer*8 in Number of non-zero elements on each process
istart integer*4 / integer*8 in Start line of matrix on each process
crsA_hi(nnz) double precision in Non-zero elements (upper bits) of matrix stored in CRS format
crsA_lo(nnz) double precision in Non-zero elements (lower bit) of matrix stored in CRS format
crsRow_ptr(n+1) integer*4 / integer*8 in Pointer table in CRS format
crsCol(nnz) integer*4 / integer*8 in Column numbers of non-zero elements in CRS format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.22 parcel_ddcg_dia

• Interface

call parcel_ddcg_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA_hi(n*nnd) double precision in Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd) double precision in Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.23 parcel_ddbicgstab_dia

• Interface

call parcel_ddbicgstab_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA_hi(n*nnd) double precision in Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd) double precision in Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.24 parcel_ddgmres_dia

• Interface

call parcel_ddgmres_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
vecx_hi(n) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
vecx_lo(n) double precision in/out in: initial value of the iteration (lower bit), out: solution vector (lower bit)
vecb_hi(n) double precision in Right hand side vector (upper bit)
vecb_lo(n) double precision in Right hand side vector (lower bit)
n integer*4 / integer*8 in Size of vector on each process
gn integer*4 / integer*8 in Total size of vector
istart integer*4 / integer*8 in Start line of matrix on each process
diaA_hi(n*nnd) double precision in Non-zero elements (upper bits) of matrix stored in DIA format
diaA_lo(n*nnd) double precision in Non-zero elements (lower bit) of matrix stored in DIA format
offset(nnd) integer*4 / integer*8 in Offset of each non-zero diagonal elements array in DIA format
nnd integer*4 / integer*8 in Number of diagonal elements arrays in DIA format
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
ilu_method integer in Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in Convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.25 parcel_ddcg_ddm

• Interface

call parcel_ddcg_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x_hi(m) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m) double precision in/out in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m) double precision in Right hand side vector (upper bit)
b_lo(m) double precision in Right hand side vector (lower bit)
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.26 parcel_ddbicgstab_ddm

• Interface

call parcel_ddbicgstab_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x_hi(m) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m) double precision in/out in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m) double precision in Right hand side vector (upper bit)
b_lo(m) double precision in Right hand side vector (lower bit)
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.27 parcel_ddgmres_ddm

• Interface

call parcel_ddgmres_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m, gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )

• Function

A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.

Parameter(dimension) Type Input/Output Description
icomm integer in MPI communicator
x_hi(m) double precision in/out in: Initial vector (upper bits), out: Solution vector (upper bits)
x_lo(m) double precision in/out in: Initial vector (lower bits), out: Solution vector (lower bits)
b_hi(m) double precision in Right hand side vector (upper bit)
b_lo(m) double precision in Right hand side vector (lower bit)
m integer in Size of vector on each process
nnd integer in Number of diagonal elements arrays on each process
ioff_dia(nnd) integer in Offset of each non-zero diagonal elements array in DDM format
val_dia_hi(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (upper bit)
val_dia_lo(m*nnd) double precision in Non-zero elements of matrix stored in DDM format (lower bit)
num_neighbor_ranks integer in Number of neighboring processes
neighbor_ranks(num_neighbor_ranks) integer in List of neighboring processes
margin integer in Maximum number of absolute value of ioff_grids
ncomp_grids integer in Dimension of structured grids
ndiv_grids(ncomp_grids) integer in Number of decomposed domains in each direction
num_grids(ncomp_grids) integer in Number of grids on each process
ioff_grids(nnd,ncomp_grids) integer in Offset of stencil data location in each direction
div_direc_th integer in Direction of thread parallelization ( 1:x, 2:y, 3:z)
ipreflag integer in Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz)
iflagAS integer in Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE)
itrmax integer in/out in: maximum number of iterations, out: number of iterations
rtolmax double precision in convergence criterion (norm of relative residual error)
reshistory(itrmax) double precision out History of relative residual error
iovlflag integer in Flag for communication-computation overlap (0: none, 1: all processes, 2: each process)
precon_thblock integer in Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning
independ_nvec integer in Size of diagonal blocks to be calculated independently in subdivision preconditioning
nBlock integer in Number of rows allocated to the thread cyclically.
gmres_m integer in Number of iterations until restart
gmres_GSflag integer in Flag for orthogonalization algorithm (1: MGS, 2: CGS)
precision_A integer in Precision of matrix (1: double precision, 2: quad precision)
precision_b integer in Precision of right hand side vector (1: double precision, 2: quad precision)
precision_x integer in Precision of solution vector (1: double precision, 2: quad precision)
precision_precon integer in Precision of preconditioner matrix (1: double precision, 2: quad precision)
iret integer out Error code (0:normal)

6.28 parcel_dqr

• Interface

call parcel_dqr(n,s,icomm,V,Q,R,iQRflag)

• Function

QR factorization for the m*s matrix, where the sum of the parameter n for the MPI processes belonging to the MPI communicator icomm is m.

Parameter(dimension) Type Input/Output Description
n integer*4 / integer*8 in Size of rows in the matrix on each process
s integer*4 / integer*8 in Size of column in the matrix
icomm integer in MPI communicator
V(s*n) double precision in Input Matrix (Column Major)
Q(s*n) double precision out Orthogonal matrix (Column Major)
R(s*s) double precision out Upper triangular matrix (Column Major)
iQRflag integer in Flag for QR factorization (1:MGS,2:CGS,3:TSQR,4:CholeskyQR,5:CholeskyQR2)

7 How to use (Fortran)

The use of the PARCEL routines is explained by FORTRAN sample programs of a CG solver.

7.1 CRS Format (Fortran)

A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.

program main
use mpi
implicit none

integer n,gn,nnz,istart
real*8,allocatable  :: crsA(:)
integer,allocatable :: crsRow_ptr(:),crsCol(:)
real*8,allocatable  :: vecx(:)
real*8,allocatable  :: vecb(:)
integer itrmax
real*8  rtolmax
real*8, allocatable :: reshistory(:)
integer ipreflag
integer ILU_method
integer iflagAS
integer iovlflag
integer precon_thblock
integer independ_nvec
integer nblock
integer iret
integer ierr

call MPI_Init(ierr)

call make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol)

allocate(vecx(n))
allocate(vecb(n))
allocate(reshistory(itrmax))
ipreflag=0
ILU_method=1
iflagAS=1
itrmax=100
rtolmax=1.0d-8
iovlflag=0
precon_thblock=-1
independ_nvec=-1
nblock = 2000

vecb=1.0d0
vecx=1.0d0
call parcel_dcg( &
MPI_COMM_WORLD, &
vecx,vecb, &
n,gn,nnz,istart, &
crsA,crsRow_ptr,crsCol, &
itrmax,rtolmax, &
reshistory, &
iovlflag,iret &
precon_thblock, independ_nvec, &
nblock, &
iret &
)
call MPI_Finalize(ierr)

deallocate(vecx)
deallocate(vecb)

deallocate(reshistory)

end program main

7.2 DIA Format (Fortran)

A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.

program main
use mpi
implicit none

integer n,gn,nnd,istart
real*8,allocatable  :: diaA(:)
integer,allocatable :: offset(:)
real*8,allocatable  :: vecx(:)
real*8,allocatable  :: vecb(:)
integer itrmax
real*8  rtolmax
real*8, allocatable :: reshistory(:)
integer ipreflag
integer ILU_method
integer iflagAS
integer iovlflag
integer precon_thblock
integer independ_nvec
integer nblock
integer iret
integer ierr

call MPI_Init(ierr)

call make_matrix_DIA(n,gn,nnd,istart,diaA,offset)

allocate(vecx(n))
allocate(vecb(n))
allocate(reshistory(itrmax))

ipreflag=0
ILU_method=1
iflagAS=1
itrmax=100
rtolmax=1.0d-8
iovlflag=0
precon_thblock=-1
independ_nvec=-1
nblock = 2000

vecb=1.0d0
vecx=1.0d0

call parcel_dcg_dia(&
MPI_COMM_WORLD, &
vecx,vecb,&
n,gn,istart,&
diaA,offset,nnd,&
itrmax,rtolmax,&
reshistory,&
iovlflag,
precon_thblock, independ_nvec, &
nblock, &
iret&
)

call MPI_Finalize(ierr)

deallocate(reshistory)

end program main

7.3 DDM Format (Fortran)

A Fortran sample code in DDM format is shown below. Here, make_matrix_DDM denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL.

program main
use mpi
implicit none

integer ityp_eq
integer gnx,gny,gnz
integer itrmax
integer MAX_NITER
integer nx,ny,nz
integer nx0,ny0,nz0
integer n,gn

integer m
integer m_
integer npes,myrank
integer ierr
integer i
integer ipreflag
integer iflagAS
real*8  rtolmax
real*8  abstolmax
integer solver,ityp_solver
integer iret
real*8,allocatable  :: reshistory_DDM(:)
! -----
integer ILU_method
integer iovlflag

real*8,allocatable  :: vecx(:)
real*8,allocatable  :: vecb(:)

integer niter,restart
integer npey,npez
integer precon_thblock
integer independ_nvec
integer nblock

integer npe_x,npe_y,npe_z
integer div_direc_th

integer rank_x, rank_y, rank_z
integer,parameter :: nnd = 7
integer,parameter :: margin = 1
integer,parameter :: ndim=3
integer npe_dim(ndim)
integer gn_grids(ndim)
integer n_grids(ndim)
integer istart_grids(ndim)
integer,parameter :: num_neighbor_ranks_max = 3**ndim -1
integer neighbor_ranks( num_neighbor_ranks_max )
integer num_neighbor_ranks
real*8, allocatable  :: diaA(:)
integer, allocatable :: offset(:)
integer, allocatable :: offset_dim(:,:)

namelist/input/ &
ityp_eq, &
gnx, &
gny, &
gnz, &
MAX_NITER, &
rtolmax, &
abstolmax, &
ipreflag, &
independ_nvec, &
iflagAS, &
ILU_method, &
iovlflag, &
ityp_solver, &
precon_thblock, &
nblock &

namelist/input_ddm/ &
npe_x,&
npe_y,&
npe_z,&
div_direc_th

open(33,file='input_namelist')
close(33)

stime = 0.0d0
etime = 0.0d0

itrmax = MAX_NITER
solver = ityp_solver
! -----
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD,npes,ierr)
call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)

gn_grids(1)  = gnx
gn_grids(2)  = gny
gn_grids(3)  = gnz

call set_rank_xyz( myrank, npe_x, npe_y, npe_z, rank_x, rank_y, rank_z )
call set_grids_decompostion( &
npe_x, npe_y, npe_z, &
rank_x, rank_y, rank_z,  &
ndim, gn_grids, npe_dim, n_grids,  &
istart_grids, margin, m )

num_neighbor_ranks = -1
call set_neighbor_ranks( &
npe_x, npe_y, npe_z, rank_x, rank_y, rank_z, &
num_neighbor_ranks, &
neighbor_ranks, num_neighbor_ranks_max)

allocate( diaA(nnd*m) )
allocate( offset(nnd) )
allocate( offset_dim(nnd,ndim) )
allocate( reshistory_DDM(MAX_NITER) )
allocate( vecx(m) )
allocate( vecx0(m) )
allocate( vecb(m) )

diaA = 0.0d0
vecx = 0.0d0
vecx0 = 0.0d0
vecb = 0.0d0
reshistory_DDM = 0
offset = 0
offset_dim = 0

call make_matrix_DDM( &
gnx,gny,gnz, &
ndim, n_grids,  &
istart_grids, &
margin, nnd, m, &
offset_dim, offset, &
diaA, vecb, vecx, vecx0,&
rank_x, rank_y, rank_z, &
npe_x, npe_y, npe_z )

call parcel_dcg_ddm(&
MPI_COMM_WORLD, &
vecx,vecb,  &
m, &
nnd, offset, diaA, &
num_neighbor_ranks, neighbor_ranks, margin, ndim,&
npe_dim, n_grids, offset_dim,&
div_direc_th, &
ipreflag, &
MAX_NITER, rtolmax, reshistory_DDM, &
iovlflag, &
precon_thblock, &
independ_nvec, &
nBlock, &
iret &
)

call MPI_Barrier(MPI_COMM_WORLD,ierr)

deallocate( diaA )
deallocate( offset )
deallocate( offset_dim )
deallocate( reshistory_DDM )
deallocate( vecx )
deallocate( vecx0 )
deallocate( vecb )

call MPI_Finalize(ierr)

end program main

8 How to use (C)

A Fortran routine can be called from C by adding "_" to the end of the routine name and passing the argument as a pointer. The PARCEL routines can also be used in C programs. Regarding the use of MPI, the C MPI communicator should be converted to the Fortran MPI communicator via MPI_Comm_c2f in the MPI library. In the following, we explain how to use the PARCEL routines by using a sample code of the CG method in C.

8.1 CRS Format (C)

A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.

#include < stdio.h >
#include < stdlib.h >
#include "mpi.h"
int main( int argc, char *argv[]){

int npes;
int myrank;
MPI_Fint icomm_fort;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);

int n,gn,nnz,istart;
double* crsA;
int* crsRow_ptr;
int* crsCol;
make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol)
double *vecx = (double *)malloc(sizeof(double)*n);
double *vecb = (double *)malloc(sizeof(double)*n);
for(int i=0;i < n;i++){
vecx[i] = 0.0;
vecb[i] = 1.0;
}
int ipreflag = 0;
int ilu_method = 1;
int iflagas = 1;
int max_niter = 100;
int rtolmax = 1.0e-8;
double *reshistory = (double *)malloc(sizeof(double)*max_niter);
int iovlflag = 0;
int precon_thblock = -1;
int independ_nvec = -1;
int nBlock = 2000;

parcel_dcg_(
&icomm_fort ,
vecx,
vecb,
&n,
&gn,
&nnz,
&istart,
crsa ,
crsrow_ptr,
crscol,
&ipreflag ,
&ilu_method ,
&iflagas ,
&max_niter ,
&rtolmax ,
reshistory,
&iovlflag ,
&precon_thblock ,
&independ_nvec ,
&nBlock,
&iret
);
free(vecx);
free(vecb);
free(reshistory);
MPI_Finalize();
}

8.2 DIA Format (C)

A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.

#include < stdio.h >
#include < stdlib.h >
#include "mpi.h"
int main( int argc, char *argv[]){

int npes;
int myrank;
MPI_Fint icomm_fort;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);

int n,gn,nnd,istart;
double* diaA;
int* offset;
make_matrix_DIA(n,gn,nnd,istart,diaA,offset)
double *vecx = (double *)malloc(sizeof(double)*n);
double *vecb = (double *)malloc(sizeof(double)*n);
for(int i=0;i < n;i++){
vecx[i] = 0.0;
vecb[i] = 1.0;
}
int ipreflag = 0;
int ilu_method = 1;
int iflagas = 1;
int max_niter = 100;
int rtolmax = 1.0e-8;
double *reshistory = (double *)malloc(sizeof(double)*max_niter);
int iovlflag = 0;
int precon_thblock = -1;
int independ_nvec = -1;
int nBlock = 2000;

parcel_dcg_dia_(
&icomm_fort ,
vecx,
vecb,
&n,
&gn,
&istart,
diaA ,
offset,
&nnd,
&ipreflag ,
&ilu_method ,
&iflagas ,
&max_niter ,
&rtolmax ,
reshistory,
&iovlflag ,
&precon_thblock ,
&independ_nvec ,
&nBlock,
&iret
);
free(vecx);
free(vecb);
free(reshistory);
MPI_Finalize();

8.3 DDM Format (C)

A C sample code in DDM format is shown below. Here, make_matrix_ddm denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL. set_rank_xyz, set_grids_decompostion, and set_neighbor_ranks are arbitrary routines, which compute neighbor processes in PARCEL format.

int main( int argc, char *argv[]){
int ityp_eq;
int itrmax;
int max_niter;
int nx,ny,nz;
int nx0,ny0,nz0;
int n,gn;
int istart,iend;

int npes,myrank;
int ierr;
int i;
int iret;

int myranky,myrankz;
int npey,npez;
MPI_Fint icomm_fort;

int gnx = 100;
int gny = 100;
int gnz = 100;
int max_niter = 100;
int rtolmax = 1.0e-8;
int ipreflag = 0;
int independ_nvec = -1;
int iflagas = 1;
int ilu_method = 1;
int iovlflag = 0;
int precon_thblock = -1;
int nBlock = 2000;

int npe_x = 1;
int npe_y = 4;
int npe_z = 1;
int div_direc_th = 1;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD);

int nnd = 7;

int rank_x,rank_y,rank_z;
set_rank_xyz(
&myrank,
&npe_x,
&npe_y,
&npe_z,
&rank_x,
&rank_y,
&rank_z);
int ndim=3 ;
int margin = 1;

int *npe_dim = (int *)malloc(sizeof(int)*ndim);
int *gn_grids = (int *)malloc(sizeof(int)*ndim);
int *n_grids = (int *)malloc(sizeof(int)*ndim);
int *istart_grids = (int *)malloc(sizeof(int)*ndim);
for(i=0;i < ndim;i++){
npe_dim[i] = 0;
gn_grids[i] = 0;
n_grids[i] = 0;
istart_grids[i] = 0;
}
int m=-1;

gn_grids  = gnx;
gn_grids  = gny;
gn_grids  = gnz;
set_grids_decompostion(
&npe_x, &npe_y, &npe_z,
&rank_x, &rank_y, &rank_z,
&ndim, gn_grids, npe_dim, n_grids,
istart_grids, &margin, &m );

int num_neighbor_ranks_max = pow(3,ndim) -1;
int *neighbor_ranks = (int *)malloc(sizeof(int)*num_neighbor_ranks_max);

int num_neighbor_ranks=-1;

for(i=0;i < num_neighbor_ranks_max;i++){
neighbor_ranks[i] = -1;
}

set_neighbor_ranks(
&npe_x,
&npe_y,
&npe_z,
&rank_x,
&rank_y,
&rank_z,
&num_neighbor_ranks,
neighbor_ranks,
&num_neighbor_ranks_max);

int *offset = (int *)malloc(sizeof(int)*nnd);
for(i=0;i < nnd;i++){
offset[i] = 0;
}
int *offset_dim = (int *)malloc(sizeof(int)*nnd*ndim);
for(i=0;i < nnd*ndim;i++){
offset_dim[i] = 0;
}
double *diaA  = (double *)malloc(sizeof(double)*m*nnd);
for(i=0;i < m*nnd;i++){
diaA[i] = 0.0;
}
double *vecx  = (double *)malloc(sizeof(double)*m);
double *vecb  = (double *)malloc(sizeof(double)*m);
for(i=0;i < m;i++){
vecx[i] = 0.0;
vecb[i] = 0.0;
}
double *reshistory_ddm  = (double *)malloc(sizeof(double)*max_niter);
for(i=0;i < max_niter;i++){
reshistory_ddm[i] = 0.0;
}

int nnd_ = nnd;
make_matrix_ddm(
&gnx,
&gny,
&gnz,
&ndim,
n_grids,
istart_grids,
&margin,
&nnd_,
&m,
offset_dim,
offset,
diaA,
vecb,
vecx,
vecx0,
&rank_x,
&rank_y,
&rank_z,
&npe_x,
&npe_y,
&npe_z);
parcel_dcg_ddm_(
&icomm_fort,
vecx,
vecb,
&m,
&nnd,
offset,
diaA,
&num_neighbor_ranks,
neighbor_ranks,
&margin,
&ndim,
npe_dim,
n_grids,
offset_dim,
&div_direc_th,
&ipreflag,
&iflagas,
&max_niter,
&rtolmax,
reshistory_ddm,
&iovlflag,
&precon_thblock,
&independ_nvec ,
&nBlock ,
&iret
);

free(offset);
free(offset_dim);
free(diaA);
free(vecx);
free(vecb);
free(reshistory_ddm);
free(neighbor_ranks);
free(npe_dim);
free(gn_grids);
free(n_grids);
free(istart_grids);

MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();

return 0;
}

R. Barret, M. Berry, T. F. Chan, et al., “Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods”, SIAM（1994）
M. Hoemmen, “Communication-avoiding Krylov subspace methods”. Ph.D.thesis, University of California, Berkeley（2010）
Y. Idomura, T. Ina, A. Mayumi, et al., “Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms”,ScalA 17:8th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp.1-8, (2017).
R. Suda, L. Cong, D. Watanabe, et al., “Communication-Avoiding CG Method: New Direction of Krylov Subspace Methods towards Exa-scale Computing”, RIMS Kokyuroku ,pp. 102-111, (2016).
Y. Idomura, T. Ina, S. Yamashita, et al., “Communication avoiding multigrid preconditioned conjugate gradient method for extreme scale multiphase CFD simulations”. ScalA 18:9th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems,pp. 17-24.(2018)
 T. Ina, Y. Idomura, T. Imamura, et al., “Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations”, ScalA21:12th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems .(2021)
 A. Stathopoulos, K. Wu, “A block orthogonalization procedure with constant synchronization requirements”. SIAM J. Sci. Comput. 23, 2165–2182 (2002)
 T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, et al., “CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System,” ScalA 14:5th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp. 31-38,(2014)