This chapter describes the iterative methods contained in this library. The following notation is used, \(Ax = b\) represents the linear system which will be solved, with \(A\) being an n\(\times\)n sparse matrix and \(x_0\) representing the initial vector. \(K\) represents an appropriate preconditioner.
The Conjugate Gradient (CG) method is an iterative Krylov subspace algorithm for solving symmetric matrices.
The Biconjugate Gradient Stabilized (Bi-CGSTAB) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices.
The Generalized Minimum Residual (GMRES(m)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. GMRES(m) is restarted periodically after \(m\) iterations if it did not converge within \(m\) iterations, the solution obtained until \(m\) iterations will be used as an input for the new restart cycle. PARCEL provides variants of GMRES with the Classical Gram-Schmidt (sometimes referred to as standard Gram-Schmidt) and Modified Gram-Schmidt (MGS) orthogonalization methods.
\(\mathbf{H}_{\mathbf{n}}\) is an upper triangular matrix with \(\mathbf{h}_{\mathbf{j,k}}\) representing its elements, \(\mathbf{\ }\mathbf{e}_{\mathbf{i}}\) represents a vector consisting of the first i elements of the vector \(\mathbf{e}\).
The Communication Avoiding Generalized Minimum Residual (CA-GMRES(s,t)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. The calculation which is done in \(s\) iterations in the GMRES(m) method is equivalent to one iteration of the CA-GMRES method. CA-GMRES(s,t) is restarted periodically after \(t\) iterations if it did not converge within \(t\) iterations, the solution obtained until \(m\) iterations will be used as an input for the new restart cycle. CA-GMRES(s,t) is equivalent to GMRES(m) if \(s\) and \(t\) are chosen so that \(m = t \times s\). The convergence property of CA-GMRES(s,t) is the same as GMRES(m) in exact arithmetic. In addition, the number of global collective communication calls is reduced by communication avoiding QR factorization. As CA-GMRES(s,t) produces \(s\) basis vectors at once, when \(s\) is too large, the linear independency of the basic vectors may become worse because of round off errors, leading to worse convergence property than GMRES(m). In order to improve the orthogonality of the basic vectors, the PARCEL implementation of CA-GMRES(s,t) provides an option to use the Newton basis in addition to the monomial basis.
\(\mathbf{e}\) represents a unit vector, \({\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\) represents the \(s\)-th row and \(s\)-th column element of the matrix \(\mathbf{R}_{\mathbf{k}}\), and \(\mathbf{E}\) represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.
The elements of the matrix \(\mathbf{B}\) are represented as \(\mathbf{b}_{\mathbf{k,i}}\)
The elements of the matrix \(\mathcal{H}\) are represented as \(\mathcal{h}_{\mathbf{k,i}}\).
The Chebyshev basis Conjugate Gradient (CBCG) method is an iterative Krylov subspace algorithm for solving symmetric matrices. The CBCG method calculates \(k\) iterations of the CG method in one iteration. By this approach, the CBCG method reduces the global collective communication calls. In order to use the Chebyshev basis the largest and smallest eigenvalues are needed. PARCEL provides two methods to obtain these eigenvalues, the so called power method and the communication avoiding Arnoldi method.
\(\mathbf{\lambda}_{\mathbf{\max}}\), \(\mathbf{\lambda}_{\mathbf{\min}}\)represent the largest and smallest eigenvalues of \(\mathbf{AK}^{-1}\).
The Preconditioned Communication Avoiding Arnoldi Method (CA-Arnoldi(s,t)) is an eigenvalue calculation algorithm for asymmetric matrices. CA-Arnoldi(s,t) computes s iterations of the Arnordi method in a single iteration, and by repeating this for t times, eigenvalues and eigenvectors of t\(\times\)s Hessenberg matrices are computed. By applying communication-reducing QR factorization algorithms, the number of collective communications is reduced compared to the Arnoldi method.
\(\mathbf{e}\) represents a unit vector, \({\widetilde{\mathbf{\rho}}}_{\mathbf{k}}\mathbf{=}\mathbf{R}_{\mathbf{k}}\) represents the \(s\)-th row and \(s\)-th column element of the matrix \(\mathbf{R}_{\mathbf{k}}\), and \(\mathbf{E}\) represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.
In iterative algorithms, the application of a preconditioner of the form \(K \approx A\) can help to reduce the number of iterations until convergence. The notation [\({{solve\ }}\widehat{{p}}{{\ from\ K}}\widehat{{p}}{= \ }{p}\)] means [\(approximately solve {A}\widehat{p}{= \ }{p}\) with respect to the vector \(\widehat{p}\)]. Although the more accurate approximation leads to the less number of iterations until convergence, the cost of preconditioner is normally increased. In non-parallel computing, one of the most common preconditioners is the incomplete LU factorization (ILU). However, in parallel computing, parallel preconditioners are needed. PARCEL provides the following parallel preconditioners: Point Jacobi, Block Jacobi, Additive Schwarz, and Neumann expansion.
The point Jacobi preconditioner is one of the most simplest preconditioners. \(K\) only consists of the diagonal elements of the matrix \(A\). Compared to other preconditioners the efficiency of the point Jacobi preconditioner is very low. However, it can be easily applied in parallel and it does not require any additional communication between processors.
LU factorization decompose a square matrix \(A\) into a lower triangular matrix \(L\) and an upper triangular matrix \(U\). For a typical sparse matrix, the LU factors can be much less sparse than the original matrix, which is called as fill-in. This increases the memory and computational requirements. In order to avoid this issue, PARCEL provides the zero fill-in incomplete LU factorization (ILU(0)).
The diagonal incomplete LU factorization (D-ILU) computes a diagonal matrix \(D\), a lower triangular matrix \(L\) and an upper triangular matrix \(U\) for the input matrix \(A\). \[A = L_{A} + D_{A} + U_{A}\] The preconditioner \(K\) is constructed with \(L_{A}\), \(U_{A}\) and \(D\) (\(D \neq D_{A}\)) with: \[K = \left( D + L_{A} \right)D^{- 1}\left( D + U_{A} \right)\] Two different conditions exist in order to construct the diagonal matrix \(D\):
Condition 1: The diagonal elements of \(K\) equal those of \(A\): \[A_{{ii}} = K_{{ii}}\ \ (i = 1,\ldots,n)\]
Condition 2: The row sum of \(K\) equals the row sum of \(A\): \[\sum_{j}^{}A_{{ij}} = \sum_{j}^{}K_{{ij}}\ \ (i = 1,\ldots,n)\]
Given the conditions above \(D\) can be computed as follows:
If condition 1 should be fullfilled the following computation is used: \[D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{A_{{ij}}D_{jj}^{-1}A_{{ji}}}\ \ (i = 1,\ldots,n)\]
If condition 2 should be fullfilled the following computation is used: \[D_{{ii}} = A_{{ii}} - \sum_{j < i}^{}{\sum_{k > j}^{}{A_{{ij}}D_{{jj}}^{-1}A_{jk}}}\ \ (i = 1,\ldots,n)\]
The Block Jacobi Preconditioner constructs \(K\) out of the diagonal blocks of the matrix \(A\). For each block, the incomplete ILU factorization is computed. When the preconditioner is applied, each block can be processed independently, resulting in a high level of parallelism. In addition, no communication is needed when each block is defined within a sub-matrix on each processor.
The Additive Schwarz Preconditioner constructs \(K\) out of the overlapping diagonal blocks of \(A\). For each block the incomplete LU factorization is computed. Compared to the Block Jacobi Preconditioner, the Additive Schwarz Preconditioner may require additional communication. PARCEL provides four different overlapping methods, BASIC, RESTRICT, INTERPOLATE and NONE.
Fig.1 Additive Schwarz Preconditioner with overlapping diagonal blocks |
Solve \(Ks=r\) with an extended \(r\) including an overlapping region with the neighboring process, \(s\) in the overlapping region is determined by summing up the results between the neighboring region.
Fig.2 BASIC |
Solve \(Ks=r\) with an extended \(r\) including an overlapping region with the neighboring process, \(s\) in the overlapping region is determined without taking account of the result in the neighboring region.
Fig.3 RESTRICT |
Solve \(Ks=r\) with an extended \(r\) without an overlapping region with the neighboring process, \(s\) in the overlapping region is determined by summing up the results between the neighboring regions.
Fig.4 INTERPOLATE |
Solve \(Ks=r\) with an extended \(r\) without an overlapping region with the neighboring process, \(s\) in the overlapping region is determined without taking account of the result in the neighboring regions.
Fig.5 NONE |
The fine-block preconditioner generates a preconditionig matrix so that SIMD operations can be applied to the incomplete LU decomposition. In each block of the block Jacobi preconditioner or the additive Schwarz preconditioner, fine diagonal blocks are defined. By ignoring off diagonal elements only in the fine diagonal blocks, data dependency is eliminated, and SIMD operations are enabled. In Fig.6, 3 \(\times\) 3 fine blocks are defined within 9 \(\times\) 9 blocks of the block Jacobi preconditioner, where data dependency is eliminated by ignoring off diagonal elements only within fine blocks (shown in white) and three vector elements are processed by SIMD operations.
Block Jacobi Preconditioner |
Subdividing Block Jacobi Preconditioner |
Fig.6 Subdividing preconditioning |
The diagonal scaling of the matrix \(A\) yields \(\bar{A}\). In the Neumann series preconditioner, the preconditioning matrix \(K\) is defined by an approximate inverse of \(\bar{A}\) using the Neumann series.
\[ \bar{A}^{-1}\simeq K^{-1}=[I+F+F^2+F^3+...] \]
Here, \(\bar{A}=I-F\). This
algorithm can be implemented using sparse matrix-vector products (SpMV),
and has high parallel efficiency. In addition, by combining the block
Jacobi preconditioner, the block Neumann series preconditioner is
constructed to avoid inter-proccess communication in SpMV. These
preconditioners are available only for GPU version.
This section describes the QR decomposition algorithms that can be used in this library.
The QR factorization by the Gram-Schmidt orthogonalization. This algorithm has High parallelism but poor orthogonality.
The QR factorization with a modified algorithm of the classical Gram-Schmidt method to reduce the error. This algorithm improves the orthogonality from the classical Gram-Schmidt method, but requires more collective communication.
TSQR is based on the Householder QR factorization shown below, and has good orthogonality. The sequential TSQR is used between threads, while the parallel TSQR is used between processes.
Sequential TSQR | Parallel TSQR | |
---|---|---|
The Cholesky QR factoriztion consists of a matrix product and the Cholesky factorization. This algorithm has high computational intensity and can be computed with one collective communication. However, its orthogonality is poor.
The Cholesky QR2 factorization performs the second Cholesky QR factorization for the orthogonal matrix obtained by the first Cholesky QR factorization. This algorithm improves the orthogonality by executing the Cholesky QR factorization twice.
In order to save memory in sparse matrix computation, only the non-zero elements and their locations are stored. PARCEL supports different sparse matrix formats.
The CRS format compresses the row indices of the non-zero elements of the sparse matrix. Together with the compressed rows, the column index and the value for each non-zero element are stored in one dimensional arrays. Compresed row indeces are stored in an incremental order.
\[ A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right) \]
Values of non-zero elements :: \(crsA=\{a,b,c,d,e,f,g,h,i,j\}\)
Compressed row indices :: \(crsRow\_ptr=\{1,4,5,8,11\}\)
Column indices of non-zero elements :: \(crsCol=\{1,2,3,2,1,3,4,2,3,4\}\)
Values of non-zero elements :: \(crsA=\{a,b,c,d\}\)
Compressed row indices :: \(crsRow\_ptr=\{1,4,5\}\)
Column indices of non-zero elements :: \(crsCol=\{1,2,3,2\}\)
Values of non-zero elements :: \(crsA=\{e,f,g,h,i,j\}\)
Compressed row indices :: \(crsRow\_ptr=\{1,4,7\}\)
Column indices of non-zero elements :: \(crsCol=\{1,3,4,2,3,4\}\)
The DIA format stores the non-zero elements of a matrix as diagonals. This format provides high performance for band block diagonal matrices. In addition to the element values, an offset value for each diagonal of the matrix needs to be stored inside an one dimensional array. A negative offset indicates that the diagonal is below the main diagonal of the matrix, a positive offset indicates that the diagonal is above the main diagonal. The main diagonal has the offset zero. Offset values are stored in an incremental order.
\[ A = \left( \begin{array}{ccc} a & b & c & 0\\ 0 & d & 0 & 0\\ e & 0 & f & g\\ 0 & h & i & j \end{array} \right) \]
Elements :: \(diaA=\{0,0,e,h,0,0,0,i,a,d,f,j,b,0,g,0,c,0,0,0\}\)
Offsets :: \(offset=\{-2,-1,0,1,2\}\)
Number of diagonals :: \(nnd=5\)
Elements :: \(diaA=\{a,d,b,0,c,0\}\)
Offsets :: \(offset=\{0,1,2\}\)
Number of diagonals :: \(nnd=3\)
Elements :: \(diaA=\{e,h,0,i,f,j,g,0\}\)
Offsets :: \(offset=\{-2,-1,0,1\}\)
Number of diagonals :: \(nnd=4\)
In the DDM format, the DIA format is extended by assuming simulations based on stencil computation such as finite difference with domain decomposition. By specifying a list of processes which exchange data with each other so that it matches domain decomposition in the simulation, one can minimize communication needed for parallel processing of matrices given by one-, two-, and three-dimensional structured grids. Related parameter settings are shown below.
The number of neighboring processes for \(m\)-dimensional domain decomposition is given by \(3^{m-1}\). For example, in the two dimensional case, the number of neighboring processes is eight, while for the three dimensional case, it becomes \(26\).
A list of the neoghboring processes is specified by their ranks on (one dimensional) MPI communicator. At the boundaries of computational domain, negative indexes are substituted to neighbor_ranks for the directions, where the neighboring processes do not exist. A sample code below can be used as the standard method for setting neighbor_ranks. The parameters npe_x, npe_y and npe_z represent the number of processes in each direction (x, y, z). rank_x, rank_y and rank_z represent the ranks in each direction. Here, the definition of one-dimensional index iirank should be consistent with that in the simulation.
ixmax = 0; iymax = 0; izmax = 0; num = 0; neighbor_ranks = -1 if ( npe_x > 1 ) ixmax=1 if ( npe_y > 1 ) iymax=1 if ( npe_z > 1 ) izmax=1 do iz=-izmax, izmax do iy=-iymax, iymax do ix=-ixmax, ixmax n1x = rank_x +ix; n1y = rank_y +iy; n1z = rank_z +iz; if ( ix*ix +iy*iy + iz*iz == 0 ) cycle num = num +1 if ( n1z >=0 .and. n1z < npe_z ) then if ( n1y >=0 .and. n1y < npe_y ) then if ( n1x >=0 .and. n1x < npe_x ) then iirank = npe_x *npe_y *n1z +npe_x *n1y +n1x; neighbor_ranks(num) = iirank; endif endif endif end do end do end do |
Grid positions referred in stencil computation such as finite difference are specified in each direction. For example, in the case of seven stencil computation consisting of three point finite difference for three-dimensional Poisson equation, the grid offsets for East, West, South, North, Up and Down from the base grid can be set as shown below.
i | ioff_grids(*,1) | ioff_grids(*,2) | ioff_grids(*,3) | |
---|---|---|---|---|
base | 1 | 0 | 0 | 0 |
east | 2 | 1 | 0 | 0 |
west | 3 | -1 | 0 | 0 |
south | 4 | 0 | 1 | 0 |
north | 5 | 0 | -1 | 0 |
top | 6 | 0 | 0 | 1 |
bottom | 7 | 0 | 0 | -1 |
\[ A = \left( \begin{array}{ccc} a & b & 0 & 0\\ c & d & e & 0\\ 0 & f & g & h\\ 0 & 0 & i & j \end{array} \right) \]
Elements :: \(val\_dia=\{a,d,g,j,b,e,h,0,0,c,f,i\}\)
Offsets :: \(ioff\_dia=\{0,0,-1\}\)
Number of diagonals :: \(nnd=3\)
Stencils ::
i | ioff_grids(*,1) | |
---|---|---|
reference | 1 | 0 |
east | 2 | 1 |
west | 3 | -1 |
Elements :: \(val\_dia=\{a,d,b,e,0,c\}\)
Offsets :: \(ioff\_dia=\{0,0,-1\}\)
Number of diagonals :: \(nnd=3\)
Stencils ::
i | ioff_grids(*,1) | |
---|---|---|
reference | 1 | 0 |
east | 2 | 1 |
west | 3 | -1 |
Elements :: \(val\_dia=\{g,j,h,0,f,i\}\)
Offsets :: \(ioff\_dia=\{0,0,0\}\)
Number of diagonals :: \(nnd=3\)
Stencils ::
i | ioff_grids(*,1) | |
---|---|---|
reference | 1 | 0 |
east | 2 | 1 |
west | 3 | -1 |
This chapter explains how to compile and install the PARCEL library.
PARCEL requires a C compiler, a Fortran compiler, which supports preprocessor, MPI, LAPACK, and OpenMP. When compiling with \(gfortran\), the option \(-cpp\) has to be specified to to activate preprocessor. Compiler options can be changed inside the file arch/make_config. An example of a make_config file is shown below.
To compile, execute the make command in the top directory of the extracted archive. Then, execute “make install” to create the example, include, lib directory in the directory specified in INSTALLDIR. If parcel_sparse.a is created in the lib directory, the compilation is successful. For large-scale problems of 2 billion dimensions or more, 64-bit integers are required, so it is necessary to compile with 64-bit integer types. To compile the GPU version of PARCEL, turn CUDAFLAG “ON”.By turning CUDAAWARE “ON”, a library that supports CUDA AWARE MPI is created.You need to link cuBLAS,cuSPARSE libraries.
CUDAFLAG = ON CUDAAWARE = ON CC = mpiicc FC = mpiifort CFLAGS = -O3 FFLAGS = -O3 -fpp -qopenmp -xHost -mcmodel=large FP_MODE = -fp-model precise INCLUDEDIR = LD_MPI = -lmpifort LD_FORT = -lifcore LD_LAPACK = -mkl LD_CUDA = -lcudart -lcublas -lcusparse NVCFLAGS = \ -arch=sm_80 -ccbin=${CC} -O3 -restrict \ --expt-extended-lambda -use_fast_math \ --default-stream per-thread \ --nvlink-options -Werror \ --maxrregcount=128 \ --extra-device-vectorization \ --restrict \ "-std=c++11" #FDEFINE = -DPARCEL_INT8 INSTALLDIR = ../PARCEL_1.2
Option | Explanation |
---|---|
CC | C compiler command name |
FC | Fortran compiler command name |
CFLAGS | C compiler options |
FFLAGS | Fortran compiler options |
FP_MODE | Option to suppress optimizations with arithmetic order change (Only for quad precision) |
CUDAFLAG | CUDA compile flag (ON:enable.OFF:disable) |
CUDAAWARE | CUDA-Aware MPI compile flag (ON:enable.OFF:disable) |
NVCFLAGS | CUDA compiler options |
INCLUDEDIR | Include directories |
LD_MPI | MPI library |
LD_LAPACK | LAPACK library |
LD_CUDA | CUDA library(cuBLAS,cuSPARSE) |
FDEFINE | 64bit integer becomes available with -DPARCEL_INT8. |
INSTALLDIR | PARCEL installation directory |
All the routines in the PARCEL library can be called by linking parcel_sparse.a which is generated in the directory lib.
The interface of each rountine in the PARCEL library is described. The following implementations are commonry used.
The names of routines starting with ‘parcel_d’ and ‘parcel_dd’ donote the PARCEL routines implemented in Double precision and Quadruple (Double-Double) precision, respectively.
The GPU version allocates memory on the GPU with Fortran type(c_ptr). The subroutine PARCEL_Init must be called before executing the subroutine of the Krylov subspace method to specify the GPU used by the process. The input matrix \(A\), right hand side vector \(b\), and the initial vector \(x\) are also allocated on the GPU memory with Fortran type(c_ptr). Sparse matrix-vector products in the CRS format are implemented using cuSPARSE(cusparseSpMV). The GPU version does not support the Fine-block Preconditioner, but the Neumann series Preconditioner is available.
The communication-computation overlap is implemneted in the following three modes. iovlflag=0 does not use any communication-computation overlap. iovlflag=1 uses the communication-computation overlap, in which computation processes are started after all non-blocking communications are launched at once. iovlflag=0 uses the communication-computation overlap, in which non-blocking communication and computation in each direction are processed sequencially with taking barrior synchronization, which reuses and saves communication buffers. The communication-computation overlap is not supported in the GPU version.
When precon_thblock is zero or negative, or is greater than the number of threads per process, precon_nblock is set to be the number of threads per process, which is the default value. One can reduce the number of threads to enlarge the block size and improve the accuracy of preconditioning, but this also leads to the degradation of thread parallel performance. In the GPU version, one block per process is computed by ILU decomposition of cuSPARSE (cusparseDcsrsv2_solve).
When independ_nvec is zero or negative, the fine-block preconditioner is not applied. On A64FX, independ_nvec=300 is recommended to facilitate optimizations for SIMD operations and software pipelining on the Fujitsu compiler. Larger independ_nvec improves the computational performance, while the accuracy of preconditioning becomes worse. When independ_nvec is greater than the size of vector on each process n, the fine-block preconditioner becomes equivalent to the point Jacobi preconditioner.
Thread optimization of SpMV employs a cyclic loop division to facilitate the reuse of on-cache data between different threads. nBlock is the chank size for this loop division. When nBlock is zero or negative, the default value of nBlock=2000 is set.
When the i-th eigenvalue is a real value, the corresponding eigenvector is stored in Evec(n*(i-1)+1:n*i). When the i-th eigenvalue is a complex value and the (i+1)-th eigenvalue is its complex conjugate value, the i-th and (i+1)-th eigenvectors are stored as follows. The CA-Arnoldi method is not supported in the GPU version.
Eigenvector | Real Part | Imaginary Part |
---|---|---|
i-th | Evec(n*(i-1)+1:n*i) | Evec(n*i+1:n*(i+1)) |
(i+1)-th | Evec(n*(i-1)+1:n*i) | -Evec(n*i+1:n*(i+1)) |
call PARCEL_Init(myrank)
Initialize the PARCEL execution environment.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
myrank | integer*4 | in | MPI process rank |
call PARCEL_Free()
Terminates PARCEL execution environment.
call parcel_dcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )
A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA(nnz) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in CRS format |
crsRow_ptr(n+1) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Pointer table in CRS format |
crsCol(nnz) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
iret | integer | out | Error code (0:normal) |
call parcel_dbicgstab( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )
A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are storedin CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA(nnz) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in CRS format |
crsRow_ptr(n+1) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Pointer table in CRS format |
crsCol(nnz) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
iret | integer | out | Error code (0:normal) |
call parcel_dgmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )
A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA(nnz) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in CRS format |
crsRow_ptr(n+1) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Pointer table in CRS format |
crsCol(nnz) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
gmres_m | integer | in | Number of iterations until restart |
gmres_GSflag | integer | in | Flag for orthogonalization algorithm (1: MGS, 2: CGS) |
iret | integer | out | Error code (0:normal) |
call parcel_dcagmres( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )
A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA(nnz) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in CRS format |
crsRow_ptr(n+1) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Pointer table in CRS format |
crsCol(nnz) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
cagmres_sstep | integer | in | Number of communication avoiding steps s in the CA-GMRES method |
cagmres_tstep | integer | in | Number of outer iterations t in the CA-GMRES method (restart length = st) |
cagmres_basis | integer | in | Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis) |
cagmres_QRflag | integer | in | Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2) |
iret | integer | out | Error code (0:normal) |
call parcel_dcbcg( icomm, vecx,vecb, n,gn,nnz,istart, crsA,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )
A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA(nnz) | double precision | in | Non-zero elements of matrix stored in CRS format |
crsRow_ptr(n+1) | integer*4 / integer*8 | in | Pointer table in CRS format |
crsCol(nnz) | integer*4 / integer*8 | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
cbcg_kstep | integer | in | Number of communication avoiding steps |
cbcg_Eigenflag | integer | in | Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI) |
power_method_itrmax | integer | in | Maximum number of iterations in the power method |
caarnoldi_sstep | integer | in | Number of communication avoiding steps s of the CA-ARNOLDI method |
caarnoldi_tstep | integer | in | Number of outer iterations t in the CA-ARNOLDI method (restart length = st) |
caarnoldi_basis | integer | in | Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis) |
caarnoldi_QRflag | integer | in | Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2) |
iret | integer | out | Error code (0:normal) |
call parcel_dcaarnoldi( icomm, vecx, vecb, n, gn, nnz, istart, crsA, crsRow_ptr, crsCol, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )
An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | double precision | in/out | in: initial vector, out: solution vector |
vecb(n) | double precision | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA(nnz) | double precision | in | Non-zero elements of matrix stored in CRS format |
crsRow_ptr(n+1) | integer*4 / integer*8 | in | Pointer table in CRS format |
crsCol(nnz) | integer*4 / integer*8 | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
caarnoldi_sstep | integer | in | Number of communication avoiding steps s in the CA-ARNOLDI method |
caarnoldi_tstep | integer | in | Number of outer iterations t in the CA-ARNOLDI method (restart length = st) |
caarnoldi_basis | integer | in | Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis) |
caarnoldi_QRflag | integer | in | Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2) |
Evalr(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Real part of the eigenvalue |
Evali(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Imaginary part of eigenvalue |
Evec(n*caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Eigenvectors |
Eerr(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Error norm \(\frac{\left\| Ax - \lambda x \right\|}{\left\| \text{λx} \right\|}\) |
EmaxID | integer | out | ID number of the maximum eigenvalue |
EminID | integer | out | ID number of the minimam eigenvalue |
iret | integer | out | Error code (0:normal) |
call parcel_dcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )
A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA(n*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DIA format |
offset(nnd) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
iret | integer | out | Error code (0:normal) |
call parcel_dbicgstab_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )
A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA(n*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DIA format |
offset(nnd) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
iret | integer | out | Error code (0:normal) |
call parcel_dgmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m,gmres_GSflag, iret )
A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA(n*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DIA format |
offset(nnd) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
gmres_m | integer | in | Number of iterations until restart |
gmres_GSflag | integer | in | Flag for orthogonalization algorithm (1: MGS, 2: CGS) |
iret | integer | out | Error code (0:normal) |
call parcel_dcagmres_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )
A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DIA format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA(n*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DIA format |
offset(nnd) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
cagmres_sstep | integer | in | Small communication step of CA-GMRES method |
cagmres_tstep | integer | in | List step of the CA-GMRES method |
cagmres_basis | integer | in | Base generation flag of CA-GMRES method (0: single basis, 1: newton basis) |
cagmres_QRflag | integer | in | QR Decomposition flag of CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2) |
iret | integer | out | Error code (0:normal) |
call parcel_dcbcg_dia( icomm, vecx,vecb, n,gn,istart, diaA,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )
A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DIA format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
vecb(n) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA(n*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DIA format |
offset(nnd) | CPU : integer*4 / integer*8 / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
cbcg_kstep | integer | in | Number of communication avoiding steps |
cbcg_Eigenflag | integer | in | Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI) |
power_method_itrmax | integer | in | Maximum number of iterations in the power method |
caarnoldi_sstep | integer | in | Number of communication avoiding steps s of the CA-ARNOLDI method |
caarnoldi_tstep | integer | in | Number of outer iterations t in the CA-ARNOLDI method (restart length = st) |
caarnoldi_basis | integer | in | Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis) |
caarnoldi_QRflag | integer | in | Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2) |
iret | integer | out | Error code (0:normal) |
call parcel_dcaarnoldi_dia( icomm, vecx, vecb, n, gn, istart, diaA,offset,nnd, ipreflag, ilu_method, addL, iflagAS, itrmax, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )
An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DIA format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx(n) | double precision | in/out | in: initial vector, out: solution vector |
vecb(n) | double precision | in | Right hand side vector |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA(n*nnd) | double precision | in | Non-zero elements of matrix stored in DIA format |
offset(nnd) | integer*4 / integer*8 | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
caarnoldi_sstep | integer | in | Number of communication avoiding steps s in the CA-ARNOLDI method |
caarnoldi_tstep | integer | in | Number of outer iterations t in the CA-ARNOLDI method (restart length = st) |
caarnoldi_basis | integer | in | Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis) |
caarnoldi_QRflag | integer | in | Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2) |
Evalr(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Real part of the eigenvalue |
Evali(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Imaginary part of eigenvalue |
Evec(n*caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Eigenvectors |
Eerr(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Error norm \({\left\| Ax - \lambda x \right\|}/{\left\| \text{λx} \right\|}\) |
EmaxID | integer | out | ID number of the maximum eigenvalue |
EminID | integer | out | ID number of the minimam eigenvalue |
iret | integer | out | Error code (0:normal) |
call parcel_dcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )
A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x(m) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
b(m) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | integer | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
iret | integer | out | Error code (0:normal) |
call parcel_dbicgstab_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, iret )
A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x(m) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
b(m) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | CPU : integer / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia(m*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DDM format |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
iret | integer | out | Error code (0:normal) |
call parcel_dgmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, gmres_m, gmres_GSflag, iret )
A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x(m) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
b(m) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | CPU : integer / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia(m*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DDM format |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
gmres_m | integer | in | Number of iterations until restart |
gmres_GSflag | integer | in | Flag for orthogonalization algorithm (1: MGS, 2: CGS) |
iret | integer | out | Error code (0:normal) |
call parcel_dcagmres_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cagmres_sstep, cagmres_tstep, cagmres_basis, cagmres_QRflag, iret )
A simultaneous linear equation system Ax = b is solved by the communication reduced generalized minimum residual method (CA-GMRES (s, t) method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x(m) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
b(m) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | CPU : integer / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia(m*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DDM format |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
cagmres_sstep | integer | in | Number of communication avoiding steps s in the CA-GMRES method |
cagmres_tstep | integer | in | Number of outer iterations t in the CA-GMRES method (restart length = st) |
cagmres_basis | integer | in | Flag for basis vector in the CA-GMRES method (0: monomial basis, 1: Newton basis) |
cagmres_QRflag | integer | in | Flag for QR factorization in the CA-GMRES method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2) |
iret | integer | out | Error code (0:normal) |
call parcel_dcbcg_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, Neuman_itrmax, independ_nvec, nBlock, cbcg_kstep,cbcg_Eigenflag,power_method_itrmax, caarnoldi_sstep,caarnoldi_tstep, caarnoldi_basis,caarnoldi_QRflag, iret )
A simultaneous linear equation system Ax = b is solved by the Chebyshev basis conjugate gradient method (CBCG method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x(m) | CPU : double precision / GPU : type(c_ptr) | in/out | in: initial vector, out: solution vector |
b(m) | CPU : double precision / GPU : type(c_ptr) | in | Right hand side vector |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | CPU : integer / GPU : type(c_ptr) | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia(m*nnd) | CPU : double precision / GPU : type(c_ptr) | in | Non-zero elements of matrix stored in DDM format |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz, 4: Neumann series(GPU), 5: block Neumann series(GPU) ) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
Neuman_itrmax | integer | in | Order of Neumann series preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
cbcg_kstep | integer | in | Number of communication avoiding steps |
cbcg_Eigenflag | integer | in | Flag for eigenvalue computation (1: power method, 2: CA-ARNOLDI) |
power_method_itrmax | integer | in | Maximum number of iterations in the power method |
caarnoldi_sstep | integer | in | Number of communication avoiding steps s of the CA-ARNOLDI method |
caarnoldi_tstep | integer | in | Number of outer iterations t in the CA-ARNOLDI method (restart length = st) |
caarnoldi_basis | integer | in | Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis) |
caarnoldi_QRflag | integer | in | Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: Cholesky QR, 5: Cholesky QR2) |
iret | integer | out | Error code (0:normal) |
call parcel_dcaarnoldi_ddm( icomm, x,b, m, nnd,ioff_dia,val_dia, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, caarnoldi_sstep, caarnoldi_tstep, caarnoldi_basis, caarnoldi_QRflag, Evalr, Evali, Evec, Eerr, EmaxID, EminID, iret )
An eigenvalue problem is solved by the communication avoiding Arnoldi method (CA-Arnoldi((s,t))). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x(m) | double precision | in/out | in: initial vector, out: solution vector |
b(m) | double precision | in | Right hand side vector |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | integer | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
caarnoldi_sstep | integer | in | Number of communication avoiding steps s in the CA-ARNOLDI method |
caarnoldi_tstep | integer | in | Number of outer iterations t in the CA-ARNOLDI method (restart length = st) |
caarnoldi_basis | integer | in | Flag for basis vector in the CA-ARNOLDI method (0: monomial basis, 1: Newton basis) |
caarnoldi_QRflag | integer | in | Flag for QR factorization in the CA-ARNOLDI method (1: MGS, 2: CGS, 3: TSQR, 4: CholeskyQR, 5: CholeskyQR2) |
Evalr(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Real part of the eigenvalue |
Evali(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Imaginary part of eigenvalue |
Evec(m*caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Eigenvectors |
Eerr(caarnoldi_sstep*caarnoldi_tstep) | double precision | out | Error norm \({\left\| Ax - \lambda x \right\|}/{\left\| \text{λx} \right\|}\) |
EmaxID | integer | out | ID number of the maximum eigenvalue |
EminID | integer | out | ID number of the minimam eigenvalue |
iret | integer | out | Error code (0:normal) |
call parcel_ddcg( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx_hi(n) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
vecx_lo(n) | double precision | in/out | in: initial value of the iteration (lower bit), out: solution vector (lower bit) |
vecb_hi(n) | double precision | in | Right hand side vector (upper bit) |
vecb_lo(n) | double precision | in | Right hand side vector (lower bit) |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA_hi(nnz) | double precision | in | Non-zero elements (upper bits) of matrix stored in CRS format |
crsA_lo(nnz) | double precision | in | Non-zero elements (lower bit) of matrix stored in CRS format |
crsRow_ptr(n+1) | integer*4 / integer*8 | in | Pointer table in CRS format |
crsCol(nnz) | integer*4 / integer*8 | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddbicgstab( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx_hi(n) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
vecx_lo(n) | double precision | in/out | in: initial value of the iteration (lower bit), out: solution vector (lower bit) |
vecb_hi(n) | double precision | in | Right hand side vector (upper bit) |
vecb_lo(n) | double precision | in | Right hand side vector (lower bit) |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA_hi(nnz) | double precision | in | Non-zero elements (upper bits) of matrix stored in CRS format |
crsA_lo(nnz) | double precision | in | Non-zero elements (lower bit) of matrix stored in CRS format |
crsRow_ptr(n+1) | integer*4 / integer*8 | in | Pointer table in CRS format |
crsCol(nnz) | integer*4 / integer*8 | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddgmres( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,nnz,istart, crsA_hi,crsA_lo,crsRow_ptr,crsCol, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m,gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in CRS format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx_hi(n) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
vecx_lo(n) | double precision | in/out | in: initial value of the iteration (lower bit), out: solution vector (lower bit) |
vecb_hi(n) | double precision | in | Right hand side vector (upper bit) |
vecb_lo(n) | double precision | in | Right hand side vector (lower bit) |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
nnz | integer*4 / integer*8 | in | Number of non-zero elements on each process |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
crsA_hi(nnz) | double precision | in | Non-zero elements (upper bits) of matrix stored in CRS format |
crsA_lo(nnz) | double precision | in | Non-zero elements (lower bit) of matrix stored in CRS format |
crsRow_ptr(n+1) | integer*4 / integer*8 | in | Pointer table in CRS format |
crsCol(nnz) | integer*4 / integer*8 | in | Column numbers of non-zero elements in CRS format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
gmres_m | integer | in | Number of iterations until restart |
gmres_GSflag | integer | in | Flag for orthogonalization algorithm (1: MGS, 2: CGS) |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddcg_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DIA format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx_hi(n) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
vecx_lo(n) | double precision | in/out | in: initial value of the iteration (lower bit), out: solution vector (lower bit) |
vecb_hi(n) | double precision | in | Right hand side vector (upper bit) |
vecb_lo(n) | double precision | in | Right hand side vector (lower bit) |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA_hi(n*nnd) | double precision | in | Non-zero elements (upper bits) of matrix stored in DIA format |
diaA_lo(n*nnd) | double precision | in | Non-zero elements (lower bit) of matrix stored in DIA format |
offset(nnd) | integer*4 / integer*8 | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddbicgstab_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DIA format
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx_hi(n) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
vecx_lo(n) | double precision | in/out | in: initial value of the iteration (lower bit), out: solution vector (lower bit) |
vecb_hi(n) | double precision | in | Right hand side vector (upper bit) |
vecb_lo(n) | double precision | in | Right hand side vector (lower bit) |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA_hi(n*nnd) | double precision | in | Non-zero elements (upper bits) of matrix stored in DIA format |
diaA_lo(n*nnd) | double precision | in | Non-zero elements (lower bit) of matrix stored in DIA format |
offset(nnd) | integer*4 / integer*8 | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddgmres_dia( icomm, vecx_hi,vecx_lo, vecb_hi,vecb_lo, n,gn,istart, diaA_hi,diaA_lo,offset,nnd, ipreflag,ilu_method,addL,iflagAS, itrmax,rtolmax, reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DIA format
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
vecx_hi(n) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
vecx_lo(n) | double precision | in/out | in: initial value of the iteration (lower bit), out: solution vector (lower bit) |
vecb_hi(n) | double precision | in | Right hand side vector (upper bit) |
vecb_lo(n) | double precision | in | Right hand side vector (lower bit) |
n | integer*4 / integer*8 | in | Size of vector on each process |
gn | integer*4 / integer*8 | in | Total size of vector |
istart | integer*4 / integer*8 | in | Start line of matrix on each process |
diaA_hi(n*nnd) | double precision | in | Non-zero elements (upper bits) of matrix stored in DIA format |
diaA_lo(n*nnd) | double precision | in | Non-zero elements (lower bit) of matrix stored in DIA format |
offset(nnd) | integer*4 / integer*8 | in | Offset of each non-zero diagonal elements array in DIA format |
nnd | integer*4 / integer*8 | in | Number of diagonal elements arrays in DIA format |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
ilu_method | integer | in | Flag for incomplete LU factorization(0:ILU(0),1:D-ILU(DIA components match),2:D-ILU(Element sums in a row match),Only 1 and 2 are available for additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | Convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
gmres_m | integer | in | Number of iterations until restart |
gmres_GSflag | integer | in | Flag for orthogonalization algorithm (1: MGS, 2: CGS) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddcg_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the conjugate gradient method (CG method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x_hi(m) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
x_lo(m) | double precision | in/out | in: Initial vector (lower bits), out: Solution vector (lower bits) |
b_hi(m) | double precision | in | Right hand side vector (upper bit) |
b_lo(m) | double precision | in | Right hand side vector (lower bit) |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | integer | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia_hi(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format (upper bit) |
val_dia_lo(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format (lower bit) |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddbicgstab_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the stabilized biconjugate gradient method (Bi-CGSTAB method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x_hi(m) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
x_lo(m) | double precision | in/out | in: Initial vector (lower bits), out: Solution vector (lower bits) |
b_hi(m) | double precision | in | Right hand side vector (upper bit) |
b_lo(m) | double precision | in | Right hand side vector (lower bit) |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | integer | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia_hi(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format (upper bit) |
val_dia_lo(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format (lower bit) |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_ddgmres_ddm( icomm, x_hi,x_lo, b_hi,b_lo, m, nnd,ioff_dia, val_dia_hi,val_dia_lo, num_neighbor_ranks,neighbor_ranks,margin,ncomp_grids, ndiv_grids,num_grids,ioff_grids, div_direc_th, ipreflag, addL,iflagAS, itrmax,rtolmax,reshistory, iovlflag, precon_thblock, independ_nvec, nBlock, gmres_m, gmres_GSflag, precision_A, precision_b, precision_x, precision_precon, iret )
A simultaneous linear equation system Ax = b is solved by the generalized minimum residual method (GMRES (m) method). Non-zero elements are stored in DDM format.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
icomm | integer | in | MPI communicator |
x_hi(m) | double precision | in/out | in: Initial vector (upper bits), out: Solution vector (upper bits) |
x_lo(m) | double precision | in/out | in: Initial vector (lower bits), out: Solution vector (lower bits) |
b_hi(m) | double precision | in | Right hand side vector (upper bit) |
b_lo(m) | double precision | in | Right hand side vector (lower bit) |
m | integer | in | Size of vector on each process |
nnd | integer | in | Number of diagonal elements arrays on each process |
ioff_dia(nnd) | integer | in | Offset of each non-zero diagonal elements array in DDM format |
val_dia_hi(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format (upper bit) |
val_dia_lo(m*nnd) | double precision | in | Non-zero elements of matrix stored in DDM format (lower bit) |
num_neighbor_ranks | integer | in | Number of neighboring processes |
neighbor_ranks(num_neighbor_ranks) | integer | in | List of neighboring processes |
margin | integer | in | Maximum number of absolute value of ioff_grids |
ncomp_grids | integer | in | Dimension of structured grids |
ndiv_grids(ncomp_grids) | integer | in | Number of decomposed domains in each direction |
num_grids(ncomp_grids) | integer | in | Number of grids on each process |
ioff_grids(nnd,ncomp_grids) | integer | in | Offset of stencil data location in each direction |
div_direc_th | integer | in | Direction of thread parallelization ( 1:x, 2:y, 3:z) |
ipreflag | integer | in | Flag for preconditioner (0: none, 1: point Jacobi, 2: block Jacobi, 3: additive Schwartz) |
addL | integer | in | Overlapping width in the additive Schwartz method |
iflagAS | integer | in | Flag for the additive Schwartz method (1: BASIC, 2: RESTRICT, 3: INTERPOLATE, 4: NONE) |
itrmax | integer | in/out | in: maximum number of iterations, out: number of iterations |
rtolmax | double precision | in | convergence criterion (norm of relative residual error) |
reshistory(itrmax) | double precision | out | History of relative residual error |
iovlflag | integer | in | Flag for communication-computation overlap (0: none, 1: all processes, 2: each process) |
precon_thblock | integer | in | Number of blocks for each process of Block Jacobi preconditioning and Additive Schwarz preconditioning |
independ_nvec | integer | in | Size of diagonal blocks to be calculated independently in subdivision preconditioning |
nBlock | integer | in | Number of rows allocated to the thread cyclically. |
gmres_m | integer | in | Number of iterations until restart |
gmres_GSflag | integer | in | Flag for orthogonalization algorithm (1: MGS, 2: CGS) |
precision_A | integer | in | Precision of matrix (1: double precision, 2: quad precision) |
precision_b | integer | in | Precision of right hand side vector (1: double precision, 2: quad precision) |
precision_x | integer | in | Precision of solution vector (1: double precision, 2: quad precision) |
precision_precon | integer | in | Precision of preconditioner matrix (1: double precision, 2: quad precision) |
iret | integer | out | Error code (0:normal) |
call parcel_dqr(n,s,icomm,V,Q,R,iQRflag)
QR factorization for the m*s matrix, where the sum of the parameter n for the MPI processes belonging to the MPI communicator icomm is m.
Parameter(dimension) | Type | Input/Output | Description |
---|---|---|---|
n | integer*4 / integer*8 | in | Size of rows in the matrix on each process |
s | integer*4 / integer*8 | in | Size of column in the matrix |
icomm | integer | in | MPI communicator |
V(s*n) | double precision | in | Input Matrix (Column Major) |
Q(s*n) | double precision | out | Orthogonal matrix (Column Major) |
R(s*s) | double precision | out | Upper triangular matrix (Column Major) |
iQRflag | integer | in | Flag for QR factorization (1:MGS,2:CGS,3:TSQR,4:CholeskyQR,5:CholeskyQR2) |
The interface of each CUDA support routine for Fortran is
described. These routines can be replaced with other implementations
such as CUDA Fortran and OpenACC.
call cudaMalloc_FP64(a_d,n)
double precision型のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_d | type(devicemem) | out | PARCEL GPUメモリ構造体 |
n | integer*4 / integer*8 | in | 確保する要素数 |
call cudaMalloc_PARCEL_INT(a_d,n)
32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_d | type(devicemem) | out | PARCEL GPUメモリ構造体 |
n | integer*4 / integer*8 | in | 確保する要素数 |
call cudaMalloc_INT(a_d,n)
integer*4型のGPUメモリをtype(devicemem)構造体のtype(c_ptr)メンバmemに割り当てる.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_d | type(devicemem) | out | PARCEL GPUメモリ構造体 |
n | integer*4 / integer*8 | in | 確保する要素数 |
call cudaFree(a_d)
GPUメモリを解放する.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_d | type(devicemem) | out | PARCEL GPUメモリ構造体 |
call cudaMemcpy_h2d_FP64(a_h,a_d,n,offset_h,offset_d)
FP64データをCPUメモリからGPUメモリにコピー.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_h | double precision | in | コピー元CPUメモリ |
a_d | type(devicemem) | out | コピー先PARCEL GPUメモリ構造体 |
n | integer*4 / integer*8 | in | コピーする要素数 |
offset_h | integer*4 / integer*8 | in | コピー元となるa_hの先頭アドレスからのオフセット |
offset_d | integer*4 / integer*8 | in | コピー先となるa_dの先頭アドレスからのオフセット |
call cudaMemcpy_h2d_PARCEL_INT(a_h,a_d,n,offset_h,offset_d)
32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のデータをCPUメモリからGPUメモリにコピー.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_h | double precision | in | コピー元CPUメモリ |
a_d | type(devicemem) | out | コピー先PARCEL GPUメモリ構造体 |
n | integer*4 / integer*8 | in | コピーする要素数 |
offset_h | integer*4 / integer*8 | in | コピー元となるa_hの先頭アドレスからのオフセット |
offset_d | integer*4 / integer*8 | in | コピー先となるa_dの先頭アドレスからのオフセット |
call cudaMemcpy_h2d_INT(a_h,a_d,n,offset_h,offset_d)
integer*4データをCPUメモリからGPUメモリにコピー.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_h | double precision | in | コピー元CPUメモリ |
a_d | type(devicemem) | out | コピー先PARCEL GPUメモリ構造体 |
n | integer*4 / integer*8 | in | コピーする要素数 |
offset_h | integer*4 / integer*8 | in | コピー元となるa_hの先頭アドレスからのオフセット |
offset_d | integer*4 / integer*8 | in | コピー先となるa_dの先頭アドレスからのオフセット |
call cudaMemcpy_d2h_FP64(a_d,a_h,n,offset_d,offset_h)
FP64データをGPUメモリからCPUメモリにコピー.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_d | type(devicemem) | in | コピー元PARCEL GPUメモリ構造体 |
a_h | double precision | out | コピー先CPUメモリ |
n | integer*4 / integer*8 | in | コピーする要素数 |
offset_d | integer*4 / integer*8 | in | コピー先となるa_dの先頭アドレスからのオフセット |
offset_h | integer*4 / integer*8 | in | コピー元となるa_hの先頭アドレスからのオフセット |
call cudaMemcpy_d2h_PARCEL_INT(a_d,a_h,n,offset_d,offset_h)
32ビット整数バージョンではinteger4,64ビット整数バージョンではinteger8のデータをGPUメモリからCPUメモリにコピー.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_d | type(devicemem) | in | コピー元PARCEL GPUメモリ構造体 |
a_h | double precision | out | コピー先CPUメモリ |
n | integer*4 / integer*8 | in | コピーする要素数 |
offset_d | integer*4 / integer*8 | in | コピー先となるa_dの先頭アドレスからのオフセット |
offset_h | integer*4 / integer*8 | in | コピー元となるa_hの先頭アドレスからのオフセット |
call cudaMemcpy_d2h_INT(a_d,a_h,n,offset_d,offset_h)
integer*4データをGPUメモリからCPUメモリにコピー.
引数名(次元) | 型 | 入力/出力 | 説明 |
---|---|---|---|
a_d | type(devicemem) | in | コピー元PARCEL GPUメモリ構造体 |
a_h | double precision | out | コピー先CPUメモリ |
n | integer*4 / integer*8 | in | コピーする要素数 |
offset_d | integer*4 / integer*8 | in | コピー先となるa_dの先頭アドレスからのオフセット |
offset_h | integer*4 / integer*8 | in | コピー元となるa_hの先頭アドレスからのオフセット |
The use of the PARCEL routines is explained by Fortran sample programs of a CG solver. If you enable a preprocessor option CUDA_SOLVER, a CUDA version of each sample program is generated.
A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.
program main use mpi use PARCEL_integer use krylov use PARCEL_wrapper implicit none integer n,gn,nnz,istart real*8,allocatable :: crsA(:) integer,allocatable :: crsRow_ptr(:),crsCol(:) real*8,allocatable :: vecx(:) real*8,allocatable :: vecb(:) integer itrmax real*8 rtolmax real*8, allocatable :: reshistory(:) integer ipreflag integer ILU_method integer iflagAS integer addL integer iovlflag integer precon_thblock integer independ_nvec integer nblock integer iret integer myrank integer ierr #ifdef CUDA_SOLVER type(devicemem) cu_vecx type(devicemem) cu_vecb type(devicemem) cu_crsA type(devicemem) cu_crsRow_ptr type(devicemem) cu_crsCol integer(kind=intType),parameter :: offset_zero=0 #endif call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr) call PARCEL_Init(myrank) call make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol) allocate(vecx(n)) allocate(vecb(n)) allocate(reshistory(itrmax)) ipreflag=0 ILU_method=1 addL=0 iflagAS=1 itrmax=100 rtolmax=1.0d-8 iovlflag=0 precon_thblock=-1 independ_nvec=-1 nblock = 2000 vecb=1.0d0 vecx=1.0d0 #ifdef CUDA_SOLVER call cudaMalloc_FP64(cu_vecx,n) call cudaMalloc_FP64(cu_vecb,n) call cudaMalloc_FP64(cu_crsA,nnz) call cudaMalloc_PARCEL_INT(cu_crsRow_ptr,n+1) call cudaMalloc_PARCEL_INT(cu_crsCol,nnz) call cudaMemcpy_h2d_fp64(vecx,cu_vecx,n,offset_zero,offset_zero) call cudaMemcpy_h2d_fp64(vecb,cu_vecb,n,offset_zero,offset_zero) call cudaMemcpy_h2d_fp64(crsMat%crsA,cu_crsA,nnz,offset_zero,offset_zero) call cudaMemcpy_h2d_PARCEL_INT(crsMat%crsRow_ptr,cu_crsRow_ptr,n+1,offset_zero,offset_zero) call cudaMemcpy_h2d_PARCEL_INT(crsMat%crsCol,cu_crsCol,nnz,offset_zero,offset_zero) #endif call parcel_dcg( & MPI_COMM_WORLD, & #ifdef CUDA_SOLVER cu_vecx%mem,cu_vecb%mem, & n,gn,nnz,istart, & cu_crsA%mem,cu_crsRow_ptr%mem,cu_crsCol%mem,& #else vecx,vecb, & n,gn,nnz,istart, & crsA,crsRow_ptr,crsCol, & #endif ipreflag,ILU_method,addL,iflagAS, & itrmax,rtolmax, & reshistory, & iovlflag,iret & precon_thblock, independ_nvec, & nblock, & iret & ) call PARCEL_Free() call MPI_Finalize(ierr) #ifdef CUDA_SOLVER call cudaMemcpy_d2h_fp64(cu_vecx,vecx,n,offset_zero,offset_zero) call cudaFree(cu_vecx) call cudaFree(cu_vecb) call cudaFree(cu_crsA) call cudaFree(cu_crsRow_ptr) call cudaFree(cu_crsCol) #endif deallocate(vecx) deallocate(vecb) deallocate(reshistory) end program main
A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.
program main use PARCEL_integer use krylov use PARCEL_wrapper implicit none integer n,gn,nnd,istart real*8,allocatable :: diaA(:) integer,allocatable :: offset(:) real*8,allocatable :: vecx(:) real*8,allocatable :: vecb(:) integer itrmax real*8 rtolmax real*8, allocatable :: reshistory(:) integer ipreflag integer ILU_method integer iflagAS integer addL integer iovlflag integer precon_thblock integer independ_nvec integer nblock integer iret integer myrank integer ierr #ifdef CUDA_SOLVER type(devicemem) cu_vecx type(devicemem) cu_vecb type(devicemem) cu_diaA type(devicemem) cu_offset integer(kind=intType),parameter :: offset_zero=0 #endif call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr) call PARCEL_Init(myrank) call make_matrix_DIA(n,gn,nnd,istart,diaA,offset) allocate(vecx(n)) allocate(vecb(n)) allocate(reshistory(itrmax)) ipreflag=0 ILU_method=1 addL=0 iflagAS=1 itrmax=100 rtolmax=1.0d-8 iovlflag=0 precon_thblock=-1 independ_nvec=-1 nblock = 2000 vecb=1.0d0 vecx=1.0d0 #ifdef CUDA_SOLVER call cudaMalloc_FP64(cu_vecx,n) call cudaMalloc_FP64(cu_vecb,n) call cudaMalloc_FP64(cu_diaA,n*nnd) call cudaMalloc_PARCEL_INT(cu_offset,nnd) call cudaMemcpy_h2d_fp64(vecx,cu_vecx,n,offset_zero,offset_zero) call cudaMemcpy_h2d_fp64(vecb,cu_vecb,n,offset_zero,offset_zero) call cudaMemcpy_h2d_fp64(diaMat%diaA,cu_diaA,n*nnd,offset_zero,offset_zero) call cudaMemcpy_h2d_PARCEL_INT(diaMat%offset,cu_offset,nnd,offset_zero,offset_zero) #endif call parcel_dcg_dia(& MPI_COMM_WORLD, & #ifdef CUDA_SOLVER cu_vecx%mem,cu_vecb%mem, & n,gn,istart,& cu_diaA%mem,cu_offset%mem,nnd, & #else vecx,vecb,& n,gn,istart,& diaA,offset,nnd,& #endif ipreflag,ILU_method,addL,iflagAS,& itrmax,rtolmax,& reshistory,& iovlflag, precon_thblock, independ_nvec, & nblock, & iret& ) call PARCEL_Free() call MPI_Finalize(ierr) #ifdef CUDA_SOLVER call cudaMemcpy_d2h_fp64(cu_vecx,vecx,n,offset_zero,offset_zero) call cudaFree(cu_vecx) call cudaFree(cu_vecb) call cudaFree(cu_diaA) call cudaFree(cu_offset) #endif deallocate(reshistory) end program main
A Fortran sample code in DDM format is shown below. Here, make_matrix_DDM denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL.
program main use mpi use PARCEL_integer use krylov use ddm_wrapper implicit none integer ityp_eq integer gnx,gny,gnz integer itrmax integer MAX_NITER integer nx,ny,nz integer nx0,ny0,nz0 integer n,gn integer m integer m_ integer npes,myrank integer ierr integer i integer ipreflag integer addL integer iflagAS real*8 rtolmax real*8 abstolmax integer solver,ityp_solver integer iret real*8,allocatable :: reshistory_DDM(:) integer ILU_method integer iovlflag real*8,allocatable :: vecx(:) real*8,allocatable :: vecb(:) integer niter,restart integer npey,npez integer precon_thblock integer independ_nvec integer nblock integer npe_x,npe_y,npe_z integer div_direc_th integer rank_x, rank_y, rank_z integer,parameter :: nnd = 7 integer,parameter :: margin = 1 integer,parameter :: ndim=3 integer npe_dim(ndim) integer gn_grids(ndim) integer n_grids(ndim) integer istart_grids(ndim) integer,parameter :: num_neighbor_ranks_max = 3**ndim -1 integer neighbor_ranks( num_neighbor_ranks_max ) integer num_neighbor_ranks real*8, allocatable :: diaA(:) integer, allocatable :: offset(:) integer, allocatable :: offset_dim(:,:) #ifdef CUDA_SOLVER type(devicemem) cu_vecx type(devicemem) cu_vecb type(devicemem) cu_diaA type(devicemem) cu_offset integer(kind=intType),parameter :: offset_zero=0 #endif namelist/input/ & ityp_eq, & gnx, & gny, & gnz, & MAX_NITER, & rtolmax, & abstolmax, & ipreflag, & independ_nvec, & addL, & iflagAS, & ILU_method, & iovlflag, & ityp_solver, & precon_thblock, & nblock & namelist/input_ddm/ & npe_x,& npe_y,& npe_z,& div_direc_th open(33,file='input_namelist') read(33,nml=input) read(33,nml=input_ddm) close(33) stime = 0.0d0 etime = 0.0d0 itrmax = MAX_NITER solver = ityp_solver call MPI_Init(ierr) call MPI_Comm_size(MPI_COMM_WORLD,npes,ierr) call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr) call PARCEL_Init(myrank) gn_grids(1) = gnx gn_grids(2) = gny gn_grids(3) = gnz call set_rank_xyz( myrank, npe_x, npe_y, npe_z, rank_x, rank_y, rank_z ) call set_grids_decompostion( & npe_x, npe_y, npe_z, & rank_x, rank_y, rank_z, & ndim, gn_grids, npe_dim, n_grids, & istart_grids, margin, m ) num_neighbor_ranks = -1 call set_neighbor_ranks( & npe_x, npe_y, npe_z, rank_x, rank_y, rank_z, & num_neighbor_ranks, & neighbor_ranks, num_neighbor_ranks_max) allocate( diaA(nnd*m) ) allocate( offset(nnd) ) allocate( offset_dim(nnd,ndim) ) allocate( reshistory_DDM(MAX_NITER) ) allocate( vecx(m) ) allocate( vecx0(m) ) allocate( vecb(m) ) diaA = 0.0d0 vecx = 0.0d0 vecx0 = 0.0d0 vecb = 0.0d0 reshistory_DDM = 0 offset = 0 offset_dim = 0 call make_matrix_DDM( & gnx,gny,gnz, & ndim, n_grids, & istart_grids, & margin, nnd, m, & offset_dim, offset, & diaA, vecb, vecx, vecx0,& rank_x, rank_y, rank_z, & npe_x, npe_y, npe_z ) #ifdef CUDA_SOLVER call cudaMalloc_FP64(cu_vecx,m) call cudaMalloc_FP64(cu_vecb,m) call cudaMalloc_FP64(cu_diaA,m*nnd) call cudaMalloc_INT(cu_offset,nnd) call cudaMemcpy_h2d_fp64(vecx,cu_vecx,m,offset_zero,offset_zero) call cudaMemcpy_h2d_fp64(vecb,cu_vecb,m,offset_zero,offset_zero) call cudaMemcpy_h2d_fp64(diaA,cu_diaA,m*nnd,offset_zero,offset_zero) call cudaMemcpy_h2d_INT(offset,cu_offset, nnd,offset_zero,offset_zero) #endif call parcel_dcg_ddm(& MPI_COMM_WORLD, & #ifdef CUDA_SOLVER cu_vecx%mem,cu_vecb%mem, & m, & nnd, cu_offset%mem, cu_diaA%mem, & #else vecx,vecb, & m, & nnd, offset, diaA, & #endif num_neighbor_ranks, neighbor_ranks, margin, ndim,& npe_dim, n_grids, offset_dim,& div_direc_th, & ipreflag, & addL, iflagAS,& MAX_NITER, rtolmax, reshistory_DDM, & iovlflag, & precon_thblock, & independ_nvec, & nBlock, & iret & ) #ifdef CUDA_SOLVER call cudaMemcpy_d2h_fp64(cu_vecx,vecx,m,offset_zero,offset_zero) call cudaFree(cu_vecx) call cudaFree(cu_vecb) call cudaFree(cu_diaA) call cudaFree(cu_offset) #endif call MPI_Barrier(MPI_COMM_WORLD,ierr) deallocate( diaA ) deallocate( offset ) deallocate( offset_dim ) deallocate( reshistory_DDM ) deallocate( vecx ) deallocate( vecx0 ) deallocate( vecb ) call PARCEL_Free() call MPI_Finalize(ierr) end program main
A Fortran routine can be called from C by adding “_” to the end of the routine name and passing the argument as a pointer. The PARCEL routines can also be used in C programs. Regarding the use of MPI, the C MPI communicator should be converted to the Fortran MPI communicator via MPI_Comm_c2f in the MPI library. the use of the PARCEL routines is explained by C sample programs of a CG solver. If you enable a preprocessor option CUDA_SOLVER, a CUDA version of each sample program is generated.
A sample code in CRS format is shown below. Here, make_matrix_CRS denotes an arbitrary routine, which generates a matrix in CRS format in PARCEL.
#include < stdio.h > #include < stdlib.h > #include "mpi.h" #ifdef CUDA_SOLVER #include < cuda.h > #include < cuda_runtime.h > #endif int main( int argc, char *argv[]){ int npes; int myrank; MPI_Fint icomm_fort; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD); parcel_init_c_(&myrank); int n,gn,nnz,istart; double* crsA; int* crsRow_ptr; int* crsCol; make_matrix_CRS(n,gn,nnz,istart,crsA,crsRow_ptr,crsCol) double *vecx = (double *)malloc(sizeof(double)*n); double *vecb = (double *)malloc(sizeof(double)*n); for(int i=0;i < n;i++){ vecx[i] = 0.0; vecb[i] = 1.0; } int ipreflag = 0; int ilu_method = 1; int addl = 0; int iflagas = 1; int max_niter = 100; int rtolmax = 1.0e-8; double *reshistory = (double *)malloc(sizeof(double)*max_niter); int iovlflag = 0; int precon_thblock = -1; int independ_nvec = -1; int nBlock = 2000; #ifdef CUDA_SOLVER typedef struct Fortran_GPUMem{ void *a_d; } devicemem; size_t len = sizeof(double)*n; devicemem cu_vecx;cudaMalloc(&(cu_vecx.a_d),len); cudaMemcpy( cu_vecx.a_d, vecx, len, cudaMemcpyHostToDevice); devicemem cu_vecb;cudaMalloc(&(cu_vecb.a_d),len); cudaMemcpy( cu_vecb.a_d, vecb, len, cudaMemcpyHostToDevice); len = sizeof(double)*nnz; devicemem cu_crsa;cudaMalloc(&(cu_crsa.a_d),len); cudaMemcpy( cu_crsa.a_d, crsa, len, cudaMemcpyHostToDevice); len = sizeof(parcel_int)*nnz; devicemem cu_crscol;cudaMalloc(&(cu_crscol.a_d),len); cudaMemcpy( cu_crscol.a_d, crscol, len, cudaMemcpyHostToDevice); len = sizeof(parcel_int)*(n+1); devicemem cu_crsrow_ptr;cudaMalloc(&(cu_crsrow_ptr.a_d),len); cudaMemcpy( cu_crsrow_ptr.a_d, crsrow_ptr, len, cudaMemcpyHostToDevice); #endif parcel_dcg_c_( &icomm_fort , #ifdef CUDA_SOLVER &cu_vecx, &cu_vecb, &n, &gn, &nnz, &istart, &cu_crsa , &cu_crsrow_ptr, &cu_crscol, #else vecx, vecb, &n, &gn, &nnz, &istart, crsa , crsrow_ptr, crscol, #endif &ipreflag , &ilu_method , &addl , &iflagas , &max_niter , &rtolmax , reshistory, &iovlflag , &precon_thblock , &independ_nvec , &nBlock, &iret ); #ifdef CUDA_SOLVER len = sizeof(double)*n; cudaMemcpy( vecx,cu_vecx.a_d, len, cudaMemcpyDeviceToHost); cudaFree(cu_vecx.a_d); cudaFree(cu_vecb.a_d); cudaFree(cu_crsa.a_d); cudaFree(cu_crscol.a_d); cudaFree(cu_crsrow_ptr.a_d); #endif free(vecx); free(vecb); free(reshistory); parcel_free_c_(); MPI_Finalize(); }
A sample code in DIA format are shown below. Here, make_matrix_DIA denotes an arbitrary routine, which generates a matrix in DIA format in PARCEL.
#include < stdio.h > #include < stdlib.h > #include "mpi.h" #ifdef CUDA_SOLVER #include < cuda.h > #include < cuda_runtime.h > #endif int main( int argc, char *argv[]){ int npes; int myrank; MPI_Fint icomm_fort; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD); parcel_init_c_(&myrank); int n,gn,nnd,istart; double* diaA; int* offset; make_matrix_DIA(n,gn,nnd,istart,diaA,offset) double *vecx = (double *)malloc(sizeof(double)*n); double *vecb = (double *)malloc(sizeof(double)*n); for(int i=0;i < n;i++){ vecx[i] = 0.0; vecb[i] = 1.0; } int ipreflag = 0; int ilu_method = 1; int addl = 0; int iflagas = 1; int max_niter = 100; int rtolmax = 1.0e-8; double *reshistory = (double *)malloc(sizeof(double)*max_niter); int iovlflag = 0; int precon_thblock = -1; int independ_nvec = -1; int nBlock = 2000; #ifdef CUDA_SOLVER size_t len = sizeof(double)*n; cudaMemcpy( cu_vecx.a_d, vecx, len, cudaMemcpyHostToDevice); cudaMemcpy( cu_vecb.a_d, vecb, len, cudaMemcpyHostToDevice); len = sizeof(double)*n*nnd; devicemem cu_diaA;cudaMalloc(&(cu_diaA.a_d),len); cudaMemcpy( cu_diaA.a_d, diaA, len, cudaMemcpyHostToDevice); len = sizeof(parcel_int)*nnd; devicemem cu_offset;cudaMalloc(&(cu_offset.a_d),len); cudaMemcpy( cu_offset.a_d, offset, len, cudaMemcpyHostToDevice); #endif parcel_dcg_dia_c_( &icomm_fort , #ifdef CUDA_SOLVER &cu_vecx, &cu_vecb, &n, &gn, &istart, &cu_diaA, &cu_offset, #else vecx, vecb, &n, &gn, &istart, diaA , offset, #endif &nnd, &ipreflag , &ilu_method , &addl , &iflagas , &max_niter , &rtolmax , reshistory, &iovlflag , &precon_thblock , &independ_nvec , &nBlock, &iret ); #ifdef CUDA_SOLVER len = sizeof(double)*n; cudaMemcpy( vecx,cu_vecx.a_d, len, cudaMemcpyDeviceToHost); cudaFree(cu_vecx.a_d); cudaFree(cu_vecb.a_d); cudaFree(cu_diaA.a_d); cudaFree(cu_offset.a_d); #endif free(vecx); free(vecb); free(reshistory); parcel_free_c_(); MPI_Finalize(); }
A C sample code in DDM format is shown below. Here, make_matrix_ddm denotes an arbitrary routine, which generates a matrix in DDM format in PARCEL. set_rank_xyz, set_grids_decompostion, and set_neighbor_ranks are arbitrary routines, which compute neighbor processes in PARCEL format.
#include < stdio.h > #include < stdlib.h > #include < math.h > #include "mpi.h" #ifdef CUDA_SOLVER #include < cuda.h > #include < cuda_runtime.h > #endif int main( int argc, char *argv[]){ int ityp_eq; int itrmax; int max_niter; int nx,ny,nz; int nx0,ny0,nz0; int n,gn; int istart,iend; int npes,myrank; int ierr; int i; int iret; int myranky,myrankz; int npey,npez; MPI_Fint icomm_fort; int gnx = 100; int gny = 100; int gnz = 100; int max_niter = 100; int rtolmax = 1.0e-8; int ipreflag = 0; int independ_nvec = -1; int addl = 0; int iflagas = 1; int ilu_method = 1; int iovlflag = 0; int precon_thblock = -1; int nBlock = 2000; int npe_x = 1; int npe_y = 4; int npe_z = 1; int div_direc_th = 1; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); icomm_fort=MPI_Comm_c2f(MPI_COMM_WORLD); parcel_init_c_(&myrank); int nnd = 7; int rank_x,rank_y,rank_z; set_rank_xyz( &myrank, &npe_x, &npe_y, &npe_z, &rank_x, &rank_y, &rank_z); int ndim=3 ; int margin = 1; int *npe_dim = (int *)malloc(sizeof(int)*ndim); int *gn_grids = (int *)malloc(sizeof(int)*ndim); int *n_grids = (int *)malloc(sizeof(int)*ndim); int *istart_grids = (int *)malloc(sizeof(int)*ndim); for(i=0;i < ndim;i++){ npe_dim[i] = 0; gn_grids[i] = 0; n_grids[i] = 0; istart_grids[i] = 0; } int m=-1; gn_grids[0] = gnx; gn_grids[1] = gny; gn_grids[2] = gnz; set_grids_decompostion( &npe_x, &npe_y, &npe_z, &rank_x, &rank_y, &rank_z, &ndim, gn_grids, npe_dim, n_grids, istart_grids, &margin, &m ); int num_neighbor_ranks_max = pow(3,ndim) -1; int *neighbor_ranks = (int *)malloc(sizeof(int)*num_neighbor_ranks_max); int num_neighbor_ranks=-1; for(i=0;i < num_neighbor_ranks_max;i++){ neighbor_ranks[i] = -1; } set_neighbor_ranks( &npe_x, &npe_y, &npe_z, &rank_x, &rank_y, &rank_z, &num_neighbor_ranks, neighbor_ranks, &num_neighbor_ranks_max); int *offset = (int *)malloc(sizeof(int)*nnd); for(i=0;i < nnd;i++){ offset[i] = 0; } int *offset_dim = (int *)malloc(sizeof(int)*nnd*ndim); for(i=0;i < nnd*ndim;i++){ offset_dim[i] = 0; } double *diaA = (double *)malloc(sizeof(double)*m*nnd); for(i=0;i < m*nnd;i++){ diaA[i] = 0.0; } double *vecx = (double *)malloc(sizeof(double)*m); double *vecb = (double *)malloc(sizeof(double)*m); for(i=0;i < m;i++){ vecx[i] = 0.0; vecb[i] = 0.0; } double *reshistory_ddm = (double *)malloc(sizeof(double)*max_niter); for(i=0;i < max_niter;i++){ reshistory_ddm[i] = 0.0; } int nnd_ = nnd; make_matrix_ddm( &gnx, &gny, &gnz, &ndim, n_grids, istart_grids, &margin, &nnd_, &m, offset_dim, offset, diaA, vecb, vecx, vecx0, &rank_x, &rank_y, &rank_z, &npe_x, &npe_y, &npe_z); #ifdef CUDA_SOLVER typedef struct Fortran_GPUMem{ void *a_d; } devicemem; size_t len = sizeof(double)*m; devicemem cu_vecx;cudaMalloc(&(cu_vecx.a_d),len); cudaMemcpy( cu_vecx.a_d, vecx, len, cudaMemcpyHostToDevice); devicemem cu_vecb;cudaMalloc(&(cu_vecb.a_d),len); cudaMemcpy( cu_vecb.a_d, vecb, len, cudaMemcpyHostToDevice); len = sizeof(double)*m*nnd; devicemem cu_diaA;cudaMalloc(&(cu_diaA.a_d),len); cudaMemcpy( cu_diaA.a_d, diaA, len, cudaMemcpyHostToDevice); len = sizeof(int)*nnd; devicemem cu_offset;cudaMalloc(&(cu_offset.a_d),len); cudaMemcpy( cu_offset.a_d, offset, len, cudaMemcpyHostToDevice); #endif parcel_dcg_ddm_c_( &icomm_fort, #ifdef CUDA_SOLVER &cu_vecx, &cu_vecb, &m, &nnd, &cu_offset, &cu_diaA, #else vecx, vecb, &m, &nnd, offset, diaA, #endif &num_neighbor_ranks, neighbor_ranks, &margin, &ndim, npe_dim, n_grids, offset_dim, &div_direc_th, &ipreflag, &addl, &iflagas, &max_niter, &rtolmax, reshistory_ddm, &iovlflag, &precon_thblock, &independ_nvec , &nBlock , &iret ); #ifdef CUDA_SOLVER len = sizeof(double)*m; cudaMemcpy( vecx,cu_vecx.a_d, len, cudaMemcpyDeviceToHost); cudaFree(cu_diaA.a_d); cudaFree(cu_offset.a_d); cudaFree(cu_vecx.a_d); cudaFree(cu_vecb.a_d); #endif free(offset); free(offset_dim); free(diaA); free(vecx); free(vecb); free(reshistory_ddm); free(neighbor_ranks); free(npe_dim); free(gn_grids); free(n_grids); free(istart_grids); MPI_Barrier(MPI_COMM_WORLD); parcel_free_c_(); MPI_Finalize(); return 0; }
[1] R. Barret, M. Berry, T. F. Chan, et al., “Templates for the
Solution of Linear Systems: Building Blocks for Iterative Methods”,
SIAM(1994)
[2] Y. Saad, “Iterative methods for sparse linear
systems”, SIAM (2003)
[3] M. Hoemmen, “Communication-avoiding Krylov
subspace methods”. Ph.D.thesis, University of California,
Berkeley(2010)
[4] Y. Idomura, T. Ina, A. Mayumi, et al.,
“Application of a communication-avoiding generalized minimal residual
method to a gyrokinetic five dimensional eulerian code on many core
platforms”,ScalA 17:8th Workshop on Latest Advances in Scalable
Algorithms for Large Scale Systems, pp.1-8, (2017).
[5] R. Suda, L.
Cong, D. Watanabe, et al., “Communication-Avoiding CG Method: New
Direction of Krylov Subspace Methods towards Exa-scale Computing”, RIMS
Kokyuroku ,pp. 102-111, (2016).
[6] Y. Idomura, T. Ina, S.
Yamashita, et al., “Communication avoiding multigrid preconditioned
conjugate gradient method for extreme scale multiphase CFD simulations”.
ScalA 18:9th Workshop on Latest Advances in Scalable Algorithms for
Large Scale Systems,pp. 17-24.(2018)
[7] T. Ina, Y. Idomura, T.
Imamura, et al., “Iterative methods with mixed-precision preconditioning
for ill-conditioned linear systems in multiphase CFD simulations”,
ScalA21:12th Workshop on Latest Advances in Scalable Algorithms for
Large Scale Systems .(2021)
[8] A. Stathopoulos, K. Wu, “A block
orthogonalization procedure with constant synchronization requirements”.
SIAM J. Sci. Comput. 23, 2165–2182 (2002)
[9] T. Fukaya, Y.
Nakatsukasa, Y. Yanagisawa, et al., “CholeskyQR2: A Simple and
Communication-Avoiding Algorithm for Computing a Tall-Skinny QR
Factorization on a Large-Scale Parallel System,” ScalA 14:5th Workshop
on Latest Advances in Scalable Algorithms for Large Scale Systems,
pp. 31-38,(2014)