This chapter describes the iterative methods contained in this library. The following notation is used, represents the linear system which will be solved, with being an nn sparse matrix and representing the initial vector. represents an appropriate preconditioner.
The Conjugate Gradient (CG) method is an iterative Krylov subspace algorithm for solving symmetric matrices.
The Biconjugate Gradient Stabilized (Bi-CGSTAB) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices.
The Generalized Minimum Residual (GMRES(m)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. GMRES(m) is restarted periodically after iterations if it did not converge within iterations, the solution obtained until iterations will be used as an input for the new restart cycle. PARCEL provides variants of GMRES with the Classical Gram-Schmidt (sometimes referred to as standard Gram-Schmidt) and Modified Gram-Schmidt (MGS) orthogonalization methods.
is an upper triangular matrix with representing its elements, represents a vector consisting of the first i elements of the vector .
The Communication Avoiding Generalized Minimum Residual (CA-GMRES(s,t)) method is an iterative Krylov subspace algorithm for solving nonsymmetric matrices. The calculation which is done in iterations in the GMRES(m) method is equivalent to one iteration of the CA-GMRES method. CA-GMRES(s,t) is restarted periodically after iterations if it did not converge within iterations, the solution obtained until iterations will be used as an input for the new restart cycle. CA-GMRES(s,t) is equivalent to GMRES(m) if and are chosen so that . The convergence property of CA-GMRES(s,t) is the same as GMRES(m) in exact arithmetic. In addition, the number of global collective communication calls is reduced by communication avoiding QR factorization. As CA-GMRES(s,t) produces basis vectors at once, when is too large, the linear independency of the basic vectors may become worse because of round off errors, leading to worse convergence property than GMRES(m). In order to improve the orthogonality of the basic vectors, the PARCEL implementation of CA-GMRES(s,t) provides an option to use the Newton basis in addition to the monomial basis.
represents a unit vector, represents the -th row and -th column element of the matrix , and represents the eigenvalues obtained from the Hessenberg matrix, which is generated by the iteration process.
The elements of the matrix are represented as
The elements of the matrix are represented as .
In iterative algorithms, the application of a preconditioner of the form can help to reduce the number of iterations until convergence. The notation [] means [ with respect to the vector ]. Although the more accurate approximation leads to the less number of iterations until convergence, the cost of preconditioner is normally increased. In non-parallel computing, one of the most common preconditioners is the incomplete LU factorization (ILU). However, in parallel computing, parallel preconditioners are needed. PARCEL provides the following parallel preconditioners: Point Jacobi, Block Jacobi and Additive Schwarz.
The point Jacobi preconditioner is one of the most simplest preconditioners. only consists of the diagonal elements of the matrix . Compared to other preconditioners the efficiency of the point Jacobi preconditioner is very low. However, it can be easily applied in parallel and it does not require any additional communication between processors.
LU factorization decompose a square matrix into a lower triangular matrix and an upper triangular matrix . For a typical sparse matrix, the LU factors can be much less sparse than the original matrix, which is called as fill-in. This increases the memory and computational requirements. In order to avoid this issue, PARCEL provides the zero fill-in incomplete LU factorization (ILU(0)).
The diagonal incomplete LU factorization (D-ILU) computes a diagonal matrix , a lower triangular matrix and an upper triangular matrix for the input matrix . The preconditioner is constructed with , and () with: Two different conditions exist in order to construct the diagonal matrix :
Condition 1: The diagonal elements of equal those of :
Condition 2: The row sum of equals the row sum of :
Given the conditions above can be computed as follows:
If condition 1 should be fullfilled the following computation is used:
If condition 2 should be fullfilled the following computation is used:
The Block Jacobi Preconditioner constructs out of the diagonal blocks of the matrix . For each block, the incomplete ILU factorization is computed. When the preconditioner is applied, each block can be processed independently, resulting in a high level of parallelism. In addition, no communication is needed when each block is defined within a sub-matrix on each processor.
Jacobi preconditioning is a type of preconditioner that approximately solves the system using the following iterative method. Since there are no dependencies within a single iteration, it is suitable for GPU-based computation.
This section describes the QR decomposition algorithms that can be used in this library.
The QR factorization by the Gram-Schmidt orthogonalization. This algorithm has High parallelism but poor orthogonality.
The QR factorization with a modified algorithm of the classical Gram-Schmidt method to reduce the error. This algorithm improves the orthogonality from the classical Gram-Schmidt method, but requires more collective communication.
The Cholesky QR factoriztion consists of a matrix product and the Cholesky factorization. This algorithm has high computational intensity and can be computed with one collective communication. However, its orthogonality is poor.
In order to save memory in sparse matrix computation, only the non-zero elements and their locations are stored. PARCEL supports different sparse matrix formats.
The CRS format compresses the row indices of the non-zero elements of the sparse matrix. Together with the compressed rows, the column index and the value for each non-zero element are stored in one dimensional arrays. Compresed row indeces are stored in an incremental order.
Values of non-zero elements ::
Compressed row indices ::
Column indices of non-zero elements ::
Values of non-zero elements ::
Compressed row indices ::
Column indices of non-zero elements ::
Values of non-zero elements ::
Compressed row indices ::
Column indices of non-zero elements ::
This section explains how to install the SYCL version of PARCEL.
You can use it by including the header file ./Include/parcel_sycl.hpp and compiling your program with a SYCL-compatible compiler. Examples of compilation on SGI8600 (Intel Xeon Gold 6248R + NVIDIA V100) are shown below.
module purge
module load gnu/9.5.0
module load openmpi/cur
. ~/spack/share/spack/setup-env.sh
spack load hipsycl
export sycl_root=`spack location -i hipsycl`
export llvm_root=`spack location -i llvm`
export LD_LIBRARY_PATH=$sycl_root/lib/:$llvm_root/lib/:${LD_LIBRARY_PATH}
export MPICXX_CXX=${compiler_base}
export OMPI_CXX=${compiler_base}
src_name=sample.cpp
compiler_base="acpp"
compiler_option=" -O3 --acpp-targets=omp -mlong-double-128 -march=native -mtune=native -Wpass-failed=transform-warning"
compiler=mpicxx
${compiler} ${compiler_option} -o a.out ${src_name}
module purge
module load gnu/9.5.0
module load cuda/12.6
module load openmpi/cur
. ~/spack/share/spack/setup-env.sh
spack load hipsycl
export sycl_root=`spack location -i hipsycl`
export llvm_root=`spack location -i llvm`
export cuda_root=/home/app/cuda/12.6
export ACPP_CUDA_PATH=$cuda_root
export LD_LIBRARY_PATH=$sycl_root/lib/:$llvm_root/lib/:$cuda_root/lib64:${LD_LIBRARY_PATH}
export MPICXX_CXX=${compiler_base}
export OMPI_CXX=${compiler_base}
src_name=sample.cpp
compiler_base="acpp"
compiler_option=" -O3 --acpp-targets=cuda:sm_70 -march=native -mtune=native -v"
compiler=mpicxx
${compiler} ${compiler_option} -o a.out ${src_name}
AdaptiveCpp can be installed from source code. For more details, please refer to the following page: https://github.com/AdaptiveCpp/AdaptiveCpp
One can also use spack to install AdaptiveCpp. Examples on SGI8600 (Intel Xeon Gold 6248R + NVIDIA V100) are shown below.
module purge module load gnu/9.5.0 module load openmpi/cur . ~/spack/share/spack/setup-env.sh spack compiler find spack external find spack install llvm@18 spack install hipsycl % llvm@18
When the CPU architectures differ between login nodes and compute nodes, as in Fugaku or Wisteria Odyssey, the installation should be performed on the compute nodes.
module purge module load gnu/9.5.0 module load cuda/12.6 module load openmpi/cur . ~/spack/share/spack/setup-env.sh spack compiler find spack external find spack external find cuda spack install llvm@18 spack install hipsycl +cuda % llvm@18
This section describes the classes provided in the SYCL version of PARCEL.
The SYCL version of PARCEL uses the following data types (parceltype):
An MPI communicator class for the SYCL version of PARCEL.
*Syntax
| Definition | Arguments |
|---|---|
| parcel::comm(MPI_Comm comm, sycl::queue* q) class parcel_comm; | comm: MPI communicator q: pointer to a sycl::queue |
Class for Sparse Matrix Manipulation in SYCL-based PARCEL
| Definition | Arguments |
|---|---|
| template< class parceltype > matrix(sycl::queue* q) class parcel_matrix; |
parceltype:Data precision of the CRS matrix q:Pointer to a sycl::queue |
| Function Name | Description | Arguments |
|---|---|---|
| set_crsMat<T>(comm,nstart,nend,n,nnz,rowptr,crsCol,mem) | Define a CRS-format matrix for MPI parallelization | T:Floating-point type parcel::comm* comm:MPI communicator for the SYCL version of PARCEL PARCEL_INT nstart:Starting row of the matrix handled by each process PARCEL_INT nend:Ending row of the matrix handled by each process size_t n:Size of the vector handled by each process size_t nnz:Number of non-zero elements in the coefficient matrix handled by each process PARCEL_INT* rowptr:Pointer table in CRS format PARCEL_INT* crsCol:Column indices of non-zero elements in the partitioned matrix stored in CRS format T* mem:Non-zero elements of the coefficient matrix |
A class for manipulating vectors in the SYCL version of PARCEL.
| Definition | Arguments |
|---|---|
| template |
parceltype: Data precision of the vector n: Number of array elements q: Pointer to a sycl::queue |
| Function Name | Description | Arguments |
|---|---|---|
| queue() | Retrieves the sycl::queue pointer | |
| resize(n, q) | Changes the number of array elements and the sycl::queue | size_t n: Number of array elements sycl::queue* q: Pointer to a sycl::queue |
| setValue |
Copies array elements | T: Floating-point type T* mem: Pointer to array elements |
| setValue_hi(mem) | Copies array elements to the high part of a double-double array | double* mem: Pointer to array elements |
| setValue_lo(mem) | Copies array elements to the low part of a double-double array | double* mem: Pointer to array elements |
| getValue |
Retrieves array elements | T: Floating-point type T* mem: Pointer to array elements |
| getValue |
Retrieves double-double array elements | T: Floating-point type T* mem_hi: Pointer to high part T* mem_lo: Pointer to low part |
| getValue_hi |
Retrieves the high part of double-double array elements | T: Floating-point type T* mem: Pointer to array elements |
| getValue_lo |
Retrieves the low part of double-double array elements | T: Floating-point type T* mem: Pointer to array elements |
| size() | Retrieves the number of array elements | |
| data() | Retrieves the pointer to the beginning of the array | |
| data_hi() | Retrieves the pointer to the beginning of the high part of a double-double array | |
| data_lo() | Retrieves the pointer to the beginning of the low part of a double-double array |
A class for the Preconditioned Conjugate Gradient (PCG) method.
| Definition | Arguments |
|---|---|
| template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pcg; |
parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation |
| Function Name | Description | Arguments |
|---|---|---|
| solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec >(Mat, x, b) |
Executes the Conjugate Gradient method | parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parcel::matrix parcel::vector parcel::vector |
| set_precon_algo(algo) | Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi) |
PARCEL_INT algo |
| set_jacobi(iter) | Sets the number of iterations for the Jacobi method | PARCEL_INT iter |
| set_blockjacobi(nblock) | Specifies the number of blocks for the Block Jacobi method | PARCEL_INT nblock |
| set_scaling_matrix(flag) | Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled) |
bool flag |
| set_scaling_vector(flag) | Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled) |
bool flag |
| set_tolerance(tol) | Sets the convergence threshold for the relative residual norm | double tol |
| set_max_iteration(itr) | Sets the maximum number of iterations for the Krylov subspace method | PARCEL_INT itr |
A class for the Preconditioned Bi-Conjugate Gradient Stabilized (BiCGStab) method.
| Definition | Arguments |
|---|---|
| template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pbicgstab; |
parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation |
| Function Name | Description | Arguments |
|---|---|---|
| solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec >(Mat, x, b) |
Executes the Preconditioned BiCGStab method | parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parcel::matrix parcel::vector parcel::vector |
| set_precon_algo(algo) | Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi) |
PARCEL_INT algo |
| set_jacobi(iter) | Sets the number of iterations for the Jacobi method | PARCEL_INT iter |
| set_blockjacobi(nblock) | Specifies the number of blocks for the Block Jacobi method | PARCEL_INT nblock |
| set_scaling_matrix(flag) | Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled) |
bool flag |
| set_scaling_vector(flag) | Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled) |
bool flag |
| set_tolerance(tol) | Sets the convergence threshold for the relative residual norm | double tol |
| set_max_iteration(itr) | Sets the maximum number of iterations for the Krylov subspace method | PARCEL_INT itr |
A class for the Preconditioned Generalized Minimal Residual (GMRES) method.
| Definition | Arguments |
|---|---|
| template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pgmres; |
parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation |
| Function Name | Description | Arguments |
|---|---|---|
| solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec, Tgs, Thess_mat >(Mat, x, b) |
Executes the Preconditioned GMRES method | parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parceltype Tgs: Precision for Gram-Schmidt orthogonalization parceltype Thess_mat: Precision of the Hessenberg matrix parcel::matrix parcel::vector parcel::vector |
| set_precon_algo(algo) | Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi) |
PARCEL_INT algo |
| set_jacobi(iter) | Sets the number of iterations for the Jacobi method | PARCEL_INT iter |
| set_blockjacobi(nblock) | Specifies the number of blocks for the Block Jacobi method | PARCEL_INT nblock |
| set_scaling_matrix(flag) | Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled) |
bool flag |
| set_scaling_vector(flag) | Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled) |
bool flag |
| set_tolerance(tol) | Sets the convergence threshold for the relative residual norm | double tol |
| set_max_iteration(itr) | Sets the maximum number of iterations for the Krylov subspace method | PARCEL_INT itr |
A class for the Preconditioned Communication-Avoiding Generalized Minimal Residual (CA-GMRES) method.
| Definition | Arguments |
|---|---|
| template< class Tspmv = parcel_double, class Tprecon = parcel_double, class Taxpy = parcel_double, class Tdot = parcel_double > class pcagmres; |
parceltype Tspmv: Precision for SpMV computation parceltype Tprecon: Precision for preconditioning parceltype Taxpy: Precision for Axpy computation parceltype Tdot: Precision for dot product computation |
| Function Name | Description | Arguments |
|---|---|---|
| solver< Tmat, Tvecx, Tvecb, Tvec, Tprecon_mat, Tprecon_vec, Tbasis, TQR, Thess_mat >(Mat, x, b) |
Executes the Communication-Avoiding GMRES method | parceltype Tmat: Precision of the coefficient matrix parceltype Tvecx: Precision of the solution vector parceltype Tvecb: Precision of the right-hand side vector parceltype Tvec: Precision for internal Krylov subspace data parceltype Tprecon_mat: Precision of the preconditioner matrix parceltype Tprecon_vec: Precision of the preconditioner vector parceltype Tbasis: Precision of the basis vectors parceltype TQR: Precision for QR decomposition parceltype Thess_mat: Precision of the Hessenberg matrix parcel::matrix parcel::vector parcel::vector |
| set_precon_algo(algo) | Sets the preconditioner for the Krylov subspace method (0: No preconditioner, 1: Point Jacobi, 2: Block Jacobi, 4: Jacobi) |
PARCEL_INT algo |
| set_jacobi(iter) | Sets the number of iterations for the Jacobi method | PARCEL_INT iter |
| set_blockjacobi(nblock) | Specifies the number of blocks for the Block Jacobi method | PARCEL_INT nblock |
| set_scaling_matrix(flag) | Enables or disables scaling of the preconditioner matrix (false: disabled, true: enabled) |
bool flag |
| set_scaling_vector(flag) | Enables or disables scaling of the preconditioner vector (false: disabled, true: enabled) |
bool flag |
| set_tolerance(tol) | Sets the convergence threshold for the relative residual norm | double tol |
| set_max_iteration(itr) | Sets the maximum number of iterations for the Krylov subspace method | PARCEL_INT itr |
This section shows a sample program using the SYCL version of PARCEL.
#include <mpi.h>
#include "./Include/parcel_sycl.hpp"
#include "./parcel_default.hpp"
void run_parcel_sycl_main(
MPI_Comm mpi_comm_c,
int n,
int nnz,
double* crsA,
int* crsCol,
int* crsRow_ptr,
double* vecx,
double* vecb
) {
sycl::device device;
sycl::property::queue::in_order property;
sycl::queue q{device,property};
q = sycl::default_selector{};
parcel::comm comm(mpi_comm_c,&q);
int myrank = comm.mpi_rank();
int nprocess = comm.mpi_size();
if(myrank == 0)std::cout << q.get_device().get_info() << std::endl;
parcel::matrix parcel_matrix(&q);
PARCEL_INT nlocal = n;
PARCEL_INT nstart = nlocal*myrank;
PARCEL_INT nend = nlocal*(myrank+1)-1;
q.wait();
parcel_matrix.set_crsMat(
&comm,nstart,nend,
n,nnz,
crsRow_ptr,
crsCol,
crsA
);
q.wait();
parcel::vector vx(n,&q);
parcel::vector vb(n,&q);
vx.setValue(vecx);
vb.setValue(vecb);
parcel::krylov::pcg<
parcel_solver_calc_type,
parcel_solver_calc_precon_type,
parcel_solver_calc_type,
parcel_solver_calc_type
> solver;
solver.set_tolerance( 1.0e-9 );
solver.set_max_iteration( MAX_ITERATION );
PARCEL_INT precon_algo = parcel_solver_precon_algo;
switch (precon_algo) {
case parcel::PRECON_NONE:
// none
solver.set_precon_algo(parcel::PRECON_NONE);
break;
case parcel::PRECON_POINTJACOBI:
// point jacobi precon
solver.set_precon_algo(parcel::PRECON_POINTJACOBI);
break;
case parcel::PRECON_BLOCKJACOBI:
// block jacobi precon
solver.set_precon_algo(parcel::PRECON_BLOCKJACOBI);
solver.set_scaling_matrix(true);
solver.set_scaling_vector(true);
solver.set_blockjacobi(BLOCKJACOBI_BLOCK);
break;
case parcel::PRECON_JACOBI:
// jacobi precon
solver.set_precon_algo(parcel::PRECON_JACOBI);
solver.set_scaling_matrix(true);
solver.set_scaling_vector(true);
solver.set_jacobi(JACOBI_ITER);
break;
}
q.wait();
solver.solver<
parcel_solver_data_type,
parcel_solver_data_type,
parcel_solver_data_type,
parcel_solver_data_type,
parcel_solver_data_precon_mat_type,
parcel_solver_data_precon_vec_type
>(parcel_matrix,vx,vb);
vx.getValue(vecx); q.wait();
q.wait();
return ;
}
extern "C" {
void run_parcel_sycl_f(
int* mpi_comm_f,
int n, int nnz,
double* crsA,
int* crsCol,
int* crsRow_ptr,
double* vecx,
double* vecb
) {
MPI_Comm mpi_comm_c = MPI_Comm_f2c(mpi_comm_f[0]);
run_parcel_sycl_main(
mpi_comm_c,
n, nnz,
crsA,
crsCol,
crsRow_ptr,
vecx,
vecb );
return ;
}
}
make_poisson3d_crs_get_nnz and make_poisson3d_crs are subroutines that generate CRS-format matrices with support for matrix partitioning.
module parcel_sycl_wrapper interface subroutine run_parcel_sycl( & mpi_comm_f, & n,nnz, & crsA, & crsCol, & crsRow_ptr, & vecx, & vecb & ) & bind(c,name="run_parcel_sycl_f") implicit none integer mpi_comm_f integer(kind=4),value :: n integer(kind=4),value :: nnz real*8 crsA(*) integer(kind=4) crsCol(*) integer(kind=4) crsRow_ptr(*) real*8 vecx(*) real*8 vecb(*) end subroutine run_parcel_sycl end interface contains end module parcel_sycl_wrapper program main use mpi use parcel_sycl_wrapper implicit none integer(kind=4) nx,ny,nz integer(kind=4) nx0,ny0,nz0 integer(kind=4),parameter :: gnx = 16 integer(kind=4),parameter :: gny = 16 integer(kind=4),parameter :: gnz = 16 integer(kind=4) gn,n integer(kind=4) nnz integer(kind=4) istart real*8,allocatable :: crsA(:) integer(kind=4),allocatable :: crsCol(:) integer(kind=4),allocatable :: crsRow_ptr(:) real*8,allocatable :: vecx(:) real*8,allocatable :: vecb(:) integer ierr integer npes,myrank integer i integer mpi_comm_f mpi_comm_f = MPI_COMM_WORLD ! MPI Init call MPI_Init(ierr) call MPI_Comm_size(mpi_comm_f,npes,ierr) call MPI_Comm_rank(mpi_comm_f,myrank,ierr) nx = gnx ny = gny nz = gnz/npes nx0 = 1 ny0 = 1 nz0 = 1 + nz*myrank gn = gnx*gny*gnz n = nx * ny * nz istart = 1+n*myrank call make_poisson3d_crs_get_nnz( & nx,ny,nz, & nx0,ny0,nz0,& gnx,gny,gnz,& nnz, & istart) allocate(crsA(nnz)) allocate(crsCol(nnz)) allocate(crsRow_ptr(n+1)) allocate(vecx(n)) allocate(vecb(n)) vecx = 0.0d0 vecb = 1.0d0 call make_poisson3d_crs( & nx,ny,nz, & nx0,ny0,nz0,& gnx,gny,gnz,& crsA,crsCol,crsRow_ptr,n,nnz, & istart) call run_parcel_sycl( & mpi_comm_f, & n,nnz, & crsA, & crsCol, & crsRow_ptr, & vecx, & vecb & ) deallocate(crsA) deallocate(crsCol) deallocate(crsRow_ptr) deallocate(vecx) deallocate(vecb) call MPI_Finalize(ierr); end program main
extern "C" {
void run_parcel_sycl_c(
MPI_Comm mpi_comm_c,
int n, int nnz,
double* crsA,
int* crsCol,
int* crsRow_ptr,
double* vecx,
double* vecb
) {
run_parcel_sycl_main(
mpi_comm_c,
n, nnz,
crsA,
crsCol,
crsRow_ptr,
vecx,
vecb );
return ;
}
}
make_poisson3d_crs_get_nnz_ and make_poisson3d_crs_ are functions that generate CRS-format matrices with support for matrix partitioning.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
extern void make_poisson3d_crs_get_nnz_(
int* nx,
int* ny,
int* nz,
int* nx0,
int* ny0,
int* nz0,
int* gnx,
int* gny,
int* gnz,
int* nnz,
int* istart);
extern void make_poisson3d_crs_(
int* nx,
int* ny,
int* nz,
int* nx0,
int* ny0,
int* nz0,
int* gnx,
int* gny,
int* gnz,
double* crsA,
int* crsCol,
int* crsRow_ptr,
int* n,
int* nnz0,
int* istart);
extern void run_parcel_sycl_c(
MPI_Comm mpi_comm_c,
int n, int nnz,
double* crsA,
int* crsCol,
int* crsRow_ptr,
double* vecx,
double* vecb
);
int main(int argc, char* argv[]) {
int gnx = 16;
int gny = 16;
int gnz = 16;
MPI_Init(&argc, &argv);
MPI_Comm mpi_comm_c = MPI_COMM_WORLD;
int npes,myrank;
MPI_Comm_size(MPI_COMM_WORLD,&npes);
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
int nx = gnx;
int ny = gny;
int nz = gnz/npes;
int nx0 = 1;
int ny0 = 1 ;
int nz0 = 1 + nz*myrank;
int gn = gnx*gny*gnz;
int n = nx * ny * nz;
int istart = 1+n*myrank;
int nnz;
make_poisson3d_crs_get_nnz_(
&nx,&ny,&nz,
&nx0,&ny0,&nz0,
&gnx,&gny,&gnz,
&nnz,
&istart);
double* crsA = (double*)malloc(sizeof(double)*nnz);
int* crsCol = (int*)malloc(sizeof(int)*nnz);
int* crsRow_ptr = (int*)malloc(sizeof(int)*(n+1));
double* vecx = (double*)malloc(sizeof(double)*n);
double* vecb = (double*)malloc(sizeof(double)*n);
make_poisson3d_crs_(
&nx,&ny,&nz,
&nx0,&ny0,&nz0,
&gnx,&gny,&gnz,
crsA,crsCol,crsRow_ptr,
&n,&nnz,
&istart);
for(int i = 0;i<n;i++){
vecx[i] = 0.0;
vecb[i] = 1.0;
}
run_parcel_sycl_c(
mpi_comm_c,
n,nnz,
crsA,
crsCol,
crsRow_ptr,
vecx,
vecb
);
free(crsA);
free(crsCol);
free(crsRow_ptr);
free(vecx);
free(vecb);
MPI_Finalize();
return 0;
}
[1]R. Barret, M. Berry, T. F. Chan, et al., “Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods”, SIAM(1994)
[2]M. Hoemmen, “Communication-avoiding Krylov subspace methods”. Ph.D.thesis, University of California, Berkeley(2010)
[3]Y. Idomura, T. Ina, A. Mayumi, et al., “Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms”,ScalA 17:8th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp.1-8, (2017).
[4] A. Stathopoulos, K. Wu, “A block orthogonalization procedure with constant synchronization requirements”. SIAM J. Sci. Comput. 23, 2165–2182 (2002)