PARCEL 1.2(Jan.31, 2023)
PARCEL 1.2 is available.
The following points have been corrected in this update.
- 1. NVIDIA GPUs are supported by CUDA.
Download
Source code |
---|
License
The source code is licensed under the GNU Lesser General Public License(LGPL)
Old Version
Overview of PARCEL
Matrix solvers for simultaneous linear equation systems are classified into direct and iterative solvers. In most of extreme scale problems, iterative solvers based on Krylov subspace methods are essential from the viewpoints of the computational cost and the memory usage. The PARallel Computing ELements (PARCEL) library provides highly efficient parallel Krylov subspace solvers for modern massively parallel supercomputers, which are characterized by accelerated computation and less performance improvement in inter-node communication. The PARCEL is based on a hybrid parallel programming model with MPI+OpenMP, and supports the latest communication avoiding (CA) Krylov methods (the Chebyshev basis CG method [5,6], the CA-GMRES method [3,4]) in addition to the conventional Krylov subspace methods (the CG method, the BiCGstab method, and the GMRES method) [1,2]. In the CA-Krylov methods, the number of collective communication is reduced to avoid a communication bottleneck. The fine-block preconditioner [7] for SIMD operations and the Neumann series preconditioner [2] for GPU operations are available in addition to the conventional preconditioners (the point Jacobi preconditioner, the ILU preconditioners, the block Jacobi preconditioner, the additive Schwaltz preconditioner) [1,2]. QR factorization routines (the classical Gram-Schmidt method [1,2], the modified Gram-Schmitt method [1,2], the tall skinny QR method [3], the Cholesky QR method [8], the Cholesky QR2 method [9]) and an eigenvalue solver based on the CA-Arnoldi method [3], which are implemented for the CA-GMRES method, are also available. The PARCEL supports three matrix formats (Compressed Row Storage (CRS) format , Diagonal (DIA) format and Domain Decomposition Metod (DDM) format) and two data types (Double precision and Quadruple precision), and can be called from programs written in C and Fortran.
Performance
≪
Performance comparisons of CG solvers with block Jacobi preconditioners using 32 CPUs on BDEC-01/Wisteria-O (A64FX) at Univ. Tokyo.
≫
Three dimensional Poisson equation with the problem size of 768×768×768
PETSc |
PARCEL |
PARCEL |
PARCEL |
|
---|---|---|---|---|
Elapse time [s] |
75.08 |
73.30 |
36.68 |
11.31 |
Memory usage [GB] |
382 |
194 |
194 |
166 |
iteration number |
1632 |
1633 |
1905 |
1571 |
≪ Performance comparisons of CG solvers with block Jacobi preconditioners using 32 CPUs on HPE SGI8600 (Intel Xeon Gold 6248R) at Japan Atomic Energy Agency. ≫
Three dimensional Poisson equation with the problem size of 768×768×768
PETSc |
PARCEL |
PARCEL |
PARCEL |
|
---|---|---|---|---|
Elapse time [s] |
137.03 |
125.45 |
167.97 |
83.03 |
Memory usage [GB] |
369 |
158 |
158 |
126 |
iteration number |
1437 |
1438 |
1903 |
1536 |
≪ Performance comparisons of CG solvers with preconditioners using 32 GPUs on HPE SGI8600 (Intel Xeon Gold 6248R, NVIDIA Tesla V100 SXM2) at Japan Atomic Energy Agency. ≫
Three dimensional Poisson equation with the problem size of 768×768×768
PETSc |
AmgX |
PARCEL |
PARCEL |
PARCEL |
|
---|---|---|---|---|---|
Elapse time [s] |
58.31 |
9.45 |
58.25 |
11.88 |
7.89 |
Memory usage [GB] |
|
|
|
|
|
iteration number |
882 |
982 |
883 |
982 |
982 |
Reference
[1] R. Barret, M. Berry, T. F. Chan, et al., "Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods", SIAM(1994)
[2] Y. Saad, "Iterative methods for sparse linear systems", SIAM (2003)
[3] M. Hoemmen, "Communication-avoiding Krylov subspace methods". Ph.D.thesis, University of California, Berkeley(2010)
[4] Y. Idomura, T. Ina, A. Mayumi, et al., "Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms",ScalA 17:8th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp.1-8, (2017).
[5] R. Suda, L. Cong, D. Watanabe, et al., "Communication-Avoiding CG Method: New Direction of Krylov Subspace Methods towards Exa-scale Computing", RIMS Kokyuroku ,pp. 102-111, (2016).
[6] Y. Idomura, T. Ina, S. Yamashita, et al., "Communication avoiding multigrid preconditioned conjugate gradient method for extreme scale multiphase CFD simulations". ScalA 18:9th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems,pp. 17-24.(2018)
[7] T. Ina, Y. Idomura , T. Imamura, S.Yamashita, and N. Onodera, "Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations", ScalA21:12th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems .(2021)
[8] A. Stathopoulos, K. Wu, "A block orthogonalization procedure with constant synchronization requirements". SIAM J. Sci. Comput. 23, 2165–2182 (2002)
[9] T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, et al., "CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System," ScalA 14:5th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp. 31-38,(2014)
Developer
Computer Science Research and Development Office, Center for Computational Science & e-Systems, Japan Atomic Energy Agency
Contact
ccse-quad(at)ml.jaea.go.jp
※ Substitute @ for (at).