PARCEL｜public software

Download

Source code
download

Manual
Japanese version	html
English version	html

License

The source code is licensed under the GNU Lesser General Public License（LGPL）

Old Version

Click Here

Overview of PARCEL

Matrix solvers for simultaneous linear equation systems are classified into direct and iterative solvers. In most of extreme scale problems, iterative solvers based on Krylov subspace methods are essential from the viewpoints of the computational cost and the memory usage. The PARallel Computing ELements (PARCEL) library provides highly efficient parallel Krylov subspace solvers for modern massively parallel supercomputers, which are characterized by accelerated computation and less performance improvement in inter-node communication. The PARCEL is based on a hybrid parallel programming model with MPI+OpenMP, and supports the latest communication avoiding (CA) Krylov methods (the Chebyshev basis CG method [5,6], the CA-GMRES method [3,4]) in addition to the conventional Krylov subspace methods (the CG method, the BiCGstab method, and the GMRES method) [1,2]. In the CA-Krylov methods, the number of collective communication is reduced to avoid a communication bottleneck. The fine-block preconditioner [7] for SIMD operations and the Neumann series preconditioner [2] for GPU operations are available in addition to the conventional preconditioners (the point Jacobi preconditioner, the ILU preconditioners, the block Jacobi preconditioner, the additive Schwaltz preconditioner) [1,2]. QR factorization routines (the classical Gram-Schmidt method [1,2], the modified Gram-Schmitt method [1,2], the tall skinny QR method [3], the Cholesky QR method [8], the Cholesky QR2 method [9]) and an eigenvalue solver based on the CA-Arnoldi method [3], which are implemented for the CA-GMRES method, are also available. The PARCEL supports three matrix formats (Compressed Row Storage (CRS) format , Diagonal (DIA) format and Domain Decomposition Metod (DDM) format) and two data types (Double precision and Quadruple precision), and can be called from programs written in C and Fortran.

Performance

≪ Performance comparisons of CG solvers with block Jacobi preconditioners using 32 CPUs on BDEC-01/Wisteria-O (A64FX) at Univ. Tokyo. ≫

Three dimensional Poisson equation with the problem size of 768×768×768

	PETSc CRS Block Jacobi Preconditioner （all-MPI）	PARCEL CRS Block Jacobi Preconditioner （MPI+OpenMP）	PARCEL CRS Fine-block Preconditioner （MPI+OpenMP）	PARCEL DDM Fine-block Preconditioner （MPI+OpenMP）
Elapse time [s]	75.08	73.30	36.68	11.31
Memory usage [GB]	382	194	194	166
iteration number	1632	1633	1905	1571

≪ Performance comparisons of CG solvers with block Jacobi preconditioners using 32 CPUs on HPE SGI8600 (Intel Xeon Gold 6248R) at Japan Atomic Energy Agency. ≫

Three dimensional Poisson equation with the problem size of 768×768×768

	PETSc CRS Block-Jacobi Preconditioner （all-MPI）	PARCEL CRS Block-Jacobi Preconditioner （MPI+OpenMP）	PARCEL CRS Fine-block Preconditioner （MPI+OpenMP）	PARCEL DDM Fine-block Preconditioner （MPI+OpenMP）
Elapse time [s]	137.03	125.45	167.97	83.03
Memory usage [GB]	369	158	158	126
iteration number	1437	1438	1903	1536

≪ Performance comparisons of CG solvers with preconditioners using 32 GPUs on HPE SGI8600 (Intel Xeon Gold 6248R, NVIDIA Tesla V100 SXM2) at Japan Atomic Energy Agency. ≫

Three dimensional Poisson equation with the problem size of 768×768×768

	PETSc Block Jacobi Preconditioner （MPI+CUDA）	AmgX Jacobi Preconditioner （MPI+CUDA）	PARCEL CRS Block Jacobi Preconditioner （MPI+CUDA）	PARCEL CRS Neumann series Preconditioner （MPI+CUDA）	PARCEL DDM Neumann series Preconditioner （MPI+CUDA）
Elapse time [s]	58.31	9.45	58.25	11.88	7.89
Memory usage [GB] CPU GPU	282 222	40 167	191 269	122 269	91 151
iteration number	882	982	883	982	982

Reference

[1] R. Barret, M. Berry, T. F. Chan, et al., "Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods", SIAM（1994）
[2] Y. Saad, "Iterative methods for sparse linear systems", SIAM (2003)
[3] M. Hoemmen, "Communication-avoiding Krylov subspace methods". Ph.D.thesis, University of California, Berkeley（2010）
[4] Y. Idomura, T. Ina, A. Mayumi, et al., "Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms",ScalA 17:8th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp.1-8, (2017).
[5] R. Suda, L. Cong, D. Watanabe, et al., "Communication-Avoiding CG Method: New Direction of Krylov Subspace Methods towards Exa-scale Computing", RIMS Kokyuroku ,pp. 102-111, (2016).
[6] Y. Idomura, T. Ina, S. Yamashita, et al., "Communication avoiding multigrid preconditioned conjugate gradient method for extreme scale multiphase CFD simulations". ScalA 18:9th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems,pp. 17-24.(2018)
[7] T. Ina, Y. Idomura , T. Imamura, S.Yamashita, and N. Onodera, "Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations", ScalA21:12th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems .(2021)
[8] A. Stathopoulos, K. Wu, "A block orthogonalization procedure with constant synchronization requirements". SIAM J. Sci. Comput. 23, 2165–2182 (2002)
[9] T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa, et al., "CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System," ScalA 14:5th Workshop on Latest Advances in Scalable Algorithms for Large Scale Systems, pp. 31-38,(2014)

Developer

Computer Science Research and Development Office, Center for Computational Science & e-Systems, Japan Atomic Energy Agency

Contact

ccse-quad(at)ml.jaea.go.jp
※ Substitute @ for (at).

PARCEL 1.2（Jan.31, 2023）

Download

License

Old Version

Overview of PARCEL

Performance

≪ Performance comparisons of CG solvers with block Jacobi preconditioners using 32 CPUs on BDEC-01/Wisteria-O (A64FX) at Univ. Tokyo. ≫

Three dimensional Poisson equation with the problem size of 768×768×768

≪ Performance comparisons of CG solvers with block Jacobi preconditioners using 32 CPUs on HPE SGI8600 (Intel Xeon Gold 6248R) at Japan Atomic Energy Agency. ≫

Three dimensional Poisson equation with the problem size of 768×768×768

≪ Performance comparisons of CG solvers with preconditioners using 32 GPUs on HPE SGI8600 (Intel Xeon Gold 6248R, NVIDIA Tesla V100 SXM2) at Japan Atomic Energy Agency. ≫

Three dimensional Poisson equation with the problem size of 768×768×768

Reference

Developer

Contact

About CCSE

Recruitment

Access/Inquiries

Supercomputers

Research Overview

Results

Public Software

Meetings