#### HOW TO UTILIZE MULTI CORE CPUS - Toward Sustained Petascale Computing -

#### Motoi Okuda<sup>1</sup>

 <sup>1</sup> Technical Computing Solutions Unit Fujitsu Limited
9-3, Nakase 1-chome, Mihamaku, Chiba City Chiba 261-8588, JAPAN <u>m.okuda@jp.fujitsu.com</u>

The improvement of semiconductor technologies makes it possible to integrate several cores in one CPU chip. This type of CPU is called as multi core or many core CPU. This implementation can improve one CPU chip peak performance dramatically. However, it also brings up new problems, i.e. how to use multi/many core effectively and easily and how to balance core performance and memory bandwidth between core and memory?

Fujitsu has been developing new architecture called **Integrated Multi-core Parallel ArChiTecture** to respond these problems. In this presentation, I will explain the concept and the outline of Integrated Multi-core Parallel ArChiTecture and the performance of Fujitsu high-end technical computing server FX1 which implements Integrated Multi-core Parallel ArChiTecture. The outline of SPARC64<sup>™</sup> VIIIfx, a Fujitsu's new high-end CPU for technical computing, and Fujitsu's future petascale computer which inherits Integrated Multi-core Parallel ArChiTecture will also be given in this presentation.

# How to utilize multi core CPUs - Toward Sustained Petascale Computing -

April 24th, 2009

Motoi Okuda

**Fujitsu Limited** 

JAEA CCSE Workshop, April. 24th, 2009

#### Agenda

- Outline of Fujitsu's HPC Solution Offerings
- High end Technical Computing Server FX1
- Fujitsu's Challenges for Petascale Computing
- Conclusion

FUĴĨTSU

## **Fujitsu's Technical Computing Platform Solutions**



## Agenda

- Out line of Fujitsu's HPC Solution Offerings
- High end Technical Computing Server FX1
- Fujitsu's Challenges for Petascale Computing
- Conclusion

FUITSU

FUJITSU



JAEA CCSE Workshop, April. 24th, 2009

All Rights Reserved, Copyright FUJITSU LIMITED 2009

**FX1 Specifications** 

| CPU               | Processor                    | SPARC64™ VII @ 2.5 GHz                                                                   |
|-------------------|------------------------------|------------------------------------------------------------------------------------------|
|                   | Cache                        | L1: 64 KB instruction & 64 KB data / core<br>L2: 6 MB/CPU, shared                        |
|                   | Cores                        | 4                                                                                        |
|                   | Performance                  | 40 GFlops                                                                                |
|                   | Simultaneous multi-threading | 2 threads/core                                                                           |
|                   | Barrier synchronization      | CPU-wide high-speed barrier mechanism between cores                                      |
| Node              | CPUs                         | 1                                                                                        |
|                   | Memory capacity              | Max 32 GB                                                                                |
|                   | Memory error-checking        | ECC, extended ECC                                                                        |
|                   | Memory bandwidth             | 40 GB/s                                                                                  |
|                   | Interfaces                   | InfiniBand™ HCA (2 GBps) x 1; 1000baseT x 2                                              |
| Inter-<br>connect | Тороlоду                     | Fat-tree                                                                                 |
|                   | Interface                    | InfiniBand™ DDR                                                                          |
|                   | Additional functions         | Intelligent SW with barrier synchronization and hardware assisted reduction capabilities |

FUITSU



## FX1 LINPAC Benchmark Score on JAXA system

# •FX1 LINPAC Benchmark on 130TFlops JAXA system (3,008 nodes = 3,008 CPUs = 12,032 cores)

|             | Results              | Compared to November<br>2008 TOP500 list (latest) |
|-------------|----------------------|---------------------------------------------------|
| Performance | 110.6 TFlops         | 1st in Japan,<br>17th in world                    |
| Efficiency  | 91.19%               | 1st in world                                      |
| Runtime     | 60 hours, 40 minutes | 1st in world                                      |

FUJITSU



FUÏTSU

# Integrated Multi-core Parallel ArChiTecture

#### Concept

- Highly efficient thread level parallel processing technology for multi-core chip
- Supports highly efficient hybrid parallel programming model (MPI + thread parallelization by OpenMP or automatic parallelization)





#### Advantage

- Handles the multi-core CPU as one equivalent faster CPU
  - Reduces number of MPI processes to 1/n<sub>core</sub>
    - ➔Increases parallel efficiency
    - →Reduce OS jitter effect
  - Reduces memory access and increase cache usage

#### Challenge

- How to decrease the thread level parallelization overhead?
- How to decrease the cost for application implementation?

| JAEA CCSE Workshop, April. 24 <sup>th</sup> , 2009 | 8 | All Rights Reserved, Copyright FUJITSU LIMITED 2009 |
|----------------------------------------------------|---|-----------------------------------------------------|
|                                                    |   |                                                     |

## Integrated Multi-core Parallel ArChiTecture Key Technologies

## CPU technologies

- Hardware barrier synchronization between cores
  - →Reduces overhead for parallel execution, 10 times faster than software emulation
  - → Start up time is comparable to that of the vector unit
  - →Barrier overhead remains constant regardless of number of cores



- Shared L2 cache memory (6 MB)
  - →Reduces the number of cache to cache data transfers
  - →Efficient cache memory usage

## Compiler technologies

Highly efficient thread parallelization (automatic parallelization or OpenMP) by vectorization technology



## Integrated Multi-core Parallel ArChiTecture FX1 OpenMP Thread Parallelization Performance

#### Comparison of thread overhead on several OpenMP functions

■ FX1 SPARC64<sup>™</sup>VII (2.52GHz) 4 threads

WoodCrest(3.00GHz) 2 threads

HX600 AMD Barcelona(2.3GHz) 4 threads

FUITSU

- Harpertown(3.16 GHz) 4 threads
- HPC2500 SPARC64<sup>™</sup>V (1.3 GHz) 4 threads



All Rights Reserved, Copyright FUJITSU LIMITED 2009

#### Integrated Multi-core Parallel ArChiTecture **FX1** Hybrid Parallelization Performance

## Performance comparison of NPB class C between pure MPI and Hybrid parallelization (automatic parallelization) on 256 CPUs (1,024 cores)

Hybrid parallelization shows better performance than pure MPI with 5/8 programs



FUÏTSU

FUITSU

#### FX1 Intelligent Interconnect Outline

Combination of fat tree topology InfiniBand DDR interconnect and the highly-functional switch (Intelligent switch)

## Intelligent switch (ISW)

Result of the PSI (Petascale System Interconnect) national project

#### Functions

- Hardware barrier function among nodes
- Hardware assistance for MPI functions (synchronization and reduction)
- Global ping for OS scheduling

#### Advantages

- Faster HW barrier speeds up OpenMP and data parallel FORTRAN (XPF)
- Fast collective operations accelerate highly parallel applications
- Reduces OS jitter effect



Intelligent Switch & its connection



FX1 Intelligent Interconnect & Integrated Multi-core Parallel ArChiTecture FX1 Hybrid Parallelization Performance

#### Performance comparison of HIMENO-BMT grid-M\* between pure MPI, pure MPI + ISW and hybrid parallelization + ISW

Hybrid parallelization (MPI + Automatic parallelization between four cores) assisted by Integrated Multi-core Parallel ArChiTecture and ISW achieves high parallel efficiency on FX1



15

\*: Size M means that mesh size is 256 X 128 X 128.

FUÏTSU

## Agenda



- Out line of Fujitsu's HPC Solution Offerings
- High end Technical Computing Server FX1
- Fujitsu's Challenges for Petascale Computing
- Conclusion

JAEA CCSE Workshop, April. 24<sup>th</sup>, 2009

All Rights Reserved, Copyright FUJITSU LIMITED 2009





## Agenda

- Out line of Fujitsu's HPC Solution Offerings
- High end Technical Computing Server FX1
- Fujitsu's Challenges for Petascale Computing
- Conclusion

FUJITSU

## FUĬĪTSU Conclusion •Key Issues for sustained Petascale computing ■ How to utilize multi-core CPU? How to handle a hundred thousand processes ? Fujitsu's technical challenge New Integrated Multi-core Parallel ArChiTecture and innovative interconnect which provide a highly efficient hybrid parallel programming environment Fujitsu's stepwise approach to product release ensures users to be ready for Petascale computing ■ Step 1 : The new high end technical computing server FX1 provides the environment for applications migration for Petascale system. Design of Petascale system which inherits FX1 architecture Step 2 : Petascale system with new high performance, highly reliable and low power consumption CPU and innovative interconnect JAEA CCSE Workshop, April. 24th, 2009 20 All Rights Reserved, Copyright FUJITSU LIMITED 2009



## THE POSSIBILITIES ARE INFINITE