#### LESSONS LEARNED FROM 1-YEAR EXPERIENCE WITH SX-9 AND TOWARDS NEXT-GENERATION VECTOR COMPUTING

#### Hiroaki Kobayashi

Cyberscience Center Tohoku University 6-3 Aramaki-Aza-Aoba, Sendai 980-8578, JAPAN koba@isc.tohoku.ac.jp

Through my talk, I would like to share with you the 1-year experiences with SX-9, which is the latest vector system installed at Tohoku University in 2008. I will start with my talk to show you the HPC challenge benchmark results of our SX-9. The HPC challenge benchmark suit is designed for comprehensive benchmarking of high-end systems, and is more focusing on evaluation of sustained memory and network bandwidth, and sustained performance of some representative kernels such as FFT, which cannot clearly be evaluated by LINPACK (HPL) only. The SX-9 system achieved 19 top one scores out of 28 HPC challenge benchmark tests. We also discuss performance-tuning options for SX-9, especially those for an on-chip, software-controllable cache, which is newly introduced into the SX-9 vector processor to cover its limited off-chip memory bandwidth. We exploit the locality of vector data reference in differential equations and indirect memory accesses through list vectors in some leading scientific and engineering applications. Finally, I will introduce you our on-going research on design of the next-generation vector processor, in which multiple vector-cores sharing an on-chip cache are implemented. The preliminary performance evaluation of the multi-vector-core processor is also discussed.

# Lessons Learned from I-year SX-9 Experience and Toward the Next Generation Vector Computing

### Hiroaki Kobayashi

Director and Professor Cyberscience Center, Tohoku University <u>koba@isc.tohoku.ac.jp</u>

> 20th CCSE Workshop April 24, 2009



Hiroaki Kobayashi, Tohoku University

### Agenda

- Lessons Learned from our 1-year experiences of SX-9
  - HPCC Benchmark Results
  - Tuning approaches to highly-efficient vector processing with caching
- Towards Next Generation Vector Computing
  - Multi Vector-Core Processor
- Summary



### Supercomputers of Tohoku University



Single Processor Performance



### **SX-9** Processor Architecture

Improvement over 5 years (vs. SX-7) Vecotr Pipe x8 12 Vector Unit 11.6x Mask Reg. Logical Main Memory 4R/F 9 Vector 2.5B/F Reg. ADB 6 Scalar Unit - 3 Cache 2.9x 2x 2x Pipelines - 0 Frequency Total 102.4=40ps. x 8units x 3.2GHz 20th CCSE Workshop April 24, 2009 Hiroaki Kobayashi, Tohoku University







### The HPCC benchmark consists of

24.2009









Hiroaki Kobayashi, Tohoku University

# System Specifications

| System name               | manufacture                   | Processor Type | Freq.    | # of<br>Cores | # of MPI<br>Proc | # of<br>Threads | Peak [TF] | Interconnect<br>Type | Network BW |
|---------------------------|-------------------------------|----------------|----------|---------------|------------------|-----------------|-----------|----------------------|------------|
| SX-9                      | NEC                           | SX-9           | 3.2GHz   | 256           | 256              | 1               | 26.2      | IXS                  | 128GB/s    |
| SX-9(SMP)                 | NEC                           | SX-9           | 3.2GHz   | 32            | 2                | 16              | 3.2       | IXS                  | 128GB/s    |
| SX-8                      | NEC                           | SX-8           | 2GHz     | 40            | 40               | 1               | 0.64      | IXS                  | 16GB/s     |
| SX-8(SMP)                 | NEC                           | SX-8           | 2GHz     | 40            | 5                | 8               | 0.64      | IXS                  | 16GB/s     |
| SX-7                      | NEC                           | SX-7           | 0.552GHz | 32            | 32               | 1               | 0.28256   | non                  | non        |
| SX-7(SMP)                 | NEC                           | SX-7           | 0.552GHz | 32            | 2                | 16              | 0.28256   | non                  | non        |
| Idataplex                 | IBM-Serviware                 | Xeon X5472     | 3.0GHz   | 1088          | 1,088            | 1               | 13.5      | Infiniband           | 2GB/s      |
| BL2x220                   | НР                            | Xeon E5450     | 3.0GHz   | 256           | 256              | 1               | 3.072     | Infiniband           | 2GB/s      |
| SC5832                    | SiCortex                      | SiCortex Ice9  | 0.7GHz   | 5760          | 5,760            | 1               | 8.064     | Custom               | 6GB/s      |
| Blue Gene/P               | IBM                           | PowePC450      | 0.85GHz  | 131,072       | 131,072          | 1               | 557       | Torus                | 425MB/s    |
| Blue Gene/P<br>(SMP)      | IBM                           | PowePC450      | 0.85GHz  | 131,072       | 32,768           | 4               | 557       | Torus                | 425MB/s    |
| XT5                       | CRAY                          | AMD Opteron    | 2.3GHz   | 149,058       | 74,529           | 2               | 1,381.62  | Seastar              | 9.6GB/s    |
| Darwin                    | ClusterVision/<br>Dell/QLogic | Xeon 5160      | 3GHz     | 256           | 256              | 1               | 3.072     | Infiniband           | 2GB/s      |
| Altix 8200EX              | SGI                           | Xeon X5472     | 3GHz     | 1,024         | 1,024            | 1               | 12.288    | Infiniband           | 2GB/s      |
| Intel Endeavor<br>cluster | Intel                         | Xeon 5160      | 3GHz     | 1,024         | 1,024            | 1               | 11.4688   | Infiniband           | 2GB/s      |



### STREAM (Averaged)



20th CCSE Workshop



Hiroaki Kobayashi, Tohoku University

April 24, 2009

### Random Access (SN/EP)





**G-FFT** 





Hiroaki Kobayashi, Tohoku University



|      |                   |                               |                                    |                   | ( <i>1</i>                    | 13 01 2000.11.10) |  |
|------|-------------------|-------------------------------|------------------------------------|-------------------|-------------------------------|-------------------|--|
| RANK | System            | Institution                   | estitution Peak Perf.<br>(Tflop/s) |                   | G-FFT<br>Results<br>(Tflop/s) | Efficiency        |  |
|      | Cray XT5          | Oak Ridge<br>National<br>Lab. | 1381.6                             | 37544<br>(150176) | 5.8                           | 0.4%              |  |
| 2    | BlueGene/<br>P    | Argonne<br>National<br>Lab.   | 557 ×                              | 32768<br>(131072) | 5.1 2.5XI                     | 0.9% X            |  |
| 3    | Red/Storm/<br>XT3 | Sandia<br>National<br>Lab.    | 25.7<br>124-4                      | 12960<br>(25920)  | 2.9                           | 2.3%              |  |
| 4    | SX-9              | Tohoku<br>Univ.               | 26.2                               | 256<br>(256)      | 2.3                           | 9.1%              |  |

20th CCSE Workshop



### Latency



#### 20th CCSE Workshop

тоноки



April 24, 2009

Hiroaki Kobayashi, Tohoku University



### Bandwidth

13



### **PingPong Performance of SX-9 IXS**

#### • Peak: 128 GB/s in each direction





0

#### Hiroaki Kobayashi, Tohoku University

# Discussion on Tuning Techniques for SX-9

#### Points for tuning

- effect of 2B/F from 4B/F
  - be aware of high-vector processing rate, relatively lower memory bandwidth
  - increase computations and reduce memory operations as many as possible
- effect of 256KB ADB
  - figure out temporal locality of vector data reference

#### Applications examined

- Earthquake
  - Simulation of seismic slow slip model
- Turbulent flow
  - Direct numerical simulation of turbulent channel flow
- Antenna
  - FDTD simulation of lens antenna using Fourier transform
- Land Mine
  - FDTD simulation of array antenna ground penetrating radar for land mine detection
- Turbine
  - Direct numerical simulation of unsteady flow through turbine channels for hydroelectric generators
- Plasma
  - Simulation of upper hybrid wave in plasma using Lax-Wendroff method

20th CCSE Workshop









Hiroaki Kobayashi, Tohoku University

# Tuning Options for Efficient Vector Processing with On-Chip Caching on SX-9

### Selective Caching

increasing opportunities of cache hits of data with higher temporal locality

### Cache Blocking

- k increasing cache hit rates to avoid capacity misses
- k decreasing vector length

### Loop Unrolling/Loop Fusion

- k increasing arithmetic density/vector length in loop body
- k decreasing the branch overhead
- increasing the temporal locality of data by removing duplicated vector loads across nested loops
  - $\checkmark$  now more sensitive to SX-9 performance due to its limited memory BW
- increasing the possibility of register spill and/or eviction from the cache if their capacities are not enough, because the data should be available on the chip for a long time
  - $\star$  this also give a pressure to the memory system, especially 2.5B/F of SX-9



# Selective Caching & Blocking (Case 1)



Hiroaki Kobayashi, Tohoku University



### Selective Caching & Blocking (Case 2)





# Selective Caching for Difference Equation Code (Case 3)



20th CCSE Workshop



Hiroaki Kobayashi, Tohoku University

### Selective Caching and Blocking: Tradeoff between Vector Length and Cache Hit Rate



128 256 750 Size of j (Vectorized Loop Size) 256KB ADB



1MB ADB (simulated)



# Effects of Loop Unrolling on ADB (Case 1)

### Earthquake





### 16-Node Performance in CFD



| Comparison with a TX-7 scalar system |        |        |       |        |         |  |
|--------------------------------------|--------|--------|-------|--------|---------|--|
|                                      | TX7(lt | anium) | SX-9  |        |         |  |
| Cores                                | 1      | 64     | 1     | 16     | 256     |  |
| Peak                                 | 6.4GF  | 409.GF | 102.G | 1.6TF  | 26TF    |  |
| Perf.                                | (1x)   | (64x)  | (16x) | (256x) | (4096x) |  |
| Sustained<br>Speedup                 | 1x     | 36x    | 21x   | 316x   | 3460x   |  |
| 20th CCSE Workshop 25                |        |        |       |        |         |  |

Started with a scalar-tuned code for TX-7/i9610
Almost 99.9 % vector performance was achieved.
0.2 billion cells were solved by present method.
Flat MPI shows better parallel efficiency than hybrid.
161x speedup obtained on the 16 nodes with 256 CPUs

<sup>6</sup>9 hours on 16 nodes (256 CPU) of SX-9

тоноки

36 days on one TX-7 node with 64 itanium cores



Towards Next Generation Vector Computing

NEC

Disclaimer: Information provided in this talk does not reflect any future design of the NEC systems.



### SX Performance Trend



Hiroaki Kobayashi, Tohoku University



### Design Choices for the Next Vector Processor



- / increasing flop/s rate
- SMP on a chip
- Limited Memory Bandwidth
  - decreasing B/F rate per core
- ✤ Large on-chip cache
  - decreasing off-chip memory access
  - ✓ decreasing IO driving power
  - private
    - exclusive, no conflict
    - 🗸 fast
    - k limited capacity
  - Shared
    - ✓ large
    - effective for shared data in SMT
       access conflicts
       limited B/F rate
  - distributed shared
  - multi-level
    - / 1st-level fast private
    - / 2nd-level large shared



Centralized (Shared)



### **Toward a Multi Vector Core Processor!**



A.Musa, Y.Sato, T.Soga, R. Egawa, H. Takizawa, H. Kobayashi, "Caching for A Chip Multi Vector Processor," presented at SC08, 2008.

20th CCSE Workshop

29

April 24, 2009

Hiroaki Kobayashi, Tohoku University

# тоноки

Prefetching Effects of the On-chip Shared Vector Cache in Multithreading of the Difference Scheme

**FDTD** kernel

DO 10 k=0,Nz ; DO 10 i=0,Nx; DO 10 j=0,Ny  

$$E_x(i,j,k) = C_x_a(i,j,k) * E_x(i,j,k)$$
  
& + C\_x\_b(i,j,k) \* ((H\_z(i,j,k) - H\_z(i,j-1,k))/dy  
inter-thread locality  
(H\_y(i,j,k) - H\_y(i,j,k-1))/dz - E\_x\_Current(i,j,k))  
 $E_z(i,j,k) = C_z_a(i,j,k) * E_z(i,j,k)$  intra-thread locality  
& + C\_z\_b(i,j,k) \* ((H\_y(i,j,k)-H\_y(i-1,j,k))/dx  
& - (H\_x(i,j,k)-H\_x(i,j-1,k))/dy - E\_z\_Current(i,j,k))  
 $E_y(i,j,k) = C_y_a(i,j,k) * E_y(i,j,k)$  inter-thread locality  
& + C\_y\_b(i,j,k) \* ((H\_x(i,j,k) - H\_x(i,j,k-1))/dz  
& - (H\_z(i,j,k) - H\_z(i-1,j,k))/dx - E\_y\_Current(i,j,k))  
10 CONTINU intra-thread locality

A.Musa, Y.Sato, T.Soga, R. Egawa, H. Takizawa, H. Kobayashi, 20th CCSE Workshop for A Chip Multi Vector Processor," presented at SC08, 2008. 30 April 24, 2009



### Thread Mapping of Difference Code on Cores

### DO 10 k=0,Nz DO 10 i=0,Ny DO 10 j=0,Nx $\sim = \sim (H_y(i,j,k) - H_y(i,j,k-1)) \sim$

### **10 CONTINUE**

| DO 10 k=0,Nz,4<br>DO 10 j=0,Ny Core 0                         | DO 10 k=1,Nz,4<br>DO 10 j=0,Ny Core 1                         |
|---------------------------------------------------------------|---------------------------------------------------------------|
| DO 10 i=0,Nx                                                  | DO 10 i=0,Nx                                                  |
| $\sim$ = $\sim$ (H_y(i,j,k) - H_y(i,j,k-1)) $\sim$            | $\sim = \sim (H_y(i,j,k) - H_y(i,j,k-1)) \sim$                |
| 10 CONTINUE                                                   | 10 CONTINUE                                                   |
| DO 10 k=2,Nz,4<br>DO 10 j=0,Ny Core 2                         | DO 10 k=3,Nz,4<br>DO 10 j=0,Ny Core 3                         |
| DO 10 i=0,Nx                                                  | DO 10 i=0,Nx                                                  |
|                                                               |                                                               |
| $\sim = \sim (H_y(i,j,k) - H_y(i,j,k-1)) \sim$                | $\sim$ = $\sim$ (H_y(i,j,k) - H_y(i,j,k-1)) $\sim$            |
| $\sim = \sim (H_y(i,i,k) - H_y(i,j,k-1)) \sim$<br>10 CONTINUE | $\sim = \sim (H_y(i,j,k) - H_y(i,j,k-1)) \sim$<br>10 CONTINUE |



Hiroaki Kobayashi, Tohoku University

### Cache Behavior on Cores



1 24, 2009



### Performance of Multi Vector-Cores with the Shared Cache



тоноки

Hiroaki Kobayashi, Tohoku University

# Prefetching Effects of the On-chip Shared Vector Cache in Multithreading



A.Musa, Y.Sato, T.Soga, R. Egawa, H. Takizawa, H. Kobayashi, 20th CCSE Workshop for A Chip Multi Vector Processor," presented at SC08, 2008. 34 April 24, 2009

# Lessons learned from SX-9 Experiences and Towards the Next-Generation Vector Computing

Great potentials of SX-9

- Powerful tool for "Short Time to Innovations" in computational science
  19 top one scores on 28 HPCC benchmark tests:
- The first on-chip cache mechanism for the SX architecture works well!
  - \* Definitely covers the lack of off-chip memory bandwidth, but...
  - **more capacity**, **more sophisticated data management needed**
- Towards the Next-Generation Vector Computing
  - **Wirtualization** of distributed vector computing resources
  - Multicore desing of the vector architecture
    - Research challenges
      - \* Hardware/software-controlled optimizations for on-chip data handling needed
        - \* Tradeoff between loop-unrolling & selective caching with prefetching, outstanding load handling on miss
      - New memory hierarchy design for a multicore era of the vector architecture under the consideration of power consumption and sustained performance

Hiroaki Kobayashi, Tohoku University



### Acknowledgements

- Tohoku University
- Koki Okabe
- Ryusuke Egawa
- Hiroyuki Takizawa
- Ei-ichi Ito
- Kenji Oizumi
- Other colleagues and students of the project

- NEC
- Akihiko Musa
- Takashi Soga
- Youichi Shimomura
- Yoko Isobe
- Tatsunobu Kokubo
- Naoyuki Sogo
- Masaaki Yamagata
- Other NEC Engineers involved in SX R&D



Disclaimer:Information provided in this talk does not reflect any future design of the NEC systems.