Optimizing LBM for Karolina and LUMI supercomputers

Jakub Klinkovský

Czech Technical University in Prague
Faculty of Nuclear Sciences and Physical Engineering
Department of Software Engineering

LBM in Krakow 2024
February 9 2024

Outline

  1. The code: TNL-LBM
  2. Hardware overview
  3. Performance analysis
  4. Parallel scalability

The Code: TNL-LBM

  • Open-source project: https://gitlab.com/tnl-project/tnl-lbm/
  • TNL ⇨ TNL-LBM ⇨ private modules
  • Modular architecture with pluggable components (collision operators, streaming patterns, boundary conditions, macroscopic quantities, etc.)
  • High-performance code in C++ and CUDA
  • Distributed computing with MPI

Credits: R. Straka, R. Fučík, P. Eichler, J. Klinkovský, T. Oberhuber, et al.

The Hardware: Karolina

Number of nodes 72
Processors per node 2×\times AMD EPYC 7763 (64 cores, 2.45-3.5 GHz)
Memory per node 1024 GB DDR4 3200 MT/s
Accelerators per node 8×\times Nvidia A100 (40 GB HBM2 memory)
Intra-node connection NVLink 3.0 (12 sub-links, 25 GB/s per sub-link per direction)
Inter-node connection 4×\times 200 Gb/s InfiniBand ports

Karolina Compute Node Overview

center

The Other Hardware: LUMI

Number of nodes 2978
Processors per node 1×\times AMD EPYC 7A53 "Trento" (64 cores, 2 GHz)
Memory per node 512 GB DDR4
Accelerators per node 4×\times AMD MI250x (128 GB HBM2 memory)
Intra-node connection Infinity Fabric (50 GB/s per sub-link per direction; also CPU-GPU)
Inter-node connection Cray Slingshot 11 (25 GB/s per link per direction; directly GPU-GPU)

LUMI-G Compute Node Overview

How To Measure Performance

  • Traditionally in FLOPS (floating-point operations per second)
    • NVIDIA A100: 19.5 TFLOPS in single-precision, 9.7 TFLOPS in double-prec.
    • AMD MI250x: 47.9 TFLOPS in single-precision and double-precision
  • But performance is not only about computing – we also need to access data
    • NVIDIA A100: 1.5 TB/s global memory
    • AMD MI250x: 3.2 TB/s global memory
  • Roofline model – performance (FLOPS) vs operational intensity (FLOPS/byte)

Ultimately, we want the computational time to be as short as possible. However, other metrics allow better comparison (different resolution, hardware, etc.)

For LBM, we use LUPS (lattice updates per second) or MLUPS, GLUPS.

Spoiler: NVIDIA A100 vs AMD MI250x

  1. 1×\times A100

    5.4 GLUPS


  2. 12×\frac{1}{2}\times MI250x

    0.74 GLUPS


Benchmark problem: channel flow with D3Q27 model, cumulant collision operator, lattice size 1283128^3, 2563256^3, or 5123512^3, in single-precision.

Programming: NVIDIA CUDA vs AMD ROCm/HIP

HIP is a C++ programming interface built on top of ROCm and modeled after CUDA.

Both toolkits provide similar low-level libraries (e.g. cuBLAS and rocBLAS/hipBLAS) and are supported in all major scientific frameworks (e.g. PyTorch, TensorFlow).

CUDA ROCm/HIP
since 2007 since 2016
proprietary open-source (with private development)
official support for all GPUs only 9 GPUs are officially supported
extensive API for all GPU features limited API compared to CUDA
graphical profiler (Nsight Compute) only command-line profiler (rocprofiler)

Profiling on NVIDIA GPU

Roofline plot for NVIDIA Quadro RTX 5000 (11 TFlop/s in single-precision, 448 GB/s):

center

The LBM kernel is memory-bound and close to the peak bandwidth on and all NVIDIA GPUs. So, what is wrong with the performance on AMD MI250x?

Profiling on AMD? Well...

  • AMD has only a command-line profiler with different metrics and configuration than NVIDIA \rightarrow I don't have any results 🙁
  • AMD models its programming interface after CUDA
    • hipify can transform CUDA code to HIP
    • but performance and optimizations clearly do not translate easily
  • Important hardware differences – e.g. the warp size is always 32 on NVIDIA GPUs, but it can be either 32 or 64 on AMD GPUs
  • Compiler differences:
    • nvcc optimizes the TNL-LBM code well, but it is proprietary to NVIDIA
    • AMD's compiler is open-source (based on LLVM)
    • LLVM/Clang can also compile CUDA code – but it is not as good as nvcc

Parallel scalability

How to use more GPUs? Split the domain to subdomains and add communication.

Optimizations:

  1. Overlapping computation and communication (LBM in Krakow 2023)
  2. Domain decomposition
    • in this talk: 1D decomposition (slicing along the xx-axis)
    • multidimensional decomposition is work in progress 🚧
  3. Data layout in GPU memory, GPU block size, ...

Sp=time of 1 workertime of N workers,E ⁣f ⁣f=time of 1 workerN×time of N workers=SpN.Sp = \frac{\text{time of 1 worker}}{\text{time of $N$ workers}}, \qquad E\!f\!f = \frac{\text{time of 1 worker}}{N \times \text{time of $N$ workers}} = \frac{Sp}{N}.

Strong Scaling on Karolina

NnodesN_{nodes} NGPUsN_{GPUs} GLUPS\text{GLUPS} SpSp E ⁣f ⁣fE\!f\!f
1 1 5.40 1.00 1.00
2 10.83 2.01 1.00
4 21.79 4.04 1.01
8 43.40 8.04 1.00
2 16 85.90 15.91 0.99
4 32 166.90 30.92 0.97
8 64 256.84 47.58 0.74
16 128 263.62 48.84 0.38

Strong Scaling on LUMI-G

NnodesN_{nodes} NGPUsN_{GPUs} GLUPS\text{GLUPS} SpSp E ⁣f ⁣fE\!f\!f
1 1/2 0.45 1.00 1.00
1 0.93 2.08 1.04
2 1.72 3.85 0.96
4 3.61 8.12 1.01
2 8 8.06 18.11 1.13
4 16 59.59 133.90 4.18
8 32 107.04 240.53 3.76
16 64 132.95 298.77 2.33

Future Work

  • Finish multidimensional decomposition, optimize it on NVIDIA GPUs
  • Thorough investigation on AMD GPUs, optimize data loading (streaming pattern)

References

Thank you for your attention!

Dziękuję za uwagę!