Optimizing LBM for Karolina and LUMI supercomputers

Jakub Klinkovský

Czech Technical University in Prague
Faculty of Nuclear Sciences and Physical Engineering
Department of Software Engineering

LBM in Krakow 2024
February 9 2024

Outline

  1. The code: TNL-LBM
  2. Hardware overview
  3. Performance analysis
  4. Parallel scalability

The Code: TNL-LBM

  • Open-source project: https://gitlab.com/tnl-project/tnl-lbm/
  • TNL ⇨ TNL-LBM ⇨ private modules
  • Modular architecture with pluggable components (collision operators, streaming patterns, boundary conditions, macroscopic quantities, etc.)
  • High-performance code in C++ and CUDA
  • Distributed computing with MPI

Credits: R. Straka, R. Fučík, P. Eichler, J. Klinkovský, T. Oberhuber, et al.

The Hardware: Karolina

Number of nodes 72
Processors per node 2×\times AMD EPYC 7763 (64 cores, 2.45-3.5 GHz)
Memory per node 1024 GB DDR4 3200 MT/s
Accelerators per node 8×\times Nvidia A100 (40 GB HBM2 memory)
Intra-node connection NVLink 3.0 (12 sub-links, 25 GB/s per sub-link per direction)
Inter-node connection 4×\times 200 Gb/s InfiniBand ports

Karolina Compute Node Overview

center

The Other Hardware: LUMI

Number of nodes 2978
Processors per node 1×\times AMD EPYC 7A53 "Trento" (64 cores, 2 GHz)
Memory per node 512 GB DDR4
Accelerators per node 4×\times AMD MI250x (128 GB HBM2 memory)
Intra-node connection Infinity Fabric (50 GB/s per sub-link per direction; also CPU-GPU)
Inter-node connection Cray Slingshot 11 (25 GB/s per link per direction; directly GPU-GPU)

LUMI-G Compute Node Overview