Number of nodes	72
Processors per node	2 $\times$ AMD EPYC 7763 (64 cores, 2.45-3.5 GHz)
Memory per node	1024 GB DDR4 3200 MT/s
Accelerators per node	8 $\times$ Nvidia A100 (40 GB HBM2 memory)
Intra-node connection	NVLink 3.0 (12 sub-links, 25 GB/s per sub-link per direction)
Inter-node connection	4 $\times$ 200 Gb/s InfiniBand ports


Number of nodes	2978
Processors per node	1 $\times$ AMD EPYC 7A53 "Trento" (64 cores, 2 GHz)
Memory per node	512 GB DDR4
Accelerators per node	4 $\times$ AMD MI250x (128 GB HBM2 memory)
Intra-node connection	Infinity Fabric (50 GB/s per sub-link per direction; also CPU-GPU)
Inter-node connection	Cray Slingshot 11 (25 GB/s per link per direction; directly GPU-GPU)

CUDA	ROCm/HIP
since 2007	since 2016
proprietary	open-source (with private development)
official support for all GPUs	only 9 GPUs are officially supported
extensive API for all GPU features	limited API compared to CUDA
graphical profiler (Nsight Compute)	only command-line profiler (rocprofiler)

Profiling on AMD? Well...

AMD has only a command-line profiler with different metrics and configuration than NVIDIA $\rightarrow$ I don't have any results
AMD models its programming interface after CUDA
- hipify can transform CUDA code to HIP
- but performance and optimizations clearly do not translate easily
Important hardware differences – e.g. the warp size is always 32 on NVIDIA GPUs, but it can be either 32 or 64 on AMD GPUs
Compiler differences:
- nvcc optimizes the TNL-LBM code well, but it is proprietary to NVIDIA
- AMD's compiler is open-source (based on LLVM)
- LLVM/Clang can also compile CUDA code – but it is not as good as nvcc

$N_{nodes}$	$N_{GPUs}$	$\text{GLUPS}$	$Sp$	$E\!f\!f$
1	1	5.40	1.00	1.00
	2	10.83	2.01	1.00
	4	21.79	4.04	1.01
	8	43.40	8.04	1.00
2	16	85.90	15.91	0.99
4	32	166.90	30.92	0.97
8	64	256.84	47.58	0.74
16	128	263.62	48.84	0.38

$N_{nodes}$	$N_{GPUs}$	$\text{GLUPS}$	$Sp$	$E\!f\!f$
1	1/2	0.45	1.00	1.00
	1	0.93	2.08	1.04
	2	1.72	3.85	0.96
	4	3.61	8.12	1.01
2	8	8.06	18.11	1.13
4	16	59.59	133.90	4.18
8	32	107.04	240.53	3.76
16	64	132.95	298.77	2.33

Optimizing LBM for Karolina and LUMI supercomputers

Jakub Klinkovský

LBM in Krakow 2024
February 9 2024

Outline

The Code: TNL-LBM

The Hardware: Karolina

Karolina Compute Node Overview

The Other Hardware: LUMI

LUMI-G Compute Node Overview

How To Measure Performance

Spoiler: NVIDIA A100 vs AMD MI250x

Programming: NVIDIA CUDA vs AMD ROCm/HIP

Profiling on NVIDIA GPU

Profiling on AMD? Well...

Parallel scalability

Strong Scaling on Karolina

Strong Scaling on LUMI-G

Future Work

References

Thank you for your attention!

Dziękuję za uwagę!

Optimizing LBM for Karolina and LUMI supercomputers

Jakub Klinkovský

LBM in Krakow 2024 February 9 2024

Outline

The Code: TNL-LBM

The Hardware: Karolina

Karolina Compute Node Overview

The Other Hardware: LUMI

LUMI-G Compute Node Overview

How To Measure Performance

Spoiler: NVIDIA A100 vs AMD MI250x

Programming: NVIDIA CUDA vs AMD ROCm/HIP

Profiling on NVIDIA GPU

Profiling on AMD? Well...

Parallel scalability

Strong Scaling on Karolina

Strong Scaling on LUMI-G

Future Work

References

Thank you for your attention!

Dziękuję za uwagę!

LBM in Krakow 2024
February 9 2024