Czech Technical University in Prague Faculty of Nuclear Sciences and Physical Engineering Department of Software Engineering
Credits: R. Straka, R. Fučík, P. Eichler, J. Klinkovský, T. Oberhuber, et al.
Ultimately, we want the computational time to be as short as possible. However, other metrics allow better comparison (different resolution, hardware, etc.)
For LBM, we use LUPS (lattice updates per second) or MLUPS, GLUPS.
1×\times× A100
5.4 GLUPS
12×\frac{1}{2}\times21× MI250x
0.74 GLUPS
Benchmark problem: channel flow with D3Q27 model, cumulant collision operator, lattice size 1283128^31283, 2563256^32563, or 5123512^35123, in single-precision.
HIP is a C++ programming interface built on top of ROCm and modeled after CUDA.
Both toolkits provide similar low-level libraries (e.g. cuBLAS and rocBLAS/hipBLAS) and are supported in all major scientific frameworks (e.g. PyTorch, TensorFlow).
Roofline plot for NVIDIA Quadro RTX 5000 (11 TFlop/s in single-precision, 448 GB/s):
The LBM kernel is memory-bound and close to the peak bandwidth on and all NVIDIA GPUs. So, what is wrong with the performance on AMD MI250x?
hipify
nvcc
How to use more GPUs? Split the domain to subdomains and add communication.
Optimizations:
Sp=time of 1 workertime of N workers,E f f=time of 1 workerN×time of N workers=SpN.Sp = \frac{\text{time of 1 worker}}{\text{time of $N$ workers}}, \qquad E\!f\!f = \frac{\text{time of 1 worker}}{N \times \text{time of $N$ workers}} = \frac{Sp}{N}. Sp=time of N workerstime of 1 worker,Eff=N×time of N workerstime of 1 worker=NSp.