22 GLUPS with TNL-LBM
in Python

Jakub Klinkovský

Department of Software Engineering, FNSPE CTU in Prague

Workshop on Scientific Computing 2026, Děčín

Motivation

  • Lattice Boltzmann Method (LBM) is powerful for numerical simulations of fluid flows
  • High-performance implementations require GPUs and distributed computing
  • Setting up simulations in C++/CUDA involves a slow edit-compile-run cycle
  • Goal: combine HPC performance with Python's ease of use

Write HPC code in C++, use it interactively from Python

Template Numerical Library

TNL — open-source C++ header-only library providing building blocks for HPC numerical solvers

  • Unified interface for multicore CPUs, GPUs (CUDA, HIP), and distributed systems (MPI)
  • Core building blocks: parallel algorithms, linear algebra, numerical meshes, ODE solvers, optimization methods, etc.

TNL Modules

All following modules have full multi-GPU support based on the common core library.

Module Method Applications
TNL‑SPH Smoothed Particle Hydrodynamics Free-surface flow
TNL‑MHFEM Mixed-Hybrid Finite Element Method Multiphase flow in porous media, advection-diffusion-reaction PDEs
TNL‑LBM Lattice Boltzmann Method Turbulent flow, incompressible Navier-Stokes, non-Newtonian flow
PyTNL Python bindings Modern UX for TNL-MHFEM and TNL-LBM

And more... (performance benchmarks, incubating projects, etc.)

AI Assistance Disclosure

  • TNL (the core library) contains some AI-assisted (vibecoded ) commits
  • PyTNL and TNL-LBM are still 100% human work
  • These slides were about 90% vibecoded
    • Assisted-by: opencode (using GLM-5.1)
  • The speaker that you see and hear is still 100% human too

TNL-LBM Key Features

  • Modular design with pluggable components combined at compile time
  • Cumulant collision operator for D3Q27 and A-A pattern streaming
  • NDArray data layout — optimized for GPU memory coalescing
  • CUDA-aware MPI — overlapping computation and communication
using NSE_CONFIG = LBM_CONFIG<
    TraitsSP,                    // float on GPU
    D3Q27_KernelStruct,          // 27-velocity kernel
    NSE_Data_ConstInflow,        // constant inflow data
    D3Q27_CUM,                   // cumulant collision
    D3Q27_EQ_INV_CUM,            // cumulant-based equilibrium
    D3Q27_STREAMING,             // A-A pattern streaming
    D3Q27_BC_All,                // all boundary conditions
    D3Q27_MACRO_Default >;       // rho, vx, vy, vz

TNL-LBM State Class

State<NSE_CONFIG> — the simulation orchestrator; holds all state and provides virtual methods for customization

template <typename NSE>
struct State {
    // Key virtual methods (override in subclass):
    virtual void setupBoundaries();           // set BCs on lattice faces
    virtual void resetDFs();                  // set initial distribution funcs
    virtual void updateKernelVelocities();    // inflow velocity profile
    virtual void getOutputDataNames();        // return names of output arrays
    virtual void outputData();                // output data for visualization
};

Users subclass as StateLocal : State<NSE_CONFIG>, override virtuals, then call execute(state)

Comparison with Other Projects

Benchmark on NVIDIA RTX A5000, lattice 2563256^3, D3Q27 cumulant

Project SP GLUPS DP GLUPS
waLBerla 3.120 0.656
TNL-LBM * 3.087 0.354
VirtualFluids 2.708 0.167
OpenLB ** 2.509 0.411
TCLB 0.680 0.283
Palabos 0.197 0.191

SP — Single Precision float (32-bit)

DP — Double Precision double (64-bit)

GLUPS — Giga Lattice Updates Per Second (10910^9 site updates/s)


* ~10% improvement in SP over our recent publication in Journal of Supercomputing

** cumulant operator not available → BGK

TNL-LBM is competitive among open-source LBM codes

The C++ Workflow Problem

Setting up an LBM simulation in C++/CUDA:

  1. Write/modify source code or config files
  2. Recompile the solver (minutes for CUDA)
  3. Run the simulation
  4. Parse output files separately
  5. Repeat...

This is slow for prototyping and experimentation

Improved Workflow

What if we could:

  • Define boundary conditions interactively in a Jupyter notebook?
  • Change inflow velocities without recompilation?
  • Orchestrate simulations directly from Python without subprocess.run?
  • Use NumPy/Matplotlib directly on simulation data?

From C++ to Python

...welcome PyTNL bindings 🎉

PyTNL Installation

Python bindings for TNL data structures — powered by nanobind, with DLPack interop (NumPy / CuPy zero-copy)

Installation (source build — compiles C++ extensions):

  1. Install dependencies:
    • Arch Linux: pacman -S base-devel git python openmpi
    • Ubuntu: apt install build-essential git python3-dev libopenmpi-dev
  2. (Optional) Install CUDA
  3. Install PyTNL: pip install pytnl

👉 Import with import pytnl (or some submodule, see later)

🐞 Bugs? Report at gitlab.com/tnl-project/pytnl/-/issues

📦 PyTNL Package Structure

import pytnl
import pytnl.devices              # Host / Cuda device selectors
import pytnl._containers          # CPU binary (compiled C++)
import pytnl._containers_cuda     # GPU binary (compiled C++)
import pytnl.containers           # Python wrapper: Array, Vector, NDArray, ...
import pytnl._matrices            # CPU binary
import pytnl._matrices_cuda       # GPU binary
import pytnl.matrices             # Python wrapper: sparse/dense matrices
import pytnl._meshes              # CPU binary
import pytnl._meshes_cuda         # GPU binary
import pytnl.meshes               # Python wrapper: Grid, Mesh, topologies

Binary modules start with _ and are considered internal
Python wrappers (containers, matrices, meshes) provide clean high-level interface and dispatching for C++ templates

🪄 C++ Templates in Python

C++ template classes get subscript syntax via metaclass — Class[params] resolves to the compiled C++ type:

import pytnl.devices
from pytnl.containers import Array, NDArray, DistributedNDArray
from pytnl.meshes import Grid, Mesh, topologies

a = Array[float]()                       # → _containers.Array_float
a = Array[float, devices.Cuda]()         # → _containers_cuda.Array_float
nd = NDArray[3, int]()                   # → _containers.NDArray_3_int
nd = NDArray[3, float, devices.Cuda]()   # → _containers_cuda.NDArray_3_float
g = Grid[3]()                            # → _meshes.Grid_3
m = Mesh[topologies.Triangle]()          # → _meshes.Mesh_Triangle
  • Device defaults to Host; specifying Cuda switches to the _cuda binary module
  • All objects have full type annotations — enjoy IntelliSense in your IDE 🤩

pytnl_lbm — LBM Bindings

Python bindings for TNL-LBM — built from the TNL-LBM repo (PyTNL + TNL + nanobind fetched by CMake automatically)

git clone https://gitlab.com/tnl-project/tnl-lbm.git
cd tnl-lbm
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build            # produces build/pytnl_lbm/pytnl_lbm.so

The module contains bindings for all LBM components:

  • Compile-time config (fixed): D3Q27 + Cumulant + A-A pattern
  • Runtime config (from Python): domain size, boundaries, physical parameters

🐞 Bugs? Report at gitlab.com/tnl-project/tnl-lbm/-/issues

The Trampoline Pattern

Python subclasses can override C++ virtual methods via nanobind trampoline:

template <typename NSE>
struct PyState : public State<NSE> {
    NB_TRAMPOLINE(State<NSE>, 16);

    void setupBoundaries() override {
        NB_OVERRIDE(setupBoundaries);
    }
    void updateKernelVelocities() override {
        NB_OVERRIDE(updateKernelVelocities);
    }
    // ... 14 more virtual methods
};

Simulation in C++ vs Python

Same class hierarchy, same virtual method overrides, no recompilation needed 😊

CUDA vs Python Benchmark

Helios, 4× NVIDIA A100-SXM4-80GB, D3Q27 cumulant, SP,
lattice 2560×640×6402560 \times 640 \times 640 (~234 GB memory used)

mpirun $MPIRUNFLAGS python ./sim_NSE/sim_1.py --resolution 20

C++/CUDA Python
GLUPS 22.022 22.022
SimUpdate time 1729.6 s 1729.6 s
Total walltime 1991.7 s 1980.6 s

Python overhead is negligible — the execute(state) call runs the entire time loop in C++. Python only affects setup and per-timestep callbacks (but additional work here affects C++ too).

Summary

  1. PyTNL provides nanobind-based Python bindings for TNL with DLPack interop
  2. pytnl_lbm enables defining LBM simulations from Python via virtual method trampolines
  3. Same simulation can be written in C++ or Python
  4. Python overhead is negligible for simulations — compute stays in C++/CUDA

Limitations & Future Work

  • Changing LBM config (e.g. precision, collision operator, velocity set) requires recompilation
  • Only SP_D3Q27_CUM_ConstInflow instantiation exported so far — more configs planned
  • TNL's DistributedNDArray does not map nicely to NumPy arrays

Thank You! Questions?

Repositories:

Main publication:

J. Klinkovský, P. Eichler, R. Straka, T. Oberhuber, R. Fučík,
TNL-LBM: Scalable LBM implementation based on Template Numerical Library, J. Supercomputing 82, 167 (2026).