Parallel forall blog posts
- https://devblogs.nvidia.com/faster-parallel-reductions-kepler/
- https://devblogs.nvidia.com/using-shared-memory-cuda-cc/
- https://devblogs.nvidia.com/cuda-8-features-revealed/
- https://devblogs.nvidia.com/inside-pascal/
- https://devblogs.nvidia.com/cuda-9-features-revealed/
- https://devblogs.nvidia.com/cooperative-groups/
- https://devblogs.nvidia.com/using-cuda-warp-level-primitives/
- https://devblogs.nvidia.com/introduction-cuda-aware-mpi/
- https://devblogs.nvidia.com/benchmarking-cuda-aware-mpi/
- https://devblogs.nvidia.com/benchmarking-gpudirect-rdma-on-modern-server-platforms/
- https://devblogs.nvidia.com/fast-multi-gpu-collectives-nccl/
- https://devblogs.nvidia.com/cuda-pro-tip-profiling-mpi-applications/
- https://devblogs.nvidia.com/unified-memory-cuda-beginners/
- https://devblogs.nvidia.com/beyond-gpu-memory-limits-unified-memory-pascal/
- https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/
- https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
- https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
- https://developer.nvidia.com/blog/accelerating-python-on-gpus-with-nvc-and-cython/
- https://developer.nvidia.com/blog/multi-gpu-programming-with-standard-parallel-c-part-1/
- https://developer.nvidia.com/blog/multi-gpu-programming-with-standard-parallel-c-part-2
- https://developer.nvidia.com/blog/developing-accelerated-code-with-standard-language-parallelism/
- https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
CUDA pro tips
- https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ – optimization via vectorized memory access using built-in vector types
- https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
- https://devblogs.nvidia.com/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
- https://devblogs.nvidia.com/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/ http://www.vi-hps.org/projects/score-p https://www.vampir.eu/
- https://devblogs.nvidia.com/cuda-pro-tip-minimize-the-tail-effect/
- https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
- https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/
- https://developer.nvidia.com/blog/cuda-pro-tip-the-fast-way-to-query-device-properties/
- https://developer.nvidia.com/blog/maximizing-performance-with-massively-parallel-hash-maps-on-gpus/
- https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/
Compilers
- GPUCC: An Open-Source GPGPU Compiler
- Compiling CUDA with clang
- CppCon 2016: „Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler“
Tools
nvvp does not support new GPUs (starting with Volta/Turing, sm_7*
).
nvvp (GUI) and nvprof (CLI) are marked as „deprecated“ in the documentation,
they are being replaced by Nsight Compute
(for profiling CUDA kernels) and Nsight Systems
(for profiling global things, including CPU and „timeline“). These new tools are finally
Java-free, but still have some quirks (e.g. remote connections ignore .ssh/config
).
I also could not find a way to export the report to PDF. Both of these tools require
at least sm_7*
. They can be launched by the ncu-ui
and nsys-ui
commands.
- https://developer.nvidia.com/blog/migrating-nvidia-nsight-tools-nvvp-nvprof/
- https://developer.nvidia.com/blog/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/
- https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/
Miscellaneous
- CUDA syntax cheatsheet
__ldg()
instruction (obsoleted by the „unified cache“ in the Pascal architecture)- Demystifying GPU Microarchitecture through Microbenchmarking
- Useful nvidia-smi queries
- from http://stackoverflow.com/questions/43953872/need-help-to-deal-with-overlap-of-gpu-and-cpu-execution „unless you put markers in the host code, there is no direct way from the profiler to tell what exactly is overlapping with the kernel call“ markers: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#marking-regions-of-cpu-activity
- What is the canonical way to check for errors using the CUDA runtime API?
- CUDA toolkit documentation: Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS
- CUDA Driver API vs. CUDA runtime
- Reduce and Scan
- julia:
cuobjdump -res-usage <binary-file-name>
dumps resource usage of all kernels in the binary file – see CUDA Binary Utilities- Dan Alcantara - hashing on GPUs (implemented in CUDPP)
- A Simple GPU Hash Table (code)
- Virtual functions in CUDA:
- ZLUDA is CUDA for AMD GPUs
Popular articles
- More Oil, Less Toil: How GPUs Can Make the Most of Fossil Fuel Resources
- Comparing NVLink vs PCI-E with NVIDIA Tesla P100 GPUs on OpenPOWER Servers
- https://diit.cz/clanek/nvidia-volta-gv100
- https://diit.cz/clanek/nvidia-titan-v-uveden
- Nvidia has banned the use of GeForce and Titan GPUs in data centres
- Nvidia‘s Titan V GPUs spit out ‚wrong answers‘ in scientific simulations
- 25 Years Later: A Brief Analysis of GPU Processing Efficiency
Benchmarks
- Rodinia Benchmark Suite
- Achievable triad bandwidth on GPUs, with ATS/system-allocator test
- A GPU benchmark tool for evaluating GPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP)
- Multi-GPU Computing Benchmark Suite (CUDA)
- Micro-benchmarking CUDA-capable GPGPUs to understand the architecture of each Streaming Multiprocessor (SM) in handling outstanding memory requests
- A benchmark / comparison of Mergesort and FFT in CUDA, MPI, and OpenMP