Exploring SYCL for Batched Kernels with Memory Allocations

Aymeric Millan, Maison de la Simulation
CS Seminar Graphic

Batched parallelism with local allocations is an extremely common pattern in HPC, appearing in multi-dimensional FFTs, neural networks processing, or split computation of numerical operators. Its efficient support is especially complex on GPU where memory per thread is limited and dynamic memory allocations are challenging. This study investigates whether the native abstractions of SYCL can support performance portability for this pattern. We implement versions of a batched semi-Lagrangian advection kernel using each parallel construct of SYCL. We evaluate them in terms of maintainability, performance portability and memory footprint on CPUs and GPUs (AMD, Intel, NVIDIA), with two distinct SYCL implementations (AdaptiveCpp and DPC++). Our results demonstrate that no single parallel construct of SYCL emerges as best solution and that a construct offering a higher level of abstraction would be required to support this common pattern.

Aymeric Millan is a second-year PhD student specializing in high-performance programming models at "Maison de la Simulation," a joint laboratory involving CEA, CNRS, Université Paris-Saclay, and Université Versailles Saint-Quentin. The lab focuses on high-performance computing and numerical simulations, connecting to physical applications, parallel software engineering, programming models, visualization techniques, and artificial intelligence.

Aymeric holds a master’s degree in HPC-AI, where he developed expertise in high-performance computing and artificial intelligence. His research primarily involves optimizing GPU code performance to enhance computational efficiency. During his previous internship, Aymeric worked on accelerating AI inference in conjunction with physical simulation codes, gaining experience integrating AI with high-performance computing.

Currently, his PhD research focuses on developing programming models to optimize hardware usage, specifically for nested parallelism with local memory allocations. He aims to manage GPU memory layers in an optimal and portable way for specific usage patterns to improve overall computational performance.

See upcoming and previous presentations at CS Seminar Series.