Improving the Performance of OpenMP Using Lightweight Threads

Adrián Castelló
Seminar

In recent years, the main increase of the computational power of computers has come from augmentation of the total number of cores. This trend is also truly reflected with co-processors, such as Intel Xeon Phi. However, a higher number of cores implies an extra effort to extract all the computational power by using parallel codes. Sequential codes can be parallelized by different ways: manual parallelization directly using thread libraries such as pthreads or compiler-supported parallelization using directives-based programming models such as OpenMP. Both approaches would work well on current multi- and manycore architectures if they would be carefully and efficiently applied to application codes, but they are usually realized as a single common implementation. Each piece of parallel code could have different requirements (e.g, data size, communication, computation, or I/O), but current solutions create the same thread configuration for all of them. Moreover, the context switch for threads used for those solutions adds a non-negligible overhead, and it can be clearly noticed on the implementation of fine-grained parallelism.

In this regard, adaptable and lightweight threads can be used to deal with several parallel code structures. They can improve the performance by reducing the, sometimes unnecessary, overhead introduced by conventional heavyweight threads and by adapting their structure to the code requirements. In this work, we study several parallel code patterns in OpenMP and investigate lightweight user-level threads (ULTs) in translation of those OpenMP codes. A lightweight ULT library called Argobots is used in this study, and ULTs and tasklets supported in Argobots are applied to implement OpenMP threads and tasks. We analyze two well-known OpenMP implementations, such as GCC OpenMP and Intel OpenMP, as well as our translation of OpenMP codes to Argobots for various OpenMP pragmas including nested parallel constructs, task creation, and task yield. Experimental results and analyses on Intel 36-core machine show that current pthreads-based OpenMP implementations have some limitations in terms of performance and expressiveness and lightweight threads can overcome these limitations with low-overhead context-switch and flexible scheduling capabilities.

Bio:
Adrián Castelló is a Research Assistant and first year Computer Science Ph.D. student at Universitat Jaume I de Castelló (Spain). He received his BS and MS in Computer Science  from the same university in 2011 and 2014 respectively. His current research interest is focused on (but not limited to) GPU computing and HPC networks. He worked as a Software Developer at Universitat Jaume I de Castelló between 2011 and 2014.