FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

John Giacomoni
Seminar

The industry wide shift to parallel architectures is a serious problem for sequential applications. This is because, application developers could rely on increases in transistor count and fabrication technology to increase performance without the need for parallel algorithm designs. The question today is, "how do we continue to increase the performance of sequential applications on parallel platforms?" We believe that fine-grain thread level parallelism (TLP) is a viable approach for many applications on general-purpose multicore/multiprocessor machines given: 1) a low overhead software-only core-to-core communication primitive, and 2) the ability to control timing jitter between threads.

This talk will present the design and implementation of Fastforward, our cache-optimized concurrent lock-free queue designed to support fine-grain thread level parallelism (TLP) on general-purpose commodity multicore/multiprocessor systems. In this context, we operationally define fine-grain TLP as 0.5 to 5 times the overhead of an empty interrupt handler per unit-of-work (e.g., approximately 100-1000 nanoseconds). FastForward's enqueue and dequeue times on a 2.6 GHz AMD Opteron 2218 based system are as low as 28.5 nanoseconds, up to 5x faster than the next best solution.

Additionally, the ease of use and effectiveness of FastForward and fine-grain TLP for real applications will be shown with supporting evidence from our work on line-rate Gigabit Ethernet frame processing on general-purpose commodity hardware.