SWAP-Assembler 2: Scalable Genome Assembler towards Millions of Cores - Practice and Experience

Jintao Meng
Seminar

There is widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these huge sequencing data, which can be hundreds of Tara bytes or even Peta bytes.  Our assembly tool, SWAP-Assembler, can scale to 2048 cores on TianHe 1A for human Yanhuang genome. The work at Argonne is to scale SWAP-Assembler to the whole Mira (768k cores), and we currently can scale to 64k cores.

SWAP-Assembler is divided into 5 steps, and the most time consuming steps are input & output, kmer graph construction, graph simplification (edge merging). We optimize these three steps to keep the percentage of time usage in each step constant when the number of cores increases. For the input & output step, the input data is divided into virtual blocks with almost equal size, the begin position and end position for each block is automatically separated at the beginning symbol of reads. This data blocking strategy plays a central role in adjusting the data size to keep the communication and memory efficiency for the subsequent steps. In kmer graph construction, to prevent the communication efficiency degradation, the message size is kept constant (about 1kBytes) between any two processes by proportionally increasing the number of data blocks to the number of processes in the input & output step in each round. The memory usage can be also benefited, as only a small part of the input data is processed in each round. Within graph simplification, the major improvements are, (1) removing heavy routine MPI_Iprobe by hanging up a global MPI_Isend routine in each process for any incoming service requests, (2) using routine MPI_Ibarrier to synchronize the complete state and  minimize the overhead among all the processes, and (3) combining messages sending & receiving between its two neighbors into one loop in the communication protocol.  

In our experiment for human dataset, the modified SWAP-Assembler can scale to 32,768 cores with parallel efficiency of 70.7%. With mixed fish and human dataset, SWAP can scale to 32,768 cores with an efficiency of 81.1%.

Bio:
Jintao Meng is an engineer in Center for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, and also a PhD student in Institute of Computing Technology, Chinese Academy of Sciences, since 2011. He completed his MS degree in computer science in Central China Normal University in 2008. His Current research interest includes parallel and distributed algorithms, high performance computing, Bioinformatics.