High Performance Data Analysis Framework

Pragnesh Patel
Seminar

The R programming language is known for its diversity and sophistication in data analysis, however its scalability to big data has been lacking. In our work that resulted in the pbdR suite of R packages, integration of scalable libraries and development of ease-of-use components inside R is firmly rooted in best practices from the HPC community. This is a requirement for effective integration of the HPC components and it is a departure from some traditional practice in R. We favor realigning R’s parallel computing infrastructure with standards in HPC rather than continue non-standard developments that have limited scalability and limited ability to leverage results from the HPC community. We have developed several packages that provide a tight coupling of R with highly scalable libraries, enabling scalability to terabytes of data on tens of thousands of cores. We have released core packages and application package on the CRAN. The packages naturally separate into 4 categories : General, I/O, Computation, and Application. I will present about pbdR packages along with applications.