Designing an MPI-Based Parallel and Distributed Machine Learning Platform on Large-Scale HPC Clusters

Zhi-Jie Yan, Teng Gao, and Qiang Huo

Abstract

This paper presents the design of an MPI-based parallel and distributed machine learning platform on large-scale HPC clusters. Researchers and practitioners can implement easily a class of parallelizable machine learning algorithms on the platform, or port quickly an existing non-parallel implementation of a parallelizable algorithm to the platform with only minor modifications. Complicated functions in parallel programming such as scheduling, caching and load balancing are handled automatically by the platform. The platform performance was evaluated in a series of stress tests by using a k-means clustering task on 7,500 hours of speech data (about 2.7 billion 52-dimensional feature vectors). Good scalability is demonstrated on an HPC cluster with thousands of CPU cores.

Details

Publication typeInproceedings
Published inInternational Workshop on Statistical Machine Learning for Speech Processing, IWSML 2012
PublisherIEEE
> Publications > Designing an MPI-Based Parallel and Distributed Machine Learning Platform on Large-Scale HPC Clusters