Hadoop: The Default Machine Learning Platform?

Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.

Date:
Speakers:
Milind Bhandarkar
Affiliation:
Greenplum