Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Mysore Park Workshop on Distributed Computing For Machine Learning and Optimization

 

 

 

 

 

Venue 

 

Lodging

 

Theme 

 

Visa Info

 

Schedule

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Dates : 18th - 20th December 2013

The Mysore Park Workshop series is a forum to promote research discussion in Computer Science and Engineering in India, through workshops on contemporary topics modeled after the Dagstuhl Workshop series. These workshops are envisioned to invite leading researchers from various related disciplines to promote cross-fertilization of ideas, and inter-disciplinary research. There is particular emphasis on the overall quality of discussion. To keep quality high, participation is limited to experts from academia and industry. More information about previous workshops is available here.

This workshop will focus specifically on the area of distributed machine learning and optimization. Click here for the theme of workshop.

The workshop is on invitation basis only.

Venue

The workshop will be held at the Infosys Mysore campus, located approximately 150 kms from Bangalore. Accommodation will be provided on the campus.

The closest international airport to Mysore is Bangalore/Bengaluru (BLR), which has direct connections to several European ports (Paris, Frankfurt, London). Travel by road between Bangalore and Mysore takes approximately 3 hours. We will arrange transportation, via a bus service, from Bangalore to Mysore and vice versa.

Accomodation and Food

During the workshop, accommodation and food will be provided at no cost to the invited participants at the beautiful campus of Infosys at Mysore. Note: Infosys is an alcohol-free campus.

Infosys Campus, Mysore

The Rs. 260 crore corporate training centre has been set up by Infosys, which adds 12,000 employees every year. The US $ 60 million training centre is housed in a 270 acre campus at Mysore, the city known for its academic and research institutions and heritage buildings.

The facilities created at this centre include 24 hours food court, sports facilities (including a swimming pool), multiplex theatre, and education research block, beside the trainee hostel. It is spread over 16,00,000 sq ft of area.

Infosys Mysore Development Center reflects its commitment to be a responsible consumer of energy, water and other resources. Construction and renovation projects are driven by the long-term goals of carbon neutrality and water sustainability.

New buildings at the center have been designed to disperse maximum daylight. It minimizes energy consumption and reduces heat load from artificial lighting. Insulation techniques used in the buildings and green procurement practices reduce the environmental impact of operations.

The ambiance of this place makes it a perfect for learning and sharing of knowledge.

Visa Information

Most international visitors require a visa to enter India. You should contact the Indian embassy in your country to obtain more details. Please note that "Business Visa" is the most appropriate visa category for your visit. The "Conference Visa" category is not applicable as it is meant only for conferences organized by a set of Government institutes. 

Theme

The past decade has witnessed an explosion of information and data. This has happened due to a convergence of trends -- growth of the web, expansion of connectivity via mobile devices and social networks, and availability of new big data sources in scientific domains such as Computational Biology, Neuroscience and High-energy physics. These trends have two important aspects. Firstly, the data can no longer be stored on a single disk or machine but instead resides on a shared memory/disk distributed over an entire cluster of machines. Secondly, the scale of this data is leading to completely new and challenging machine learning problems across a variety of domains. New classes of distributed machine learning algorithms have started emerging that not only need to take care of computational efficiency but also communication cost and correctness of parallel execution over the cluster/network. This field is attracting both academia and industry due to novel modeling and algorithm design aspects as immediate relevance to practical big data problems.

In this workshop, we aim to focus specifically on the area of distributed machine learning and optimization. We plan to have a 3 day workshop with each day focusing on the below topics:

a) Programming Frameworks: Distribution computation has several critical system aspects like communication: when (synchronous vs. asynchronous), how (master-slave vs. de-centralized), bandwidth, fault tolerance etc. as well as tricks like hashing, caching for speed-up. As a result, different frameworks like Map-reduce, MPI, Dryad, Graph lab etc. have been proposed that try to optimize a subset of above aspects while potentially compromising on others. The key challenges include understanding any underlying unified framework, evolution of programming frameworks, optimally choosing the framework for various machine learning algorithms and big data applications.

b) Machine Learning: There has been lot of work going on in parallelizing common supervised (SVMs, Neural Networks, Decision Trees, Deep Belief Nets, Graphical models etc.) as well as un-supervised algorithms (K-means clustering, Latent Dirichlet Allocation, Graphical models etc.) . As a result several novel generic templates for distributed machine learning have come up, that trade-off different aspects like time, accuracy, generality, algorithmic complexity and so on.

c) Optimization: There are several challenges from optimization point of view once data is distributed across nodes. For example: communication bandwidth becomes bottleneck for algorithms that communicate heavily, convergence guarantee to global single node solution becomes non-trivial, and so on. As a result there has been lot of focus on optimization techniques like ADMM, Dual decomposition etc. that are naturally suited for distributed optimization as well as asynchronous and de-centralized optimizations that remove communication bottlenecks while still ensuring good convergence.

d) Big Data Applications: Various web based companies are collecting huge data every day in the form of user/search logs for various applications like content recommendation, click prediction etc. Moreover, even non IT companies are hugely getting benefitted from big data analysis: measurements from various sensors can be used to build anomaly detection models saving them from huge losses due to technical shutdowns.

Schedule

Each talk will be 55 mins. long including 10 mins. for questions and discussions.

17th December

Buses Leaving Bangalore from Microsoft Research India office at 1:30 pm.

Address: Microsoft Research India, Vigyan, #9 Lavelle Road, Bangalore, 560025

18th December

 

Session1 (8:30 – 10:20)

Inderjit Dhillon: Divide and Conquer Methods for Big Data Analytics

SVN Vishwanathan: NOMAD: Non-locking, stochastic Multi-machine algorithm for Asynchronous and Decentralized matrix factorization

Coffee Break (10:20 - 10:50)

Session 2 (10:50 – 12:40)

Vijay Saraswat: Programming Distributed Machine Learning Applications / Frameworks

Milind Bhandarkar: Hadoop, the Default Machine Learning Platform

Lunch Break (12:40 - 2:00)

Panel Discussion 1: (2:00 – 3:00)

To Be Announced

Coffee Break (3:00 - 3:20)

Session 3 (3:20 – 5:10)

Francesca Rossi: Preference reasoning and aggregation

Ganesh Ramarkrishnan: Scaling up entity extraction and search over entities and relations

19th December

 

Session 1 (8:30-10:20)

S. Sundararajan and Shirish Shevade: Distributed Algorithms for Kernel Machines

Coffee Break (10:20 - 10:50)

Session 2 (10:50 – 12:40)

Chih-Jen Lin: Distributed Newton Methods for CTR (Click Through Rate) Classification

Tara Sainath: Techniques for Improving Training Time of Deep Neural Networks with Applications to Speech Recognition

Lunch Break (12:40 - 2:00)

Panel Discussion 1: (2:00 – 3:00)

To Be Announced

Coffee Break (3:00 - 3:20)

Session 3 (3:20 – 5:10)

Nikhil Rasiwasia: Yahoo Challenge: Large-scale Flickr-tag Image Classification with Noisy Labels.

Manik Varma: Local Deep Kernel Learning for Efficient Non-linear SVM Prediction

20th December:

 

Session 1 (7:30 – 9:20)

Anju Kambadur: A Generic Framework For Sparse Machine Learning

Geoffrey Holmes: Frameworks for Distributed Machine Learning

Coffee Break (9:20 – 9:45)

Session 2 (9:45 – 11:35)

Srinivasan Parthasarathy: Large Scale Data Analytics: Challenges, and the role of Stratified Data Placement

Anima Anandkumar: To Be Announced

Buses Leaving Mysore (12:30)

 

 

Abstracts

Inderjit Dhillon (Professor, The University Of Texas at Austin)

Divide and Conquer Methods for Big Data Analytics

Data is being generated at a tremendous rate in modern applications as diverse as internet applications, genomics, health care, energy management and social network analysis. There is a great need for developing scalable methods for analyzing these data sets. In this talk, I will present some new Divide-and-Conquer
algorithms for various challenging problems in large-scale data analysis. Divide-and-Conquer has been a common paradigm that has been widely used in computer science and scientific computing, for example, in sorting, scalable computation of n-body interactions via the fast multipole method and eigenvalue
computations of symmetric matrices. However, this paradigm has not been widely employed in problems that arise in machine learning. I will introduce some recent divide-and-conquer methods that we have developed for three representative
problems: (i) classification using kernel support vector machines,
(ii) dimensionality reduction for large-scale social network analysis, and (iii) structure learning of graphical models. For each of these problems, we develop specialized algorithms, in particular, tailored ways of "dividing" the problem into subproblems, solving the subproblems, and finally "conquering" them. It should be noted that the subproblem solutions yield localized models for analyzing the data; an intriguing question is whether the hierarchy of localized models can be combined to yield models that are not only easier to compute, but are also statistically more robust.

This is joint work with Cho-Jui Hsieh, Pradeep Ravikumar, Donghyuk Shin and Si Si.

SVN Vishwanathan (Associate Professor, Purdue University)

NOMAD: Non-locking, Stochastic Multi-machine algorithm for Asynchronous and Decentralized matrix factorization

In this talk I will describe NOMAD, which is an asynchronous, distributed framework for large scale matrix factorization. A unique feature of our framework is the ability to keep the network and CPU busy by simultaneously performing communication and computation. Extensive experiments using public datasets confirm that NOMAD is able to outperform other algorithms both in a multi-core as well as distributed memory setting. If time permits, I will also discuss some preliminary results on extending our framework to solve a larger class of regularized risk minimization problems.

Vijay Saraswat (Member Research Staff, IBM Watson)

Programming Distributed Machine Learning Applications / Frameworks

We propose to use X10 (http://x10-lang.org) to program distributed machine learning applications.

Designed for the productivity of Java-like languages, and the performance of MPI-like libraries, at scale, X10 is built around a few core programming constructs for concurrency and distribution. X10 is a statically typed, garbage-collected, parallel, object-oriented programming language that supports dynamic, asynchronous, state-full computations running in a single global address space, partitioned over a cluster of tens of thousands of cores. Programs can run either natively (single binary, running on each node in a cluster), or in a managed environment (cluster of JVMs), either on commodity networks, or high performance networks (e.g. infiniband), or supercomputers such as Blue Gene/Q, IBM P775, or the Japanese K computer.

X10 is particularly useful for building libraries / application frameworks involving distributed/global data-structures. In earlier work we have developed high performance, in-memory implementations of application frameworks such as Hadoop Map Reduce, a global matrix library, and a library for globally load-balancing highly irregular computations. Application and libraries such as a non-resilient version of Pregel, or distributed version of the DBScan clustering algorithm, an expectation-maximization algorithm for HMM training etc have been developed in a few hundred lines of code. Additionally, X10 is being used to develop distributed SAT solvers (e.g. Sat X10, DDX10 based on decision diagrams) and adaptive solvers using local search.

Milind Bhandarkar (Chief Scientist, Greenplum, EMC)

Hadoop, the Default Machine Learning Platform

Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.

Francesca Rossi (Professor, Univ. of Padova, Italy)

Preference reasoning and aggregation

Preference reasoning and aggregation are the main ingredients of a very active area of research which brings together multi-agent systems, knowledge representation, combinatorial optimisation, and voting theory. Preferences are often used in collective decision making: each agent expresses its preferences over a set of possible decisions, and such preferences are then aggregated to return the collective decision. The main challenges in this area are related to: handling a large set of candidate decisions with a combinatorial structure, as well as a possibly very large set of agents expressing their preferences; choosing among several formalisms to model preferences compactly; the presence of indifference, incomparability, and uncertainty in the preferences; and of course computational concerns.

After describing the main lines of activities and the most significant results in this area, I will discuss the possible uses of preference modelling, reasoning, and aggregation in two application domains: sentiment analysis and stable matching problems. In sentiment analysis, preferences have to be extracted by a massive amount of textual data, such as tweets, then they have to be modelled faithfully, and then aggregated in an efficient way, in order to generate an accurate collective opinion. This poses several challenges in terms of handling big data, extracting relevant information from it, and learning the missing one. In stable matching problems, there are two sets of agents, each agent expresses preferences over the agents in the other set, and the aim is to find a stable matching according to the preferences. Both systematic and local search algorithms have been used to solve such problems. But to handle problems of large size, such algorithms need to be much faster, so distribution/parallelization techniques can be of great help in this respect.

Ganesh Ramarkrishnan (Assistant Professor, IIT Bombay)

Scaling up entity extraction and search over entities and relations

Entity relationship search at the Web scale or even at the Enterprise level depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being (i) fast and effective entity extractors and disambiguators, (ii) the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices and (ii) use of annotations and efficient indices for effective and efficient entity-oriented search.

After providing a brief introduction to our prior work on entity annotation, disambiguation and entity-based search, we will focus on specific approaches we explored for scaling them up. In particular, we present two of our approaches geared toward scaling up operations in this area:

(a) the translation of rule based annotation to operations on the inverted index, to achieve an order of magnitude speedup (EMNLP 2006, ICDE 2008, Infoscale 2008, CIKM 2008) over the standard document-at-a-time rule-based annotation paradigm.

(b) the design of RAM data structures for spotting and and disambiguating entity mentions (WWW 2012), and highly compressed disk-based annotation indices (WWW 2011). These data structures cannot be readily built upon standard inverted indices. We present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. We present how the disk-based annotation index enables entity-centric snippet oriented search (WWW 2011).

S. Sundararajan (Senior Applied Scientist, MSR India) and Shirish Shevade (Professor, IISC Bangalore)

Distributed Algorithms for Kernel Machines

Training a kernel machine requires to solve a convex quadratic programming problem. Different decomposition algorithms, designed to solve this problem, were found to be very slow when the training data set size is very large. Effective implementation of distributed training algorithms for kernel machines have shown promising results on very large data sets. In the first part of this talk, we review some of the prominent distributed algorithms to design kernel machines by solving the standard primal or dual formulations. In the second part of this talk, we will cover kernel approximation based algorithms to build distributed kernel machines. This include using Nystrom approximation, related randomized algorithms, Random Fourier Features, and, linearization of kernel machines, etc. A comparison of storage, computation and communication costs required by these algorithms will be presented. We conclude with some high level observations, recommendations and future directions.

Chih-Jen Lin (Professor, NTU, Taiwan)

Distributed Newton Methods for CTR (Click Through Rate) Classification

CTR (Click Through Rate) prediction is extremely important for Internet advertisements. Data of users' impression and click logs possess two major challenges. First, the collected data set in just a few days contains billions or more instances. Second, the number of positive data (i.e., clicks) is relatively small, so the data set is highly unbalanced. We develop a distributed Newton method for training very large-sale logistic regression. We use real data to analyze the scalability of our method, the relationship between test accuracy and data size, the workflow of big-data experiments, and the various tools for implementing big-data machine learning packages.

Tara Sainath (Research Staff Member, IBM Watson)

Techniques for Improving Training Time of Deep Neural Networks with Applications to Speech Recognition

Deep Neural Networks have become state of the art in acoustic modeling for speech recognition, allowing for improvements of 10-30% relative over previously used acoustic modeling techniques across a wide variety of large vocabulary continuous speech recognition (LVCSR) tasks. However, training time with using these networks still remains a large challenge. In this talk, I will discuss various techniques we have explored to improving training speed. This includes reducing the number of network parameters using a low-rank matrix factorization, parallelizing matrix-multiplications with GPUs, and using a 2nd-order data-parallel Hessian-free training algorithm.

Nikhil Rasiwasia (Scientist, Yahoo! Labs, Bangalore)

Yahoo Challenge: Large-scale Flickr-tag Image Classification with Noisy Labels.

In this talk I would present an overview and the results of the Yahoo Large-scale Flickr-tag Image Classification Challenge held at ACM-MM 2013. Classification of images with a large number of classes and images has attracted several research efforts in recent years. The availability of datasets such as ImageNet, which boasts of over 14 million images and over 21 thousand classes, has motivated researchers to develop classification algorithms that can deal with large quantities of data. However, most of the effort has been dedicated to building systems that can scale up when the number of classes is large. In this challenge we are interested to learn classifiers when the number of images is large. There has been some recent work that deals with thousands of images for training; in this challenge, however, we are looking at 200,000 images per class which has never been studied in the literature before. What makes the challenge difficult is that the annotations are provided by users of Flickr.com, and so they are inherently noisy. Our estimates suggest that 20-40% of tagged images are incorrect, based on editorial review. Furthermore, each class can be considered as a collection of sub-classes with varied visual properties for example, the class “Nature” contains images from sub-classes such as “Waterfall”, “Mountains”, “Beach” etc.

Manik Varma (Researcher, MSR India)

Local Deep Kernel Learning for Efficient Non-linear SVM Prediction

“Our objective is to speed up non-linear SVM prediction while maintaining classification accuracy above an acceptable limit. We generalize Localized Multiple Kernel Learning so as to learn a tree-based primal feature embedding which is high dimensional and sparse. Primal based classification decouples prediction costs from the number of support vectors and our tree-structured features efficiently encode non-linearities while speeding up prediction exponentially over the state-of-the-art. We develop routines for optimizing over the space of tree-structured features and efficiently scale to problems with more than half a million training points. Experiments on benchmark data sets reveal that our formulation can reduce prediction costs by more than three orders of magnitude in some cases with a moderate sacrifice in classification accuracy as compared to RBF-SVMs. Furthermore, our formulation leads to better classification accuracies over leading methods.”

Anju Kambadur (Research Staff Member, IBM Watson)

A Generic Framework For Sparse Machine Learning

Data and computation are growing at an ever-increasing pace, but advances in single threaded processor performance have reached a standstill; that is, machine learning algorithms need to be parallelized. The key requirements placed by parallel machine learning algorithms are: (1) a means to specify parallel algorithms that are capable of operating on clusters, SMPs, and CMPs and (2) support for rapid prototyping to enable evaluation of several different algorithms in a short period of time. Fortunately, parallel programming is well-studied; this allows us to leverage proven high performance computing (HPC) technologies such as BLAS, LAPACK, and MPI to build high-throughput, high-level, domain-specific programming abstractions and hide complex implementation details. In this talk, we present a simple conceptual framework that we developed at T.J. Watson to parallelize several greedy algorithms for sparse learning on commodity clusters leveraging HPC technologies. In particular, we present our framework through four examples: (1) graphical Granger modeling, (2) genome-wide association studies, (3) non-negative matrix factorization, and (4) sparse inverse covariance estimation to demonstrate the utility of our framework.

Geoffrey Holmes (Professor, University of Waikato)

Frameworks for Distributed Machine Learning

This talk is in three parts. The first deals with an aspect of the Weka project that has received little attention, namely the use of machine learning in agricultural applications. I will outline our experiences in this field and present an application development framework which is a direct result of this activity. In particular, one project has met one of the challenges proposed by Kiri Wagstaff at ICML 2012. Second, I will talk about our work in data stream mining with a focus on classification within the Massive Online Analysis framework MOA. After a quick overview of what is in MOA I will present two recent results that indicate a need for caution and a statement of what constitutes state-of-the-art in data stream classification for practitioners. I will also discuss attempts to produce a distributed version of MOA called SAMOA - a platform for data stream mining in a cluster/cloud environment. It features an architecture that allows it to run on several distributed stream processing engines such as S4 and Storm. Finally, I will present the idea of experiment databases, a framework for machine learning experimentation that saves effort and offers opportunities for meta learning and hypothesis generation.

Srinivasan Parthasarathy (Professor, The Ohio State University)

Large Scale Data Analytics: Challenges, and the role of Stratified Data Placement

With the increasing popularity of XML data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and processing such large and complex data stores, on modern computational eco-systems, to realize actionable informa- tion efficiently, is daunting. In this talk I will begin with discussing some of these challenges. Subsequently I will discuss a critical element at the heart of this challenge relates to the placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this eco- system according to the needs of the application to maximize locality, balance load, or minimize data skew. Results on several real-world applications validate the efficacy and efficiency of our approach.