Microsoft Research Publications
http://research.microsoft.com/apps/dp/pu/publications.aspx
Keep current with all the latest Microsoft Research Publications and Technical Reports
© 2015 Microsoft Corporation. All rights reserved.
enUS
Wed, 10 Jun 2015 14:08:04 GMT
Wed, 10 Jun 2015 14:08:04 GMT
2880

Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base
http://research.microsoft.com/apps/pubs/default.aspx?id=244749
Wed, 01 Jul 2015 07:00:00 GMT

Language Models for Image Captioning: The Quirks and What Works
Two recent approaches have achieved stateoftheart results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of the different language modeling approaches for the first time by using the same stateoftheart CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of both the ME and RNN methods, we achieve a new record performance on the benchmark COCO dataset.
http://research.microsoft.com/apps/pubs/default.aspx?id=249867
Wed, 01 Jul 2015 07:00:00 GMT

Approval Voting and Incentives in Crowdsourcing
The growing need for labeled training data has made crowdsourcing an important part of machine learning. The quality of crowdsourced labels is, however, adversely affected by three factors: (1) the workers are not experts; (2) the incentives of the workers are not aligned with those of the requesters; and (3) the interface does not allow workers to convey their knowledge accurately, by forcing them to make a single choice among a set of options. In this paper, we address these issues by introducing approval voting to utilize the expertise of workers who have partial knowledge of the true answer, and coupling it with a (“strictly proper”) incentivecompatible compensation mechanism. We show rigorous theoretical guarantees of optimality of our mechanism together with a simple axiomatic characterization. We also conduct preliminary empirical studies on Amazon Mechanical Turk which validate our approach.
http://research.microsoft.com/apps/pubs/default.aspx?id=249834
Wed, 01 Jul 2015 07:00:00 GMT

Pushing the Limits of Affine Rank Minimization by Adapting Probabilistic PCA
Many applications require recovering a matrix of minimal rank within an affine constraint set, with matrix completion a notable special case. Because the problem is NPhard in general, it is common to replace the matrix rank with the nuclear norm, which acts as a convenient convex surrogate. While elegant theoretical conditions elucidate when this replacement is likely to be successful, they are highly restrictive and convex algorithms fail when the ambient rank is too high or when the constraint set is poorly structured. Nonconvex alternatives fare somewhat better when carefully tuned; however, convergence to locally optimal solutions remains a continuing source of failure. Against this backdrop we derive a deceptively simple and parameterfree probabilistic PCAlike algorithm that is capable, over a wide battery of empirical tests, of successful recovery even at the theoretical limit where the number of measurements equals the degrees of freedom in the unknown lowrank matrix. Somewhat surprisingly, this is possible even when the affine constraint set is highly illconditioned. While proving general recovery guarantees remains evasive for nonconvex algorithms, Bayesianinspired or otherwise, we nonetheless show conditions whereby the underlying cost function has a unique stationary point located at the global optimum; no existing cost function we are aware of satisfies this property. The algorithm has also been successfully deployed on a computer vision application involving image rectification and a standard collaborative filtering benchmark.
http://research.microsoft.com/apps/pubs/default.aspx?id=249754
Wed, 01 Jul 2015 07:00:00 GMT

Clustered Sparse Bayesian Learning
Many machine learning and signal processing tasks involve computing sparse representations using an overcomplete set of features or basis vectors, with compressive sensingbased applications a notable example. While traditionally such problems have been solved individually for different tasks, this strategy ignores strong correlations that may be present in real world data. Consequently there has been a push to exploit these statistical dependencies by jointly solving a series of sparse linear inverse problems. In the majority of the resulting algorithms however, we must a priori decide which tasks can most judiciously be grouped together. In contrast, this paper proposes an integrated Bayesian framework for both clustering tasks together and subsequently learning optimally sparse representations within each cluster. While probabilistic models have been applied previously to solve these types of problems, they typically involve a complex hierarchical Bayesian generative model merged with some type of approximate inference, the combination of which renders rigorous analysis of the underlying behavior virtually impossible. On the other hand, our model subscribes to concrete motivating principles that we carefully evaluate both theoretically and empirically. Importantly, our analyses take into account all approximations that are involved in arriving at the actual cost function to be optimized. Results on synthetic data as well as image recovery from compressive measurements show improved performance over existing methods.
http://research.microsoft.com/apps/pubs/default.aspx?id=249756
Wed, 01 Jul 2015 07:00:00 GMT

MultiTask Learning for Subspace Segmentation
Subspace segmentation is the process of clustering a set of data points that are assumed to lie on the union of multiple linear or affine subspaces, and is increasingly being recognized as a fundamental tool for data analysis in high dimensional settings. Arguably one of the most successful approaches is based on the observation that the sparsest representation of a given point with respect to a dictionary formed by the others involves nonzero coefficients associated with points originating in the same subspace. Such sparse representations are computed independently for each data point via ℓ 1 norm minimization and then combined into an affinity matrix for use by a final spectral clustering step. The downside of this procedure is twofold. First, unlike canonical compressive sensing scenarios with ideallyrandomized dictionaries, the datadependent dictionaries here are unavoidably highly structured, disrupting many of the favorable properties of the ℓ 1 norm. Secondly, by treating each data point independently, we ignore useful relationships between points that can be leveraged for jointly computing such sparse representations. Consequently, we motivate a multitask learningbased framework for learning coupled sparse representations leading to a segmentation pipeline that is both robust against correlation structure and tailored to generate an optimal affinity matrix. Theoretical analysis and empirical tests are provided to support these claims.
http://research.microsoft.com/apps/pubs/default.aspx?id=249755
Wed, 01 Jul 2015 07:00:00 GMT

How to Elect a Leader Faster than a Tournament
The problem of electing a leader from among n contenders is one of the fundamental questions in distributed computing. In its simplest formulation, the task is as follows: given n processors, all participants must eventually return a win or lose indication, such that a single contender may win. Despite a considerable amount of work on leader election, the following question is still open: can we elect a leader in an asynchronous faultprone system faster than just running a (log n)time tournament, against a strong adaptive adversary? In this paper, we answer this question in the affirmative, improving on a decadesold upper bound. We introduce two new algorithmic ideas to reduce the time complexity of electing a leader to O(log n), using O(n^2) pointtopoint messages. A nontrivial application of our algorithm is a new upper bound for the tight renaming problem, assigning n items to the n participants in expected O(log^2 n) time and O(n^2) messages. We complement our results with lower bound of Omega(n^2) messages for solving these two problems, closing the question of their message complexity.
http://research.microsoft.com/apps/pubs/default.aspx?id=249410
Wed, 01 Jul 2015 07:00:00 GMT

PolylogarithmicTime Leader Election in Population Protocols
Population protocols are networks of finitestate agents, interacting randomly, and updating their state using simple rules. Despite their extreme simplicity, these systems have been shown to cooperatively perform complex computational tasks, such as simulating register machines to compute standard arithmetic functions. The election of a unique leader agent is a key requirement in such computational constructions. Yet, the fastest currently known population protocol for electing a leader only has polynomial convergence time. In this paper, we give the first population protocol for leader election with polylogarithmic convergence time. The protocol structure is quite simple: each node has an associated value, and is either a leader (still in contention) or a minion (following some leader). A leader keeps incrementing its value and defeats other leaders in onetoone interactions, and will drop from contention and become a minion if it meets a leader with higher value. Importantly, a leader also drops out if it meets a minion with higher absolute value. While these rules are quite simple, the proof that this algorithm achieves polylogarithmic convergence time is nontrivial. In particular, the argument combines careful use of concentration inequalities with anticoncentration bounds, showing that the leaders' values become spread apart as the execution progresses, which in turn implies that straggling leaders get quickly eliminated. We complement our analysis with empirical results, showing that our protocol converges fast, even for large network sizes.
http://research.microsoft.com/apps/pubs/default.aspx?id=249409
Wed, 01 Jul 2015 07:00:00 GMT

Fast and Exact Majority in Population Protocols
Population protocols, roughly defined as systems consisting of large numbers of simple identical agents, interacting at random and updating their state following simple rules, are an important research topic at the intersection of distributed computing and biology. One of the fundamental tasks that a population protocol may solve is majority: each node starts in one of two states; the goal is for all nodes to reach a correct consensus on which of the two states was initially the majority. Despite considerable research effort, known protocols for this problem are either exact but slow (taking linear parallel time to converge), or fast but approximate (with nonzero probability of error). In this paper, we show that this tradeoff between precision and speed is not inherent. We present a new protocol called \emph{Average and Conquer (AVC)} that solves majority exactly in expected parallel convergence time $\log{n}/(s\epsilon) + \log{n}\log{s})$, where $n$ is the number of nodes, $\epsilon n$ is the initial node advantage of the majority state, and $s$ is the number of states the protocol employs. This shows that the majority problem can be solved exactly in time polylogarithmic in $n$, provided that the memory per node is $s = \Omega(1/\epsilon)$. On the negative side, we establish a lower bound of $\Omega(1/\epsilon)$ on the expected parallel convergence time for the case of four memory states per node, and a lower bound of $\Omega(\log{n})$ parallel time for protocols using any number of memory states per node.
http://research.microsoft.com/apps/pubs/default.aspx?id=245180
Wed, 01 Jul 2015 07:00:00 GMT

Surrogate Functions for Maximizing Precision at the Top
The problem of maximizing precision at the top of a ranked list, often dubbed Precision@k (prec@k), finds relevance in myriad learning applications such as ranking, multilabel classification, and learning with severe label imbalance. However, despite its popularity, there exist significant gaps in our understanding of this problem and its associated performance measure. The most notable of these is the lack of a convex upper bounding surrogate for prec@k. We also lack scalable perceptron and stochastic gradient descent algorithms for optimizing this performance measure. In this paper we make key contributions in these directions. At the heart of our results is a family of truly upper bounding surrogates for prec@k. These surrogates are motivated in a principled manner and enjoy attractive properties such as consistency to prec@k under various natural margin/noise conditions. These surrogates are then used to design a class of novel perceptron algorithms for optimizing prec@k with provable mistake bounds. We also devise scalable stochastic gradient descent style methods for this problem with provable convergence bounds. Our proofs rely on novel uniform convergence bounds which require an indepth analysis of the structural properties of prec@k and its surrogates. We conclude with experimental results comparing our algorithms with stateoftheart cutting plane and stochastic gradient algorithms for maximizing prec@k.
http://research.microsoft.com/apps/pubs/default.aspx?id=246579
Wed, 01 Jul 2015 07:00:00 GMT

Optimizing Nondecomposable Performance Measures: A Tale of Two Classes
Modern classification problems frequently present mild to severe label imbalance as well as specific requirements on classification characteristics, and require optimizing performance measures that are nondecomposable over the dataset, such as Fmeasure. Such measures have spurred much interest and pose specific challenges to learning algorithms since their nonadditive nature precludes a direct application of wellstudied large scale optimization methods such as stochastic gradient descent. In this paper we reveal that for two large families of performance measures that can be expressed as functions of true positive/negative rates, it is indeed possible to implement point stochastic updates. The families we consider are concave and pseudolinear functions of TPR, TNR which cover several popularly used performance measures such as Fmeasure, Gmean and Hmean. Our core contribution is an adaptive linearization scheme for these families, using which we develop optimization techniques that enable truly pointbased stochastic updates. For concave performance measures we propose SPADE, a stochastic primal dual solver; for pseudolinear measures we propose STAMP, a stochastic alternate maximization procedure. Both methods have crisp convergence guarantees, demonstrate significant speedups over existing methods  often by an order of magnitude or more, and give similar or more accurate predictions on test data.
http://research.microsoft.com/apps/pubs/default.aspx?id=246580
Wed, 01 Jul 2015 07:00:00 GMT

Contextual Dueling Bandits
We consider the problem of learning to choose actions using contextual information when provided with limited feedback in the form of relative pairwise comparisons. We study this problem in the duelingbandits framework of Yue et al. (2009), which we extend to incorporate context. Roughly, the learner’s goal is to find the best policy, or way of behaving, in some space of policies, although “best” is not always so clearly defined. Here, we propose a new and natural solution concept, rooted in game theory, called a von Neumann winner, a randomized policy that beats or ties every other policy. We show that this notion overcomes important limitations of existing solutions, particularly the Condorcet winner which has typically been used in the past, but which requires strong and often unrealistic assumptions. We then present three efficient algorithms for online learning in our setting, and for approximating a von Neumann winner from batchlike data. The first of these algorithms achieves particularly low regret, even when data is adversarial, although its time and space requirements are linear in the size of the policy space. The other two algorithms require time and space only logarithmic in the size of the policy space when provided access to an oracle for solving classification problems on the space.
http://research.microsoft.com/apps/pubs/default.aspx?id=246936
Wed, 01 Jul 2015 07:00:00 GMT

MemoryCentric Data Storage for Mobile Systems
Current data storage on smartphones mostly inherits from desktop/server systems a flashcentric design: The memory (DRAM) effectively acts as an I/O cache for the relatively slow flash. To improve both app responsiveness and energy efficiency, this paper proposes MobiFS, a memorycentric design for smartphone data storage. This design no longer exercises cache writeback at short fixed periods or on file synchronization calls. Instead, it incrementally checkpoints app data into flash at appropriate times, as calculated by a set of app/useradaptive policies. MobiFS also introduces transactions into the cache to guarantee data consistency. This design trades off data staleness for better app responsiveness and energy efficiency, in a quantitative manner. Evaluations show that MobiFS achieves 18.8X higher write throughput and 11.2X more database transactions per second than the default Ext4 filesystem in Android. Popular realworld apps show improvements in response time and energy consumption by 51.6% and 35.8% on average, respectively.
http://research.microsoft.com/apps/pubs/default.aspx?id=244455
Wed, 01 Jul 2015 07:00:00 GMT

Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Datacenterscale computing for analytics workloads is increasingly common. High operational costs force heterogeneous applications to share cluster resources for achieving economy of scale. Scheduling such large and diverse workloads is inherently hard, and existing approaches tackle this in two alternative ways: 1) centralized solutions offer strict, secure enforcement of scheduling invariants (e.g., fairness, capacity) for heterogeneous applications, 2) distributed solutions offer scalable, efficient scheduling for homogeneous applications. We argue that these solutions are complementary, and advocate a blended approach. Concretely, we propose Mercury, a hybrid resource management framework that supports the full spectrum of scheduling, from centralized to distributed. Mercury exposes a programmatic interface that allows applications to tradeoff between scheduling overhead and execution guarantees. Our framework harnesses this flexibility by opportunistically utilizing resources to improve task throughput. Experimental results show gains of over 35\% on productionderived workloads. These benefits can be translated by appropriate application and operator policies into job throughput or job latency improvements. We have implemented and are currently opensourcing Mercury as an extension of Apache Hadoop / YARN.
http://research.microsoft.com/apps/pubs/default.aspx?id=244469
Wed, 01 Jul 2015 07:00:00 GMT

Optimizing Network Performance in Distributed Machine Learning
To cope with the ever growing availability of training data, there have been several proposals to scale machine learning computation beyond a single server and distribute it across a cluster. While this enables reducing the training time, the observed speed up is often limited by network bottlenecks. To address this, we design MLNet, a hostbased communication layer that aims to improve the network performance of distributed machine learning systems. This is achieved through a combination of traffic reduction techniques (to diminish network load in the core and at the edges) and traffic management (to reduce average training time). A key feature of MLNet is its compatibility with existing hardware and software infrastructure so it can be immediately deployed. We describe the main techniques underpinning MLNet and show through simulation that the overall training time can be reduced by up to 78%. While preliminary, our results indicate the critical role played by the network and the benefits of introducing a new communication layer to increase the performance of distributed machine learning systems.
http://research.microsoft.com/apps/pubs/default.aspx?id=245011
Wed, 01 Jul 2015 07:00:00 GMT

RealTime CityScale Taxi Ridesharing
We proposed and developed a taxisharing system that accepts taxi passengers’ realtime ride requests sent from smartphones and schedules proper taxis to pick up them via ridesharing, subject to time, capacity, and monetary constraints. The monetary constraints provide incentives for both passengers and taxi drivers: passengers will not pay more compared with no ridesharing and get compensated if their travel time is lengthened due to ridesharing; taxi drivers will make money for all the detour distance due to ridesharing. While such a system is of significant social and environmental benefit, e.g., saving energy consumption and satisfying people’s commute, realtime taxisharing has not been well studied yet. To this end, we devise a mobilecloud architecture based taxisharing system. Taxi riders and taxi drivers use the taxisharing service provided by the system via a smart phone App. The Cloud first finds candidate taxis quickly for a taxi ride request using a taxi searching algorithm supported by a spatiotemporal index. A scheduling process is then performed in the Cloud to select a taxi that satisfies the request with minimum increase in travel distance. We built an experimental platform using the GPS trajectories generated by over 33,000 taxis over a period of 3 months. A ride request generator is developed (available at http://cs.uic.edu/~sma/ridesharing) in terms of the stochastic process modelling real ride requests learned from the dataset. Tested on this platform with extensive experiments, our proposed system demonstrated its efficiency, effectiveness and scalability. For example, when the ratio of the number of ride requests to the number of taxis is 6, our proposed system serves three times as many taxi riders as that when no ridesharing is performed while saving 11% in total travel distance and 7% taxi fare per rider.
http://research.microsoft.com/apps/pubs/default.aspx?id=219428
Wed, 01 Jul 2015 07:00:00 GMT

Angelic Verification: Precise Verification Modulo Unknowns
Verification of open programs can be challenging in the presence of an unconstrained environment. Verifying properties that depend on the environment yields a large class of uninteresting false alarms. Using a verifier on a program thus requires extensive initial investment in modeling the environment of the program. We propose a technique called angelic verification for verification of open programs, where we constrain a verifier to report warnings only when no acceptable environment specification exists to prove the assertion. Our framework is parametric in a vocabulary and a set of angelic assertions that allows a user to configure the tool. We describe several instantiations of the framework and an evaluation on a set of realworld benchmarks to show that our technique is competitive with industrialstrength tools even without models of the environment.
http://research.microsoft.com/apps/pubs/default.aspx?id=244552
Wed, 01 Jul 2015 07:00:00 GMT

Log2: A CostAware Logging Mechanism for Performance Diagnosis
Logging has been a common practice for monitoring and diagnosing performance issues. However, logging comes at a cost, especially for largescale online service systems. First, the overhead incurred by intensive logging is nonnegligible. Second, it is costly to diagnose a performance issue if there is a tremendous amount of redundant logs. Therefore, we believe that it is important to limit the overhead incurred by logging, without sacrificing the logging effectiveness. In this paper we propose Log2, a costaware logging mechanism. Given a “budget” (defined as the maximum volume of logs allowed to be outputted in a time interval), Log2 makes the “whether to log” decision through a twophase filtering mechanism. In the first phase, a large number of irrelevant logs are discarded efficiently. In the second phase, useful logs are cached and outputted while complying with logging budget. In this way, Log2 keeps the useful logs and discards the less useful ones. We have implemented Log2 and evaluated it on an open source system as well as a realworld online service system from Microsoft. The experimental results show that Log2 can control logging overhead while preserving logging effectiveness.
http://research.microsoft.com/apps/pubs/default.aspx?id=244551
Wed, 01 Jul 2015 07:00:00 GMT

Synthesizing Executable Gene Regulatory Networks from SingleCell Gene Expression Data
http://research.microsoft.com/apps/pubs/default.aspx?id=244559
Wed, 01 Jul 2015 07:00:00 GMT

LockFree Algorithms under Stochastic Schedulers
In this work, we consider the following random process, motivated by the analysis of lockfree concurrent algorithms under high memory contention. In each round, a new scheduling step is allocated to one of $n$ threads, according to a distribution $\vect{p} = (p_1, p_2, \ldots, p_n)$, where thread $i$ is scheduled with probability $p_i$. When some thread first reaches a set threshold of executed steps, it registers a \emph{win}, completing its current operation, and resets its step count to $1$. At the same time, threads whose step count was close to the threshold also get reset because of the win, but to $0$ steps, being penalized for \emph{almost} winning. We are interested in two questions: how often does \emph{some} thread complete an operation (\emph{system latency}), and how often does a \emph{specific} thread complete an operation (\emph{individual latency})? We provide asymptotically tight bounds for the system and individual latency of this general concurrency pattern, for arbitrary scheduling distributions $\vect{p}$. Surprisingly, a simple characterization exists: in expectation, the system will complete a new operation every $\Theta( 1 / \ \vect{p} \_2 )$ steps, while thread $i$ will complete a new operation every $\Theta( \ \vect{p} \_2 / p_i^2 )$ steps. The proof is interesting in its own right, as it requires a careful analysis of how the higher norms of the vector $\vect{p}$ influence the thread step counts and latencies in this random process. Our result offers a simple connection between the scheduling distribution and the average performance of concurrent algorithms, which has several applications.
http://research.microsoft.com/apps/pubs/default.aspx?id=245182
Wed, 01 Jul 2015 07:00:00 GMT