﻿<?xml version="1.0" encoding="utf-8" standalone="no"?>
<rss version="2.0">
  <channel>
    <title>Microsoft Research Publications</title>
    <link>http://research.microsoft.com/apps/dp/pu/publications.aspx</link>
    <description>Keep current with all the latest Microsoft Research Publications and Technical Reports</description>
    <copyright>© 2013 Microsoft Corporation. All rights reserved.</copyright>
    <language>en-US</language>
    <lastBuildDate>Fri, 17 May 2013 10:20:34 GMT</lastBuildDate>
    <pubDate>Fri, 17 May 2013 10:20:34 GMT</pubDate>
    <ttl>2880</ttl>
    <item>
      <title>You Needn't Build That: Reusable Ethics-Compliance Infrastructure for Human Subjects Research</title>
      <description>Just as security is often a secondary task when users sit down to accomplish something on their computers, ethics tends to be a secondary task for the security researchers who study these users. Both security and ethics rules are often viewed as an inconvenience to those whose productivity is reduced by demands to comply. For researchers, ethics requirements such as informed consent and debriefing are just one of many sources of friction that stand in the way of their research goals. In this paper, we describe how shared tooling could assist in three different research functions related to ethical compliance: obtaining informed consent, debriefing, and the surveying of surrogate participants when consent cannot be obtained from actual participants. Having invested the time to exceed ethical compliance standards in our recent security experiments, we believe this increased attention to ethical design has benefited participants. We are building services to perform these compliance tasks with the goal of reducing the cost of compliance to researchers and obtaining a level of attention to participant protection that would be unreasonable to expect from researchers for whom this is not a primary goal. While we are in part motivated to build reusable ethics-compliance tools because they serve a social good, we too stand to benefit; we plan to build these tools as services that facilitate the sharing of ethics-related behavioral data with the ethics research community. As members of that community, we hope to aggregate the behavioral observations flowing from myriad experiments' ethics infrastructure and use these data to iteratively improve the design of our tools. We also hope to run experiments and analyses using these data that benefit the research community as a whole. We hope that, as the flow of data on ethics-related interactions grows, other researchers will also use these data to advance the state of research ethics. In the remainder of this paper, we describe proposed improvements to three ethics-compliance tasks that could be achieved improve reusable tooling. These tasks are the obtaining of informed consent, the debriefing participants along with monitoring participants reactions during debriefings, and the surveying of surrogate participants.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=189424</link>
      <pubDate>Thu, 23 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>A Characteristic Study on Failures of Production Distributed Data-Parallel Programs</title>
      <description>SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking, and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C# user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers’ mistakes and frequent data-schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93% fixes do not change data processing logic; (4) there are 8% failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=185279</link>
      <pubDate>Wed, 22 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Structural and Temporal Patterns-Based Features</title>
      <description>In this paper, we propose a data transformation pattern to transform sequential data into a set of binary/categorical features and numerical features to enable data analysis. These features capture both structural and temporal information inherent in sequential data.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=188362</link>
      <pubDate>Tue, 21 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Pinocchio: Nearly Practical Verifiable Computation</title>
      <description>To instill greater confidence in computations outsourced to the cloud, clients should be able to verify the correctness of the results returned. To this end, we introduce Pinocchio, a built system for efficiently verifying general computations while relying only on cryptographic assumptions. With Pinocchio, the client creates a public evaluation key to describe her computation; this setup is proportional to evaluating the computation once. The worker then evaluates the computation on a particular input and uses the evaluation key to produce a proof of correctness. The proof is only 288 bytes, regardless of the computation performed or the size of the inputs and outputs. Anyone can use a public verification key to check the proof. Crucially, our evaluation on seven applications demonstrates that Pinocchio is efficient in practice too. Pinocchio's verification time is typically 10ms: 5-7 orders of magnitude less than previous work; indeed Pinocchio is the first general-purpose system to demonstrate verification cheaper than native execution (for some apps). Pinocchio also reduces the worker's proof effort by an additional 19-60x. As an additional feature, Pinocchio generalizes to zero-knowledge proofs at a negligible cost over the base protocol. Finally, to aid development, Pinocchio provides an end-to-end toolchain that compiles a subset of C into programs that implement the verifiable computation protocol. For the full version of our paper, including a correction to the verification procedure, see http://eprint.iacr.org/2013/279 Pinocchio's source code is also available! See the "Related Downloads" link on the right.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=180286</link>
      <pubDate>Tue, 21 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Finding the Linchpins of the Dark Web: a Study on Topologically Dedicated Hosts on Malicious Web Infrastructures</title>
      <description />
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=184251</link>
      <pubDate>Sun, 19 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Direct GPU/FPGA Communication Via PCI Express</title>
      <description>We describe a mechanism for connecting GPU and FPGA devices directly via the PCI Express bus, enabling the transfer of data between these heterogeneous computing units without the intermediate use of system memory. We evaluate the performance benefits of this approach over a range of transfer sizes, and demonstrate its utility in a computer vision application. We find that bypassing system memory yields improvements as high as 2.2x in data transfer speed, and 1.9x in application performance.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=192884</link>
      <pubDate>Fri, 17 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Enhancing Personalized Search by Mining and Modeling Task Behavior</title>
      <description>Personalized search systems tailor search results to the current user intent using historic search interactions. This relies on being able to find pertinent information in the user’s search history, which can be challenging for unseen queries and for new search scenarios. Building richer models of users’ current and historic search tasks can help improve the likelihood of finding relevant content and enhance the relevance and coverage of personalization methods. The task-based approach can be applied to the current user’s search history, or as we focus on here, all users’ search histories as so-called “groupization” (a variant of personalization whereby other users’ profiles can be used to personalize the search experience). We describe a method whereby we mine historic search-engine logs to find other users performing similar tasks to the current user and leverage their on-task behavior to identify Web pages to promote in the current ranking. We investigate the effectiveness of this approach versus query-based matching and finding related historic activity from the current user (i.e., group vs. individual). As part of our studies we also explore the use of the on-task behavior of particular user cohorts, such as people who are more expert in the current topic, rather than all users, with potentially-promising results. Our findings have direct implications for improving personalization in Web search engines.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=183841</link>
      <pubDate>Mon, 13 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Learning to Extract Cross-Session Search Tasks</title>
      <description>Search tasks comprising a series of search queries serving the same information need, have recently been recognized as an accurate atomic unit for modeling user search intent. Most prior research in this area has focused on short-term search tasks within a single search session, and heavily depend on human annotations for supervised classification model learning. In this work, we target the identification of long-term, or \emph{cross-session}, search tasks (transcending session boundaries) by investigating inter-query dependencies learned from users' searching behavior. A semi-supervised clustering model is proposed based on the latent structural SVM framework, and a set of effective automatic annotation rules are proposed as weak supervision to release the burden of manual annotation. Experimental results using a large-scale search log of real user behavior collected from Bing.com confirms the effectiveness of the proposed model in identifying cross-session search tasks and the utility of the introduced weak supervision signals. Our learned model enables a more comprehensive understanding of search behavior via search logs and facilitates the development of dedicated search-engine support for long-term tasks.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=183842</link>
      <pubDate>Mon, 13 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Failure Recovery: When the Cure Is Worse Than the Disease</title>
      <description>Cloud services inevitably fail: machines lose power, networks become disconnected, pesky software bugs cause sporadic crashes, and so on. Unfortunately, failure recovery itself is often faulty; e.g. recovery can accidentally recursively replicate small failures to other machines until the entire cloud service fails in a catastrophic outage, amplifying a small cold into a contagious deadly plague! We propose that failure recovery should be engineered fore-most according to the maxim of primum non nocere, that it “does no harm.” Accordingly, we must consider the system holistically when failure occurs and recover only when observed activity safely allows for it.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=191008</link>
      <pubDate>Mon, 13 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Using Dark Fiber to Replace Diesel Generators</title>
      <description>Cloud providers and other data center operators use geo-distributed data centers. But these data centers largely continue to employ the same designs as were appropriate for single data centers. These designs are wasteful because they do not take full advantage of geo-redundancy. Geo-redundancy can reduce other redundancy at multiple intermediate layers in individual data centers and decrease costs. We discuss options for changing infrastructure design to realize such savings. Our proposal opens up an exciting and novel area of investigation into the design of software that can effectively leverage such platforms.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=188933</link>
      <pubDate>Mon, 13 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Predicting Advertiser Bidding Behaviors in Sponsored Search by Rationality Modeling</title>
      <description>We study how an advertiser changes his/her bid prices in sponsored search, by modeling his/her rationality. Predicting the bid changes of advertisers with respect to their campaign performances is a key capability of search engines, since it can be used to improve the offline evaluation of new advertising technologies and the forecast of future revenue of the search engine. Previous work on advertiser behavior modeling heavily relies on the assumption of perfect advertiser rationality; however, in most cases, this assumption does not hold in practice. Advertisers may be unwilling, incapable, and/or constrained to achieve their best response. In this paper, we explicitly model these limitations in the rationality of advertisers, and build a probabilistic advertiser behavior model from the perspective of a search engine. We then use the expected payoff to define the objective function for an advertiser to optimize given his/her limited rationality. By solving the optimization problem with Monte Carlo, we get a prediction of mixed bid strategy for each advertiser in the next period of time. We examine the effectiveness of our model both directly using real historical bids and indirectly using revenue prediction and click number prediction. Our experimental results based on the sponsored search logs from a commercial search engine show that the proposed model can provide a more accurate prediction of advertiser bid behaviors than several baseline methods.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=190943</link>
      <pubDate>Mon, 13 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The benefits of selecting phenotype-specific variants for applications of mixed models in genomics</title>
      <description>Applications of linear mixed models (LMMs) to problems in genomics include phenotype prediction, correction for confounding in genome-wide association studies, estimation of narrow sense heritability, and testing sets of variants (e.g., rare variants) for association. In each of these applications, the LMM uses a genetic similarity matrix, which encodes the pairwise similarity between every two individuals in a cohort. Although ideally these similarities would be estimated using strictly variants relevant to the given phenotype, the identity of such variants is typically unknown. Consequently, relevant variants are excluded and irrelevant variants are included, both having deleterious effects. For each application of the LMM, we review known effects and describe new effects showing how variable selection can be used to mitigate them.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=192474</link>
      <pubDate>Thu, 09 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>SocialWatch: Detection of Online Service Abuse via Large-Scale Social Graphs</title>
      <description />
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=184250</link>
      <pubDate>Tue, 07 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Statistical Image Completion</title>
      <description>Image completion involves filling missing parts in images. In this paper we address this problem through novel statistics of patch offsets. We observe that if we match similar patches in the image and obtain their offsets (relative positions), the statistics of these offsets are sparsely distributed. We further observe that a few dominant offsets provide reliable information for completing the image. We show that such statistics can be incorporated into both matching-based and graph-based methods for image completion. Experiments show that our method yields better results in various challenging cases, and is faster than existing state-of-the-art methods.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=192345</link>
      <pubDate>Sat, 04 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Open-World Logic Programs: A New Foundation for Formal Specifications</title>
      <description>Recent advances in decision procedures and constraint solvers can enable a new generation of formal specification languages. In this paper we present a new semantic foundation for formal specifications, called open-world logic programming, which integrates with state-of-the-art solvers. Analysis, verification, and synthesis problems on open-world logic programs can be converted to constraints by a quantifier-elimination scheme using symbolic execution. This paper presents the features, semantics, and algorithms of open-world logic programs. We have implemented this approach in the FORMULA specification language, which has been used for production-quality specifications and models.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=192963</link>
      <pubDate>Wed, 01 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Diversely Enumerating System-Level Architectures</title>
      <description>Embedded systems are highly constrained. System-level constraints, such as task partitioning problems and communication scheduling problems, are common, combinatorial, and fundamentally intractable. Though modern constraint solvers can help to synthesize constrained architectures, the architect's troubles do not end here: There may be (infinitely) many architectures satisfying system-level constraints. Multiple candidates must be examined and this is often infeasible for large solution spaces. In this paper we describe an improved enumeration scheme, which still reaps the benefits of modern constraint solvers. The idea is to build a diverse enumerator around an unmodified constraint solver. A diverse enumerator uniformly draws equivalence classes of solutions. Such an enumerator is powerful because it allows unbiased enumeration of the space and can be used to make inferences about the space as a whole. This paper presents the theory, practice, and algorithms for diverse enumeration of architectures with system-level constraints.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=192964</link>
      <pubDate>Wed, 01 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Multi-Style Adaptive Training for Robust Cross-Lingual Spoken Language Understanding</title>
      <description>Given the increasingly available machine translation (MT) services nowadays, one efficient strategy for cross-lingual spoken language understanding (SLU) is to first translate the input utterance from the second language into the primary language, and then call the primary language SLU system to decode the semantic knowledge. However, errors introduced in the MT process create a condition similar to the “mismatch” condition encountered in robust speech recognition. Such mismatch makes the performance of cross-lingual SLU far from acceptable. Motivated by successful solutions developed in robust speech recognition, we in this paper propose a multi-style adaptive training method to improve the robustness of the SLU system for cross-lingual SLU tasks. For evaluation, we created an English-Chinese bilingual ATIS database, and then carried out a series of experiments on that database to experimentally assess the proposed methods. Experimental results show that, without relying on any data in the second language, the proposed method significantly improves the performance on a cross-lingual SLU task while producing no degradation for input in the primary language. This greatly facilitates porting SLU to as many languages as there are MT systems without any human effort. We further study the robustness of this approach to another type of mismatch condition, caused by speech recognition errors, and demonstrate its success also.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=188865</link>
      <pubDate>Wed, 01 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>On the Complexity Analysis of Randomized Block-Coordinate Descent Methods</title>
      <description>In this paper we analyze the randomized block-coordinate descent (RBCD) methods for minimizing the sum of a smooth convex function and a block-separable convex function. In particular, we extend Nesterov's technique (SIOPT 2012) for analyzing the RBCD method for minimizing a smooth convex function over a block-separable closed convex set to the aforementioned more general problem and obtain a sharper expected-value type of convergence rate than the one in Richtarik and Takac (Math Programming 2012). Also, we obtain a better high-probability type of iteration complexity, which improves upon the one by Richtarik and Takac by at least the amount $O(n/\epsilon)$, where $\epsilon$ is the target solution accuracy and $n$ is the number of problem blocks. In addition, for unconstrained smooth convex minimization, we develop a new technique called randomized estimate sequence to analyze the accelerated RBCD method proposed by Nesterov (SIOPT 2012) and establish a sharper expected-value type of convergence rate.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=192927</link>
      <pubDate>Wed, 01 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Cluster-based Smoothing of Sparse Ranking Signals in Mobile Local Search</title>
      <description>Users increasingly rely on their mobile devices to search for local entities, typically businesses, while on the go. Recent work has recognized unique ranking signals in mobile local search (e.g., distance, customer rating, and number of reviews), and has proposed various ways of leveraging these signals for ranking. However, these techniques have overlooked a major challenge that is amplified in the case of mobile local search: data sparseness. In this work, we exploit domain knowledge about businesses to cluster them based on either the category of the business or the parent chain store that the business belongs to. We then smooth individual business' sparse ranking signals based on the hypothesis that businesses in the same cluster share similar ranking signals. Our experimental evaluation using 14 months of real mobile local search logs, shows that the proposed cluster-based smoothing of these ranking signals can improve mean average precision by 5%.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=192874</link>
      <pubDate>Wed, 01 May 2013 07:00:00 GMT</pubDate>
    </item>
    <item>
      <title>SMT-based Analysis of Biological Computation</title>
      <description>Synthetic biology focuses on the re-engineering of living organisms for useful purposes while DNA computing targets the construction of therapeutics and computational circuits directly from DNA strands. The complexity of biological systems is a major engineering challenge and their modeling relies on a number of diverse formalisms. Moreover, many applications are mission-critical" (e.g. as recognized by NASA's Synthetic Biology Initiative) and require robustness which is difficult to obtain. The ability to formally specify desired behavior and perform automated computational analysis of system models can help address these challenges, but today there are no unifying scalable analysis frameworks capable of dealing with this complexity. In this work, we study pertinent problems and modeling formalisms for DNA computing and synthetic biology and describe how they can be formalized and encoded to allow analysis using Satisfiability Modulo Theories (SMT). This work highlights biological engineering as a domain that can benefit extensively from the application of formal methods. It provides a step towards the use of such methods in computational design frameworks for biology and is part of a more general effort towards the formalization of biology and the study of biological computation.</description>
      <link>http://research.microsoft.com/apps/pubs/default.aspx?id=187333</link>
      <pubDate>Wed, 01 May 2013 07:00:00 GMT</pubDate>
    </item>
  </channel>
</rss>