|The WW Web of Invisible Trackers
Internet advertisers reach millions of consumers through practices that involve real time tracking of users’ online activities. The tracking is conducted by third party ad services engaged by the Web sites to facilitate marketing campaigns and service analytics. At the same time, the applications that facilitate interaction with services, such as Internet browsers, reveal little or no information to the user about the information flow between the devices and services. That leaves the consumers with no insight and no understanding of what data is collected and how it is used. In the broader context of privacy and cyber-security, it is important to consider methods and computing designs that empower users to make well informed decisions and take actions that keep themselves and others safe.
We present research projects that investigated several aspects: (1) characterizing the tracking ecosystem and the value exchange within it, (2) understanding the users’ attitudes, behaviour, and awareness of tracking practices, and (3) designing applications and systems to increase the transparency of the data and value exchange between the user and services. We discuss the findings of three studies. They motivate us to consider alternatives to the privacy invading online practices and urge deeper questions about the design and comprehensibility of computing systems.
|Real-Time Audience Polling Using Computer Vision
How can teachers, executives, and other public speakers make an interactive presentation to a large audience? Up until now, techniques for polling the audience relied on personal electronic devices, such as smart phones or special purpose 'clickers'. To enable real-time polling of any audience, we introduce a new technique using computer vision. Each member of the audience is given a qCard: an ordinary sheet of paper with a printed barcode. The speaker asks a multiple-choice question, and audience members respond by holding their qCard in different ways. Using a laptop and digital camera, our software automatically recognizes and aggregates the responses. In this talk, we will describe our experience piloting this technology in Bangalore schools. We will also conduct a live demonstration!
|Probabilistic Models and Machine Learning
The last forty years of the digital revolution has been driven by one simple fact: the number of transistors on a silicon chip doubles every couple of years. Today we are witnessing a second form of exponential growth: in the quantity of data being collected and stored. It is driving a transformation in information technology, from solutions that are explicitly hand-crafted to those which are learned from data. Real-world data, however, is full of complexity, ambiguity and uncertainty and so the data revolution is driving a corresponding transformation from computing with logic to computing with probabilities. This talk will introduce the key ideas of computing with uncertainty, and will be illustrated with tutorial examples and real-world case studies.
|Impact of Computer Science Research on Science, Technology, and Society
The field of computing is driven by scientific questions, technological innovation and societal demands. There is wonderful interplay-push and pull-among these three drivers. For example, accelerating technological advances and monumental societal demands force us to revisit the most basic scientific questions of computing. These drivers are also measures of the impact of computing research. In my talk I will give examples from Microsoft Research of our impact on science, technology, and society. I will close with pointers to new directions for computing research.
|From the Edge of the Universe and Back Again
This talk will cover how the creation of the WorldWide Telescope to visualize the Universe served as the foundation to explore more complex dynamic data sets and its guided tours will help democratize access and understanding of spatial temporal data for people back here on Earth.
Curtis Wong is a Principal Researcher in Microsoft Research eScience and co-creator of the WorldWide Telescope which has over ten million users around the world. Curtis leveraged the ideas behind WorldWide Telescope to drive the development of high performance interactive spatial temporal data visualization called Power Map in Office Excel that will be released later this year.
|Filling in the Blanks - The Importance of Basic Computing Research
One of the most exciting aspects of computer science is that the results of basic research so often end up being applied in completely unexpected ways. At Microsoft Research, we actively seek out these surprising outcomes, by building a pipeline that connects long-term, blue-sky research to technological innovations. This talk will provide a glimpse into specific research projects that have had significant impact on many areas of computing.
|Fast Online Tensor Method for Overlapping Community Detection
|Frameworks for Distributed Machine Learning
This talk is in three parts. The first deals with an aspect of the Weka project that has received little attention, namely the use of machine learning in agricultural applications. I will outline our experiences in this field and present an application development framework which is a direct result of this activity. In particular, one project has met one of the challenges proposed by Kiri Wagstaff at ICML 2012. Second, I will talk about our work in data stream mining with a focus on classification within the Massive Online Analysis framework MOA. After a quick overview of what is in MOA I will present two recent results that indicate a need for caution and a statement of what constitutes state-of-the-art in data stream classification for practitioners. I will also discuss attempts to produce a distributed version of MOA called SAMOA - a platform for data stream mining in a cluster/cloud environment. It features an architecture that allows it to run on several distributed stream processing engines such as S4 and Storm. Finally, I will present the idea of experiment databases, a framework for machine learning experimentation that saves effort and offers opportunities for meta learning and hypothesis generation.
|Distributed Newton Methods for CTR (Click Through Rate) Prediction
CTR (Click Through Rate) prediction is extremely important for Internet advertisements. Data of users' impression and click logs possess two major challenges. First, the collected data set in just a few days contains billions or more instances. Second, the number of positive data (i.e., clicks) is relatively small, so the data set is highly unbalanced. We develop a distributed Newton method for training very large-sale logistic regression. We use real data to analyze the scalability of our method, the relationship between test accuracy and data size, the workflow of big-data experiments, and the various tools for implementing big-data machine learning packages.
|Scaling up Extraction Over Entities (and Relations)
Entity relationship search at the Web scale or even at the Enterprise level depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being (i) fast and effective entity extractors and disambiguators, (ii) the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices and (ii) use of annotations and efficient indices for effective and efficient entity-oriented search.
After providing a brief introduction to our prior work on entity annotation, disambiguation and entity-based search, we will focus on specific approaches we explored for scaling them up. In particular, we present two of our approaches geared toward scaling up operations in this area:
- Microsoft Research India
196/36 2nd Main
Sadashivnagar, Bangalore 560 080