To make a prairie it takes a clover and one bee,—
One clover, and a bee,
The revery alone will do
If bees are few.
— Emily Dickinson, 1830-1886
Over the years, my research projects have spanned user interfaces, software engineering and type theory, but they all share a common goal: to make it easier to produce usable, reliable software. When you observe the work practice of an experienced professional, like a surgeon or a car mechanic, you see efficient, graceful use of task-appropriate tools. In contrast, if you watch an experienced software developer doing an every-day task, you see fumbling, confusion and frustration. Software developers are every bit as trained and talented, but their tools and processes are often poorly suited for their tasks.
My group at Microsoft Research, Human Interactions in Programming (HIP), applies user-centered design to software development: studying developers both in the lab and in the field; understanding what is difficult about their typical tasks; building new tools to make those tasks easier; and evaluating those tools with developers. My recent research studies recommender systems for team newcomers, the use of spatial memory to navigate large code bases, retaining knowledge in long-lived projects, and patterns of communication and interruption in co-located and geographically distributed development teams.
- Robert DeLine, Making CHASE Mainstream (Keynote at CHASE Workshop), 17 May 2009.
- Titus Barik, Robert DeLine, Steven Drucker, and Danyel Fisher, The Bones of the System: A Study of Logging and Telemetry at Microsoft, no. MSR-TR-2015-79, 26 October 2015.
Large software organizations are transitioning to event data platforms as they culturally shift to better support data-driven decision making. This paper offers a case study at Microsoft during such a transition. Through qualitative interviews of 28 participants, and a quantitative survey of
1,823 respondents, we catalog a diverse set of activities that leverage event data sources, identify challenges in conducting these activities, and describe tensions that emerge in data-driven cultures as event data flow through these activities within the organization. We find that the use of event data span every job role in our interviews and survey, that different perspectives on event data create tensions between roles or teams, and that professionals report social and technical challenges across activities.
- Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing, Trill: A High-Performance Incremental Query Processor for Diverse Analytics, VLDB – Very Large Data Bases, August 2015.
This paper introduces Trill – a new query processor for analytics. Trill fulfills a combination of three requirements for a query processor to serve the diverse big data analytics space: (1) Query Model: Trill is based on a tempo-relational model that enables it to handle streaming and relational queries with early results, across the latency spectrum from real-time to offline; (2) Fabric and Language Integration : Trill is architected as a high-level language library that supports rich data-types and user libraries, and integrates well with existing distribution fabrics and applications; and (3) Performance: Trill’s throughput is high across the latency spectrum. For streaming data, Trill’s throughput is 2-4 orders of magnitude higher than comparable streaming engines. For offline relational queries, Trill’s throughput is comparable to a major modern commercial columnar DBMS. Trill uses a streaming batched-columnar data representation with a new dynamic compilation-based system architecture that addresses all these requirements. In this paper, we describe Trill’s new design and architecture, and report experimental results that demonstrate Trill’s high performance across diverse analytics scenarios. We also describe how Trill’s ability to support diverse analytics has resulted in its adoption across many usage scenarios at Microsoft.
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel, The Emerging Role of Data Scientists on Software Development Teams, no. MSR-TR-2015-30, 12 April 2015.
Creating and running software produces large amounts of raw data about the development process and the customer usage, which can be turned into actionable insight with the help of skilled data scientists. Unfortunately, data scientists with the analytical and software engineering skills to analyze these large data sets have been hard to come by; only recently have software companies started to develop competencies in software-oriented data analytics. To understand this emerging role, we interviewed data scientists across several product groups at Microsoft. In this paper, we describe their education and training background, their raison d’être in software engineering contexts, and the type of problems on which they work. We identify five distinct working styles of data scientists and describe a set of strategies that they employ to increase the impact and actionability of their work.
- Emanuel Zgraggen, Steven M. Drucker, Danyel Fisher, and Robert DeLine, (s|qu)eries: Visual Regular Expressions for Querying and Exploring Event Sequences, ACM – Association for Computing Machinery, April 2015.
Many different domains collect event sequence data and rely on finding and analyzing patterns within it to gain meaningful insights. Current systems that support such queries either provide limited expressiveness, hinder exploratory workflows or present interaction and visualization models which do not scale well to large and multi-faceted data sets. In this paper we present (s|qu)eries (pronounced “Squeries”), a visual query interface for creating queries on sequences (series) of data, based on regular expressions. (s jqu)eries is a touchbased system that exposes the full expressive power of regular expressions in an approachable way and interleaves query specification with result visualizations. Being able to visually investigate the results of different query-parts supports debugging and encourages iterative query-building as well as exploratory work-flows. We validate our design and implementation through a set of informal interviews with data scientists that analyze event sequences on a daily basis.
- Danyel Fisher, Badrish Chandramouli, Robert DeLine, Jonathan Goldstein, Andrei Aron, Mike Barnett, John C. Platt, James F. Terwilliger, and John Wernsing, Tempe: An Interactive Data Science Environment for Exploration of Temporal and Streaming Data, no. MSR-TR-2014-148, November 2014.
Over the last two decades, data scientists performed increasingly sophisticated analyses on larger data sets, yet their tools and workflows remain low-level. A typical analysis involves different tools for different stages of the work, requiring file transfers and considerable care to keep everything organized. Temporal data adds additional complexity: users typically must write queries offline before porting them to production systems. To address these problems, this paper introduces Tempe, a web application providing an integrated, collaborative environment for both real-time and offline temporal data analysis. Tempe's central concept is a persistent research notebook retaining data sources, analysis steps and results. Analysis steps are carried out in script editor that uses a live programming approach to display interactive, progressively updated visualizations. Tempe uses a temporal streaming engine, Trill , as its backend data processor. In the process of creating Tempe, we have discovered new interactivity and responsiveness requirements for Trill. Conversely, building around Trill has shaped the user experience for Tempe. We report on this cross-disciplinary design process to argue that end user experience can be an integral part of creating a data engine.
- Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing, The Trill Incremental Analytics Engine, no. MSR-TR-2014-54, April 2014.
This technical report introduces Trill – a new query processor for analytics. Trill fulfills a combination of three requirements for a query processor to serve the diverse big data analytics space: (1) Query Model: Trill is based on a tempo-relational model that enables it to handle streaming and relational queries with early results, across the latency spectrum from real-time to offline; (2) Fabric and Language Integration: Trill is architected as a high-level language library that supports rich data-types and user libraries, and integrates well with existing distribution fabrics and applications; and (3) Performance: Trill’s throughput is high across the latency spectrum. For streaming data, Trill’s throughput is 2-4 orders of magnitude higher than today’s comparable streaming engines. For offline relational queries, Trill’s throughput is comparable to a major modern commercial columnar DBMS.
Trill uses a streaming batched-columnar data representation with a new dynamic compilation-based system architecture that addresses all these requirements. In this technical report, we describe Trill’s new design and architecture, and report experimental results that demonstrate Trill’s high performance across diverse analytics scenarios. We also describe how Trill’s ability to support diverse analytics has resulted in its adoption across many usage scenarios at Microsoft.
- Mike Barnett, Robert DeLine, Akash Lal, and Shaz Qadeer, Get Me Here: Using Verification Tools to Answer Developer Questions, no. MSR-TR-2014-10, February 2014.
While working developers often struggle to answer reachability questions (e.g. How can execution reach this line of code? How can execution get into this state?), the research community has created analysis and verification technologies whose purpose is systematic exploration of program execution. In this paper, we show the feasibility of using verification tools to create a query engine that automatically answers certain kinds of reachability questions. For a simple query, a developer invokes the “Get Me Here" command on a line of code. Our tool uses an SMT-based static analysis to search for an execution that reaches that line of code. If the line is reachable, the tool visualizes the trace using a Code Bubbles representation to show the methods invoked, the lines executed within the methods and the values of variables. The GetMeHere tool also supports more complex queries where the user specifies a start point, intermediate points, and an end point, each of which can specify a predicate over the program's state at that point. We evaluate the tool on a set of three benchmark programs. We compare the performance of the tool with professional developers answering the same reachability questions. We conclude that the tool has sufficient accuracy, robustness and performance for future testing with professional users.
- Mike Barnett, Badrish Chandramouli, Robert DeLine, Steven Drucker, Danyel Fisher, Jonathan Goldstein, Patrick Morrison, and John Platt, Stat! - An Interactive Analytics Environment for Big Data, in ACM SIGMOD International Conference on Management of Data (SIGMOD 2013), ACM SIGMOD, June 2013.
Exploratory analysis on big data requires us to rethink data management across the entire stack – from the underlying data processing techniques to the user experience. We demonstrate Stat! – a visualization and analytics environment that allows users to rapidly experiment with exploratory queries over big data. Data scientists can use Stat! to quickly refine to the correct query, while getting immediate feedback after processing a fraction of the data. Stat! can work with multiple processing engines in the backend; in this demo, we use Stat! with the Microsoft StreamInsight streaming engine. StreamInsight is used to generate incremental early results to queries and refine these results as more data is processed. Stat! allows data scientists to explore data, dynamically compose multiple queries to generate streams of partial results, and display partial results in both textual and visual form.
- Kael Rowan, Robert DeLine, Andrew Bragdon, and Jens Jacobsen, Debugger Canvas: Industrial Experience with the Code Bubbles Paradigm, International Conference on Software Engineering, 2 June 2012.
At ICSE 2010, the Code Bubbles team from Brown University and the Code Canvas team from Microsoft Research presented similar ideas for new user experiences for an integrated development environment. Since then, the two teams formed a collaboration, along with the Microsoft Visual Studio team, to release Debugger Canvas, an industrial version of the Code Bubbles paradigm. With Debugger Canvas, a programmer debugs her code as a collection of code bubbles, annotated with call paths and variable values, on a twodimensional pan-and-zoom surface. In this experience report, we describe new user interface ideas, describe the rationale behind our design choices, evaluate the performance overhead of the new design, and provide user feedback based on lab participants, post-release usage data, and a user survey and interviews. We conclude that the code bubbles paradigm does scale to existing customer code bases, is best implemented as a mode in the existing user experience rather than a replacement, and is most useful when the user has a long or complex call paths, a large or unfamiliar code base, or complex control patterns, like factories or dynamic linking.
- Danyel Fisher, Rob DeLine, Mary Czerwinski, and Steven Drucker, Interactions with Big Data Analytics, in ACM Interactions, ACM, May 2012.
One Microsoft Way
Redmond, WA 98052