May 21, 2013

DAPSE’13: International Workshop on Data Analysis Patterns in Software Engineering

Location: San Francisco, CA, USA

Tuesday, May 21, 2013
Hyatt Regency San Francisco (opens in new tab)
5 Embarcadero Center (Map (opens in new tab))
San Francisco, California, USA 94111

Workshop in conjunction with the ICSE 2013 (opens in new tab) conference.

Important Dates

Workshop paper submissions due
February 7, 2013 (archival papers)

Notification of authors
February 28, 2013

Camera-ready copies
March 7, 2013

Non-archival submissions accepted until
April 24, 2013

Submission Site

https://www.easychair.org/conferences/?conf=dapse2013 (opens in new tab)

Data scientists in software engineering seek insight in data collected from software projects to improve software development. The demand for data scientists with domain knowledge in software development is growing rapidly and there is already a shortage of such data scientists.

Data science is a skilled art with a steep learning curve. To shorten that learning curve, this workshop will collect best practices in form of data analysis patterns, that is, analyses of data that leads to meaningful conclusions and can be reused for comparable data. In the workshop we will compile a catalog of such patterns that will help both experienced and emerging data scientists to better communicate about data analysis. The workshop is intended for anyone interested in how to analyze data correctly and efficiently in a community accepted way.

Workshop Program

8:30 – 9:00 Introductions and discussions of plans and goals from the chairs.

9:00 – 10:00 lightning talks

(5 min presentation with 2 minutes for questions)

Olga Baysal, Oleksii Kononenko, Reid Holmes and Mike Godfrey. Extracting Artifact Lifecycle Models From Metadata History
Emanuel Giger and Harald Gall. Effect Size Analysis
Rodrigo Souza, Christina Chavez and Roberto Bittencourt. Patterns for Cleaning Up Bug Data
David Weiss and Audris Mockus. The Chunking Pattern
Barbara Russo. Parametric Classiﬁcation over Multiple Samples
Xiaobing Sun, Ying Chen, Bin Li and Bixin Li. Exploring Software Engineering Data with Formal Concept Analysis
Barbara Russo and Maximilian Steff. Commit Histories
Sandro Morasca. Data Analysis Anti-Patterns in Empirical Software Engineering

10:00 – 10:30 Break

10:30 – 11:15 lightning talks 2

Peter Schulam, Roni Rosenfeld and Premkumar Devanbu. Building Statistical Language Models of Code
Scott McGrath, Dhundy Kiran Bastola and Harvey Siy. Concept to Commit: A pattern designed to trace code changes from user requests to change implementation by analyzing mailing lists and code repositories.
Emmanuel Letier and Camilo Fitzgerald. Measure what Counts: An Evaluation Pattern for Software Data Analysis
Venkatesh Prasad Ranganath and Jithin Thomas. Structural and Temporal Patterns-based Features
Rodrigo Souza, Christina Chavez and Roberto A. Bittencourt. Patterns for Extracting High Level Information from Bug Reports
Burak Turhan. Relevancy Filtering

11:15 – 12:00

Discussion on what makes a good data analysis pattern.

12:00 – 13:30 Lunch

13:30 – 14:45 Breakout discussion groups

14:45 – 15:30 Breakout groups present

15:30 – 16:00 Break

16:00 – 17:00 Workshop Discussion.

Potential topics include:

How do we “evangelize” patterns?
How can we make patterns reusable?
What needs exist for data analysis patterns?
What are common data analysis mistakes and how can we or patterns help others avoid them.
What is the right way to catalog the patterns?
Where should data analysis patterns live? Should there be a web resource where people post info on patterns?
Additional topics solicited from attendees.

17:00 Wrap up. Discussion of future events.

17:30 End

Accepted Papers

Emanuel Giger and Harald Gall. Effect Size Analysis
Rodrigo Souza, Christina Chavez and Roberto Bittencourt. Patterns for Cleaning Up Bug Data
Xiaobing Sun, Ying Chen, Bin Li and Bixin Li. Exploring Software Engineering Data with Formal Concept Analysis
David Weiss and Audris Mockus. The Chunking Pattern
Barbara Russo. Parametric Classiﬁcation over Multiple Samples
Barbara Russo and Maximilian Steff. Commit Histories
Sandro Morasca. Data Analysis Anti-Patterns in Empirical Software Engineering
Olga Baysal, Oleksii Kononenko, Reid Holmes and Mike Godfrey. Extracting Artifact Lifecycle Models From Metadata History
Rodrigo Souza, Christina Chavez and Roberto A. Bittencourt. Patterns for Extracting High Level Information from Bug Reports
Peter Schulam, Roni Rosenfeld and Premkumar Devanbu. Building Statistical Language Models of Code
Scott McGrath, Dhundy Kiran Bastola and Harvey Siy. Concept to Commit: A pattern designed to trace code changes from user requests to change implementation by analyzing mailing lists and code repositories.
Emmanuel Letier and Camilo Fitzgerald. Measure what Counts: An Evaluation Pattern for Software Data Analysis
Venkatesh Prasad Ranganath and Jithin Thomas. Structural and Temporal Patterns-based Features

Non-Archival Accepted Papers

Burak Turhan. Relevancy Filtering.

Submissions

We solicit papers (2-3 pages) describing one or more data analysis pattern. Authors should use the form that is most suited to describe the pattern. Where possible, we encourage authors to describe pattern as follows

Pattern name: a handle for the pattern
Problem: when to apply the pattern
Solution: how to apply the pattern
Consequence: results and trade-offs of applying the pattern, common mistakes in applying the pattern to be avoided, etc.
Examples: brief summary and/or cite example applications of the pattern in literature; if possible, R snippets or Weka code to apply the pattern, etc.

There are two options for submitting a proposal.

Archival Papers: Submit the pattern by February 7, 2013. If accepted, it will be published in the workshop proceedings and the ACM and IEEE Digital Libraries.
Non-Archival Papers: Submit the paper by April 24, 2013; we will send the notification within two weeks. If accepted, the paper will be published on the workshop web-pages only; non-archival papers will not be published in the workshop proceedings and the ACM and IEEE Digital Libraries.

Both archival and non-archival papers will be reviewed by a program committee and accepted based on the clarity of the description and how broadly their proposed pattern might be applicable. Prior application of the pattern by the authors is not a requirement. This workshop is more interested in the mechanics and choice of the data analysis than the impact of published results.

Upon notification of acceptance, all authors of accepted archival papers will be asked to complete an IEEE Copyright form and will receive further instructions for preparing their camera ready versions. At least one author of each paper is expected to present the paper at the workshop.

All submitted papers must conform to the ICSE 2013 formatting and submission instructions (opens in new tab) and must not exceed the page limits mentioned above, including figures and references. All submissions must be in English. Papers must be submitted electronically, in PDF format, using the submission site hosted by EasyChair:
https://www.easychair.org/conferences/?conf=dapse2013 (opens in new tab)

It is the desire of the organizers that discussion of research at the workshop does not preclude publication of closely related material at conferences or journals. Authors of accepted papers will be able to choose whether to include their papers in the workshop proceedings.

Format

The workshop will consist of the following sessions:

Lightning session. Authors of accepted papers will give a lightning talk in the morning to present their proposed pattern (about 5-10 minutes depending on the number of accepted papers).
Discussion session. This session has two goals: (1) Group the patterns into pattern types. (2) Refine the pattern groups and the interactions between patterns. For example, we expect that some patterns could be composed into more powerful patterns while other patterns could be split into smaller pattern.
Breakout session. For the next session, participants will break out into groups and try to use the data analysis patterns to solve several data science tasks provided by the workshop organizers. The tasks will come from academic research but also from industry. The goal of this session is to assess the usefulness as well as the completeness of the patterns identified. We expect that patterns will be refined and new patterns will be discovered. At the end of the session each group presents their findings in a 5 minute blitz presentation.

Before the workshop there will be a blog to promote and discuss accepted patterns.

After the workshop there will be a Dagstuhl seminar on software development analytics building on the outcomes of this workshop, to which selected authors will be invited. Furthermore the organizers plan to edit a book on “Data Science for Software Engineers” with a collection of data analysis patterns. Selected authors from the workshop will be invited to contribute chapters to this book.

Example of a Pattern

For illustrative purposes, here’s an example pattern in short and simplified form. For the workshop, we expect the discussion to be more comprehensive. We do welcome both simple and complex analysis patterns.

Pattern name: Contrast

Problem:
Determine if there is a difference in one or more properties between two populations.

Solution:
1. Apply a hypothesis test (student t-test for parametric data, Mann Whitney test for non-parametric test) to check if the property is statistically different between populations.

2. Determine the magnitude of the difference, either through visualization (e.g., boxplot) or when appropriate through mean or median.

Discussion:
Either step without the other can be misleading. For large populations, tiny differences might be statistically significant. In contrast for small populations large differences might not be statistically significant.

Choosing the wrong hypothesis test is a common mistake.

Examples:
For example, at ICSE 2009, Bird et al. used a Mann Whitney test to compare the defect proneness (=the property) between distributed and co-located binaries (=two populations). See Figure 5 in their paper for a sample visualization of the differences between the two population.