Applications of Inductive Programming in Data Wrangling

Sumit Gulwani

Applications of Inductive Programming in Data Wrangling

Sumit Gulwani

Talks at Dagstuhl seminar on Approaches and Applications of Inductive Programming, Oct 2015 | October 2015

Download BibTex

99% of computer end users do not know programming and struggle with repetitive tasks. Inductive synthesis can revolutionize this landscape by enabling end users to automate repetitive tasks using examples. In order to realize this potential, we need to apply inductive synthesis to the right set of application domains. Data wrangling turns out to be a killer application area for inductive synthesis.

Data is the new oil. Evolution of digital revolution, social media, cloud computing, IoT has led to production of massive amounts of digital data. This data is the new currency of the digital world since it can help drive business decisions, advertising, recommendation systems, etc. Data wrangling refers to the tedious process of transforming data from its raw format to a more structured form that is amenable for drawing insights. It is estimated that data scientists spend 80% of their time wrangling with data. Inductive synthesis can enable easier and faster data wrangling. We have developed inductive synthesis tools for assisting with various data wrangling activities including string/number/date transformations (FlashFill), extraction of structured data from semi-structured log files or webpages (FlashExtract), and formatting or table layout transformations (FlashRelate). FlashFill has been released as an Excel 2013 feature, while FlashExtract has been released as the ConvertFrom-string Powershell cmdlet and the custom field extraction capability in Azure OMS.

Practical deployment of inductive synthesis tools require addressing an important challenge associated with inductive synthesis systems, namely resolving ambiguity in the example based specification. We address this challenge using two key ideas: (i) machine learning based ranking techniques to predict an intended program from within the set of programs that are consistent with the examples provided by the user, (ii) user interaction models (including program navigation and active-learning based conversational clarification) that communicate actionable information to the user to help resolve ambiguity in the example based specification.