Applications of Formal Methods to Data Wrangling and Education

Keynote at CBSoft 2015 |

Publication

Data is the new oil. The digital revolution and evolution of social media, cloud computing, and IoT has led to massive amounts of digital data. This data is the new currency of the digital world since it can help drive business processes and decisions including advertising and recommendation systems. However, this data is locked up in semi-structured formats such as spreadsheets, text/log files, JSON/XML, webpages, and pdf documents. Data wrangling refers to the tedious process of converting such raw data to a more structured form that allows exploration and analysis for drawing insights. While data scientists spend 80% of their time wrangling data, programmatic solutions to data manipulation are beyond the expertise of 99% of end users who do not know programming. Programming by Examples (PBE) can enable easier and faster data wrangling.

We have developed PBE technologies for many wrangling tasks including string/number/date transformations, extraction of tabular data from log _les or webpages, and formatting or table layout transformations. Some of these technologies appear in mass-market industrial products. The FlashFill PBE technology for string transformations ships as a feature in Excel 2013. The FlashExtract PBE technology for extracting structured data out of log files ships as the ConvertFrom-string Powershell cmdlet in Windows 10 and the custom field extraction capability in Azure Operations Management Suite (OMS).