Parts that add up to a whole: a framework for the analysis of tables
Ana Costa e Silva
In a time when unstructured data grows exponentially throughout the world, tools are required to extract benefit from it. Tables are an important under-exploited part of that data. The vision of this thesis is to find robust answers to the main issues involved in understanding information in tables in unstructured data sources.
We begin by conducting a systematic quantitative literature review to chart the table analysis field and identify some important needs that the scientific community has so far left unanswered and that we consider fundamental to the success of our vision. We believe our contributions answer these needs. In fact,
- Most authors in this field have applied heuristics to tables. Heuristics can be brittle. We create road maps to the definition of probabilistic models for table location and segmentation, where the particular representation needs of the document analysis field can be straightforwardly encompassed.
- With table analysis requiring a long sequence of detailed decisions at different levels of granularity (line, character, cell, column and table levels), interaction between the decisions is fundamental. Flagging consists of measuring the probability that each partial result is erroneous. These probabilities can serve as flags to subsequent processing steps; in fact, they can be tracked by latter algorithms to recover from incoherent states in future results.
- This being a complex multi-layer problem, no one algorithm is capable of accurately treating all tables. Coordination between different table processing approaches, whether these are alternative or sequential to each other, is thus fundamental. We propose specific evaluation, a methodology that allows mathematically splitting the space of tables into sub-parts, in which alternative algorithms can be applied to maximise results and/or efficiency.
- Precision and recall, the most commonly used metrics in the table community, are not best suited for document analysis problems. We propose completeness and purity, which are purpose-built succinct performance metrics that are better suited for merging and splitting tasks.
We believe these four aspects are fundamental towards the automatic understanding of tabular data. The first led us to the definition of our first two experiments, which aim at grouping lines into tables and at splitting character sequences into cells. The last three aspects - flagging, specific evaluation and our two new metrics - are implemented after each experiment, in a detailed analysis of their results.