Table Extraction Series – Part 1: Challenges
Extracting tables from PDFs and electronic documents can be easy or very complex, depending on the nature of the file. In this series, we’ll see how automatic table extraction can help companies overcome various challenges. We will also compare the different approaches available on the market (OCR, Deep Learning, and Layout Analysis) and tell you more about why a Layout Analysis approach is the best for batch-processing business documents.
- Why everyone needs Automatic Table Extraction?
In this first article of our Series, we’ll see how companies from all industries can benefit from an automatic table extraction system. We will then explain why using a traditional OCR engine will be of little help when batch-processing business documents.
Why everyone needs Automatic Table Extraction?
When looking at automating table extraction processes, users – and this is true for all industries – are always looking for:
- Faster processes,
- Versatility (the solution needs to work with documents of various formats),
- A minimal configuration or easy parameters setting.
If we’re looking at more specific use cases, here are a few examples where an Automatic Table Extraction solution will make the difference.
- Invoice automation
Most invoices and quotes contain at least one table where essential information such as product, quantity, pricing, and taxes are displayed.
As most countries still haven’t adopted electronic invoicing, which takes the form of a PDF with a structured, machine-readable file (such as XML), companies still need to process physical and digital invoices in their CRM and accounting systems.
- Quality control & data reconciliation
In many industries, it is crucial to have consistent and accurate information in their records across various systems. Automatic table extraction solutions can streamline processes and help with compliance.
- Form automation
An automatic table extraction system will allow users to extract data from tables and repopulate the information in forms.
What is a table?
This question may seem silly, but if a human being can recognize a table on an electronic document in an instant, more often than not, a computer cannot.
And more, there is no definition of what a table is.
A table can have different hierarchies between cells. Sometimes a cell specifies the header and can be located on top of the table, left, or right. Tables can include:
- Cells & columns
- Tabular layout (no table outline)
- Stacked layout (no vertical lines)
- Mixed layout (a mix of the above)
- Justified layout
Tables, tables, tables! Slide of our IDP webinar
Now, you see why it is complicated for a computer to detect and recognize a table in a document?
And there is more.
Structured vs. unstructured data
We’ve just seen that the computer doesn’t work like the human eye for detecting and recognizing tables (yet).
What the computer can do easily, is to detect & recognize tables in a document with structured data. HTML and XML files are examples of structured documents.
Structured data also found in spreadsheets and SQL databases.
The only problem is that any document that does not have a pre-defined data model or is not organized in a pre-defined manner has unstructured data, which represents about 90% of all documents generated.
If we take the example of PDF, the most common format found in businesses and organizations. Scanned and image-based PDFs are unstructured.
Digitally-born PDFs contain unstructured text and images as individual graphic objects. They are also unstructured on the content stream level.
A computer cannot extract a table from an unstructured or even semi-structured document automatically without an Intelligent Document Processing engine.
Extracting tables with OCR
But wait a minute? We can extract tables with a simple OCR engine, that works!
Indeed, running an OCR engine over scanned documents and images can help retrieve data from tables.
Given that the image is good quality and the table is clearly defined, you can easily extract the text. The output document, usually in .txt, will display the text line by line, just like in the original document table.
However, in complex business documents that show tables, text, images, graphics, and without a template or specific settings, the OCR engine will extract all the text of the document without necessarily keeping the logical reading order.
A traditional OCR engine won’t be performant in many other contexts:
- When the quality of the document is poor, with a lot of speckles,
- When the pages are skewed, or the text is not strictly horizontal,
- When cells have a colored background or the text itself is in a light color,
- When the characters are touching the borders or the cells.
For all these reasons, extracting tables with an OCR engine only is not recommended when batch-processing documents of various natures and layouts.
For this, we will need a bit of help from AI.
The following article will show how table detection, recognition, and extraction can be achieved using Deep Learning methods.