GdPicture.NET is a Nutrient product. Learn more

Document Layout Analysis, the Key for Document Understanding

Table of contents

    Document Layout Analysis, the Key for Document Understanding - GdPicture.NET Blog

    In today’s article, we will show why a strong document layout analysis system is crucial for document understanding and Intelligent Document Processing solutions.

    Definitions

    Document Layout Analysis DLA and OCR Document Understanding

    What are the benefits of a good layout analysis system? The layout analysis system of Tesseract

    In our latest blog article, we described our new Key-Value Pair extractor(opens in a new tab).

    To give you a better insight into how it’s working, we will talk a bit more about the technology essential for key-value pair extraction: document layout analysis. With a strong OCR engine(opens in a new tab), document layout analysis is the first level of any document understanding process.

    Definitions

    Document Layout Analysis

    Document layout analysis (DLA) is the identification (or detection) and categorization (or decoding) of regions.

    DLA implies a geometric analysis of tables, pictures, equations, and barcodes and a logical layout analysis (paragraphs, lines, words, characters) of the document.

    DLA and OCR

    An OCR solution is a complex system that combines several engines which intervene at different stages of the process.

    A standard OCR process includes:

    1. Preprocessing (the cleanup phase);
    2. Thresholding (segmentation);
    3. Layout analysis;
    4. Recognition;
    5. Post-processing.

    All OCR processes include these steps, with more or less success. It is why you obtain different results with the same document when testing solutions from different vendors.

    Document Understanding

    When you combine layout analysis with a recognition engine, you obtain the first level of document understanding.

    *Document understanding is at the core of any Intelligent Document Processing(opens in a new tab) solution. *

    If you add Natural Language Processing (NLP) capabilities, as we do in GdPicture.NET, you obtain reinforced document understanding with accurate results.

    What are the benefits of a good layout analysis system?

    A performant layout analysis system will, of course, help with OCR, but the benefits go far beyond:

    • It improves the OCR recognition step, especially with LSTM-based recognizers.
    • It also permits to move beyond simple OCR processes and toward document understanding.
    • It provides better accessibility support by making the conversion to PDF/UA(opens in a new tab) easier.
    • It improves the conversion of a fixed into an editable layout (IE: PDF to Office).

    The layout analysis system of Tesseract

    Hewlett-Packard first developed tesseract in the 1980s. At first proprietary software, it became open-source in 2005. Since 2006, Google has been sponsoring its development.

    At first, Tesseract was developed to OCR scanned books.This is why it is excellent to detect 2-columns pages of text.

    However, there are three main constraints with this engine.

    1. It aggressively aggregates words into lines.
    2. Results are poor with business documents.
    3. Tesseract is almost impossible to maintain.

    Even a good pre-processing phase does little to improve OCR results with an engine that relies on Tesseract only.

    For GdPicture.NET, we chose a hybrid approach that includes heuristics, mathematics, and ML capabilities.

    The GdPicture.NET layout analysis system is much stronger, and together with the other processes (pre/post-processing and segmentation engines), it gets better results than Tesseract, especially in difficult cases such as:

    • Skewed documents
    • Text on colored background
    • Documents with a lot of noise
    • Underlined text
    • Text in graphics and tables

    In a future blog article, we will show examples highlighting the differences between Tesseract and GdPicture.NET.Stay tuned!

    Cheers,

    Jonathan D. Rhyne

    Jonathan D. Rhyne

    Co-Founder and CEO

    Jonathan joined PSPDFKit in 2014. As Co-founder and CEO, Jonathan defines the company’s vision and strategic goals, bolsters the team culture, and steers product direction. When he’s not working, he enjoys being a dad, photography, and soccer.

    Explore related topics

    FREE TRIAL Ready to get started?