October 18, 2021 | Deep learning, OCR

Deep Learning for Text Detection (Part1)

In a previous blog article, we outlined several deep learning techniques for OCR. We mentioned how these techniques are used for text detection and text recognition, which are the two primary building blocks of an OCR system.
In this article, we will go deeper into how deep learning is being used for text detection, which is the first block when doing OCR.

We are planning to release more articles where we will explore more deep learning techniques for text detection and text recognition.

Object detection for text detection

Object detection is a field in deep learning that is applied in several computer vision tasks, including text detection. There are mainly two types of object detection models: one-stage detectors and two-stages detectors.

1. One-stage object detection

Some famous examples of one-stage detectors are SSD (Single Shot Multibox Detectors) and YOLO (You Only Look Once).

These object detection models are very similar in the way they work. First, they take an image as input; then this image is passed through a set of convolutional layers. These layers are usually part of a larger network that was pre-trained on a large dataset of images. 

After each convolutional layer, we get a set of feature maps, which will then be used to identify objects’ regions and classes. This identification process can be different from one network to another. In SSD, for example, we use the output feature maps from each convolutional layer to try to identify objects’ regions and classes. But in YOLO, for example, only the last layers of the network are used to try and identify objects in the image, though all the network’s layers are contributing to the process by learning relevant features about the dataset.

SSD (up) and YOLO (down) networks architectures [2]

2. Two-stages object detection

A famous example of two-stage object detectors is Faster-RCNN. The difference between this type of network and one-stage networks is the fact that two-stage networks perform the detection in 2 stages.

The first stage is responsible for extracting features from the input image. Then, these features are passed through a region-proposal subsystem (which can be a neural network in itself). This subsystem proposes regions where there is a possibility of finding an object; in our case, the object corresponds to a region containing text. The proposed regions are then further processed by later layers to get the best predictions of text areas in the image. If there are different types of text regions (text header, text body, …), then these later layers will also categorize the text area into the right type.

Both one-stage and two-stage networks can learn to identify regions of text and even differentiate between types of regions (text header, text body, …). 

Specialized object detection for text detection

As mentioned in the previous section, one-stage and two-stage object detection models are general-purpose models that can be used for a variety of tasks, not just text detection. To create more specialized deep learning models, some researchers used object detection models as a basis for further work focusing on text detection.

One of these specialized models is called TextBoxes and it’s inspired by the SSD network architecture. It uses the same approach as SSD, where different convolutional layers yield different detections. The network architecture looks like this: 

TextBoxes network architecture [1]


In this article, we took a look at some of the deep learning models used for performing text detection. We saw that some general-purpose object detection models such as YOLO and Faster-RCNN can be used to perform text detection. We also saw an example of more specialized neural networks that are designed to only do text detection, such as TextBoxes. In upcoming articles, we will take a look at some other types of deep learning models used for text detection.


Nour Islam Mokhtari

Machine Learning Engineer


[1] Liao et al., “TextBoxes: A Fast Text Detector with a Single Deep Neural Network.”

[2] Wei Liu et al., SSD: Single Shot MultiBox Detector.