OcrPages(String,Int32,String,String,String,Single) Method

In This Topic

Runs the optical character recognition (OCR) on the specified page range of the loaded PDF document using a defined number of threads. You can also set other parameters according to your preferences. The recognized text is added as invisible text on each processed page. The page orientation is automatically detected for each page as well.

This method involves a rasterization process so any existing visible text within the processed pages will become a part of the images of those pages before the OCR process starts. The same applies to the invisible text contained within pages. It is not kept because of the rasterization process, which simply means any invisible text is removed from processed pages before the OCR process starts.

This method is running asynchronously, in other words you have to wait for the OCR process ending before manipulating the document further. You can benefit from using several OCR related events like BeforePageOcr, OcrPagesProgress and OcrPagesDone.

Syntax

Visual Basic
C#
Delphi
JScript
Managed Extensions for C++
C++/CLI

'Declaration

 

Public Overloads Function OcrPages( _

   ByVal PageRange As String, _

   ByVal ThreadCount As Integer, _

   ByVal Dictionary As String, _

   ByVal DictionaryPath As String, _

   ByVal CharWhiteList As String, _

   ByVal DPI As Single _

) As GdPictureStatus

public GdPictureStatus OcrPages( 

   string PageRange,

   int ThreadCount,

   string Dictionary,

   string DictionaryPath,

   string CharWhiteList,

   float DPI

)

public function OcrPages( 

    PageRange: String;

    ThreadCount: Integer;

    Dictionary: String;

    DictionaryPath: String;

    CharWhiteList: String;

    DPI: Single

): GdPictureStatus;

public function OcrPages( 

   PageRange : String,

   ThreadCount : int,

   Dictionary : String,

   DictionaryPath : String,

   CharWhiteList : String,

   DPI : float

) : GdPictureStatus;

public: GdPictureStatus OcrPages( 

   string* PageRange,

   int ThreadCount,

   string* Dictionary,

   string* DictionaryPath,

   string* CharWhiteList,

   float DPI

)

public:

GdPictureStatus OcrPages( 

   String^ PageRange,

   int ThreadCount,

   String^ Dictionary,

   String^ DictionaryPath,

   String^ CharWhiteList,

   float DPI

)

Parameters

PageRange

The page range to be processed, for example, "1;4;5" to process pages 1, 4 and 5 or "1-5;10" to process pages from 1 to 5 and page 10. Set this parameter to "*" to process all pages of the current document.

ThreadCount

The number of threads to use, asynchronously. Set this parameter to 0 to let the engine to automatically maximize the performance.

Dictionary

The prefix of the dictionary file to use, for example, "spa" for Spanish, "eng" for English, "fra" for French, etc.

The name of such dictionary file has a predefined format [LANGUAGE].traineddata, where [LANGUAGE] defines the used language. You can find these files within your standard installation usually in the directory @\GdPicture.Net 14\Redist\OCR or you can download additional language dictionary files here.

You can also combine multiple dictionaries with the "+" separator, for instance English with French is "eng+fra".

DictionaryPath

The path with all installed dictionary files the OCR engine will use. The proper path is usually within your standard installation and it looks like @\GdPicture.Net 14\Redist\OCR. Of course you can specify your own path as well.

CharWhiteList

So called white list of characters, in other words the restricted recognition characters. It means that the engine returns only the specified characters when processing. For example, if you want to only recognize numeric characters, set this parameter to "0123456789". If you want to only recognize uppercase letters, set it to "ABCDEFGHIJKLMNOPQRSTUVWXYZ". Set this parameter to the empty string to recognize all characters.

DPI

The dpi resolution the OCR engine will use. It is recommended to use 300 by default.

A value between 200 and 300 should give optimal results on A4-sized documents. Generally values over 300 will cause excessive memory usage.

Return Value

A member of the GdPictureStatus enumeration. If the method has been successfully followed, then the return value is GdPictureStatus.OK.

We strongly recommend always checking this status first.

Remarks

This method is only allowed for use with non-encrypted documents. At the same, be aware that this method is running asynchronously.

Just to inform you that this method uses the GdPicture OCR engine.

This method requires the OCR component to run.

Example

How to convert a TIFF image file (one page or multipage) to a searchable PDF document using multithreading.

VB.NET
C#

Dim gdpicturePDF As New GdPicturePDF()

'Adding the OcrPagesDone event.

AddHandler gdpicturePDF.OcrPagesDone, AddressOf OcrPagesDone

            

Sub OcrPagesDone(status As GdPictureStatus) Handles gdpicturePDF.OcrPagesDone

    'Saving the resulting document when the OCR process is finished.

    If gdpicturePDF.SaveToFile("output.pdf") = GdPictureStatus.OK Then

        MessageBox.Show("The resulting document is saved.", "OcrPages")

    Else

        MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), "OcrPages")

    End If

End Sub

            

Dim caption As String = "OcrPages"

Using oGdPictureImaging As New GdPictureImaging()

    Dim imageId As Integer = oGdPictureImaging.CreateGdPictureImageFromFile("image.tif")

    If oGdPictureImaging.GetStat() = GdPictureStatus.OK Then

        If gdpicturePDF.NewPDF() = GdPictureStatus.OK Then

            If oGdPictureImaging.TiffIsMultiPage(imageId) = False Then

                gdpicturePDF.AddImageFromGdPictureImage(imageId, False, True)

            Else

                Dim NumberOfPages As Integer = oGdPictureImaging.TiffGetPageCount(imageId)

                For i As Integer = 1 To NumberOfPages

                    If oGdPictureImaging.TiffSelectPage(imageId, i) = GdPictureStatus.OK Then

                        gdpicturePDF.AddImageFromGdPictureImage(imageId, False, True)

                        If gdpicturePDF.GetStat() <> GdPictureStatus.OK Then

                            Exit For

                        End If

                    Else

                        Exit For

                    End If

                Next

            End If

            If gdpicturePDF.GetStat() = GdPictureStatus.OK Then

                If gdpicturePDF.OcrPages("*", 0, "eng", "C:\GdPicture.NET 14\Redist\OCR", "", 300) = GdPictureStatus.OK Then

                    MessageBox.Show("OcrPages - Done!", caption)

                Else

                    MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption)

                End If

            Else

                MessageBox.Show("The process of adding images has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption)

            End If

        Else

            MessageBox.Show("The new document can't be created. Status: " + gdpicturePDF.GetStat().ToString(), caption)

        End If

        oGdPictureImaging.ReleaseGdPictureImage(imageId)

    Else

        MessageBox.Show("The image file can't be loaded. Status: " + oGdPictureImaging.GetStat().ToString(), caption)

    End If

End Using

'Release resources only if all processes are finished.

gdpicturePDF.Dispose()

GdPicturePDF gdpicturePDF = new GdPicturePDF();

//Adding the OcrPagesDone event.

gdpicturePDF.OcrPagesDone += OcrPagesDone;

            

void OcrPagesDone(GdPictureStatus status)

{

    //Saving the resulting document when the OCR process is finished.

    if (gdpicturePDF.SaveToFile("output.pdf") == GdPictureStatus.OK)

        MessageBox.Show("The resulting document is saved.", "OcrPages");

    else

        MessageBox.Show("The resulting document can't be saved. Status: " + gdpicturePDF.GetStat().ToString(), "OcrPages");

}

            

string caption = "OcrPages";

using (GdPictureImaging oGdPictureImaging = new GdPictureImaging())

{

    int imageId = oGdPictureImaging.CreateGdPictureImageFromFile("image.tif");

    if (oGdPictureImaging.GetStat() == GdPictureStatus.OK)

    {

        if (gdpicturePDF.NewPDF() == GdPictureStatus.OK)

        {

            if (oGdPictureImaging.TiffIsMultiPage(imageId) == false)

            {

                gdpicturePDF.AddImageFromGdPictureImage(imageId, false, true);

            }

            else

            {

                int NumberOfPages = oGdPictureImaging.TiffGetPageCount(imageId);

                for (int i = 1; i <= NumberOfPages; i++)

                {

                    if (oGdPictureImaging.TiffSelectPage(imageId, i) == GdPictureStatus.OK)

                    {

                        gdpicturePDF.AddImageFromGdPictureImage(imageId, false, true);

                        if (gdpicturePDF.GetStat() != GdPictureStatus.OK)

                            break;

                    }

                    else

                        break;

                }

            }

            if (gdpicturePDF.GetStat() == GdPictureStatus.OK)

            {

                if (gdpicturePDF.OcrPages("*", 0, "eng", "C:\\GdPicture.NET 14\\Redist\\OCR", "", 300) == GdPictureStatus.OK)

                {

                    MessageBox.Show("OcrPages - Done!", caption);

                }

                else

                    MessageBox.Show("The OCR process has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption);

            }

            else

                MessageBox.Show("The process of adding images has failed. Status: " + gdpicturePDF.GetStat().ToString(), caption);

        }

        else

        {

            MessageBox.Show("The new document can't be created. Status: " + gdpicturePDF.GetStat().ToString(), caption);

        }

        oGdPictureImaging.ReleaseGdPictureImage(imageId);

    }

    else

        MessageBox.Show("The image file can't be loaded. Status: " + oGdPictureImaging.GetStat().ToString(), caption);

}

//Release resources only if all processes are finished.

gdpicturePDF.Dispose();

Reference

GdPicturePDF Class
GdPicturePDF Members
Overload List
OcrPage Method
BeforePageOcr Event
OcrPagesProgress Event
OcrPagesDone Event