OCR Zone of a PDF page

Example requests & Code samples for GdPicture Toolkits.
Post Reply
jloizagah
Posts: 29
Joined: Tue Mar 17, 2009 2:45 pm

OCR Zone of a PDF page

Post by jloizagah » Tue Mar 17, 2009 7:52 pm

For GdPicture.NET 8, see: viewtopic.php?t=3217

Hi.

I want to know how to perform a zonal OCR over a PDF Image page (that's a Jpeg page inside a PDF). The zonal OCR works fine over the JPEG, but not if I convert the Jpeg to PDF. I'm using the .Net SDK.

Thanks.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: OCR Zone

Post by Loïc » Wed Mar 18, 2009 12:18 pm

Hi,

You have to use both GdPictureImaging & GdViewer classes.

I give you a sample to open PDF, convert page 1 to image and OCR the image with a GdPictureImaging object:

Code: Select all

Dim oGdViewer As New GdViewer
Dim oGdPictureImaging As New GdPictureImaging

oGdViewer.SetLicenseNumber("XXX")
oGdPictureImaging.SetLicenseNumber("XXX")

oGdViewer.SilentMode = True
oGdViewer.DisplayFromFile("c:\test.pdf")
Dim ImageID As Integer = oGdViewer.PdfRenderPageToGdPictureImage(200, 1)

oGdPictureImaging.SetROI(0, 0, 500, 500) ' here we make a region of interest from (0, 0) to (500, 500)
MsgBox(oGdPictureImaging.OCRTesseractDoOCR(ImageID, TesseractDictionary.TesseractDictionaryEnglish, "C:\Program Files\GdPicture.NET\Redist\OCR\", ""))

'Clearing ressources
oGdViewer.ReleaseGdPictureImage(ImageID)
oGdViewer.CloseDocument()
oGdPictureImaging.OCRTesseractClear()


Best regards,

Loïc

jloizagah
Posts: 29
Joined: Tue Mar 17, 2009 2:45 pm

Re: OCR Zone of a PDF page

Post by jloizagah » Wed Mar 18, 2009 3:24 pm

Thanks, but I think I need a little more help.

I have a FileBrowser and a Gdviewer. I display JPG and PDF Images in the Filebrowser. When you select one of thems, the image is displayed in the Gdviewer. I want to select a Zone in the Gdviewer and perform an OCR of the zone, giving the result in a MessageBox for example. It works fine with color JPEG images, but if I transform that JPG in a PDF, display it in the Gdviewer, select a Zone and perform an OCR using the code above, nothing appears.If I select a widest zone, it reads me something, but it seems that the zone of the Gdviewer is not the same zone I'm using with the code above. Code is like this:

Code: Select all

        Dim sOCR As String
        Dim oGDTemp As New GdPictureImaging
        Dim oGDViewer As New GdViewer
        Dim ImageID As Integer

        oGDViewer.SetLicenseNumber(My.Settings.GDPictureLicense)
        oGDTemp.SetLicenseNumber(My.Settings.GDPictureLicense)
        oGDTemp.SetLicenseNumberOCRTesseract(My.Settings.GDPictureTesseratPluging)

        oGDViewer.SilentMode = True
        Dim LeftArea As Integer, TopArea As Integer, WidthArea As Integer, HeightArea As Integer
        oGDViewer.DisplayFromFile(_Fichero)
        ImageID = oGDViewer.PdfRenderPageToGdPictureImage(200, GdViewer1.CurrentPage)
        
        'Gdviewer1 is the viewer where I display the PDF and select the Zone

        oGDTemp = oGdPictureImaging

        'oGdPictureImaging is an GdpictureImaging Object I'm using with the Gdviewer1

        If GdViewer1.IsRect Then
            Call GdViewer1.GetRectCoordinatesOnDocument(LeftArea, TopArea, WidthArea, HeightArea)
            Call oGDTemp.SetROI(LeftArea, TopArea, WidthArea, HeightArea)
        Else
            oGDTemp.ResetROI()
            LeftArea = 0
            TopArea = 0
            WidthArea = 0
            HeightArea = 0
        End If

        Dim Textbox1 As String = "C:\Archivos de Programa\GdPicture.NET\Redist\OCR\"


        sOCR = oGDTemp.OCRTesseractDoOCR(ImageID, Dictionary, Textbox1.ToString, Patron)
        If oGDTemp.GetStat = GdPictureStatus.OCRDictionaryNotFound Then
            MsgBox("Needed dictionary is not into the specified path! ")
            Return ""
        Else
       
        End If
Thank's a lot

Best regards

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: OCR Zone of a PDF page

Post by Loïc » Thu Mar 19, 2009 10:59 am

Hi,

Just a coordinate confusion.

This code should be better:

Code: Select all

        Dim sOCR As String
        Dim oGDTemp As New GdPictureImaging
        Dim oGDViewer As New GdViewer
        Dim ImageID As Integer

        oGDViewer.SetLicenseNumber(My.Settings.GDPictureLicense)
        oGDTemp.SetLicenseNumber(My.Settings.GDPictureLicense)
        oGDTemp.SetLicenseNumberOCRTesseract(My.Settings.GDPictureTesseratPluging)

        oGDViewer.SilentMode = True
        Dim LeftArea, TopArea, WidthArea, HeightArea As Single
        oGDViewer.DisplayFromFile(_Fichero)
        ImageID = oGDViewer.PdfRenderPageToGdPictureImage(200, GdViewer1.CurrentPage)

        'Gdviewer1 is the viewer where I display the PDF and select the Zone

        oGDTemp = oGdPictureImaging

        'oGdPictureImaging is an GdpictureImaging Object I'm using with the Gdviewer1

        If GdViewer1.IsRect Then
            Call GdViewer1.GetRectCoordinatesOnDocumentInches(LeftArea, TopArea, WidthArea, HeightArea)
            Call oGDTemp.SetROI(CInt(LeftArea * 200), CInt(TopArea * 200), CInt(WidthArea * 200), CInt(HeightArea * 200))
        Else
            oGDTemp.ResetROI()
            LeftArea = 0
            TopArea = 0
            WidthArea = 0
            HeightArea = 0
        End If

        Dim Textbox1 As String = "C:\Archivos de Programa\GdPicture.NET\Redist\OCR\"


        sOCR = oGDTemp.OCRTesseractDoOCR(ImageID, Dictionary, Textbox1.ToString, Patron)
        If oGDTemp.GetStat = GdPictureStatus.OCRDictionaryNotFound Then
            MsgBox("Needed dictionary is not into the specified path! ")
            Return ""
        Else

        End If

Best regards,

Loïc

jloizagah
Posts: 29
Joined: Tue Mar 17, 2009 2:45 pm

Re: OCR Zone of a PDF page

Post by jloizagah » Mon Mar 23, 2009 1:55 pm

OK, it works fine now.

I really don't understand the reason why I have to use

GdViewer1.GetRectCoordinatesOnDocument(LeftArea, TopArea, WidthArea, HeightArea)

when it is an Image Format (JPEG, TIF)

but I have to use

GdViewer1.GetRectCoordinatesOnDocumentInches(LeftArea, TopArea, WidthArea, HeightArea)

when it is an Image Format (JPEG, TIF) inside a PDF, but it works really great.

How could I know , from a PDF displayed on a GDViewer, if it is a Text PDF (that is, if a could extract text from the page) or if it's just an Image format that I have to perform OCR to extract text?.

Thank's a lot... I must admit this are the most complete imaging developer tools I've ever seen, and I've seen lots of them (Imaging, Pegasus, LeadTools, etc).

Best regards

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: OCR Zone of a PDF page

Post by Loïc » Mon Mar 23, 2009 4:33 pm

Hi,

I give you some precisions.

For all type of documents (image, metafile, PDF) you can use:

Code: Select all

 GdViewer1.GetRectCoordinatesOnDocumentInches(LeftArea, TopArea, WidthArea, HeightArea)
A PDF is not a bitmap. Therefore, the coordinate space to use must be independent of the notion of pixel. For image you can.

Getting coordinates on Inches is a good compromise to have the same space coordinates on all kind of documents.
How could I know , from a PDF displayed on a GDViewer, if it is a Text PDF (that is, if a could extract text from the page) or if it's just an Image format that I have to perform OCR to extract text?.
An idea: extract all the text of the page. -> If the result is an empty string the PDF includes probably only images.

Kind regards,

Loïc

Bubba
Posts: 5
Joined: Mon Feb 22, 2010 10:05 pm

Re: OCR Zone of a PDF page

Post by Bubba » Mon Feb 22, 2010 10:51 pm

I am confused as to why you added the * 200 to the SetROI method here:

Code: Select all

Call oGDTemp.SetROI(CInt(LeftArea * 200), CInt(TopArea * 200), CInt(WidthArea * 200), CInt(HeightArea * 200))
I'm writing in C#, and evaluating the library for an internal tool.

We need the ability to allow the user to designate multiple regions of interest on an image, and then click a button to "test" that each region pulls back the appropriate data for that region. In some cases this requires that the image be rotated prior to the user selecting an area.

In testing with a single image, that the user must rotate counter-clockwise once. It appears that I am getting the appropriate Top/Left/Width/Height from the viewer. However, when the "test" is kicked off, the OCR returns an empty string, no errors.

When rotating, I'm already converting from PDF to GDPictureImage?

Code: Select all

                int imgID = gdViewer1.PdfRenderPageToGdPictureImage(200 , 1);
                oGdPictureImaging.Rotate(imgID, GdPicture.RotateFlipType.Rotate90FlipNone);
                gdViewer1.DisplayFromGdPictureImage(imgID);
So - perhaps later, when I attempt to do the OCR, I need to use some other method for getting the imgID from the viewer?

Code: Select all

int imgID = gdViewer1.PdfRenderPageToGdPictureImage(200, 1);
            float myTopArea = float.Parse(this.txtbxTop.Text);
            float myLeftArea = float.Parse(this.txtBxLeft.Text);
            float myWidthArea = float.Parse(this.txtbxWidth.Text);
            float myHeightArea = float.Parse(this.txtbxHeight.Text);
            string myResults = "";

            oGdPictureImaging.SetROI(Convert.ToInt32(myLeftArea * 200), Convert.ToInt32(myTopArea * 200), Convert.ToInt32(myWidthArea * 200), Convert.ToInt32(myHeightArea ));
            
            myResults = oGdPictureImaging.OCRTesseractDoOCR(imgID, TesseractDictionary.TesseractDictionaryEnglish, "C:\\Program Files\\GdPicture.NET\\Redist\\OCR\\" , "");
            if (oGdPictureImaging.GetStat() == GdPictureStatus.OCRDictionaryNotFound){
                MessageBox.Show("Needed dictionary is not into the specified path! ");
            }

            if (myResults == "")
            {
                MessageBox.Show("Didn't find anything, spinning it!");
                oGdPictureImaging.Rotate(imgID, GdPicture.RotateFlipType.Rotate90FlipNone);
                myResults = oGdPictureImaging.OCRTesseractDoOCR(imgID, TesseractDictionary.TesseractDictionaryEnglish, "C:\\Program Files\\GdPicture.NET\\Redist\\OCR\\", "");
            }
            
            if (myResults == "")
            {
                MessageBox.Show("Didn't find anything, spinning it!");
                oGdPictureImaging.Rotate(imgID, GdPicture.RotateFlipType.Rotate90FlipNone);
                myResults = oGdPictureImaging.OCRTesseractDoOCR(imgID, TesseractDictionary.TesseractDictionaryEnglish, "C:\\Program Files\\GdPicture.NET\\Redist\\OCR\\", "");
            }
            if (myResults == "")
            {
                MessageBox.Show("Didn't find anything, spinning it!");
                oGdPictureImaging.Rotate(imgID, GdPicture.RotateFlipType.Rotate90FlipNone);
                myResults = oGdPictureImaging.OCRTesseractDoOCR(imgID, TesseractDictionary.TesseractDictionaryEnglish, "C:\\Program Files\\GdPicture.NET\\Redist\\OCR\\", "");
            }
            if (myResults == "")
            {
                MessageBox.Show("Didn't find anything, I am done!");
            }
            else
            {
                MessageBox.Show("FOUND: " + myResults);
            }

            oGdPictureImaging.OCRTesseractClear();
However, I haven't found a method for returning the imgID from the viewer, nor a method for creating an imgID from a gdPictureImage within the viewer. End result is, I can't seem to figure out how to actually get my Region OCR'd.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: OCR Zone of a PDF page

Post by Loïc » Tue Feb 23, 2010 5:09 pm

Hi,

Code: Select all

I am confused as to why you added the * 200 to the SetROI method here:

Call oGDTemp.SetROI(CInt(LeftArea * 200), CInt(TopArea * 200), CInt(WidthArea * 200), CInt(HeightArea * 200))
Just before, we retrieve coordinates on document is Inches:

Code: Select all

Call GdViewer1.GetRectCoordinatesOnDocumentInches(LeftArea, TopArea, WidthArea, HeightArea)
The SetROI methof of the GdPictureImaging class wait for coordinates in pixels.
Pixels coordinate = inch coordinates * Resolution

The resolution of PDF page rendering was set to 200 DPI:

Code: Select all

ImageID = oGDViewer.PdfRenderPageToGdPictureImage(200, GdViewer1.CurrentPage)

------------------------------------------------------
However, I haven't found a method for returning the imgID from the viewer
The only way to get a GdPicture ImageID from the viewer (in PDF viewing mode) is to use the PdfRenderPageToGdPictureImage method. The returned value is a GdPicture Image ID.

Code: Select all

nor a method for creating an imgID from a gdPictureImage within the viewer. 
From the reference guide:
5. Displaying a GdPicture Image handled by a GdPictureImaging object to a GdViewer object.



The GdPictureImaging class can create GdPicture Image from file, memory RAW, scanner...

If you want to display in real time a GdPicture Image handled by a GdPictureImaging object you can use the DisplayFromGdPictureImage method of the GdViewer class.



VB.NET example:



This example assumes you already have a GdPictureImaging object named oGdPictureImaging and a GdViewer object drawn on your form named GdViewer1.



Dim ImageID As Integer = oGdPictureImaging.CreateImageFromFile("c:\test.jpg")

GdViewer1.DisplayFromGdPictureImage(ImageID)



If later in your code, you want to draw something on the image (like text) and display the modification without reloading the image you can do:



oGdPictureImaging.DrawText(ImageID, "hello World!", 50, 50, 10, FontStyle.FontStyleBold, Color.Red, "Arial", True)

GdViewer1.Refresh()

Hope this helps. Please, if you need more assistance on a specific issue try to create a new thread (1 thread per issue should be very useful and easier).

With best regards,

Loïc

Bubba
Posts: 5
Joined: Mon Feb 22, 2010 10:05 pm

Re: OCR Zone of a PDF page

Post by Bubba » Tue Feb 23, 2010 6:04 pm

Thanks. The INCHES & Pixels explanation was enlightening.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest