pdf bad ocr

Discussions about PDF management.
Post Reply
mirkop
Posts: 41
Joined: Wed Jun 24, 2009 5:38 pm

pdf bad ocr

Post by mirkop » Mon Sep 07, 2009 10:32 am

Hello,

I'm creating a PDF OCR using tesseract from a tif, but with some files (like fax) the OCR result i complety wrong.
I attach a tif file as sample ..
Can you help me?

Mirko
Attachments
20090901_113606_UNKNOWN.tif

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: pdf bad ocr

Post by Loïc » Mon Sep 07, 2009 11:42 am

Hi,

This file contains different horizontal (204 dpi) & vertical (98 dpi) resolution.

What I can suggest is to resize the image in order to have similar resolution:

Code: Select all

        Dim ResFactor As Single
        Dim Hres, Vres As Single

        Vres = oGdPictureImaging.GetVerticalResolution(m_ImageID)
        Hres = oGdPictureImaging.GetVerticalResolution(m_ImageID)

        ResFactor = Hres / Vres

        If ResFactor <> 1 Then
            Call oGdPictureImaging.Resize(m_ImageID, CInt(oGdPictureImaging.GetWidth(m_ImageID) * ResFactor), oGdPictureImaging.GetHeight(m_ImageID), Drawing2D.InterpolationMode.HighQualityBicubic)
        End If
You should have good OCR improvement with this method.

Best regards,

Loïc

mirkop
Posts: 41
Joined: Wed Jun 24, 2009 5:38 pm

Re: pdf bad ocr

Post by mirkop » Tue Sep 08, 2009 10:44 am

Hi,

It work fine for tif single page, but it doesn't work for multipage tif. It create one page pdf.
This is my code:

Code: Select all

Single ResFactor;
                Single Hres;
                Single Vres;
                Vres = oGdPictureImaging.GetVerticalResolution(ImageID);
                Hres = oGdPictureImaging.GetHorizontalResolution(ImageID);

                ResFactor = Hres / Vres;

                if (ResFactor != 1)
                    oGdPictureImaging.Resize(ImageID,
                        oGdPictureImaging.GetWidth(ImageID),
                        Convert.ToInt32(oGdPictureImaging.GetHeight(ImageID) * ResFactor),
                      System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic);

                if (oGdPictureImaging.TiffIsMultiPage(ImageID))
                {
                    
                    if (makeSearchable == false)
                        oGdPictureImaging.PdfCreateFromMultipageTIFF(ImageID, filePDF, pdfA, "", "", "", "", "");
                    else
                        oGdPictureImaging.PdfOCRCreateFromMultipageTIFF(ImageID,
                            GetTesseractDictionary(dizionario),
                            _dirOCR, "", filePDF, pdfA, "", "", "", "", "");
                }
                else
                {
                    if (makeSearchable == false)
                        oGdPictureImaging.SaveAsPDF(ImageID, filePDF, pdfA, "", "", "", "", "");
                    else
                        oGdPictureImaging.SaveAsPDFOCR(ImageID, filePDF,
                            GetTesseractDictionary(dizionario),
                            _dirOCR, "", pdfA, "", "", "", "", "");
                }
                oGdPictureImaging.ReleaseGdPictureImage(ImageID);
Mirko

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: pdf bad ocr

Post by Loïc » Tue Sep 08, 2009 10:49 am

Hi,

You need to open the multipage tiff for read & write.
For that, just call the

Code: Select all

TiffOpenMultiPageForWrite(True)
before opening a file.

Then, you will have to resize all page in a loop:

Code: Select all

For i = 1 to oGdPictureImagingImaging.TiffGetPageCount(imageid)       
        oGdPictureImagingImaging.TiffSelectPage(imageid, i)
        oGdPictureImagingImaging.Resize(imageid...)
Next i
Let me know if you have other problem with this issue.

Kind regards,

Loïc

mirkop
Posts: 41
Joined: Wed Jun 24, 2009 5:38 pm

Re: pdf bad ocr

Post by mirkop » Tue Sep 08, 2009 11:41 am

Thank you .. it works fine.

Mirko

Post Reply

Who is online

Users browsing this forum: Bing [Bot] and 1 guest