OCR pdf perfomance

Discussions about machine vision support in GdPicture.
Post Reply
mirkop
Posts: 41
Joined: Wed Jun 24, 2009 5:38 pm

OCR pdf perfomance

Post by mirkop » Mon Nov 07, 2011 6:43 pm

Hi,
I have installed the lastest version of GdPicture 8.4.
I read the text from pdf file, but it's too slow.
I have some files (total 85 MB), and for read the text of all pdf the process during 2 hours.
Foreach file use this code:

Code: Select all

GdPicturePDF oPDF = new GdPicturePDF();
                    if (oPDF.LoadFromFile(path, false) == GdPictureStatus.OK)
                    {
                        
                        int dimCount= oPDF.GetPageCount();

                        for (int i = 1; i <= dimCount; i++)
                        {
                            if (i > 1)
                            {
                                Debug("SELEZIONE PAGINA " + i);
                                oPDF.SelectPage(i);
                            }
                            Debug("RenderPageToGdPictureImage");
                            m_ImageID = oPDF.RenderPageToGdPictureImage(200, true);

                            Debug("OCRTesseractReinit");
                            oGdPictureImaging.OCRTesseractReinit();

                            Debug("OCRTesseractDoOCR");
                            s += oGdPictureImaging.OCRTesseractDoOCR(m_ImageID, "ita", _dirOCR, "");
                            if (oGdPictureImaging.GetStat() != GdPictureStatus.OK)
                                Debug("[" + path + "] Error on page " + i + ": " + oGdPictureImaging.GetStat().ToString());
                            Debug("OCRTesseractClear");
                            oGdPictureImaging.OCRTesseractClear();
                        }
                        oPDF.CloseDocument();
                    }
How can i improve the performance?

Thank you

Mirko

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: OCR pdf perfomance

Post by Loïc » Tue Nov 08, 2011 5:36 pm

Hi Mirkop,

You can decrease the PDF dpi rendering or OCR your document in a multi-threaded application.

Steps are:

- Split the document (one new doc per page)
- in X threads create 1 PDF/OCR per page
- At the end, compose a new PDF by merging all document produced

We should deliver such demo application within the next release.

Kind regards,

Loîc

mirkop
Posts: 41
Joined: Wed Jun 24, 2009 5:38 pm

Re: OCR pdf perfomance

Post by mirkop » Tue Nov 08, 2011 5:57 pm

Hi Loic,

thank you for your reply.
I don't need to create a new file, but only exctract the ocr text.

Is there a way to improve the performace?

Mirko

mirkop
Posts: 41
Joined: Wed Jun 24, 2009 5:38 pm

Re: OCR pdf perfomance

Post by mirkop » Thu Nov 10, 2011 4:33 pm

any suggestion ?

Mirko

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: OCR pdf perfomance

Post by Loïc » Thu Nov 10, 2011 5:11 pm

Mirko I already provided 2 suggestions:

1- Decrease your PDF rendering resolution.

2- use multiple threads splitting your input document.

I don't see other possible way. I will try to provide a demo for (2) as soon as I can.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest