Merge PDF and ocr text

Discussions about PDF management.
Post Reply
ygquantum
Posts: 3
Joined: Fri Nov 08, 2013 5:03 pm

Merge PDF and ocr text

Post by ygquantum » Fri Nov 08, 2013 5:10 pm

Hi All,

I would like to use the GDPicture.NET 10 for OCR PDF documents. Then, i would extract the text with coordinates (i find the GetPageTextWithCoords() method).
I would like to store this text (with coords) separately from the non-ocr PDF. When the user clicks to see the PDF, i would like to merge the non-ocr pdf with text (with coords) on the fly, so the users gets a pdf with ocr-text.

Is this possible with GDPicture.NET 10?

Thank in advance,
Gabor

Cedric
Posts: 269
Joined: Sun Sep 02, 2012 7:30 pm

Re: Merge PDF and ocr text

Post by Cedric » Thu Dec 05, 2013 12:28 pm

I really don't understand the point of producing a PDF-OCR file to extract the text of it and put it in the original PDF. Doing that will transform the original PDF to en PDF-OCR which is the whole point of the method in the first place.
It is achievable but it is not practical nor straight forward and in the end it is the exact same thing as directly producing a PDF-OCR.

ygquantum
Posts: 3
Joined: Fri Nov 08, 2013 5:03 pm

Re: Merge PDF and ocr text

Post by ygquantum » Thu Dec 05, 2013 12:36 pm

Dear Cedric,

The point is: The customer has a database. They store PDFs in a blob field of a table. We have to OCR a these PDFs, but we have to keep the original file as well. The database now is 400 GB, if we put PDF-OCR next to the original PDF, we got an almost 800 GB database. We thought, that it would be simple to store only the text layer, and when the user want to view the document, on-the-fly merge the text and the PDF.

Cedric
Posts: 269
Joined: Sun Sep 02, 2012 7:30 pm

Re: Merge PDF and ocr text

Post by Cedric » Thu Dec 05, 2013 12:47 pm

This is still possible but you will have to create from scratch a PDF file on-the-fly and depending on the PDF size and content, it may not be a good idea.
The method you are looking for is the DrawText method documented here: https://www.gdpicture.com/guides/gdpicture/GdP ... tring.html

Basically what you will have to do is:
- Load the original PDF document
- Create a new blank PDF document
- Create each page in the new PDF document with the size based on the corresponding page from the original one
- Insert the text layer on each page
- Create a raster image of each original PDF page and insert them on the corresponding new PDF page (otherwise you will have the inserted text over the picture which is not what you want)

ygquantum
Posts: 3
Joined: Fri Nov 08, 2013 5:03 pm

Re: Merge PDF and ocr text

Post by ygquantum » Thu Dec 05, 2013 12:55 pm

Thank you for help! We will consider you advise.

Best regards,
Gabor

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest