some magic in .GetPageText?

Discussions about PDF management.
Post Reply
fierikit
Posts: 20
Joined: Mon Nov 05, 2007 7:03 pm
Location: Italy
Contact:

some magic in .GetPageText?

Post by fierikit » Wed Jan 06, 2021 3:59 pm

Hi
we have to extract text from Pdf, and we use .GetPageText from GdPicturePDF.
but appens something funny:
what i see is different from what i get
i see: COC TELEMATICO
but i get: cac TELEMATiCa
i see: Q1J7N1LVL
but i get: Q1J7N1 LVL <-- there is a space between 1 and L

I have attached an the Pdf with this problem, but this problem appens with too much Pdf of the same kind
I have also to say that 'select and paste' from Acrobat produce the same problem as using .GetPageText
Maybe is not a problem of GDPicture but can someone help me solve the problem?
Alberto
Attachments
VF1RJB00566551626.pdf
(244.68 KiB) Downloaded 1852 times

Hugo
Posts: 227
Joined: Tue Dec 18, 2018 10:09 am

Re: some magic in .GetPageText?

Post by Hugo » Tue Jan 12, 2021 4:30 pm

Hi Fierikit,

I have taken a look at your PDF and can say this is not GdPicture at fault but the document's fault. The way the text has previously been OCR'ed is not accurate and this is showed by adobe.

Have you tried using our OCR engine on your documents to get more accurate text results from your document?
This is the result you should be able to get from our OCR (attached file)
Screenshot_404.png
Let me know if you need anything else

fierikit
Posts: 20
Joined: Mon Nov 05, 2007 7:03 pm
Location: Italy
Contact:

Re: some magic in .GetPageText?

Post by fierikit » Tue Jan 12, 2021 5:13 pm

Thanks Hugo
I also said is a Pdf fault.
The problem is that .GetPageText is in a batch and automatic activity because the number of Pdf to read could be too much high.
The batch cannot understand if text from .GetPageText is different from what an Human eye can see.

is very very rare to find a Pdf like the one i sent, but when it happens could be a problem for later activities,

usually if i use .GetPageText in a Pdf produced from scanning (so an image Pdf) the result is an empty string
but with a Pdf of that kind i get the whole text (with some errors generated from what You said).

is too long, for each Pdf, to use .GetPageText AND do OCR, also if i only need only the last 3 rows
so there is a way, or propery, so i can undertand that some parts of the Pdf are generated from a previous OCR?

I hope have explained
thanks for your support
Alberto

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest