Page 1 of 1

How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other tool

Posted: Thu Jun 07, 2018 11:38 pm
by reisrf
I will receive some PDFs where some pages can have text (invisible) so we can search (example: PDFs created by OCR tools, or Office tools or others). And other pages where it will be a scanned image without ocr contents , so we can´t search. For the pages without OCR contents I need to apply OCR and create the hidden text in the specific locations (this I know how to do). My question is: how to detect a page has or not invisible text?

Thanks in advance

Robson Reis

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Posted: Sun Jun 10, 2018 9:29 pm
by Loïc

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Posted: Mon Jun 11, 2018 5:00 pm
by reisrf
Thank you!

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Posted: Tue Jun 19, 2018 8:52 pm
by reisrf
PageHasText method is returning True even if in the page we have only special characeters like \r, \n, \l, .... I have created by own PageHasText, using GetPageText:

string pageText = Regex.Replace(_gdPDF.GetPageText(), "[^0-9a-zA-Z]+", string.Empty).Trim();
return (pageText.Length == 0 ? false : true) ;

The snippet above returns True if we have at least a number or a letter (lower or uppercase) and false if there are only spaces or special characters.

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Posted: Mon Jan 21, 2019 4:48 pm
by Gabriela
Hi,

The PageHasText() method returns true/True if an arbitrary text is on the page. Special characters are considered as text; hence the method is working correctly. Your workaround is nice, and it is working for you very well. It always depends on the requirements you have for your application. Methods intended to work generally needs to do the proper job for all users. You can open a ticket on our support platform if you need some "custom" method so we can investigate it further and offer you a solution.

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Posted: Mon Jan 21, 2019 6:13 pm
by reisrf
No worries. My custom code is in place and it is working as expected. Case can be closed. Many thanks

Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other

Posted: Mon Jan 21, 2019 9:02 pm
by Gabriela
Hi,

Thank you for your return. Please do not hesitate to contact us if will need any custom solution or further technical assistance.