How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other tool
How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other tool
I will receive some PDFs where some pages can have text (invisible) so we can search (example: PDFs created by OCR tools, or Office tools or others). And other pages where it will be a scanned image without ocr contents , so we can´t search. For the pages without OCR contents I need to apply OCR and create the hidden text in the specific locations (this I know how to do). My question is: how to detect a page has or not invisible text?
Thanks in advance
Robson Reis
Thanks in advance
Robson Reis
Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other
PageHasText method is returning True even if in the page we have only special characeters like \r, \n, \l, .... I have created by own PageHasText, using GetPageText:
string pageText = Regex.Replace(_gdPDF.GetPageText(), "[^0-9a-zA-Z]+", string.Empty).Trim();
return (pageText.Length == 0 ? false : true) ;
The snippet above returns True if we have at least a number or a letter (lower or uppercase) and false if there are only spaces or special characters.
string pageText = Regex.Replace(_gdPDF.GetPageText(), "[^0-9a-zA-Z]+", string.Empty).Trim();
return (pageText.Length == 0 ? false : true) ;
The snippet above returns True if we have at least a number or a letter (lower or uppercase) and false if there are only spaces or special characters.
Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other
Hi,
The PageHasText() method returns true/True if an arbitrary text is on the page. Special characters are considered as text; hence the method is working correctly. Your workaround is nice, and it is working for you very well. It always depends on the requirements you have for your application. Methods intended to work generally needs to do the proper job for all users. You can open a ticket on our support platform if you need some "custom" method so we can investigate it further and offer you a solution.
The PageHasText() method returns true/True if an arbitrary text is on the page. Special characters are considered as text; hence the method is working correctly. Your workaround is nice, and it is working for you very well. It always depends on the requirements you have for your application. Methods intended to work generally needs to do the proper job for all users. You can open a ticket on our support platform if you need some "custom" method so we can investigate it further and offer you a solution.
Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other
No worries. My custom code is in place and it is working as expected. Case can be closed. Many thanks
Re: How to know a PDF page has text (invisible / we can search) (from OCR or it was created by Word, Excel or any other
Hi,
Thank you for your return. Please do not hesitate to contact us if will need any custom solution or further technical assistance.
Thank you for your return. Please do not hesitate to contact us if will need any custom solution or further technical assistance.
Who is online
Users browsing this forum: No registered users and 0 guests