Page 1 of 1

Creating hOCR

Posted: Wed Feb 13, 2013 12:33 pm
by Jarwo
Hello,

is there any kind of method I can use to create a file with the hOCR-standard?

The method "PdfAddGdPictureImageToPdfOCR(...)" returns only the recordnized text.
The method "GetPageTextWithCoords(..)" from the "GdPicturePDF" class returns just the text with coordinates, but not as an hOCR-File.

I need the option to extract an hOCR file by an searchable PDF. Is there any method I can use? Or should I create my own method with that called before?

Thanks, Jarwo

Re: Creating hOCR

Posted: Wed Feb 13, 2013 1:55 pm
by Cedric
Hello,

We do not support hOCR format at the moment so you will have to handle it by your own I'm afraid.

Cheers,
Cedric

HOCR

Posted: Fri Mar 04, 2016 10:07 pm
by SarrasiM
Good day,

Is there any plan to allow HOCR file creation with Tesseract? I've tried to use gdp.OCRTesseractSetVariable("tessedit_create_hocr", "1"); but it doesn't seems to save the file anywhere.

Maybe this functionnality would not be too hard to implement considering Tesseract already does it.

Thank you and have a nice week end!

Re: HOCR

Posted: Mon Mar 07, 2016 3:51 pm
by David
Hi,

The feature is part of the wish list but we have not set a high priority on it. At the moment I thus cannot communicate a release date.

Would it be possible for you to describe us what you wish to do with the HOCR ouput?

Than you

David

Re: HOCR

Posted: Mon Mar 07, 2016 4:02 pm
by SarrasiM
Hello David,

We'd like to explore text based document classification using HOCR.

Thanks!

Re: HOCR

Posted: Tue Mar 08, 2016 9:57 am
by David
Hi,

Please note you can access the full text by the mean of GetPageText:
https://www.gdpicture.com/guides/gdpicture/web ... eText.html

GetPageText will not retrieve all of the details the HOCR may contain but maybe the rough text will be sufficient for your need. Please note it GetPageText works with searchable PDF and also with text PDF.

Regards,

David

Re: HOCR

Posted: Thu Mar 10, 2016 8:04 pm
by SarrasiM
Hello David,

Thanks for the information. Unfortunately we also need positional information of text block :)

Re: HOCR

Posted: Wed Apr 06, 2016 5:24 pm
by SarrasiM
It also be a good addition for scenarios like this one:

- Extract the text layer in HOCR format.
- Make manipulations on the results
- Create a searchable PDF from HOCR file. You wouldn't need to make OCR again at this point here, so it's a performance gain.

Also, none of the provider that I know off really support or have great support for that. That's a sweet spot for GdPicture to exploit :)

I'm confident you could get that done with a minimum of effort as Tesseract already supports it and you offer a wrapper around their libraries.

Thank you!

Re: HOCR

Posted: Tue Jan 29, 2019 2:13 pm
by Gabriela
Hello,

GdPicture now offers a completely new class for OCR: https://www.gdpicture.com/guides/gdpicture/we ... reOCR.html
Unfortunately, we do not provide the hOCR option and we do not have any plans to support it a short or medium term.