HOCR

Feature Requests for GdPicture.NET.
Post Reply
Jarwo
Posts: 1
Joined: Wed Feb 13, 2013 12:24 pm

Creating hOCR

Post by Jarwo » Wed Feb 13, 2013 12:33 pm

Hello,

is there any kind of method I can use to create a file with the hOCR-standard?

The method "PdfAddGdPictureImageToPdfOCR(...)" returns only the recordnized text.
The method "GetPageTextWithCoords(..)" from the "GdPicturePDF" class returns just the text with coordinates, but not as an hOCR-File.

I need the option to extract an hOCR file by an searchable PDF. Is there any method I can use? Or should I create my own method with that called before?

Thanks, Jarwo

Cedric
Posts: 269
Joined: Sun Sep 02, 2012 7:30 pm

Re: Creating hOCR

Post by Cedric » Wed Feb 13, 2013 1:55 pm

Hello,

We do not support hOCR format at the moment so you will have to handle it by your own I'm afraid.

Cheers,
Cedric

SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

HOCR

Post by SarrasiM » Fri Mar 04, 2016 10:07 pm

Good day,

Is there any plan to allow HOCR file creation with Tesseract? I've tried to use gdp.OCRTesseractSetVariable("tessedit_create_hocr", "1"); but it doesn't seems to save the file anywhere.

Maybe this functionnality would not be too hard to implement considering Tesseract already does it.

Thank you and have a nice week end!

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: HOCR

Post by David » Mon Mar 07, 2016 3:51 pm

Hi,

The feature is part of the wish list but we have not set a high priority on it. At the moment I thus cannot communicate a release date.

Would it be possible for you to describe us what you wish to do with the HOCR ouput?

Than you

David

SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

Re: HOCR

Post by SarrasiM » Mon Mar 07, 2016 4:02 pm

Hello David,

We'd like to explore text based document classification using HOCR.

Thanks!

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: HOCR

Post by David » Tue Mar 08, 2016 9:57 am

Hi,

Please note you can access the full text by the mean of GetPageText:
https://www.gdpicture.com/guides/gdpicture/web ... eText.html

GetPageText will not retrieve all of the details the HOCR may contain but maybe the rough text will be sufficient for your need. Please note it GetPageText works with searchable PDF and also with text PDF.

Regards,

David

SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

Re: HOCR

Post by SarrasiM » Thu Mar 10, 2016 8:04 pm

Hello David,

Thanks for the information. Unfortunately we also need positional information of text block :)

SarrasiM
Posts: 22
Joined: Thu Dec 17, 2015 6:20 pm

Re: HOCR

Post by SarrasiM » Wed Apr 06, 2016 5:24 pm

It also be a good addition for scenarios like this one:

- Extract the text layer in HOCR format.
- Make manipulations on the results
- Create a searchable PDF from HOCR file. You wouldn't need to make OCR again at this point here, so it's a performance gain.

Also, none of the provider that I know off really support or have great support for that. That's a sweet spot for GdPicture to exploit :)

I'm confident you could get that done with a minimum of effort as Tesseract already supports it and you offer a wrapper around their libraries.

Thank you!

Gabriela
Posts: 436
Joined: Wed Nov 22, 2017 9:52 am

Re: HOCR

Post by Gabriela » Tue Jan 29, 2019 2:13 pm

Hello,

GdPicture now offers a completely new class for OCR: https://www.gdpicture.com/guides/gdpicture/we ... reOCR.html
Unfortunately, we do not provide the hOCR option and we do not have any plans to support it a short or medium term.

Post Reply

Who is online

Users browsing this forum: Google [Bot] and 1 guest