Extract Text from PDF in unicode

rassekst · Post by **rassekst** » Mon Aug 17, 2015 1:13 pm

Hi,

I will extract text with the method GetPageText form PDF. The text in the pdf ist in unicode. How can I unicode text extract?

Thanks for the help.

Steffen

Post by **Loïc** » Mon Aug 17, 2015 1:23 pm

Hi,

The text extraction is already done using unicode encoding.

If you have text extracted with incorrectly mapped characters, check the extraction with Adobe. If the result is different with GdPicture you should open an incident through our helpdesk that you can reach here: https://www.gdpicture.com/support/getting-support-from-our-team

With best regards,

Loïc

rassekst · Post by **rassekst** » Mon Aug 17, 2015 5:00 pm

Hi Loic,
I use following code:

Code: Select all

                       cOCRText = oPDF.GetPageText();
                                System.IO.Stream fs = new System.IO.FileStream("Test.OCR", System.IO.FileMode.Create);
                                byte[] data = System.Text.Encoding.UTF8.GetBytes(cOCRText);
                                fs.Write(data, 0, data.Length);
                                fs.Close();

The extracted text is:
@@
@@
@@
@@
@@
@@
@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@Æ™¤@Ó…ƒˆ¦™k@é‰””…™@ñò
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ã…“…†–•z@ðøòñ@a@óòô`ôøôó
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ã…“…†§z@ðøòñ@a@óòô`ôøöò
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@òóKññKòððñ
@@@@@@ÁÁÃÈÅÕÅÙ@ÇÙäÕÄåÅÙÔÖÅÇÅÕ
@@@@@@ÒÁ×ÉãÁÓÁÕÓÁÇÅ@ÇÔÂÈ
@@
@@@@@@æÖÅÙãÈâãÙK@óò
@@
@@@@@@õðööø@ÒÖÅÓÕ
@@
@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@õKôùððKðððùóùKõ@@ððððñ
@@@@@@@@Â…¢ƒˆ…‰„@ý‚…™@â£™ÿ…•™…‰•‰‡¤•‡¢‡…‚ýˆ™…•
@@
@@@@@@@@Ö‚‘…’£z
@@@@@@@@ÓÅÖÕÈÁÙÄâÂÅÙÇ@ñ
@@
@@@@@@@@Ç…‚ýˆ™…•¢ƒˆ¤“„•…™a‰•z
@@@@@@@@ÁÁÃÈÅÕÅÙ@ÇÙäÕÄåÅÙÔÖÅÇÅÕ
@@@@@@@@ÒÁ×ÉãÁÓÁÕÓÁÇÅ@ÇÔÂÈ
@@
@@@@@@ñKÇ…‚ýˆ™…•†…¢£¢…£©¤•‡z
================================================================================
Over the acrobat is the text also unicode.
The PDF is create with PCLGhost from a PCL file.

With best regards

Steffen

Post by **Loïc** » Mon Aug 17, 2015 5:55 pm

Hello Stefen,

Please consider my latest reply. We will be able to investigate if you create a ticket that contains the document.

Kind regards,

Loïc

Extract Text from PDF in unicode

Extract Text from PDF in unicode

Re: Extract Text from PDF in unicode

Re: Extract Text from PDF in unicode

Re: Extract Text from PDF in unicode

Who is online

Stay in Touch

About ORPALIS