Retrieving national characters from PDF file

Discussions about PDF management.
Post Reply
Halla602
Posts: 8
Joined: Wed Jan 02, 2013 4:59 pm

Retrieving national characters from PDF file

Post by Halla602 » Fri Feb 07, 2014 11:07 am

Hi, I recently encountered a customer, who had trouble retrieving text from his PDF file. He sent us a sample PDF, on which calling the function "GetPageText()" retrieves the page text with some characters corrupted, some correct. For example:

This is how the sentence looks exported using GDPicture - Záznam o ovìøení elektronického podání doruèeného
This is how the sentence should look with all nat. chars - Záznam o ověření elektronického podání doručeného

It seems the problem is with characters ě š č ř ž ď ť ň ľ

Sample code we using for page extraction:

Code: Select all

        GdPicture9.LicenseManager LicenceManager = new GdPicture9.LicenseManager();
        LicenceManager.RegisterKEY("xxxxxxxxxxxx");
        GdPicturePDF Pdf = null;
        status = Pdf.LoadFromStream(Input_stream);
        int Page_Count = Pdf.GetPageCount();
        StringBuilder output = new StringBuilder();
        try 
        {
          for (int i = 0; i < Page_Count; i++)
          {
            temp_page_text = Pdf.GetPageText();
            output.Append(temp_page_text);
          }
        }
...
        return output.ToString();
The attached .zip contains Input PDF and the output of GetPageText() function.
I am using GDPicture 9.4.0.9
Attachments
GDPicture.zip
(48.69 KiB) Downloaded 429 times

Gabriela
Posts: 436
Joined: Wed Nov 22, 2017 9:52 am

Re: Retrieving national characters from PDF file

Post by Gabriela » Wed Mar 20, 2019 11:08 am

Hello,

With the current version of the toolkit https://www.gdpicture.com/download-gdpicture/ you should be able to extract the text correctly.
Please kindly let us know if it works for you.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest