Retrieving national characters from PDF file

Halla602 · Post by **Halla602** » Fri Feb 07, 2014 11:07 am

Hi, I recently encountered a customer, who had trouble retrieving text from his PDF file. He sent us a sample PDF, on which calling the function "GetPageText()" retrieves the page text with some characters corrupted, some correct. For example:

This is how the sentence looks exported using GDPicture - Záznam o ovìøení elektronického podání doruèeného
This is how the sentence should look with all nat. chars - Záznam o ověření elektronického podání doručeného

It seems the problem is with characters ě š č ř ž ď ť ň ľ

Sample code we using for page extraction:

Code: Select all

        GdPicture9.LicenseManager LicenceManager = new GdPicture9.LicenseManager();
        LicenceManager.RegisterKEY("xxxxxxxxxxxx");
        GdPicturePDF Pdf = null;
        status = Pdf.LoadFromStream(Input_stream);
        int Page_Count = Pdf.GetPageCount();
        StringBuilder output = new StringBuilder();
        try 
        {
          for (int i = 0; i < Page_Count; i++)
          {
            temp_page_text = Pdf.GetPageText();
            output.Append(temp_page_text);
          }
        }
...
        return output.ToString();

The attached .zip contains Input PDF and the output of GetPageText() function.
I am using GDPicture 9.4.0.9

Gabriela · Post by **Gabriela** » Wed Mar 20, 2019 11:08 am

Hello,

With the current version of the toolkit https://www.gdpicture.com/download-gdpicture/ you should be able to extract the text correctly.
Please kindly let us know if it works for you.

Retrieving national characters from PDF file

Retrieving national characters from PDF file

Re: Retrieving national characters from PDF file

Who is online

Stay in Touch

About ORPALIS