Extract text from PDF

Discussions about PDF management.
Post Reply
Nemo
Posts: 1
Joined: Mon Feb 15, 2016 3:48 pm

Extract text from PDF

Post by Nemo » Mon Feb 15, 2016 4:01 pm

Hi all !
How to extract text from OCRed PDF to string using C# ?

i tried to use this code but got empty results..

Code: Select all

ocrText = oGdPicturePDF.GetPageText();
if (String.IsNullOrEmpty(ocrText))
{
    if (oGdPicturePDF.OcrPage("eng", Environment.CurrentDirectory + @"\OCR\", "", (float)200.0) != GdPictureStatus.OK)
    {
        Console.WriteLine("OCR problem on page " + i.ToString() + ". Error: " + oGdPicturePDF.GetStat().ToString());
    }
    ocrText = oGdPicturePDF.GetPageText();
}
using (StreamWriter sw = File.AppendText("D:\\output.txt"))
{
    sw.WriteLine(ocrText);
    sw.WriteLine("=====================================================");
}

David
Posts: 66
Joined: Mon Feb 08, 2016 3:12 pm

Re: Extract text from PDF

Post by David » Tue Feb 16, 2016 10:09 am

Hi,

GdPicture.NET is delivered with a fully functional sample illustrating this feature. It is available in the folder "GdPicture.NET 11\Samples\WinForm\C#\OCR".

In a nutshell, the sample does the following operations:
- load the pdf
- setup the ocr engine
- And do the ocr, getting the result in a string:
[...]
sOCR = oGdPictureImaging.OCRTesseractDoOCR(m_ImageID, txtLang.Text, TextBox1.Text, "");
[...]

If you are experiencing troubles, may I ask you to share the PDF you are trying to read?

Thank you,

David

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest