February 22, 2025 | blog, Tutorial

Read Text from PDFs Using C# with GdPicture.NET OCR


Sometimes, text in a scanned PDF isn’t selectable or searchable. This is where Optical Character Recognition (OCR) comes in.

By using OCR, you can extract text from PDF files and save it in a file for editing or further processing. In this guide, you’ll learn how to use GdPicture.NET’s OCR engine to recognize text from PDFs using C#.

What Is OCR?

OCR (Optical Character Recognition) is a technology that converts different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data.

GdPicture.NET provides a powerful OCR engine to extract text from images and convert it into a format like TXT, or searchable PDFs.

Prerequisites

Before you start, ensure you have the following:

  1. Visual Studio installed with the necessary workloads for .NET development.
  2. GdPicture.NET SDK installed on your system.
  3. .NET 6.0 or newer (recommended) or .NET Framework 4.6.2/.NET Core 3.1.

Installing GdPicture.NET SDK

  1. Download the latest version from the official GdPicture.NET website.
  2. Run the installation wizard and install it at a location such as C:\GdPicture.NET\.

Steps to Read Text from PDFs Using OCR in C#

1. Add GdPicture.API to Your Project

Using NuGet Package Manager

  1. Open Visual Studio and create or open a C# .NET project.
  2. Go to Project > Manage NuGet Packages.
  3. In the Package source dropdown, select nuget.org.
  4. Search for GdPicture.API and install it (for .NET 6.0+ projects).
  5. For .NET 4.6.2 or .NET Core 3.1, install GdPicture.

2. Import GdPicture.NET to Your Code

At the beginning of your C# files, add:

using GdPicture14;

3. Load the PDF File

To begin, create a GdPicturePDF object and load the PDF file that contains the text you want to recognize.

using GdPicturePDF gdpicturePDF = new GdPicturePDF();
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");

4. Configure OCR Settings

  • Create a GdPictureOCR object:
using GdPictureOCR gdpictureOCR = new GdPictureOCR();
  • Set the resource folder: This folder contains the OCR language models.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
  • Choose a language: Specify the language(s) for text recognition.
gdpictureOCR.AddLanguage(OCRLanguage.English);

5. Extract Text Using OCR

  • Determine the number of pages in the document:
int pageCount = gdpicturePDF.GetPageCount();
  • Loop through the pages and process each page:
    • Render the page to an image.
    • Run the OCR process.
    • Retrieve the extracted text.
    • Release used resources.
string outputText = "";

for (int page = 1; page <= pageCount; page++)
{
    gdpicturePDF.SelectPage(page);
    int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);
    gdpictureOCR.SetImage(imageId);
    string resultId = gdpictureOCR.RunOCR();
    outputText += gdpictureOCR.GetOCRResultText(resultId);
    GdPictureDocumentUtilities.DisposeImage(imageId);
    gdpictureOCR.ReleaseOCRResult(resultId);
}

6. Save Extracted Text to a File

  • Write the extracted text to a.txt file
 System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.docx"); outputFile.WriteLine(outputText); outputFile.Close();
  • Release unnecessary resources:
 gdpicturePDF.CloseDocument();

Full C# Code Example

 
using GdPicture14;

LicenseManager licenseManager = new LicenseManager();
licenseManager.RegisterKEY("");

using GdPicturePDF gdpicturePDF = new GdPicturePDF();
using GdPictureOCR gdpictureOCR = new GdPictureOCR();

// Select the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\output.pdf");
// Configure the OCR process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Create an empty string where you'll save the output.
string outputText = "";
// Determine the number of pages and loop through them.
int pageCount = gdpicturePDF.GetPageCount();
for (int page = 1; page <= pageCount; page++)
{
    gdpicturePDF.SelectPage(page);
    // Render the page to a 300 DPI image.
    int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);
    // Pass the image to the `GdPictureOCR` object.
    gdpictureOCR.SetImage(imageId);
    // Run the OCR process.
    string resultId = gdpictureOCR.RunOCR();
    // Get the result of the OCR process as text.
    outputText += gdpictureOCR.GetOCRResultText(resultId);
    // Release the image and the OCR result.
    GdPictureDocumentUtilities.DisposeImage(imageId);
    gdpictureOCR.ReleaseOCRResult(resultId);
}
// Save the output in a new text file.
System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");
outputFile.WriteLine(outputText);
outputFile.Close();
// Release unnecessary resources.
gdpicturePDF.CloseDocument();

Conclusion

By following this guide, you can successfully extract text from PDF files using OCR in C# and save it as a text file. This process makes scanned documents and images editable and searchable, improving document accessibility and workflow automation.

Looking to integrate advanced OCR solutions into your application? Contact our sales team to explore enterprise-grade features and customized solutions!


Tags: