How to create searchable PDF

Example requests & Code samples for GdPicture Toolkits.
Post Reply
User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

How to create searchable PDF

Post by Loïc » Sun Oct 12, 2008 5:44 pm

Several VB6 samples to create searchable PDF file using GdPicture Pro Imaging SDK:

Note: The optional GdPicture OCR Tesseract Plugin is needed: https://www.gdpicture.com/products/plugi ... engine.php


- Sample 1: Creating multipage searchable PDF from the content of the document feeder of a scanner:

Code: Select all

Dim nImageID As Long
Dim nCpt As Long
    
If Imaging1.TwainOpenDefaultSource() Then   
   Imaging1.TwainSetAutoFeed (True) 'Set AutoFeed Enabled
   Imaging1.TwainSetAutoScan (True) 'To  achieve the maximum scanning rate
   Imaging1.TwainSetCurrentResolution (300) 'We scan in 300 DPI
   Imaging1.TwainSetCurrentPixelType (TWPT_BW) 'Black & White scanning
   Imaging1.TwainSetCurrentBitDepth (1) ' 1 bpp scanning
      
   Imaging1.TwainPdfOCRStart ("output.pdf")
   While Imaging1.CreateImageFromTwain(Me.hWnd) <> 0
         nImageID = Imaging1.GetNativeImage
          'In AppData we should have ne needed dictionary files
         Call Imaging1.TwainAddGdPictureImageToPdfOCR(nImageID, TesseractDictionaryEnglish,  App.Path & "\AppData") 'AppData includes dictionary files
         Imaging1.CloseImage (nImageID)
   Wend
   Imaging1.TwainPdfOCRStop
      
   Call Imaging1.TwainCloseSource
Else
   MsgBox "can't open default source, twain state is: " & Trim(Str(Imaging1.TwainGetState))
End If

- Sample 2: Creating multipage searchable PDF from a multipage TIFF image:

Code: Select all

Dim nImageID As Long

Imaging1.TiffOpenMultiPageAsReadOnly (True)
nImageID = Imaging1.CreateImageFromFile("multipage.tif")
'In AppData we should have ne needed dictionary files
Call Imaging1.PdfOCRCreateFromMultipageTIFF(nImageID, "output.pdf", TesseractDictionaryEnglish, App.Path & "\AppData")  'AppData includes dictionary files
Call Imaging1.CloseImage(nImageID)

- Sample 3: Creating single page searchable PDF from image:

Code: Select all

Imaging1.CreateImageFromFile ("image.tif")
Call Imaging1.SaveAsPDFOCR("output.pdf", TesseractDictionaryEnglish, App.Path & "\AppData")  'AppData includes dictionary files
Imaging1.CloseNativeImage

- Sample 4: Creating multipage searchable PDF from existing multipage PDF:

Code: Select all

Dim nPage As Long
Dim oImaging As Object, oGdViewer As Object
Dim RasterizedPage As Long

Set oImaging = CreateObject("gdpicturepro5.Imaging")
Set oGdViewer = CreateObject("gdpicturepro5.GdViewer")

oGdViewer.SetLicenseNumber ("XXX")
oImaging.SetLicenseNumber ("XXX")
oGdViewer.LockControl = True
oGdViewer.PdfDpiRendering = 200
oGdViewer.DisplayFromPdfFile ("c:\test.pdf")

For nPage = 1 To oGdViewer.PageCount
    oGdViewer.DisplayFrame (nPage)

    RasterizedPage = oGdViewer.GetNativeImage

    If nPage = 1 Then
       oImaging.TwainPdfOCRStartEx ("c:\testocr.pdf")
    End If
    Call oImaging.TwainAddGdPictureImageToPdfOCR(RasterizedPage, TesseractDictionaryEnglish, App.Path & "\AppData") 'AppData includes dictionary files
Next nPage
oImaging.TwainPdfOCRStop

oGdViewer.CloseImage

Dantevios
Posts: 4
Joined: Sun Jan 18, 2009 7:59 pm

Re: How to create searchable PDF

Post by Dantevios » Sun Jan 18, 2009 8:01 pm

What data type is Imaging1 , and how did you instantiate it?
- Myself

Nevermind. I figured it out. Imageing1 is an instantiation of GdPicturePro5.cImaging, in C# you instantiate it like this:

Code: Select all

GdPicturePro5.cImaging cImage = new GdPicturePro5.cImaging();

I have figured out how to make a single page searchable PDF out of a tif in C#, here is the code for all those looking for it:

GdPicturePro5.cImaging cImage = new GdPicturePro5.cImaging();
            cImage.SetLicenseNumber("XXXXX"); //Replace XXXXX with your license #
            cImage.SetLicenseNumberOCRTesseract("XXXXX"); //Replace XXXXX with your license #
            cImage.CreateImageFromFile("C:\\input.tif");
            cImage.SaveAsPDFOCR("C:\\output.pdf", GdPicturePro5.TesseractDictionary.TesseractDictionaryEnglish, "PATH TO YOUR UNZIPPED DICTIONARY FILES", "", "Mr. Smith", "Mr. Smith", "Mr. Smith", "Mr. Smith");
            cImage.CloseNativeImage();
I know there are two topic posts on this forum about people wanting examples of how to create multipage searchable PDFs in C# so I am also providing this example to make multipage PDFs out of multipage TIFs

Code: Select all

      int nDimage = 0;

            GdPicturePro5.cImaging cImage = new GdPicturePro5.cImaging();
            
            cImage.SetLicenseNumber("XXXXX"); //Replace XXXXX with your license #
            cImage.SetLicenseNumberOCRTesseract("XXXXX"); //Replace XXXXX with your license #
            cImage.TiffOpenMultiPageAsReadOnly(true);            
            nDimage = cImage.CreateImageFromFile("C:\\input.tif");            
            cImage.PdfOCRCreateFromMultipageTIFF(nDimage, "C:\\output.pdf", GdPicturePro5.TesseractDictionary.TesseractDictionaryEnglish, PATH TO YOUR DICTIONARY FILES, "", "Mr. Smith", "Mr. Smith", "Mr. Smith", "Mr. Smith");            
            cImage.CloseImage(nDimage);
Last edited by Dantevios on Sun Jan 18, 2009 8:50 pm, edited 2 times in total.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: How to create searchable PDF

Post by Loïc » Sun Jan 18, 2009 8:05 pm

Thank you Dante. This should be a good help for many users ;)

Loïc

dchillman
Posts: 2
Joined: Wed Jan 21, 2009 9:03 pm

Re: How to create searchable PDF

Post by dchillman » Wed Jan 21, 2009 9:11 pm

I am evaluating your tool for a slightly different purpose. I have a bunch of pdf files in a sharepoint list that may or may not be text-searchable. My requirement is to create a feature which loops through the list, opens the pdf, updates it to make it text-searchable, then save it back to the list. I can open up each pdf file as a memory stream. Will it then be possible to pass the stream to your object, have it processed to make it text-searchable, and get back the updated stream, which I can them pass back to the sharepoint list? If so, can you post a code example of how to handle a memory stream with your objects? thanks

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: How to create searchable PDF

Post by Loïc » Fri Jan 23, 2009 12:06 pm

Hi,

You can't open a PDF from a stream object in GdPicture ActiveX. This feature is only available in GdPicture.NET.

To create a searchable PDF from an existing PDF document see
Sample 4: Creating multipage searchable PDF from existing multipage PDF

Best regards,

Loïc

alexandres
Posts: 1
Joined: Fri May 08, 2009 4:38 pm

Re: How to create searchable PDF

Post by alexandres » Fri May 08, 2009 4:48 pm

Hi,

I'm evualating your software for imaging and OCR,
My doubt is if there's a way to use other OCR engine than Tesseract to create searchable PDF.

User avatar
ryancole11
Posts: 21
Joined: Fri May 21, 2010 7:19 pm

Re: How to create searchable PDF

Post by ryancole11 » Thu Jun 03, 2010 11:35 pm

How would you perform OCR after doing sample 4? I still am getting "13: Unknown image format" when I try to perform OCR on the PDF produced by sample 4's process.

Thanks.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: How to create searchable PDF

Post by Loïc » Tue Oct 12, 2010 10:08 am

Hi,

I am not sure to understand the question. The sample 4 already does OCR, from PDF to PDF + text. What do you expect from the resulting document ?

Kind regards,

Loïc

User avatar
ryancole11
Posts: 21
Joined: Fri May 21, 2010 7:19 pm

Re: How to create searchable PDF

Post by ryancole11 » Wed Mar 02, 2011 10:25 pm

Loïc wrote:Hi,

I am not sure to understand the question. The sample 4 already do OCR, from PDF to PDF + text. What do you expect from the resulting document ?

Kind regards,

Loïc
Loic, the example #4 code does not even work, anymore. It's more than 3 years old. Functions in the example don't exist in the newer versions of the library. I still cannot get example #4 working. I'm just trying to OCR an existing multi-page PDF.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: How to create searchable PDF

Post by Loïc » Thu Mar 03, 2011 10:17 am

Hi,

All examples are working. I suppose you are using GdPicture.NET, so please move to the GdPicture.NET forum (here it is GdPciture ActiveX).

Kind regards,

Loïc

luke92
Posts: 6
Joined: Tue Apr 05, 2011 2:45 pm

Re: How to create searchable PDF

Post by luke92 » Tue Apr 12, 2011 1:28 pm

I have a question for sample 4.
Is it possible to create a multipage searchable PDF from an existing multipage PDF without using the gdp-viewer.
Because the viewer isn't included in GdPicture Light Imaging Toolkit.
Last edited by luke92 on Wed Apr 13, 2011 12:16 pm, edited 1 time in total.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: How to create searchable PDF

Post by Loïc » Tue Apr 12, 2011 4:23 pm

Hi,

No you have to use the GdViewer object to convert a PDF page to an image with GdPicture Activex.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest