Page 1 of 1

How to create searchable PDF

Posted: Sun Oct 12, 2008 5:44 pm
by Loïc
Several VB6 samples to create searchable PDF file using GdPicture Pro Imaging SDK:

Note: The optional GdPicture OCR Tesseract Plugin is needed: https://www.gdpicture.com/products/plugi ... engine.php


- Sample 1: Creating multipage searchable PDF from the content of the document feeder of a scanner:

Code: Select all

Dim nImageID As Long
Dim nCpt As Long
    
If Imaging1.TwainOpenDefaultSource() Then   
   Imaging1.TwainSetAutoFeed (True) 'Set AutoFeed Enabled
   Imaging1.TwainSetAutoScan (True) 'To  achieve the maximum scanning rate
   Imaging1.TwainSetCurrentResolution (300) 'We scan in 300 DPI
   Imaging1.TwainSetCurrentPixelType (TWPT_BW) 'Black & White scanning
   Imaging1.TwainSetCurrentBitDepth (1) ' 1 bpp scanning
      
   Imaging1.TwainPdfOCRStart ("output.pdf")
   While Imaging1.CreateImageFromTwain(Me.hWnd) <> 0
         nImageID = Imaging1.GetNativeImage
          'In AppData we should have ne needed dictionary files
         Call Imaging1.TwainAddGdPictureImageToPdfOCR(nImageID, TesseractDictionaryEnglish,  App.Path & "\AppData") 'AppData includes dictionary files
         Imaging1.CloseImage (nImageID)
   Wend
   Imaging1.TwainPdfOCRStop
      
   Call Imaging1.TwainCloseSource
Else
   MsgBox "can't open default source, twain state is: " & Trim(Str(Imaging1.TwainGetState))
End If

- Sample 2: Creating multipage searchable PDF from a multipage TIFF image:

Code: Select all

Dim nImageID As Long

Imaging1.TiffOpenMultiPageAsReadOnly (True)
nImageID = Imaging1.CreateImageFromFile("multipage.tif")
'In AppData we should have ne needed dictionary files
Call Imaging1.PdfOCRCreateFromMultipageTIFF(nImageID, "output.pdf", TesseractDictionaryEnglish, App.Path & "\AppData")  'AppData includes dictionary files
Call Imaging1.CloseImage(nImageID)

- Sample 3: Creating single page searchable PDF from image:

Code: Select all

Imaging1.CreateImageFromFile ("image.tif")
Call Imaging1.SaveAsPDFOCR("output.pdf", TesseractDictionaryEnglish, App.Path & "\AppData")  'AppData includes dictionary files
Imaging1.CloseNativeImage

- Sample 4: Creating multipage searchable PDF from existing multipage PDF:

Code: Select all

Dim nPage As Long
Dim oImaging As Object, oGdViewer As Object
Dim RasterizedPage As Long

Set oImaging = CreateObject("gdpicturepro5.Imaging")
Set oGdViewer = CreateObject("gdpicturepro5.GdViewer")

oGdViewer.SetLicenseNumber ("XXX")
oImaging.SetLicenseNumber ("XXX")
oGdViewer.LockControl = True
oGdViewer.PdfDpiRendering = 200
oGdViewer.DisplayFromPdfFile ("c:\test.pdf")

For nPage = 1 To oGdViewer.PageCount
    oGdViewer.DisplayFrame (nPage)

    RasterizedPage = oGdViewer.GetNativeImage

    If nPage = 1 Then
       oImaging.TwainPdfOCRStartEx ("c:\testocr.pdf")
    End If
    Call oImaging.TwainAddGdPictureImageToPdfOCR(RasterizedPage, TesseractDictionaryEnglish, App.Path & "\AppData") 'AppData includes dictionary files
Next nPage
oImaging.TwainPdfOCRStop

oGdViewer.CloseImage

Re: How to create searchable PDF

Posted: Sun Jan 18, 2009 8:01 pm
by Dantevios
What data type is Imaging1 , and how did you instantiate it?
- Myself

Nevermind. I figured it out. Imageing1 is an instantiation of GdPicturePro5.cImaging, in C# you instantiate it like this:

Code: Select all

GdPicturePro5.cImaging cImage = new GdPicturePro5.cImaging();

I have figured out how to make a single page searchable PDF out of a tif in C#, here is the code for all those looking for it:

GdPicturePro5.cImaging cImage = new GdPicturePro5.cImaging();
            cImage.SetLicenseNumber("XXXXX"); //Replace XXXXX with your license #
            cImage.SetLicenseNumberOCRTesseract("XXXXX"); //Replace XXXXX with your license #
            cImage.CreateImageFromFile("C:\\input.tif");
            cImage.SaveAsPDFOCR("C:\\output.pdf", GdPicturePro5.TesseractDictionary.TesseractDictionaryEnglish, "PATH TO YOUR UNZIPPED DICTIONARY FILES", "", "Mr. Smith", "Mr. Smith", "Mr. Smith", "Mr. Smith");
            cImage.CloseNativeImage();
I know there are two topic posts on this forum about people wanting examples of how to create multipage searchable PDFs in C# so I am also providing this example to make multipage PDFs out of multipage TIFs

Code: Select all

      int nDimage = 0;

            GdPicturePro5.cImaging cImage = new GdPicturePro5.cImaging();
            
            cImage.SetLicenseNumber("XXXXX"); //Replace XXXXX with your license #
            cImage.SetLicenseNumberOCRTesseract("XXXXX"); //Replace XXXXX with your license #
            cImage.TiffOpenMultiPageAsReadOnly(true);            
            nDimage = cImage.CreateImageFromFile("C:\\input.tif");            
            cImage.PdfOCRCreateFromMultipageTIFF(nDimage, "C:\\output.pdf", GdPicturePro5.TesseractDictionary.TesseractDictionaryEnglish, PATH TO YOUR DICTIONARY FILES, "", "Mr. Smith", "Mr. Smith", "Mr. Smith", "Mr. Smith");            
            cImage.CloseImage(nDimage);

Re: How to create searchable PDF

Posted: Sun Jan 18, 2009 8:05 pm
by Loïc
Thank you Dante. This should be a good help for many users ;)

Loïc

Re: How to create searchable PDF

Posted: Wed Jan 21, 2009 9:11 pm
by dchillman
I am evaluating your tool for a slightly different purpose. I have a bunch of pdf files in a sharepoint list that may or may not be text-searchable. My requirement is to create a feature which loops through the list, opens the pdf, updates it to make it text-searchable, then save it back to the list. I can open up each pdf file as a memory stream. Will it then be possible to pass the stream to your object, have it processed to make it text-searchable, and get back the updated stream, which I can them pass back to the sharepoint list? If so, can you post a code example of how to handle a memory stream with your objects? thanks

Re: How to create searchable PDF

Posted: Fri Jan 23, 2009 12:06 pm
by Loïc
Hi,

You can't open a PDF from a stream object in GdPicture ActiveX. This feature is only available in GdPicture.NET.

To create a searchable PDF from an existing PDF document see
Sample 4: Creating multipage searchable PDF from existing multipage PDF

Best regards,

Loïc

Re: How to create searchable PDF

Posted: Fri May 08, 2009 4:48 pm
by alexandres
Hi,

I'm evualating your software for imaging and OCR,
My doubt is if there's a way to use other OCR engine than Tesseract to create searchable PDF.

Re: How to create searchable PDF

Posted: Thu Jun 03, 2010 11:35 pm
by ryancole11
How would you perform OCR after doing sample 4? I still am getting "13: Unknown image format" when I try to perform OCR on the PDF produced by sample 4's process.

Thanks.

Re: How to create searchable PDF

Posted: Tue Oct 12, 2010 10:08 am
by Loïc
Hi,

I am not sure to understand the question. The sample 4 already does OCR, from PDF to PDF + text. What do you expect from the resulting document ?

Kind regards,

Loïc

Re: How to create searchable PDF

Posted: Wed Mar 02, 2011 10:25 pm
by ryancole11
Loïc wrote:Hi,

I am not sure to understand the question. The sample 4 already do OCR, from PDF to PDF + text. What do you expect from the resulting document ?

Kind regards,

Loïc
Loic, the example #4 code does not even work, anymore. It's more than 3 years old. Functions in the example don't exist in the newer versions of the library. I still cannot get example #4 working. I'm just trying to OCR an existing multi-page PDF.

Re: How to create searchable PDF

Posted: Thu Mar 03, 2011 10:17 am
by Loïc
Hi,

All examples are working. I suppose you are using GdPicture.NET, so please move to the GdPicture.NET forum (here it is GdPciture ActiveX).

Kind regards,

Loïc

Re: How to create searchable PDF

Posted: Tue Apr 12, 2011 1:28 pm
by luke92
I have a question for sample 4.
Is it possible to create a multipage searchable PDF from an existing multipage PDF without using the gdp-viewer.
Because the viewer isn't included in GdPicture Light Imaging Toolkit.

Re: How to create searchable PDF

Posted: Tue Apr 12, 2011 4:23 pm
by Loïc
Hi,

No you have to use the GdViewer object to convert a PDF page to an image with GdPicture Activex.