Multipage PDF Tesseract OCR Capture Via Threading in ASP.net

Discussions about machine vision support in GdPicture.
Post Reply
csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Multipage PDF Tesseract OCR Capture Via Threading in ASP.net

Post by csinkinson » Mon Mar 29, 2010 6:59 pm

Hi Loïc,

Hope you are well! :)

I am having some difficulty running the Tesseract OCR on a Multipage PDF using threading in ASP.net. I want to run the OCR process in the background so that the user doesn't need to wait for it to complete. Instead, they can continue doing other tasks which the OCR runs.

When I run my code using a single page PDF it works perfectly! But, when I try a mutlipage PDF I get the following error:

Code: Select all

System.ArgumentNullException was unhandled
  Message="Value cannot be null. Parameter name: ptr"
  ParamName="ptr"
  Source="mscorlib"
  StackTrace:
       at System.Runtime.InteropServices.Marshal.GetDelegateForFunctionPointer(IntPtr ptr, Type t)
       at Ꮑ.ᢤ.ᢲ(Int32 ᢳ, Int32 ᢴ, Int32 ᢵ, Int32 ᢶ, Int32 ᢷ, TesseractDictionary ᢸ, String ᢹ, String ᢺ, IntPtr& ᢻ, Int32& ᢼ, Int32 ᢽ)
       at GdPicture.GdPictureImaging.PdfAddGdPictureImageToPdfOCR(Int32 PdfID, Int32 ImageID, TesseractDictionary Dictionary, String DictionaryPath, String CharWhiteList)
       at _Default.DoOCR_Multi() in C:\Projects\TesseractTest\Default.aspx.vb:line 84
       at _Default._Lambda$__2() in C:\Projects\TesseractTest\Default.aspx.vb:line 20
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.runTryCode(Object userData)
       at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException: 

Below is the sample code that I am using to test the mutlipage PDF.

Code: Select all

Dim licensenumber As String = "LICENCENUMBER"

    Protected Sub Button2_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Button2.Click
        Dim NewThread As Thread = New Thread(AddressOf DoOCR_Multi)
        NewThread.Priority = ThreadPriority.Lowest
        NewThread.Start()

        Button2.Text = "Started Multi OCR... Wait for a few moments then check multipage_out.pdf"
    End Sub

    Public Function DoOCR_Multi() As Boolean
        Dim sourcedoc As String = Server.MapPath("./") & "multipage.pdf"
        Dim outdoc As String = Server.MapPath("./") & "multipage_out.pdf"

        Dim randomgen As New Random()
        Dim randomnum As Integer = randomgen.Next()
        Dim thedate As String = DateTime.Now.ToString("yyyymmddhhMMss")
        Dim tempfilename As String = sourcedoc
        Dim tempfilename2 As String = outdoc

        Dim ImageID As Integer
        Dim oGdViewer As New GdPicture.GdViewer
        Dim oGdPictureImaging As New GdPicture.GdPictureImaging
        Dim PdfID As Integer

        oGdViewer.SetLicenseNumber(licensenumber)
        oGdPictureImaging.SetLicenseNumber(licensenumber)
        oGdPictureImaging.SetLicenseNumberOCRTesseract(licensenumber)

        oGdViewer.DisplayFromFile(tempfilename)
        PdfID = oGdPictureImaging.PdfOCRStart(tempfilename2, True, "", "", "", "", "")
        For i As Integer = 1 To oGdViewer.PageCount
            ImageID = oGdViewer.PdfRenderPageToGdPictureImage(300, i)
            oGdPictureImaging.ConvertTo1Bpp(ImageID)
            oGdPictureImaging.PdfAddGdPictureImageToPdfOCR(PdfID, ImageID, TesseractDictionary.TesseractDictionaryEnglish, Server.MapPath("./") & "App_Data\Dictionary", "")
            oGdViewer.ReleaseGdPictureImage(ImageID)
        Next
        oGdPictureImaging.PdfOCRStop(PdfID)
        oGdViewer.CloseDocument()

    End Function

I've also attached a complete sample project that demonstrates the working code (with single page PDF) and the code that causes the error (with mutlipage PDF).

Would you mind trying the sample project that I've created? You just need to add the dictionary files to the /App_Data/Dictionary/ folder and the GDPicture.NET DLLs files to the /Bin/ folder. I must be doing something incorrectly, but I just can't locate the problem. I'm hoping that you'll be able to point me in the right direction!

Thank you,
Chris
Attachments
MultiPageOCRThreading.zip
Sample
(47.06 KiB) Downloaded 434 times

csinkinson
Posts: 14
Joined: Wed Sep 30, 2009 2:31 am

Re: Multipage PDF Tesseract OCR Capture Via Threading in ASP.net

Post by csinkinson » Tue Mar 30, 2010 8:15 pm

Good news! I updated to 6.6 and everything is working properly! :)

Thanks
Chris

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Multipage PDF Tesseract OCR Capture Via Threading in ASP.net

Post by Loïc » Wed Mar 31, 2010 9:27 am

OK Chris thank you for the return.

With best regards,

Loïc

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest