[VFP 9] OCR PDF pages

Discussions about GdPicture.NET usage in non managed applications built in vb6, Delphi, vfp, MFC c++ etc...
Post Reply
ronk
Posts: 10
Joined: Wed Aug 29, 2012 7:53 am

[VFP 9] OCR PDF pages

Post by ronk » Wed Nov 21, 2012 5:28 am

I have been prviously advised to post here as I am trying to show my client that the latest version will improve their program. The program is written in Visual Foxpro 9.2.

With some previous help I was able to register as follows using the 30 trial version:

Code: Select all

oGdPictureImaging = CREATEOBJECT("GdPicture9.GdPictureImaging)
oGdPictureViewer = CREATEOBJECT("GdPicture9.GdViewer")				
thisform.GdViewer1.SetLicenseNumber("xxxxxxxxxxxxxx")
so far so good and I was able to go on and successfully use this line to extract text from a pdf:

Code: Select all

vtextbit=thisform.GdViewer1.PdfGetPageText
However, we need to use Tesseract and a lot has changed here since the old version and I am obviously missing something.

What I have done is (in addition to the registering above):

Code: Select all

LicMgr = CreateObject("GdPicture9.LicenseManager")
LicMgr.RegisterKEY(MY_PLUGIN_KEY)

thisform.GdViewer1.DisplayFromFile(vimage)	
					
mpages = Thisform.GdViewer1.PageCount 
				

vimage = (valid name of file e.g. test.pdf, test.tiff, test.jpg etc - can be multi page

For i=1 To mpages &&fr.BatchDocument.Pages.Count
					
	thisform.GdViewer1.DisplayPage(i)
								
   	ImageID  = oGdPictureImaging.CreateGdPictureImageFromFile(vimage)
					
	ImageID=INT(ImageID)
	thisform.GdViewer1.DisplayFromGdPictureImage(ImageID)
        vtextbit = oGdPictureImaging.OCRTesseractDoOCR(ImageID, "eng", lcOcrDir, "") 

       etc....
endfor
My problem is that nothing is found and the problem seems to be that ImageID is evaluating to 0. I have tested many times and the files exist so I am clearly missing something.

Secondly, if a document has an incorrect orientation, what command would I use above to automatically orient it ? We can process many thousands of documents in a batch and therefore manual processing is unworkable.

If we can solve this problem I am sure my client will upgrade and we can both make some money!

Any help would be most appreciated.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Upgrade to lastest version from GdPicturePro5

Post by Loïc » Wed Nov 21, 2012 4:36 pm

Hello,

Could you explain what you are trying to do?
Your code has no sense and I am not able to clearly catch what is the goal of this procedure.

kind regards,

Loïc

ronk
Posts: 10
Joined: Wed Aug 29, 2012 7:53 am

Re: Upgrade to lastest version from GdPicturePro5

Post by ronk » Wed Nov 21, 2012 5:15 pm

I am sorry that I am not making sense and thank you again for your rapid response.

My client has an existing Visual Foxpro program that, amongst other things, scans the contents of various documents of various types from a hard drive into the memo field of a FoxPro table. It does so currently by employing GdpicturePro 5 and OCRTesseract to read the contents of these documents. The existing code partly says :

Thisform.AddObject("olecontrol1", "olecontrol", "GdPicturePro5.GdViewer")
Thisform.AddObject("olecontrol2", "olecontrol", "GdPicturePro5.Imaging")

thisform.olecontrol2.SetLicenseNumber("1xxxxxxxxxxxxxxxx")

thisform.olecontrol2.SetLicenseNumberOCRTesseract("xxxxxxxxxxxxxxxxxxxxx")

thisform.olecontrol1.SetLicenseNumber("xxxxxxxxxxxxxxxxxxxxxxxxxxx)
Thisform.olecontrol1.MouseMode = 1 && MouseModeAreaSelection
Thisform.olecontrol1.zoomMode = 2 && ZoomFitToControl

if thisform.check1.value = .t.
thisform.olecontrol1.visible = .T.
ENDIF


thisform.olecontrol1.top = thisform.container1.Top
thisform.olecontrol1.left = thisform.container1.Left
thisform.olecontrol1.height = thisform.container1.Height
thisform.olecontrol1.width = thisform.container1.Width
thisform.olecontrol1.DisplayFromFile(vimage) && vimage being the file to be scanned

pages = Thisform.olecontrol1.PageCount
vText=""

For i=1 To Thisform.olecontrol1.PageCount
thisform.md.caption = "Getting text for page " + alltrim(str(i))
thisform.Cls
thisform.olecontrol1.DisplayFrame(i)

thisform.olecontrol2.SetNativeImage(thisform.olecontrol1.GetNativeImage)
thisform.olecontrol2.ResetROI
vtextbit = thisform.olecontrol2.OCRTesseractDoOCR(2, lcOcrDir)

********the contents of the variable vtextbit are then placed into the memo field and other functions are carried out.
endfor

My investigations tell me that some or all of these commands are now not relevant so I have attempted to replicate this procedure with the latest test version as per my post.

So keeping in mind my previous post, how do I ocr the contents of documents in version 9? Your help would be most appreciated.

ronk
Posts: 10
Joined: Wed Aug 29, 2012 7:53 am

Re: Upgrade to lastest version from GdPicturePro5

Post by ronk » Fri Nov 23, 2012 6:26 am

Having worked on this some more, I can see that the OCRTesseractDoOCR does not work as it did previously. My comment about ImageID is evaluating to 0 was because I was applying it to pdf documents; when I apply it to jpg documents for example everything works fine.

So please ignore what has gone before; my main problem is that many pdf documents are processed just fine by using PdfGetPageText (i.e. I am able to extract the text), but some documents return no text even though they display correctly in GdViewer. It appears that these are pdf files created from an image and we used to be able to use OCRTesseractDoOCR to extract text from these but now I don't know what to do with these documents. Given there may be many thousands of these, can you suggest how I can extract text from them? Also, can you advise how to automatically properly orient pdf files that were scanned at the wrong angle.

Solving these 2 issues should satisfy my client that they need to upgrade.

Thank you for your patience and I apologise to wasting your time - it is just that I find the latest version so different from the version 5 my client uses.

User avatar
Loïc
Site Admin
Posts: 5881
Joined: Tue Oct 17, 2006 10:48 pm
Location: France
Contact:

Re: Upgrade to lastest version from GdPicturePro5

Post by Loïc » Wed Nov 28, 2012 7:14 pm

Hello,

Here a vfp snippet to OCR a PDF:

Code: Select all

gdpictureImaging = CREATEOBJECT("GdPicture9.GdPictureImaging")
gdpicturePDF = CREATEOBJECT("GdPicture9.GdPicturePDF")
licenseManager = CREATEOBJECT("GdPicture9.LicenseManager")

pdfText = ""

WITH licenseManager as "GdPicture9.LicenseManager"
  .RegisterKEY("YOUR_LICENSE_KEY_HERE")
ENDWITH

WITH gdpicturePDF as "GdPicture9.GdPicturePDF"
  IF .LoadFromFile("c:\test.pdf", .F.) = 0  THEN&& GdPictureStatus_OK 
	  pageCount = .GetPageCount()
	  FOR i = 1 TO pageCount 
	     .SelectPage(i)
	     imageID = .RenderPageToGdPictureImage(200, .F.)
	     IF imageID <> 0 THEN
	        WITH gdpictureImaging as "GdPicture9.GdPictureImaging"
	           IF i > 1 THEN
	              pdfText = pdfText + CHR(10) + CHR(13)
	           ENDIF
	           pdfText  = pdfText + .OCRTesseractDoOCR(imageID, "eng", "C:\Program Files (x86)\GdPicture.NET 9\Redist\OCR", "")
	           .ReleaseGdPictureImage(imageID) 
	        ENDWITH
	     ENDIF   
	  ENDFOR
	  .CloseDocument()
  ENDIF
ENDWITH

?pdfText
Hope this helps!

Kind regards,

Loïc

ronk
Posts: 10
Joined: Wed Aug 29, 2012 7:53 am

Re: [VFP 9] OCR PDF pages

Post by ronk » Thu Nov 29, 2012 7:12 am

Many, many thanks for your help on this - the code works just fine,

I wonder if you had any thoughts on my other question I asked, i.e.: Also, can you advise how to automatically properly orient pdf files that were scanned at the wrong angle?

At the moment the text is scanned at the angle of the document and produces rubbish. If it could be automatically oriented prior to ocr that would be perfect.

Sorry for taking up your time.

SamiKharma
Posts: 352
Joined: Tue Sep 27, 2011 11:47 am

Re: [VFP 9] OCR PDF pages

Post by SamiKharma » Thu Nov 29, 2012 10:28 am

Hi,

If the pages are scanned at a slightly rotated angle, meaning 0-15 degrees, you can Render the pages to images, then use the AutoDeskew function to properely align them. If they are rotated compltely (90, 180, 270) degrees, the OCR engine itself tried to find out the alignment OCRTesseractGetOrientation.

Hope this helps.
Best,
Sami

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest