OCR, ROI's and Whitelists

Nigels · Post by **Nigels** » Wed May 06, 2009 2:24 pm

Hi

We have used GDPicturePro control for sometime and I downloaded the new version with OCR yesterday. I have been doing some basic evaluation on the OCR methods using the sample program supplied and come up with a couple of questions . . . . .

1) Different results when OCR’ing the entire document and when using ROI
I am testing using a document with our name and address on it. When I OCR the entire document our company name “MARDAK” gets read as “IVIARDAK” and a confidence level of 33.33 for each character. However if a create a ROI of the top half of the page, it now reads it correctly as “MARDAK” and a confidence level of 30.58 for each character.

How does defining an ROI change the ability to OCR characters and is it more accurate to chop the page up into ROI’s rather than reading the entire page?

2) Use of Whitelists and the affect on OCR
Using the same document as above, towards the bottom of the document are 3 numbers. If you set a ROI round these numbers (20.00, 3.50 and 23.50), they get read correctly except the last number is “zs.s0” and a confidence of 49.41. However, if I set a whitelist of “0123456789.” and repeat the OCR, all the number are read correctly with a confidence of 25.30.

Does the Whitelist also act to direct the OCR process to the type of characters that are expected, and therefore, should one be specified if numbers are expected in a particular ROI?

These tests were carried out using the GDPicturePro OCX and the VB6 OCR example program supplied. This program was tweaked to show the character/OCR confidence level when looping through placing the red boxes round each character.

I have attached a copy of the document I was using to test this to the log.

Many thanks

Nigel

Nigels · Post by **Nigels** » Mon May 11, 2009 12:02 pm

Hi

Item 2 above is also manifesting itself in a different way. . . . . .

If you OCR the entire supplied page (using the demo OCR program but remove the line that draws the red rectangle around each read character), the first time you get the result for the total as "zs.so" for "23.50" (confidence 49.41177 for each character). However, if you then OCR it again (without reloading the program, just hit the "Perform OCR!" button again) it then reads it correctly as "23.50" (confidence 25.09804).

No optimisation of the image appears to take place within the code, so why does performing OCR the second time improve the results?

Thanks

Nigel

Post by **Loïc** » Tue May 12, 2009 11:07 am

Hi Nigel,

Sorry for the delay. The Tesseract engine is based on a natural learning algorithm. This means that OCR process at the time (T-1) have incidence on the OCR at time T.

Therefore:

- If you make OCR on a specific area the result can be different as the OCR of the entire page.
- If you make two passes OCR, the second pass can bring some improvements.

I hope I was clear enough.

Kind regards,

Loïc

Slava · Post by **Slava** » Tue May 12, 2009 11:31 am

Loïc wrote: The Tesseract engine is based on a natural learning algorithm. This means that OCR process at the time (T-1) have incidence on the OCR at time T.

Hi Loïc,

I have seen this post and I would like to know more about this "learning algorithm" to improve OCR quality. Can you tell more about how it works: does Tessereact clear these "learnings" on OCRTesseractClear? or are they stored somewhere? and can maybe be loaded / manipulated (like ADR templates)?

I haven't seen second-time-ocr improvements, is it because I call OCRTesseractClear? what is the proper use of this funtion?

Thanks in advance,
Slava

Post by **Loïc** » Tue May 12, 2009 3:12 pm

Hi,

does Tessereact clear these "learnings" on OCRTesseractClear ?

No. In the .NET edition we have an option to reinit the OCR but we have no found solution for doing that using the ActiveX edition. The OCRTesseractClear is only useful to release from memory the recognized informations (characters positions, confidences...)

I haven't seen second-time-ocr improvements, is it because I call OCRTesseractClear? what is the proper use of this funtion?

Sometime improvement are bring, sometime not. It depends on the document content.

Kind regards,

Loïc

Nigels · Post by **Nigels** » Tue May 12, 2009 5:43 pm

Thanks Loic - that explains it!

kketterman · Post by **kketterman** » Tue Jun 09, 2009 3:18 pm

Either I'm not understanding the whitelist, or it's not working properly on my test.

I'm defining my whitelist as

Code: Select all

charWhitelist = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-/\()"

** I only want alpha numeric results with limited symbols.

I am getting quite a few unwanted characters.

I get my image from a screenshot

Code: Select all

screenID = pImaging.CreateGdPictureImageFromHwnd(pImaging.GetDesktopHwnd)

I then scale the image and convert to a bitonal image

Code: Select all

pImaging.Scale(screenID, 200, InterpolationMode.InterpolationModeHighQualityBicubic)
        pImaging.ConvertTo1Bpp(screenID, 165)

OCR, ROI's and Whitelists

OCR, ROI's and Whitelists

Re: OCR, ROI's and Whitelists

Re: OCR, ROI's and Whitelists

Re: OCR, ROI's and Whitelists

Re: OCR, ROI's and Whitelists

Re: OCR, ROI's and Whitelists

Re: OCR, ROI's and Whitelists

Who is online

Stay in Touch

About ORPALIS