Any known OCR issues when running in Japanese environment?

mjnaleva · Post by **mjnaleva** » Fri Aug 07, 2009 3:34 pm

Hi,

I am getting feedback that the OCR functionality within GDPicture does not perform well (if at all) when running within a Japanese Windows XP environment. We are in the process of upgrading to the latest GDPicture/OCR plugin to see if that helps, but I wanted to see if there were any known problems out there. The file being used contains English text and works fine when processed in an English Windows XP environment. Regardless of environment, the same language files should be used so it's unclear why this behavior is being seen.

I am not sure yet if the underlying problem originates within Tesseract or GDPicture.

Thanks,
Mark

Post by **Loïc** » Sat Aug 08, 2009 10:12 pm

Hi Mark,

This kind of issue is not known. However, if you are not using the latest version of the engine I highly recommend you to upgrade to the last one which fix several bugs.

Kind regards,

Loïc

mjnaleva · Post by **mjnaleva** » Fri Aug 21, 2009 6:07 am

Hi Loic,

After further investigation, I am coming to the conclusion that Tesseract OCR is limited within GdPicture to ASCII characters. Is this correct?

And even within the ASCII character code range (0x00 to 0xFF), the upper 128 characters might not map correctly if run in different language environments on Windows, which is the case when running in Japanese. For example, we have problems with the +/- character (ASCII 0xB1). In Windows in a Japanese environment, this particular character maps to a Japanese-specific character but GdPicture returns from the DoOCR() method a '<' symbol. I have also seen scenarios where the resulting character returned is '/'.

So, I tried doing OCR of the same text using Tesseract's standalone utility and it worked beautifully in both English and Japanese environments. The +/- character was consistently represented in the output.

I suspect this difference in behavior is that Tesseract is setup to process characters using unicode values but GdPicture attempts some sort of conversion back to ASCII, which obviously can't work all the time.

Are there plans to extend character support in GdPicture outside of the ASCII code range? What do I need to do to formally request such an enhancement?

Thanks,
Mark

Post by **Loïc** » Wed Aug 26, 2009 11:07 am

Hi Mark,

Do you have an image generating this behavior ?

Are there plans to extend character support in GdPicture outside of the ASCII code range?

We are thinking about that.

Kind regards,

Loïc

mjnaleva · Post by **mjnaleva** » Wed Sep 02, 2009 4:09 am

I'm using a custom dictionary as I needed the +/- character to be picked up by the OCR engine better so I don't have a quick and easy way of providing an example. But I suspect if you tried OCR captures on an image containing characters in the upper 128 ASCII code range (characters like +/- or the degree symbol) you'll see similar behavior when doing it with the Windows environment setup for Japanese or some other Far East language. I would just create an image from a Word document that has some of those characters.

On a related note, I noticed anytime I setup my custom dictionary to map text to a character outside of the ASCII code range then the returned string from GdPicture would contain a ? wherever the out of ASCII code range character should have been.

Mark

Any known OCR issues when running in Japanese environment?

Any known OCR issues when running in Japanese environment?

Re: Any known OCR issues when running in Japanese environment?

Re: Any known OCR issues when running in Japanese environment?

Re: Any known OCR issues when running in Japanese environment?

Re: Any known OCR issues when running in Japanese environment?

Who is online

Stay in Touch

About ORPALIS