Any known OCR issues when running in Japanese environment?
Any known OCR issues when running in Japanese environment?
Hi,
I am getting feedback that the OCR functionality within GDPicture does not perform well (if at all) when running within a Japanese Windows XP environment. We are in the process of upgrading to the latest GDPicture/OCR plugin to see if that helps, but I wanted to see if there were any known problems out there. The file being used contains English text and works fine when processed in an English Windows XP environment. Regardless of environment, the same language files should be used so it's unclear why this behavior is being seen.
I am not sure yet if the underlying problem originates within Tesseract or GDPicture.
Thanks,
Mark
I am getting feedback that the OCR functionality within GDPicture does not perform well (if at all) when running within a Japanese Windows XP environment. We are in the process of upgrading to the latest GDPicture/OCR plugin to see if that helps, but I wanted to see if there were any known problems out there. The file being used contains English text and works fine when processed in an English Windows XP environment. Regardless of environment, the same language files should be used so it's unclear why this behavior is being seen.
I am not sure yet if the underlying problem originates within Tesseract or GDPicture.
Thanks,
Mark
Re: Any known OCR issues when running in Japanese environment?
Hi Mark,
This kind of issue is not known. However, if you are not using the latest version of the engine I highly recommend you to upgrade to the last one which fix several bugs.
Kind regards,
Loïc
This kind of issue is not known. However, if you are not using the latest version of the engine I highly recommend you to upgrade to the last one which fix several bugs.
Kind regards,
Loïc
Re: Any known OCR issues when running in Japanese environment?
Hi Loic,
After further investigation, I am coming to the conclusion that Tesseract OCR is limited within GdPicture to ASCII characters. Is this correct?
And even within the ASCII character code range (0x00 to 0xFF), the upper 128 characters might not map correctly if run in different language environments on Windows, which is the case when running in Japanese. For example, we have problems with the +/- character (ASCII 0xB1). In Windows in a Japanese environment, this particular character maps to a Japanese-specific character but GdPicture returns from the DoOCR() method a '<' symbol. I have also seen scenarios where the resulting character returned is '/'.
So, I tried doing OCR of the same text using Tesseract's standalone utility and it worked beautifully in both English and Japanese environments. The +/- character was consistently represented in the output.
I suspect this difference in behavior is that Tesseract is setup to process characters using unicode values but GdPicture attempts some sort of conversion back to ASCII, which obviously can't work all the time.
Are there plans to extend character support in GdPicture outside of the ASCII code range? What do I need to do to formally request such an enhancement?
Thanks,
Mark
After further investigation, I am coming to the conclusion that Tesseract OCR is limited within GdPicture to ASCII characters. Is this correct?
And even within the ASCII character code range (0x00 to 0xFF), the upper 128 characters might not map correctly if run in different language environments on Windows, which is the case when running in Japanese. For example, we have problems with the +/- character (ASCII 0xB1). In Windows in a Japanese environment, this particular character maps to a Japanese-specific character but GdPicture returns from the DoOCR() method a '<' symbol. I have also seen scenarios where the resulting character returned is '/'.
So, I tried doing OCR of the same text using Tesseract's standalone utility and it worked beautifully in both English and Japanese environments. The +/- character was consistently represented in the output.
I suspect this difference in behavior is that Tesseract is setup to process characters using unicode values but GdPicture attempts some sort of conversion back to ASCII, which obviously can't work all the time.
Are there plans to extend character support in GdPicture outside of the ASCII code range? What do I need to do to formally request such an enhancement?
Thanks,
Mark
Re: Any known OCR issues when running in Japanese environment?
Hi Mark,
Do you have an image generating this behavior ?
Kind regards,
Loïc
Do you have an image generating this behavior ?
We are thinking about that.Are there plans to extend character support in GdPicture outside of the ASCII code range?
Kind regards,
Loïc
Re: Any known OCR issues when running in Japanese environment?
I'm using a custom dictionary as I needed the +/- character to be picked up by the OCR engine better so I don't have a quick and easy way of providing an example. But I suspect if you tried OCR captures on an image containing characters in the upper 128 ASCII code range (characters like +/- or the degree symbol) you'll see similar behavior when doing it with the Windows environment setup for Japanese or some other Far East language. I would just create an image from a Word document that has some of those characters.
On a related note, I noticed anytime I setup my custom dictionary to map text to a character outside of the ASCII code range then the returned string from GdPicture would contain a ? wherever the out of ASCII code range character should have been.
Mark
On a related note, I noticed anytime I setup my custom dictionary to map text to a character outside of the ASCII code range then the returned string from GdPicture would contain a ? wherever the out of ASCII code range character should have been.
Mark
Who is online
Users browsing this forum: No registered users and 2 guests