Clearing the "natural learning algorithm"
Clearing the "natural learning algorithm"
Hi
I have been experiencing problems performing OCR on multiple documents. I get different results returned depending on the order in which the documents are processed.
I believe this is because of the “natural learning algorithm” employed by the Tesseract Engine as mentioned in other posts.
I am using the ActiveX version which does not have the option to “clear” the “learnt” information. As a result, I guess, it learns from previous documents that can be different sizes, different fonts, different quality, etc – making the results differ apparently randomly and quite significantly according to what has been read before.
This is far from ideal – you really want to receive the same OCR results each time the same document is read!
Has this been fixed in a later version or does a workaround exist to get over this problem?
Thanks
Nigel
I have been experiencing problems performing OCR on multiple documents. I get different results returned depending on the order in which the documents are processed.
I believe this is because of the “natural learning algorithm” employed by the Tesseract Engine as mentioned in other posts.
I am using the ActiveX version which does not have the option to “clear” the “learnt” information. As a result, I guess, it learns from previous documents that can be different sizes, different fonts, different quality, etc – making the results differ apparently randomly and quite significantly according to what has been read before.
This is far from ideal – you really want to receive the same OCR results each time the same document is read!
Has this been fixed in a later version or does a workaround exist to get over this problem?
Thanks
Nigel
Re: Clearing the "natural learning algorithm"
Hi Nigel,
This bug is known by the tesseract ocr development team.
We found a workaround in GdPicture.NET but we were unable to include it in ActiveX editions of GdPicture.
Unfortunately, I can't do anything now. Just hope this bug will be solved asap.
Kind regards,
Loïc
This bug is known by the tesseract ocr development team.
We found a workaround in GdPicture.NET but we were unable to include it in ActiveX editions of GdPicture.
Unfortunately, I can't do anything now. Just hope this bug will be solved asap.
Kind regards,
Loïc
Re: Clearing the "natural learning algorithm"
Thanks Loic
I hope so too!
Cheers
Nigel
I hope so too!
Cheers
Nigel
Re: Clearing the "natural learning algorithm"
Hi Loic
I have come up with a possible workaround to this, it is not ideal because it increases the amount of processing but it does appear to clear the algorithm.
We are typically processing a number of documents (IE a batch of invoices). The solution I have come up with is to late bind the cimaging control and then destroy and recreate it between each page/document. This means that the object needs to be created for each page and the document reloaded for each page if it is a multipage document - which is a bit of an overhead!
I ran a test on a 24 page document and received significantly different results using this technique compared to just OCR'ing each page in turn.
Can you confirm if this method will be clearing the "natural learning algorithm" which is why I am seeing different results?
Also, any idea when this problem will be fixed (if it is soon I will not change all my code!).
Cheers
Nigel
I have come up with a possible workaround to this, it is not ideal because it increases the amount of processing but it does appear to clear the algorithm.
We are typically processing a number of documents (IE a batch of invoices). The solution I have come up with is to late bind the cimaging control and then destroy and recreate it between each page/document. This means that the object needs to be created for each page and the document reloaded for each page if it is a multipage document - which is a bit of an overhead!
I ran a test on a 24 page document and received significantly different results using this technique compared to just OCR'ing each page in turn.
Can you confirm if this method will be clearing the "natural learning algorithm" which is why I am seeing different results?
Also, any idea when this problem will be fixed (if it is soon I will not change all my code!).
Cheers
Nigel
Re: Clearing the "natural learning algorithm"
Hi Nigel,
Your solution is good and it is the one we implemented in GdPicture.NET
For the bug from the Tesseract team, I think they solved it in a current beta release. However, we did not try it because we are waiting for stable release only.
Kind regards,
Loïc
Your solution is good and it is the one we implemented in GdPicture.NET
For the bug from the Tesseract team, I think they solved it in a current beta release. However, we did not try it because we are waiting for stable release only.
Kind regards,
Loïc
Re: Clearing the "natural learning algorithm"
I wish to turn off this learning feature.
You are referring to a fix in tesseract (beta). Do you know where I can get that version and how I can turn off the learning feature ?
Thanks a lot in advance,
Jawa
You are referring to a fix in tesseract (beta). Do you know where I can get that version and how I can turn off the learning feature ?
Thanks a lot in advance,
Jawa
Who is online
Users browsing this forum: No registered users and 1 guest