Hi Loïc,
thank you for your reply.
Sure, you're right, on the one hand, it is surely easy and I understand, that the Tesseract engine ist from a Google project and none of your major programming tasks. But you are the one who are providing a SDK with it. So if everyone has do this "simple programming" why not offer this in a just more simple method/class in the next update, like ".OCRTesseractGetWordCount"
Here is mine little programming:
I used to store the OCR results (before GDPicture) in an array of this simple structure
Code: Select all
Public Structure OCRDataStruct
Public Coord As RectangleF
Public Text As String
Public Confidence As Double
End Structure
since I - and everybody else - wants to know the coordinates of the bounding box of a certain word.
And there is the first problem, how to get the bounding box?
So many months ago, I wrote this little helper routine to get the words from the chars with space like you described it in your post, and also get the bounding box from the existing data:
Code: Select all
' Build WordList with Coordinates
Dim wordList As New List(Of OCRDataStruct), word As String, newWord As OCRDataStruct
Dim maxBottom As Long, maxRight As Long
For i = 1 To tOCRGdPictureImaging.OCRTesseractGetCharCount
If i = 1 Then
newWord.Text = ""
newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0)
Else
If tOCRGdPictureImaging.OCRTesseractGetCharSpaces(i) Then
newWord.Text = word
newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top)
wordList.Add(newWord)
newWord.Text = ""
newWord.Coord = New RectangleF(tOCRGdPictureImaging.OCRTesseractGetCharLeft(i), tOCRGdPictureImaging.OCRTesseractGetCharTop(i), 0, 0)
word = ""
maxBottom = 0
maxRight = 0
End If
End If
word += ChrW(tOCRGdPictureImaging.OCRTesseractGetCharCode(i))
maxBottom = Math.Max(maxBottom, tOCRGdPictureImaging.OCRTesseractGetCharBottom(i))
maxRight = Math.Max(maxRight, tOCRGdPictureImaging.OCRTesseractGetCharRight(i))
Next
newWord.Text = word
newWord.Coord = New RectangleF(newWord.Coord.Left, newWord.Coord.Top, maxRight - newWord.Coord.Left, maxBottom - newWord.Coord.Top)
wordList.Add(newWord)
This doing quite fine, at least for me, and produces similar output like other OCR engines we tried and used in the past (e.g. Pegasus, FineReader). I know, there is a little bit more programming necessary if you want to provide the "for....each" feature, but most of this code should meet the requirements
BUT:
I see the disadvantage in the missing confidence of the words. I cannot tell - since I've read the Tesseract article only on the fly - where the Tesseract engine gets ist confidence for a certain char.
For example: The simple word "Look". The upper "L" and the lower "k" are chars that will be recognized quite easy.
But the double lower "o" kann also be interpreted as two small zeros "0". So there is a valid chance, to decide, these are zeros instead of 0 with a confidence of e.g. 60:40 but there is a much lesser confidence of "L00k" instead of "Look" if using a dictionary on word basis instead of single character recognition.
I thought there is a dictionary that is used for the OCR recognition on word basis. And if it is so, why not deliver these results, too?
By the way, if a word is separated because it is too long for the rest of the line, e.g.
"........ swinging his long-
sword over his......." the separation on with space will not do the job. The only solution is a dictionary for these cases.
Every professional programmer, who is not only trying to make searchable PDF files, will need this functionality because on the word basis will be made decisions, wether keywords are found on defined positions or not.
Thank you very much for your patience and the update for the PDF ans MRC-jpgs.
EF