OCR on TIF coverted from PDF
OCR on TIF coverted from PDF
Hi,
We have a requirement to extract text from a scanned PDF document. We are doing English language OCR. We have been able to use GDPicture to do that but a lot of the extracted text in not correct.
We thought we may get better results if we convert PDF to TIF first and then run OCR on it. The results were a little better than before, but still a lot of inaccuracies in text.
Then we tried converting the PDF to TIF using a separate product called 2TIff. When we ran GDPicture OCR of that TIF, the results were much much better and accurate.
I have attached the original TIF files and their results.
Could you please tell what is GDPicture not doing that 2Tiff did to get worse OCR results using the same GDPicture OCR engine? Is there a way to improve the TIF conversion from PDF?
Example files
https://drive.google.com/file/d/1mNfOCZ ... sp=sharing
Thanks
Ajit
We have a requirement to extract text from a scanned PDF document. We are doing English language OCR. We have been able to use GDPicture to do that but a lot of the extracted text in not correct.
We thought we may get better results if we convert PDF to TIF first and then run OCR on it. The results were a little better than before, but still a lot of inaccuracies in text.
Then we tried converting the PDF to TIF using a separate product called 2TIff. When we ran GDPicture OCR of that TIF, the results were much much better and accurate.
I have attached the original TIF files and their results.
Could you please tell what is GDPicture not doing that 2Tiff did to get worse OCR results using the same GDPicture OCR engine? Is there a way to improve the TIF conversion from PDF?
Example files
https://drive.google.com/file/d/1mNfOCZ ... sp=sharing
Thanks
Ajit
Re: OCR on TIF coverted from PDF
Hello,
May I ask you to provide us with the exact code snippet you are using for OCR so we can replicate your issues? We do not know what 2Tiff is doing. In order to provide you support on GdPicture.NET toolkit, we need to reproduce your issues using the current release. Then we can investigate them more.
Thank you for your understandings and we are waiting for the code and exact steps on how to replicate it.
May I ask you to provide us with the exact code snippet you are using for OCR so we can replicate your issues? We do not know what 2Tiff is doing. In order to provide you support on GdPicture.NET toolkit, we need to reproduce your issues using the current release. Then we can investigate them more.
Thank you for your understandings and we are waiting for the code and exact steps on how to replicate it.
Re: OCR on TIF coverted from PDF
Hi, below is function that runs OCR on a Tif file and extracts text in a text file.
Code: Select all
Private Function ConvertTifToOCR(TifFilename As String, textFilename As String) As Boolean
Dim inputTifObj As GdPictureImaging = New GdPictureImaging()
Dim pageCount As Integer
Dim imageID As Integer = inputTifObj.CreateGdPictureImageFromFile(TifFilename)
If inputTifObj.GetStat() = GdPictureStatus.OK Then
If inputTifObj.TiffIsMultiPage(imageID) Then
pageCount = inputTifObj.TiffGetPageCount(imageID)
End If
Dim ocrObj As GdPictureOCR = New GdPictureOCR()
ocrObj.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"
ocrObj.CharacterSet = ""
ocrObj.AddLanguage(OCRLanguage.English)
Dim resID As String = "page"
Dim content As String = Nothing
Dim stream As System.IO.StreamWriter = New System.IO.StreamWriter(textFilename)
For i As Integer = 1 To pageCount
inputTifObj.TiffSelectPage(imageID, i)
If ocrObj.SetImage(imageID) = GdPictureStatus.OK Then
ocrObj.OCRMode = OCRMode.FavorAccuracy
ocrObj.RunOCR(resID)
If ocrObj.GetStat() = GdPictureStatus.OK Then
content = ocrObj.GetOCRResultText(resID)
If ocrObj.GetStat() = GdPictureStatus.OK Then
stream.WriteLine(content & vbFormFeed & vbCrLf)
End If
Else
MessageBox.Show("The Ocr didn't process. Error: " + ocrObj.GetStat().ToString())
End If
Else
MessageBox.Show("The image can't be set. Error: " + ocrObj.GetStat().ToString())
End If
ocrObj.ReleaseOCRResult(resID)
Next
stream.Close()
inputTifObj.ReleaseGdPictureImage(imageID)
ocrObj.Dispose()
MessageBox.Show("Tif file processed through OCR")
Return True
Else
MessageBox.Show("The Tif file can't be opened. Error: " + inputTifObj.GetStat().ToString())
End If
inputTifObj.Dispose()
Return False
End Function
Re: OCR on TIF coverted from PDF
Hello,
I would like to explain to you here some more details about OCR. From what I see, you saved the scanned pages in PDF document. Using GdPictureOCR class you will need the scanned image, so here I would recommend you to scan directly to tiff. Next, you need to scan using appropriate DPI, so the scanned page will be readable. The precision of the OCRed text you can also achieve using another set of languages, for further details read here:
https://github.com/tesseract-ocr/tesser ... Data-Files
There are different language files for fast OCR and accurate OCR. And finally, the OCR'ed text will be more accurate when doing OCR on regions as on the whole pages. I hope this help.
I would like to explain to you here some more details about OCR. From what I see, you saved the scanned pages in PDF document. Using GdPictureOCR class you will need the scanned image, so here I would recommend you to scan directly to tiff. Next, you need to scan using appropriate DPI, so the scanned page will be readable. The precision of the OCRed text you can also achieve using another set of languages, for further details read here:
https://github.com/tesseract-ocr/tesser ... Data-Files
There are different language files for fast OCR and accurate OCR. And finally, the OCR'ed text will be more accurate when doing OCR on regions as on the whole pages. I hope this help.
Re: OCR on TIF coverted from PDF
We get PDF from third party sources that need to be OCR'd, so Tifs are out questions.
Running GDPicture OCR on PDFs produced worst results in terms of text accuracy.
Running GDPicture OCR on TIF converted from PDF using GDPicture produced better results in term of accuracy.
Running GDPicture OCR on TIF converted from PDF using 2Tiff produced best results in terms of text accuracy.
We are definitely using the accurate OCR trained files.
Running GDPicture OCR on PDFs produced worst results in terms of text accuracy.
Running GDPicture OCR on TIF converted from PDF using GDPicture produced better results in term of accuracy.
Running GDPicture OCR on TIF converted from PDF using 2Tiff produced best results in terms of text accuracy.
We are definitely using the accurate OCR trained files.
Re: OCR on TIF coverted from PDF
Hello,
Here is an interesting source that can be useful:
https://github.com/tesseract-ocr/tesser ... oveQuality
Thank you also for creating a support ticket.
Here is an interesting source that can be useful:
https://github.com/tesseract-ocr/tesser ... oveQuality
Thank you also for creating a support ticket.
Re: OCR on TIF coverted from PDF
Hi,
Finally, we have figured out that the source PDF has internal page rotation. After solving this with the use of NormalizePage() method the OCR results are excellent and there is no need to convert to TIFF.
So maybe this helps also to others.
Finally, we have figured out that the source PDF has internal page rotation. After solving this with the use of NormalizePage() method the OCR results are excellent and there is no need to convert to TIFF.
So maybe this helps also to others.
Who is online
Users browsing this forum: No registered users and 0 guests