My application executed OCR on scanned image PDFs to create searchable PDF/A output. However, I have noticed that the result of the code below takes as much as 10x disk space as the original. Can anyone explain why? Am I missing a step somewhere?
Attached is a sample PDF that goes from 12k before the process to 700k after.
Thanks,
Leo
Code: Select all
Dict = "eng"
PdfID = oGdPictureImaging.PdfOCRStart(OutputFilePath, True, "", "", "", "", "DocDigester")
oGdPictureImaging.OCRTesseractSetPassCount(2)
If InputPDF.LoadFromFile(pdfPath, False) = GdPicture.GdPictureStatus.OK Then
For i As Integer = 1 To InputPDF.GetPageCount()
InputPDF.SelectPage(i)
ImageID = InputPDF.RenderPageToGdPictureImage(200, True)
curPageImage = InputPDF.ExtractPageImage(i)
inPgPD = myPage.GetBitDepth(curPageImage)
Select Case inPgPD
Case 1
oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W
Case 8
oGdPictureImaging.ConvertTo8BppGrayScale(ImageID) 'grayscale
Case 24
'do nothing default is 3x8bit color
Case Else
oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W
End Select
Dim pgText As String = oGdPictureImaging.PdfAddGdPictureImageToPdfOCR(PdfID, ImageID, Dict, sciroot & "docdigester\bin\win", "")
oGdPictureImaging.ReleaseGdPictureImage(ImageID)
oGdPictureImaging.ReleaseGdPictureImage(curPageImage)
Next i
Else
'report out reason for problem.
Dim errCode As Integer = InputPDF.GetStat()
End If
InputPDF.CloseDocument()
oGdPictureImaging.PdfOCRStop(PdfID)