Document sizes after PdfAddGdPictureImageToPdfOCR

lbleicher · Post by **lbleicher** » Thu Dec 22, 2011 7:13 pm

Hi-

My application executed OCR on scanned image PDFs to create searchable PDF/A output. However, I have noticed that the result of the code below takes as much as 10x disk space as the original. Can anyone explain why? Am I missing a step somewhere?

Attached is a sample PDF that goes from 12k before the process to 700k after.

Thanks,
Leo

Code: Select all

        
Dict = "eng"

        PdfID = oGdPictureImaging.PdfOCRStart(OutputFilePath, True, "", "", "", "", "DocDigester")
        oGdPictureImaging.OCRTesseractSetPassCount(2)

        If InputPDF.LoadFromFile(pdfPath, False) = GdPicture.GdPictureStatus.OK Then
            For i As Integer = 1 To InputPDF.GetPageCount()
                InputPDF.SelectPage(i)
                ImageID = InputPDF.RenderPageToGdPictureImage(200, True)

                curPageImage = InputPDF.ExtractPageImage(i)
                inPgPD = myPage.GetBitDepth(curPageImage)
                Select Case inPgPD
                    Case 1
                        oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W 
                    Case 8
                        oGdPictureImaging.ConvertTo8BppGrayScale(ImageID) 'grayscale
                    Case 24
                        'do nothing default is 3x8bit color
                    Case Else
                        oGdPictureImaging.ConvertTo1Bpp(ImageID) 'B/W 
                End Select

                Dim pgText As String = oGdPictureImaging.PdfAddGdPictureImageToPdfOCR(PdfID, ImageID, Dict, sciroot & "docdigester\bin\win", "")
                oGdPictureImaging.ReleaseGdPictureImage(ImageID)
                oGdPictureImaging.ReleaseGdPictureImage(curPageImage)
            Next i
        Else
            'report out reason for problem.
            Dim errCode As Integer = InputPDF.GetStat()
        End If
        InputPDF.CloseDocument()
        oGdPictureImaging.PdfOCRStop(PdfID)

Post by **Loïc** » Thu Dec 22, 2011 7:22 pm

Hello Leo,

If your input PDF is image based you should consider to replace:

Code: Select all

ImageID = InputPDF.RenderPageToGdPictureImage(200, True)

by:

Code: Select all

ImageID = InputPDF.RenderPageToGdPictureImageEx(200, True)

Let me know if this is better.

Kind regards,

Loïc

lbleicher · Post by **lbleicher** » Fri Jan 13, 2012 7:58 pm

Hi Loic-

Thanks for the suggestion, but that does not help. I already had a select/case statement to do conversion back to the original bit depth (though the RenderPageToGdPictureImageEx method is a better way).

I still have this 11k input pdf coming out as 1148k!!!

Is it possible that the JPEG compression is not being applied? Could this be a result of generating the output as a PDF/A?

How could I make sure compression is being applied to the PDF created by the PdfOCRStart statement?

Thanks,
Leo

Document sizes after PdfAddGdPictureImageToPdfOCR

Document sizes after PdfAddGdPictureImageToPdfOCR

Re: Document sizes after PdfAddGdPictureImageToPdfOCR

Re: Document sizes after PdfAddGdPictureImageToPdfOCR

Who is online

Stay in Touch

About ORPALIS