PDF OCR + Compression, PDF Encryption + PDFA

dreynolds · Post by **dreynolds** » Thu Oct 29, 2009 4:21 pm

We’ve purchased gdPicture.NET Ultimate, and I’ve been developing a .NET app that processes input files and creates output in various formats.

When I create PDFs I noticed there are various methods that have to be used.

If you want searchable PDFs, you have to use the PDF OCR methods. These methods seem to ignore the color/bitonal compression and jpg quality settings, and always use a default of JPG compression at 75%? quality. Will this be addressed in future updates?

Also there seems to be no way to specify PDF/A when using the PDF encryption methods to start a PDF (with or without OCR). Are the PDFs generated with encryption always PDF/A or are they never PDF/A? Will this be addressed in future updates?

~ Don ~

Post by **Loïc** » Thu Oct 29, 2009 5:32 pm

Hi Don,

These methods seem to ignore the color/bitonal compression and jpg quality settings, and always use a default of JPG compression at 75%? quality. Will this be addressed in future updates?

OK. Compression will also affect OCR PDF for the next release.

Also there seems to be no way to specify PDF/A when using the PDF encryption methods to start a PDF (with or without OCR). Are the PDFs generated with encryption always PDF/A or are they never PDF/A? Will this be addressed in future updates?

No, a PDF/A must not be encrypted, else it is not a PDF/A

Loïc

dreynolds · Post by **dreynolds** » Thu Oct 29, 2009 5:34 pm

Thanks for the quick reply!

Post by **Loïc** » Thu Oct 29, 2009 5:35 pm

You are welcome.

dreynolds · Post by **dreynolds** » Tue Dec 01, 2009 7:26 am

Good news for color compression with GdPicture 6.5.0!

If I use the color compression options when creating PDF OCR, it now works the same as when using non-OCR methods.

However, if I convert to 1bpp the filesize is always the same as using no compression for PDF or PDF OCR, regardless of the compression type used.

Post by **Loïc** » Tue Dec 01, 2009 11:11 am

Hi,

However, if I convert to 1bpp the filesize is always the same as using no compression for PDF or PDF OCR, regardless of the compression type used.

Keep in mind that for 1bpp image only FlateCompression & CCITT4 are supported. I suspect you to try JPEG.

Kind regards,

Loïc

dreynolds · Post by **dreynolds** » Tue Dec 01, 2009 9:09 pm

I actually tried all of the different PDFOCR compression types after converting a source image to 1bpp.
The filesize was always the same as the original 24bpp TIF with no compression.
Using the exact same files and code with color compression works perfect.

Post by **Loïc** » Tue Dec 01, 2009 9:29 pm

Hi,

This is strange. What code are you using to create your PDFs ?

Kind regards,

Loïc

dreynolds · Post by **dreynolds** » Tue Dec 01, 2009 11:44 pm

I had typed up two huge code blocks with lots of detail but I was logged out and lost all of it.

Here is a summary:

I use a large 24bpp two-page TIF for input. It doesn't matter if you use OCR or not.
The output pdf opens and looks fine (black and white) but has original 24bpp TIF
uncompressed size no matter which bitonal compression used.
The same exact code on same input TIF using color compression works as expected.

Using a single page JPG input appears to work ok with 1bpp compression.
It may just be a problem with converting individual pages to 1bpp and add to PDF with bitonal compression.

Code: Select all

I create the PDF output, set the bitonal compression.
I open the multipage TIF input.
I get first page, convert to 1bpp with 128 thresh, add to PDF.
I then switch to second page, convert to 1bpp with 128 thresh, add to PDF.
(If not using OCR, for each page you have to get width/height, add page, draw image.)
Then I close the PDF and cleanup all the ImageIDs

dreynolds · Post by **dreynolds** » Thu Dec 10, 2009 4:44 pm

Partially good news. I tested this issue in GdPicture.Net 6.5.1 and the behavior is better, but not quite what I'd expect.

Before if I took a 24bpp color two page TIF and converted each page to 1bpp and added to PDF (OCR or not didn't matter), the output would always be the same size as the original 24bpp uncompressed TIF (huge) no matter what type of bitonal compression I used for the PDF output.

Now the 1bpp PDF output is MUCH smaller (10% of the original 24bpp TIF), but it is always the same size regardless of which bitonal compression I use. In other words the CCITT4 compressed file is the same size as uncompressed and Flate compressed. Here is a screen shot:

This is acceptable and a big improvement from 6.5.0 and previous versions. I would expect to see sizes closer to the sizes of TIF output using the CCITT4 compression, and I would expect no compression PDF output to be larger than CCITT4 or Flate compression PDF output.

Post by **Loïc** » Thu Dec 10, 2009 5:21 pm

Hi,

I think you have an error in your code.

Please, give me a code snippet to reproduce the behavior for investigation.

Kind regards,

Loïc

dreynolds · Post by **dreynolds** » Thu Dec 10, 2009 6:16 pm

My application is very large and complex (10,000 lines of code).
The basics of the relevant code are:

Code: Select all

create the PDF output
set the bitonal compression (none, ccitt4, or flate)
open the 24bpp color 2-page multipage TIF input file
get first page, convert to 1bpp with 128 thresh, add to PDF
switch to second page, convert to 1bpp with 128 thresh, add to PDF with OCR method
finally close the PDF 
release all the ImageIDs

note: the same code with color compression works fine with the same TIF input file
Also the behavior changed from 6.5.0 and earlier to 6.5.1.
Before the bitonal pdf would have filesize of 100% of the input.
Now it has a filesize of 10% of the input file... without any code changes.

do you have sample code that works? i can try that with my TIF.

Post by **Loïc** » Fri Dec 11, 2009 9:40 am

Hi Don,

I really need you isolate your problem in a tiny code snippet to see what is wrong with your method. Else, we will lose many time...

Well, write me a small function reproducing your problem and I will give you correction for it.

if you want a code snippet, please post on the code sample request section.

Thank you for your comprehension,

Loïc

dreynolds · Post by **dreynolds** » Fri Dec 11, 2009 7:17 pm

Ok. I spent all day trying to isolate the problem. I finally figured it out. I have a large UI for setting all the various GdPicture parameters. Part of the UI allows you to resize, rotate, etc before creating output.

The bitonal compression will fail if ROTATE is called (even 0 degrees) after the input is converted to 1bpp. If rotate is called before the 1bpp conversion, the bitonal compression will work fine. Otherwise the 1bpp output will always have the same filesize as if bitonal compression was none.

Below is VB.NET code stripped from my large app to demonstrate. It is setup to output CCITT4, but it will fail because the rotate is called after the 1bpp conversion. If you uncomment the lines to output any of the 3 bitonal compression types, the filesize will all be the same.

If you comment the rotate after the 1bpp conversion and uncomment the rotate before the 1bpp conversion, each bitonal compression type will cause different filesizes as expected.

Note: this only seemed to be an issue with multipage TIF input. It didn't seem to be an issue with JPG, BMP, etc. I didn't try multiframe GIF or multipage PDF input. (I doubt PDF input would be a problem because you have to use a gdviewer to get a gdimage for each page, so it would effectively be handled the same as a single page file type.)

Code: Select all

'=======================
'INIT 

Dim gdPicImg As New GdPictureImaging
Dim intGdPicImgInputID As Integer
Dim intGdPicImgOutputID As Integer

'-------------
'get TIF input file (24bpp, no compression, 2 pages)
gdPicImg.TiffOpenMultiPageForWrite(True)
intGdPicImgInputID = gdPicImg.TiffCreateMultiPageFromFile("c:\inbox\24bpp_none.TIF")

'-------------
'use PDF ocr start method (there is one line for each bitonal compression type - uncomment one)
intGdPicImgOutputID = gdPicImg.PdfOCRStart("c:\inbox\1bpp_ccitt4.pdf", true, "Title", "Author" , "Subject" , "Keywords" , "Creator")
'intGdPicImgOutputID = gdPicImg.PdfOCRStart("c:\inbox\1bpp_flate.pdf", true, "Title", "Author" , "Subject" , "Keywords" , "Creator")
'intGdPicImgOutputID = gdPicImg.PdfOCRStart("c:\inbox\1bpp_none.pdf", true, "Title", "Author" , "Subject" , "Keywords" , "Creator")

'set bitonal compression (there is one line for each bitonal compression type - uncomment the same as above) 
gdPicImg.PdfSetCompressionForBitonalImage(intGdPicImgOutputID,PdfCompression.PdfCompressionCCITT4)
'gdPicImg.PdfSetCompressionForBitonalImage(intGdPicImgOutputID,PdfCompression.PdfCompressionFlate)
'gdPicImg.PdfSetCompressionForBitonalImage(intGdPicImgOutputID,PdfCompression.PdfCompressionNone)

'set color compression (not used in this example, all pages are converted to 1bpp for output)
gdPicImg.PdfSetCompressionForColorImage(intGdPicImgOutputID,PdfCompression.PdfCompressionJPEG)
gdPicImg.PdfSetJpegQuality(intGdPicImgOutputID,75)

'=======================
'TIF PAGE 1

'get gdpInputPage
gdPicImg.TiffSelectPage(intGdPicImgInputID, 1)

'rotate (called before bpp conversion will work fine)
'gdPicImg.RotateAngle(intGdPicImgInputID,0)

'convert page to 1bpp with 128 thresh
gdPicImg.ConvertTo1Bpp(intGdPicImgInputID,cbyte(128))

'rotate (called after bpp conversion will cause pdf bitonal compression to fail)
gdPicImg.RotateAngle(intGdPicImgInputID,0)

'add 1bpp TIF page1 to PDF
gdPicImg.PdfAddGdPictureImageToPdfOCR(intGdPicImgOutputID, intGdPicImgInputID, TesseractDictionary.TesseractDictionaryEnglish, System.Windows.Forms.Application.StartupPath, "")

'=======================
'TIF PAGE 2

'get gdpInputPage
gdPicImg.TiffSelectPage(intGdPicImgInputID, 2)

'rotate (called before bpp conversion will work fine)
'gdPicImg.RotateAngle(intGdPicImgInputID,0)

'convert page to 1bpp with 128 thresh
gdPicImg.ConvertTo1Bpp(intGdPicImgInputID,cbyte(128))

'rotate (called after bpp conversion will cause pdf bitonal compression to fail)
gdPicImg.RotateAngle(intGdPicImgInputID,0)

'add 1bpp TIF page2 to PDF
gdPicImg.PdfAddGdPictureImageToPdfOCR(intGdPicImgOutputID, intGdPicImgInputID, TesseractDictionary.TesseractDictionaryEnglish, System.Windows.Forms.Application.StartupPath, "")

'=======================
'CLEANUP

'use PDF ocr stop method
gdPicImg.PdfOCRStop(intGdPicImgOutputID)

'-------------
'release output ID if it isn't the same as the input ID
'(most of the time it will be the same value for both IDs)
If intGdPicImgOutputID<>intGdPicImgInputID then
	gdPicImg.ReleaseGdPictureImage(intGdPicImgOutputID)
End If

'release input ID
gdPicImg.ReleaseGdPictureImage(intGdPicImgInputID)

'-------------
'finished
Call MsgBox("Done.", MsgBoxStyle.OkOnly + MsgBoxStyle.Information, "Done")

I can work around this now that I know exactly what was causing the issue (by calling rotate before 1bpp conversion), but it may be nice to fix it so that it works as expected either way.

I would think rotating a large input image in 24bpp would be more costly than after it was converted to a smaller 1bpp image. That is why I originally did the resize/dpi/bpp conversions before rotate/deskew/etc.

Post by **Loïc** » Sat Dec 12, 2009 11:05 am

Ok Don you got it !

Let me give you a small clarification: You are calling the RotateAngle() method on an indexed bitmap (in your case 1bpp). To apply this kind of non-standard rotation GdPciture deals with something called "Matrix of Transformation" which output a bitmap with "true colors" in RGB space (8 + 8 + 8 = 24 bpp). Therefore, the result is a 24 bpp image. So, if you want to keep the initial bitdepth (1bpp) you have to apply a post-convertion using the ConvertTo1bpp() method. Else, if you are using standard rotation (90, 180, 270 deg ) you should use the RotateImage function.

Hope I brought you some understanding.

Cheers,

Loïc

PDF OCR + Compression, PDF Encryption + PDFA

PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Re: PDF OCR + Compression, PDF Encryption + PDFA

Who is online

Stay in Touch

About ORPALIS