Introduction to the Optimization of Existing PDF files: Methods
This paper was originally a presentation in French delivered at the PDF Day – France by Loïc Carrère, CEO of ORPALIS, in April 2019.
Organized by the PDF Association, the PDF days are the meeting place of the PDF industry, where experts conduct educational (non-commercial) presentations, panel, and discussion-based sessions about the format.
The richness of PDF offers many opportunities to reduce the weight of existing documents.
Organizations need to meet more and more legal requirements for archiving and data retention and often adopt a strategy to reduce the amount of storage used by their existing documents.
The PDF Optimization in-depth series
This series of articles will address the issues and constraints of such an approach, as well as various optimization methods that can be applied.
We will try to describe a maximum of optimization techniques, with or without loss of data, which can be adapted according to one’s expectations. We will discuss them with case studies dealing with documents of different nature (documents with vector content and documents containing only images).
Therefore, we will focus on the following issue: can compression allow data loss? If so, to what extent?
Choosing lossless or lossy compression
We will introduce several methods of compression, some without loss of data, others with degradation. For this second category, it will be necessary to decide in advance whether the loss of data is tolerable, if so, to what extent.
For reducing the file size of PDF documents, images are the very first logical candidates for any compression. The reason is apparent – the image can be compressed with retaining its approximate representation of the original data without losing its meaning.
Did you know that a 50% compression applied to a single image will decrease the file size of that image by 90%?
Lossy compression often drastically reduces the file size, but it is at the expense of an irreversible loss of information. Some of the removed data are redundant. Some of them are not, but these are mostly not noticeable to the users in the result. There is no way back once you have used lossy.
When you want to perform further processing on the image, do not select lossy compression.
However, the definite pros of lossy image compression are the best compression ratios with good enough approximations.
compression will mostly be related to image reprocessing.
It will be necessary for each image to decide if one can:
- Change its color depth. IE: 24-bit to 8-bit or 1-bit per pixel.
- Perform a downscaling.
- Alter its pixels (noise suppression, trimming, MRC processing …).
- Re-encode with lossy compression algorithm (JPEG2000 – JBIG2).
In opposite, lossless compression maintains the image quality without any change while reducing its file size.
This is mainly achieved by removing metadata from source images. Therefore the size reduction is not so exciting here.
Lossless compression is recommended if the image is going to be processed further with its original quality.
In other words, it is usable for discrete data or any raster images as it retains raster values during compression.
The PDF specification allows seven compression schemes for images, which are:
- LZW – An adaptive compression method, lossless, and mainly used in GIF and TIFF digital image formats.
- RLE (RunLengthDecode) – A lossless compression method, primarily used for Group 3 and 4 faxes (black and white), BMP, and PCX.
- CCITT (CCITTFaxDecode) – A lossless compression method, for bitonal images only.
- JPEG (DCTDecode) – A compression method typically used with loss, for 8-bit grayscale or 24-bit color images.
- zlib / deflate (FlateDecode) – A lossless compression method that couples the LZ77 algorithm and Huffman coding.
- JBIG2 (JBIG2Decode) – A compression method, which can be lossy or lossless for bitonal images only.
- JPEG 2000 (JPXDecode) – A compression method commonly used with loss, for 8-bit grayscale or color images using wavelet transforms.
For example, image formats like RAW, BMP, GIF, and PNG are all lossless image formats.
JPEG is a lossy compression type commonly used for digital images.
An alternative to the JPEG is the TIFF format with an LZW compression, which is considered a lossless file format.
And JBIG2 is an image compression suitable for both lossless and lossy.
To summarize, the dilemma of lossy vs. lossless is not about what is good or bad. It is about what suits the best your purpose.
The next article will be about lossless methods: deleting unnecessary and unused content and objects.