Reading/Writing Same PDF Increases Size

Discussions about PDF management.
Post Reply
risotoh985
Posts: 10
Joined: Mon Nov 11, 2019 8:57 pm

Reading/Writing Same PDF Increases Size

Post by risotoh985 » Sun Dec 01, 2019 12:28 am

Hi,

I've noticed an interesting and very strange behavior:
When I read a PDF file previously saved by GdPicturePdf with GdPicturePdf again, remove all hidden text, perform a new OCR recognition and save it again the file gets bigger and bigger every time I do this.

Here is a short sample code that reproduces this problem:

Code: Select all

            for (var i = 0; i <= 10; i++)
            {
                var gdPicturePdf = new GdPicturePDF();
                gdPicturePdf.LoadFromFile($"sample{i}.pdf");
                gdPicturePdf.RemoveHiddenText();

                gdPicturePdf.OcrPages("*", 0, "eng+deu", @"C:\GdPicture.NET 14\Redist\OCR", string.Empty, 300, OCRMode.FavorAccuracy, int.MaxValue, true);
                gdPicturePdf.SaveToFile($"sample{i + 1}.pdf", true, false);
            }
I've also attached you a full sample project for easy reproducing.

As you will see when running the sample project the "sample1.pdf" (after first saving it with GdPicturePdf) is 231 KB and after the 10th iteration the file "sample11.pdf" increased to 289 KB!

What's the reason for this?
As the hidden text is always cleared before the next OCR round, I would expect the file size to stay the same.
Why does it increase more and more every time?

Thanks
Riso
Attachments
sample0.pdf
(223.59 KiB) Downloaded 527 times
GdPictureTest3.zip
(118.65 KiB) Downloaded 516 times

risotoh985
Posts: 10
Joined: Mon Nov 11, 2019 8:57 pm

Re: Reading/Writing Same PDF Increases Size

Post by risotoh985 » Mon Dec 02, 2019 6:21 pm

An additional info regarding this:
I've found out in the meantime that the reason for the size increase is that with every round it embeds one additional font into the PDF file.

In the resulting "sample1.pdf" (after the first round) there is only a single font in the PDF (free tool PDF-Analyzer used for this):
GdPictureFontsEmbedded1.PNG
GdPictureFontsEmbedded1.PNG (2.57 KiB) Viewed 8782 times
But in "sample11.pdf" (after the 10th round) there are also 10 fonts embedded in the PDF:
GdPictureFontsEmbedded2.PNG
GdPictureFontsEmbedded2.PNG (10.01 KiB) Viewed 8782 times
I hope this helps to better understand this issue!

Maybe the

Code: Select all

RemoveHiddenText()
function could remove also the embedded font related to the removed hidden text?
Or is there another function to remove all embedded fonts in the document?

Thanks
Rizo

Hugo
Posts: 227
Joined: Tue Dec 18, 2018 10:09 am

Re: Reading/Writing Same PDF Increases Size

Post by Hugo » Thu Dec 12, 2019 4:55 pm

Thanks for bringing up this issue.

For deleting unecessary & duplicate embedded fonts, we have this existing method to solve this: https://www.gdpicture.com/guides/gdpicture/GdP ... Fonts.html

This should solve the size problem.

Regards,
Hugo

risotoh985
Posts: 10
Joined: Mon Nov 11, 2019 8:57 pm

Re: Reading/Writing Same PDF Increases Size

Post by risotoh985 » Thu Dec 12, 2019 6:21 pm

Hi Hugo,

thanks a lot for your reply!

This method seems not to remove all unused fonts, please try with attached "sample-test.pdf":
  • This PDF is just a scan (so one image, no text, also no hidden text).
  • If you load this PDF using "LoadFromFile()" into GdPicturePDF and call "GetFontCount()" the result is "0".
  • But if you verify this with any external PDF tool (e. g. the mentioned PDF-Analyzer) you will see that this PDF in fact includes even 11 embedded fonts.
  • Also after using your "PackFonts()" method and saving the PDF again these 11 fonts are still embedded (although definitely not used inside the PDF as it does not have any text, just a single image - so should be safe to remove those).
What's the reason that these embedded fonts are not detected by your "GetFontCount()" method?
And how can really all embedded fonts be removed from a PDF?

Thanks again for your support
Riso
Attachments
sample-test.pdf
(240.99 KiB) Downloaded 518 times

Hugo
Posts: 227
Joined: Tue Dec 18, 2018 10:09 am

Re: Reading/Writing Same PDF Increases Size

Post by Hugo » Mon Dec 16, 2019 3:54 pm

Hey Risotoh985,

For future replies, let's exchange in the ticket you have create regarding this issue allowing me to track of this.

The reasoning behind this not returning all fonts in your GetFontCount method is because your fonts are not being used on your page. They are embedded into your PDF file but are not currently being used in your pages. This is mentioned in the method's description.
We are currently looking into solutions to make the RemoveHiddenText method cover your issue.

I'll keep you up to date with this tasks progress in your ticket!

Thanks!

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest