Approach question going from tiffs to PDF/A

Discussions about PDF management.
Post Reply
KurtInCali
Posts: 13
Joined: Fri Mar 27, 2020 6:21 pm

Approach question going from tiffs to PDF/A

Post by KurtInCali » Sun May 31, 2020 12:39 am

Hello!

I have a need for a Blazor Server app and several WPF clients to pull images from a repository of single page 300 dpi bitonal tiffs, apply redactions, combine them into a searchable PDF/A document that looks and prints exactly like the original tiffs and potentially add a digital cert.

So I created a Standard Library (2.1) and started working from the the sample code that goes from a multipage tiff to a searchable pdf. I'm most of the way writing the one (simple to call) function I intend to share amongst the clients when I noticed that the resulting pages being added to the pdf appear to be being made from the OCR text results. I haven't gotten to the point of actually testing it, but it doesn't look like the original tiff image is being folded into the new PDF pages...

So since it has to look exactly like the original, should I try a completely different approach? Maybe:

1. use GdPictureDocumentConverter to convert the single page tiffs to a PDF1_5 document.
2. loop through each page and use the AddRedactionRegion method then the OCRPage method (or vice versa?)
3. Then convert it to PDF_A_1b document
4. Then apply a digital cert if necessary

Anyhow I'm not sure if i'm over or under thinking this.
The samples created from the online PDF/A conversion engine were pretty much perfect I would like to emulate that if possible.

Thanks in advance for any direction!

Kurt

KurtInCali
Posts: 13
Joined: Fri Mar 27, 2020 6:21 pm

Re: Approach question going from tiffs to PDF/A

Post by KurtInCali » Mon Jun 01, 2020 5:36 pm

Never mind I think I'm in good shape. I have a single threaded solution working pretty good, but 30 seconds to OCR 6 pages without showing progress is probably not gonna fly. I'm going to now try to implement OCRPages method for multithreading. I expect it will go really well.
A couple observations:
1. Ok so, first off I'm totally enjoying the toolkit.
2. I was not able to use some methods under Net Standard (2.1) because it uses a slightly older System.Runtime. But everything works fine in .Net Core. I assume the next version of .Net Standard will fix that. So I moved on.
3. It looks like I cannot apply redactions to a PDF/A document, I assume that is part of the point of the format. I would like to do long term storage in PDF/A and apply redactions based on the audience context, I guess I will try converting the PDF/A to a different flavor, redact and change it back. Will cross that bridge later.
4. OCR quality for modern scanned documents is mac daddy out of the box with the OCRPage method, for older content not so much. I presume I can key off the tiff file size to then tell the OCR engine to do a despeckle, erode and dilate first for the purpose of OCR, but leave the image intact for the purposes of display and storage. Will cross that bridge later too.

Once again to circle back. I'm really enjoying the toolkit, after getting more and more familiar the methodologies are consistent and becoming predictable. Speed, ability, quality and memory management seem to be spot on especially compared to some imaging toolkits I've used over the last 25 years. Thank you to the developers, testers and support folks you have made my work a lot easier than I hoped...

Hugo
Posts: 227
Joined: Tue Dec 18, 2018 10:09 am

Re: Approach question going from tiffs to PDF/A

Post by Hugo » Mon Jun 15, 2020 4:45 pm

Hi KurtInCali,

Thank you for your kind words, we appreciate it and are glad you are enjoying our software.

2. Let me know if you are still experiencing this issue.
3. There should not be any such limitations. Would you be able to provide the material you are using with a code sample so we may reproduce this issue? Feel free to contact us here to send the material: https://orpalis.zendesk.com/hc/en-us/requests/new
4. What you can do is clone the tiff image, apply the filters, apply OCR and get the text from the page. You can then delete the clonedGdPictureImage and work with the original but now you also have the OCR Results in a string.

Regards,

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest