In This Topic
Programming / Document Cleanup / General noise clean up

General noise clean up

In This Topic

General document noise is random noise that is 90% of the time resulting from the scanning process. Whether it is a bad quality scanner with a bad A/D converter, or that the quality of the document itself is degraded due to multi-scanning and printing, this random noise is the single greatest obstacle to intelligent document recognition.

GdPicture.NET comes with a number of clean up algorithms that should be able to handle the most difficult of those scenarios.

  1. Bitonal Despeckle

    Before / After

    FxBitonalDespeckle() and FxBitonalDespeckleMore() are two functions that are specifically designed to deal with random salt and pepper noise in your binary black and white documents. Those two functions take two parameters:

    • The image ID
    • A Boolean field named FixText:

      If FixText is set to true, the function will prioritize keeping the quality of the text over the clean up process. If it is set to false, more noise will be cleared, but in low DPI images, the text edges might be a little hindered.

  2. Remove Isolated Dots

    Before / After

    Isolated dots come in different sizes. GdPicture has three functions to clean them up depending on their size:

  3. Remove Parasite Noise, Speckle without affecting content.

    Before / After

    RemoveBlob is a very powerful and fast function. It can remove all types of content from an image within a given size range and a given fill percentage.
    Since noise is usually small in size when compared to text size, specifying low values for size of blobs to be removed would remove the noise and keep the text as is.
    The problem arises when the DPI of the image or the contents vary. Usually, higher DPI means a larger image, which means a larger content (including the size of the noise).

    Here is an example of the parameters to use on a generic 200 DPI image, and you should try changing those parameters on your images through the Document Clean Up Sample, under Remove Parasites:

    • Set the MinFillPercent and MaxFillPercent to 1 and 100 respectively
    • Set MinBlobWidth and MinBlobHeight to 1
    • Set MaxBlobWidth and MaxBlobHeight to 12

    If you notice that not all the noise is removed, you should increase the max height and width. If, on the other hand, some of the content (text) was removed, you should reduce those values.

  4. Hopeless cases, where amount of noise almost exceeds amount of data.

    Before / After

    As document experts, you are bound to find some legacy documents, where the amount of noise is extreme and the type of noise is sporadic and does not follow a pattern or a certain size.
    In those cases we recommend to use FxBitonalVigorousDespeckle. This function performs statistical analysis of all image content and tries to create a model for what the content size should be and what the noise size should be.
    It will clean so vigorously, that some data might be affected, which should not matter, because as mentioned, this is a hopeless case, that no other function will result in any recognition. It has one parameter, which is whether to check for dots of the letter "i" and try to retain it. Setting it to true will result in slightly slower processing and retain a little more noise around text content, but retains "most" of the dots of the letters "i" and "j".

Sometimes you would have to call the above three methods on your document in order to clean all types of dots.

The advantage of Removing Isolated Dots over BitonalDespeckling is that it does not affect the text at all in low DPI images. On the other hand, it does not clean as much noise.

Calling RemoveBlob would remove most noise with no effect on the text, but you would have to know, how to correspond the DPI of the image, the size of the content to the size of the noise, making it trickier to use.

Finally, all these functions are included for you to test the results along with changing their parameters in the Document Clean Up Sample demo, please try it out for better understanding.