Search code examples
c#pdfpdf-generation

Compress existing PDF using C# programming using freeware libraries


I have been searching a lot on Google about how to compress existing pdf (size). My problem is

  1. I can't use any application, because it needs to be done by a C# program.

  2. I can't use any paid library as my clients don't want to go out of Budget. So a PAID library is certainly a NO

I did my home-work for last 2 days and came upon a solution using iTextSharp, BitMiracle but to no avail as the former decrease just 1% of a file and later one is a paid.

I also came across PDFcompressNET and pdftk but i wasn't able to find their .dll.

Actually the pdf is insurance policy with 2-3 images (black and white) and around 70 pages accounting to size of 5 MB.

I need the output in pdf only(can't be in any other format)


Solution

  • Here's an approach to do this (and this should work without regard to the toolkit you use):

    If you have a 24-bit rgb or 32 bit cmyk image do the following:

    • determine if the image is really what it is. If it's cmyk, convert to rgb. If it's rgb and really gray, convert to gray. If it's gray or paletted and only has 2 real colors, convert to 1-bit. If it's gray and there is relatively little in the way of gray variations, consider converting to 1 bit with a suitable binarization technique.
    • measure the image dimensions in relation to how it is being placed on the page - if it's 300 dpi or greater, consider resampling the image to a smaller size depending on the bit depth of the image - for example, you can probably go from 300 dpi gray or rgb to 200 dpi and not lose too much detail.
    • if you have an rgb image that is really color, consider palettizing it.
    • Examine the contents of the image to see if you can help make it more compressible. For example, if you run through a color/gray image and fine a lot of colors that cluster, consider smoothing them. If it's gray or black and white and contains a number of specks, consider despeckling.
    • choose your final compression wisely. JPEG2000 can do better than JPEG. JBIG2 does much better than G4. Flate is probably the best non-destructive compression for gray. Most implementations of JPEG2000 and JBIG2 are not free.
    • if you're a rock star, you want to try to segment the image and break it into areas that are really black and white and really color.

    That said, if you do can do all of this well in an unsupervised manner, you have a commercial product in its own right.

    I will say that you can do most of this with Atalasoft dotImage (disclaimers: it's not free; I work there; I've written nearly all the PDF tools; I used to work on Acrobat).

    One particular way to that with dotImage is to pull out all the pages that are image only, recompress them and save them out to a new PDF then build a new PDF by taking all the pages from the original document and replacing them the recompressed pages, then saving again. It's not that hard.

    List<int> pagesToReplace = new List<int>();
    PdfImageCollection pagesToEncode = new PdfImageCollection();
    
    using (Document doc = new Document(sourceStream, password)) {
    
        for (int i=0; i < doc.Pages.Count; i++) {
            Page page = doc.Pages[i];
            if (page.SingleImageOnly) {
                pagesToReplace.Add(i);
                // a PDF image encapsulates an image an compression parameters
                PdfImage image = ProcessImage(sourceStream, doc, page, i);
                pagesToEncode.Add(i);
            }
        }
    
        PdfEncoder encoder = new PdfEncoder();
        encoder.Save(tempOutStream, pagesToEncode, null); // re-encoded pages
        tempOutStream.Seek(0, SeekOrigin.Begin);
    
        sourceStream.Seek(0, SeekOrigin.Begin);
        PdfDocument finalDoc = new PdfDocument(sourceStream, password);
        PdfDocument replacementPages = new PdfDocument(tempOutStream);
    
        for (int i=0; i < pagesToReplace.Count; i++) {
             finalDoc.Pages[pagesToReplace[i]] = replacementPages.Pages[i];
        }
    
        finalDoc.Save(finalOutputStream);
    

    What's missing here is ProcessImage(). ProcessImage will rasterize the page (and you wouldn't need to understand that the image might have been scaled to be on the PDF) or extract the image (and track the transformation matrix on the image), and go through the steps listed above. This is non-trivial, but it's doable.