Search code examples
c#pdfitext7

How to Extract pages from a PDF using IText 7?


I trying to use the iText7 library to extract some pages from a PDF file to create a new one.

    static void Splitter()
        {
        string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 $29,514.42 02.12.20.pdf";
        string range = "1, 4, 8";
        var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
        var split = new PdfSplitter(pdfDocumentInvoiceNumber);
        var result = split.ExtractPageRange(new PageRange(range));
        var numberOfPagesPdfDocumentInvoiceNumber = result.GetNumberOfPages();
        String toFile = @"C:\Users\Standard\Downloads\Result\Extracted.pdf";
        var pdfWriter = new PdfWriter(toFile);
        var pdfDocumentInvoiceMergeResult = new PdfDocument(pdfWriter);
        for (var i = 1; i <= numberOfPagesPdfDocumentInvoiceNumber; i++) 
            { 
            var pdfPage = result.GetPage(i).CopyTo(pdfDocumentInvoiceMergeResult);
            pdfDocumentInvoiceMergeResult.AddPage(pdfPage);
            }
        }

But when I attempt to use CopyTo method I get the error

iText.Kernel.PdfException: 'Cannot copy indirect object from the document that is being written.'

Solution

  • The problem here is that the documents returned by the PdfSplitter methods, in particular by ExtractPageRange, are iText 7 documents written to, i.e. these PdfDocument instances have been instantiated with a PdfWriter.

    Such documents are subject to certain restrictions, in particular that pages cannot be copied from them. For details on this read the answers here and here.

    To make these result documents (and the whole PdfSplitter class with them) be of any value, therefore, you need a way to define where the PdfWriter objects of these documents write to. And there is a way, albeit not really an intuitive way: You have to overwrite the GetNextPdfWriter method of the PdfSplitter which originally looks like this:

    /// <summary>This method is called when another split document is to be created.</summary>
    /// <remarks>
    /// This method is called when another split document is to be created.
    /// You can override this method and return your own
    /// <see cref="iText.Kernel.Pdf.PdfWriter"/>
    /// depending on your needs.
    /// </remarks>
    /// <param name="documentPageRange">the page range of the original document to be included in the document being created now.
    ///     </param>
    /// <returns>the PdfWriter instance for the document which is being created.</returns>
    protected internal virtual PdfWriter GetNextPdfWriter(PageRange documentPageRange) {
        return new PdfWriter(new ByteArrayOutputStream());
    }
    

    In a use case like yours in which you merely expect a single return document you eventually want to write to a file, you can do so like this:

    class MySplitter : PdfSplitter
    {
        public MySplitter(PdfDocument pdfDocument) : base(pdfDocument)
        {
        }
    
        protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange)
        {
            String toFile = @"C:\Users\Standard\Downloads\Result\Extracted.pdf";
            return new PdfWriter(toFile);
        }
    }
    

    With the PdfWriter instantiation moved into that custom splitter your main code is reduced to

    string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 $29,514.42 02.12.20.pdf";
    string range = "1, 4, 8";
    var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
    var split = new MySplitter(pdfDocumentInvoiceNumber);
    var result = split.ExtractPageRange(new PageRange(range));
    result.Close();
    

    In a use case like yours this admittedly looks weird, having to derive a custom class from the PdfSplitter merely to extract a few pages from a source PDF to a result PDF. Wouldn't an additional PdfWriter parameter to the ExtractPageRange have made it much easier?

    Please be aware, though, that the main objective of the PdfSplitter class is to split documents into many parts using the ExtractPageRanges and SplitBy... methods, and in that situation you'd need to supply a larger, probably not exactly known number of PdfWriters... not easier at all!

    Of course, a better solution probably would have been injecting some lambda expression or some other callback mechanism. For example:

    class ImprovedSplitter : PdfSplitter
    {
        private Func<PageRange, PdfWriter> nextWriter;
        public ImprovedSplitter(PdfDocument pdfDocument, Func<PageRange, PdfWriter> nextWriter) : base(pdfDocument)
        {
            this.nextWriter = nextWriter;
        }
    
        protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange)
        {
            return nextWriter.Invoke(documentPageRange);
        }
    }
    

    you can use like this

    string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 $29,514.42 02.12.20.pdf";
    string range = "1, 4, 8";
    var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
    var split = new ImprovedSplitter(pdfDocumentInvoiceNumber, pageRange => new PdfWriter(@"C:\Users\Standard\Downloads\Result\Extracted.pdf"));
    var result = split.ExtractPageRange(new PageRange(range));
    result.Close();