Search code examples
c#pdfgembox-pdf

Split PDF by chapters from Table Of Contents


I'm using GemBox.Pdf and I need to extract individual chapters in a PDF file as a separate PDF files.

The first page (maybe the second page as well) contains TOC (Table Of Contents) and I need to split the rest of the PDF pages based on it:

PDF file with Chapters and Table Of Contents

Also, those PDF documents that are split, should be named as the chapters they contains.
I can split the PDF based on the number of pages for each document (I figured that out using this example):

using (var source = PdfDocument.Load("Chapters.pdf"))
{
    int pagesPerSplit = 3;
    int count = source.Pages.Count;

    for (int index = 1; index < count; index += pagesPerSplit)
    {
        using (var destination = new PdfDocument())
        {
            for (int splitIndex = 0; splitIndex < pagesPerSplit; splitIndex++)
                destination.Pages.AddClone(source.Pages[index + splitIndex]);

            destination.Save("Chapter " + index + ".pdf");
        }
    }
}

But I can't figure out how to read and process that TOC and incorporate the chapters splitting base on its items.


Solution

  • EDIT:

    On that same page that you linked, there is now Split PDF file by bookmarks (outlines) example.

    ORIGINAL:

    You should iterate through the document's bookmarks (outlines) and split it based on the bookmark destination pages.

    For instance, try this:

    using (var source = PdfDocument.Load("Chapters.pdf"))
    {
        PdfOutlineCollection outlines = source.Outlines;
    
        PdfPages pages = source.Pages;
        Dictionary<PdfPage, int> pageIndexes = pages
            .Select((page, index) => new { page, index })
            .ToDictionary(item => item.page, item => item.index);
    
        for (int index = 0, count = outlines.Count; index < count; ++index)
        {
            PdfOutline outline = outlines[index];
            PdfOutline nextOutline = index + 1 < count ? outlines[index + 1] : null;
    
            int pageStartIndex = pageIndexes[outline.Destination.Page];
            int pageEndIndex = nextOutline != null ?
                pageIndexes[nextOutline.Destination.Page] :
                pages.Count;
    
            using (var destination = new PdfDocument())
            {
                while (pageStartIndex < pageEndIndex)
                {
                    destination.Pages.AddClone(pages[pageStartIndex]);
                    ++pageStartIndex;
                }
    
                destination.Save($"{outline.Title}.pdf");
            }
        }
    }
    

    Note, from the screenshot it seems that your chapter bookmarks include the order's number (roman numerals). If needed, you can easily remove those with something like this:

    destination.Save($"{outline.Title.Substring(outline.Title.IndexOf(' ') + 1)}.pdf");