Search code examples
pythonpdfpypdftableofcontents

PyPDF2 : extract table of contents/outlines and their page number


I am trying to extract the TOC/outlines from PDFs and their page number using Python (PyPDF2), I am aware of the reader.outlines but it does not return the correct page number.

Pdf example: https://www.annualreports.com/HostedData/AnnualReportArchive/l/NASDAQ_LOGM_2018.pdf

and the output of reader.outlines is :

[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'},
{'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'}, 
...
{'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, 
[{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'}, 
{'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'}
...

For instance, PART I was not expected to begin at page 10, am I missing something ? Does anyone have an alternative ?

I've tried with PyMupdf, Tabula and the getDestinationPageNumber method with no luck.

Thank you in advance.


Solution

  • Martin Thoma's answer is exactly what I needed (PyMuPDF). Diblo Dk's answer is an interesting workaround as well (PyPDF2).

    I am citing exactly Martin Thoma's code :

    from typing import Dict
    
    import fitz  # pip install pymupdf
    
    
    def get_bookmarks(filepath: str) -> Dict[int, str]:
        # WARNING! One page can have multiple bookmarks!
        bookmarks = {}
        with fitz.open(filepath) as doc:
            toc = doc.getToC()  # [[lvl, title, page, …], …]
            for level, title, page in toc:
                bookmarks[page] = title
        return bookmarks
    
    
    print(get_bookmarks("my.pdf"))