I am trying to extract the TOC/outlines from PDFs and their page number using Python (PyPDF2), I am aware of the reader.outlines
but it does not return the correct page number.
Pdf example: https://www.annualreports.com/HostedData/AnnualReportArchive/l/NASDAQ_LOGM_2018.pdf
and the output of reader.outlines
is :
[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'},
{'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'},
...
{'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
[{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
{'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'}
...
For instance, PART I was not expected to begin at page 10, am I missing something ? Does anyone have an alternative ?
I've tried with PyMupdf, Tabula and the getDestinationPageNumber method with no luck.
Thank you in advance.
Martin Thoma's answer is exactly what I needed (PyMuPDF). Diblo Dk's answer is an interesting workaround as well (PyPDF2).
I am citing exactly Martin Thoma's code :
from typing import Dict
import fitz # pip install pymupdf
def get_bookmarks(filepath: str) -> Dict[int, str]:
# WARNING! One page can have multiple bookmarks!
bookmarks = {}
with fitz.open(filepath) as doc:
toc = doc.getToC() # [[lvl, title, page, …], …]
for level, title, page in toc:
bookmarks[page] = title
return bookmarks
print(get_bookmarks("my.pdf"))