Search code examples
pythonpypdftext-extraction

pypdf text extraction throws IndexError on some PDFs


I'm using Python (v 3.10.11) and pypdf (v 3.17.0) to extract the text from several PDFs.

Recently I ran into a particular kind of file from which I cannot extract text because the library throws an exception.

    File "...\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range

full code ahead

The files are not encrypted and I'm able to get their metadata (according to it, the files are created using TCPDF) and perform other operations but the issue arises whenever I try to use the extract_text function on one of its pages.

A sample of non-text-extractable PDFs can be found here.

I have searched for people/topics facing the exact same issue/exception but I haven't found them. However,I think I might be facing a situation like the ones described in Python text extraction does not work on some pdfs or PyPDF2 Font Read Issue

Looking for other options I have found that Pypdf2 (which as far as I know it´s already deprecated and maintainers/developers have moved their efforts to pypdf) can extract the text.

Code sample:

# from pypdf import PdfReader
# from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(text)

If I run it using PyPDF2, I will get the proper text:

CÁMARA DE COMERCIO...
...
...

If I try to use pypdf I'll get:

Traceback (most recent call last):
    File "...\prueba_pdf\test.py", line 8, in <module>
        text = page.extract_text()
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 2284, in extract_text
        return self._extract_text(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 1903, in _extract_text
        cmaps[f] = build_char_map(f, space_width, obj)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 29, in build_char_map
        font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 54, in build_char_map_from_dict
        map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 224, in parse_to_unicode
        return type1_alternative(ft, map_dict, space_code, int_entry)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range

Is there a way to make pypdf work in this scenario? Am I missing something?

P.S. I would prefer to keep using pypdf instead of having more dependencies in my project.


Solution

  • I submitted a bug report as suggested by @Martin Thoma.

    Indeed, it was a bug that was fixed with the release of version 3.17.2, so now text extraction works on this kind of pdf files.