pypdf text extraction throws IndexError on some PDFs

I'm using Python (v 3.10.11) and pypdf (v 3.17.0) to extract the text from several PDFs.

Recently I ran into a particular kind of file from which I cannot extract text because the library throws an exception.

    File "...\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range

^{full code ahead}

The files are not encrypted and I'm able to get their metadata (according to it, the files are created using TCPDF) and perform other operations but the issue arises whenever I try to use the extract_text function on one of its pages.

A sample of non-text-extractable PDFs can be found here.

I have searched for people/topics facing the exact same issue/exception but I haven't found them. However,I think I might be facing a situation like the ones described in Python text extraction does not work on some pdfs or PyPDF2 Font Read Issue

Looking for other options I have found that Pypdf2 (which as far as I know it´s already deprecated and maintainers/developers have moved their efforts to pypdf) can extract the text.

Code sample:

# from pypdf import PdfReader
# from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(text)

If I run it using PyPDF2, I will get the proper text:

CÁMARA DE COMERCIO...
...
...

If I try to use pypdf I'll get:

Traceback (most recent call last):
    File "...\prueba_pdf\test.py", line 8, in <module>
        text = page.extract_text()
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 2284, in extract_text
        return self._extract_text(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 1903, in _extract_text
        cmaps[f] = build_char_map(f, space_width, obj)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 29, in build_char_map
        font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 54, in build_char_map_from_dict
        map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 224, in parse_to_unicode
        return type1_alternative(ft, map_dict, space_code, int_entry)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range

Is there a way to make pypdf work in this scenario? Am I missing something?

P.S. I would prefer to keep using pypdf instead of having more dependencies in my project.

Solution

I submitted a bug report as suggested by @Martin Thoma.

Indeed, it was a bug that was fixed with the release of version 3.17.2, so now text extraction works on this kind of pdf files.