I'm using Python (v 3.10.11) and pypdf (v 3.17.0) to extract the text from several PDFs.
Recently I ran into a particular kind of file from which I cannot extract text because the library throws an exception.
File "...\pypdf\_cmap.py", line 481, in type1_alternative if words[3] != b"put": IndexError: list index out of range
full code ahead
The files are not encrypted and I'm able to get their metadata (according to it, the files are created using TCPDF) and perform other operations but the issue arises whenever I try to use the extract_text
function on one of its pages.
A sample of non-text-extractable PDFs can be found here.
I have searched for people/topics facing the exact same issue/exception but I haven't found them. However,I think I might be facing a situation like the ones described in Python text extraction does not work on some pdfs or PyPDF2 Font Read Issue
Looking for other options I have found that Pypdf2 (which as far as I know it´s already deprecated and maintainers/developers have moved their efforts to pypdf) can extract the text.
Code sample:
# from pypdf import PdfReader
# from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(text)
If I run it using PyPDF2, I will get the proper text:
CÁMARA DE COMERCIO... ... ...
If I try to use pypdf I'll get:
Traceback (most recent call last): File "...\prueba_pdf\test.py", line 8, in <module> text = page.extract_text() File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 2284, in extract_text return self._extract_text( File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 1903, in _extract_text cmaps[f] = build_char_map(f, space_width, obj) File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 29, in build_char_map font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict( File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 54, in build_char_map_from_dict map_dict, space_code, int_entry = parse_to_unicode(ft, space_code) File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 224, in parse_to_unicode return type1_alternative(ft, map_dict, space_code, int_entry) File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 481, in type1_alternative if words[3] != b"put": IndexError: list index out of range
Is there a way to make pypdf work in this scenario? Am I missing something?
P.S. I would prefer to keep using pypdf instead of having more dependencies in my project.
I submitted a bug report as suggested by @Martin Thoma.
Indeed, it was a bug that was fixed with the release of version 3.17.2, so now text extraction works on this kind of pdf files.