When trying to extract text from a pdf using pdfminer, I get the following error:
ValueError: unichr() arg not in range(0x110000) (wide Python build)
It appears that there is an unrecognized character and that one character is throwing an error before the rest of the text can be extracted. the utf integer is greater than 110000. Most errors of this kind have to do with a narrow Python build, but not in this case.
The error appears to be in the name2unicode function in pdfminer:
<ipython-input-87-ebcd473faf08> in name2unicode(name)
13 if not m:
14 raise KeyError(name)
---> 15 return unichr(int(m.group(0)))
I've found the offending character. Its unicode int that is much larger than the range, and I haven't found a corresponding symbol.
The pdfminer function is set up to skip key errors, (the calling function is in a try except that passes after a key error) but misses the error when it's an out of range error. You can fix this by changing the original function, as follows:
import re
from pdfminer.psparser import PSLiteral
from pdfminer.glyphlist import glyphname2unicode
from pdfminer.latin_enc import ENCODING
STRIP_NAME = re.compile(r'[0-9]+')
def edit_name2unicode(name):
"""Converts Adobe glyph names to Unicode numbers."""
if name in glyphname2unicode:
return glyphname2unicode[name]
m = STRIP_NAME.search(name)
# print('name: '+name)
# print('m: '+str(m))
if not m or m>110000:
raise KeyError(name)
return unichr(int(m.group(0)))
pdfminer.encodingdb.name2unicode = edit_name2unicode
Note at the end that you have to set the old function to the new function, after having imported pdfminer for the document as a whole. This is a runtime workaround, for a process you have to complete more than once, I'd change the source document instead, especially since pdfminer doesn't have good class structure that you can inherit and overwrite easily.
IF, however, there are key errors for characters you want to keep, you can add them to the pypdf glyphlist or add another character set encoding, here.