Search code examples
pythonpdfunicode

Python Unicode Error Reading an Arabic PDF into txt


Goal

To convert a PDF file that has some arabic text within it into a utf-8 txt file in Python using PyPDF.

Code

What I have tried:

import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(open(input_filepath, "rb"))#read PDF
for page in pdf.pages:#loop through pages
    page_text = page.extractText()#get text from page
    page_text = page_text.decode(encoding='utf-8')#decode 
    output_file.write(page_text)#write to file
output_file.close()#close

Error

I however receive this error:

Traceback (most recent call last):
  File "pdf2txt.py", line 9, in <module>
    page_text = page_text.decode(encoding='windows-1256')#decode 
  File "/usr/lib/python2.7/encodings/cp1256.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 98: ordinal not in range(128)

Solution

  • Instead of opening the file using the built in python open you could try to open the file using codecs and specifying the encoding of the file when opening, which it looks like you already imported codecs. Your code would change to:

    import pyPdf
    import codecs
    input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
    output_filepath = "output.txt"#output text file path
    output_file = open(output_filepath, "wb")#open output file
    pdf = pyPdf.PdfFileReader(codecs.open(input_filepath, "rb", encoding='utf-8'))#read PDF
    for page in pdf.pages:#loop through pages
        page_text = page.extractText()#get text from page
        page_text = page_text.decode(encoding='utf-8')#decode 
        output_file.write(page_text)#write to file
    output_file.close()#close