Goal
To convert a PDF file that has some arabic text within it into a utf-8 txt file in Python using PyPDF.
Code
What I have tried:
import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(open(input_filepath, "rb"))#read PDF
for page in pdf.pages:#loop through pages
page_text = page.extractText()#get text from page
page_text = page_text.decode(encoding='utf-8')#decode
output_file.write(page_text)#write to file
output_file.close()#close
Error
I however receive this error:
Traceback (most recent call last):
File "pdf2txt.py", line 9, in <module>
page_text = page_text.decode(encoding='windows-1256')#decode
File "/usr/lib/python2.7/encodings/cp1256.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 98: ordinal not in range(128)
Instead of opening the file using the built in python open
you could try to open the file using codecs
and specifying the encoding of the file when opening, which it looks like you already imported codecs
. Your code would change to:
import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(codecs.open(input_filepath, "rb", encoding='utf-8'))#read PDF
for page in pdf.pages:#loop through pages
page_text = page.extractText()#get text from page
page_text = page_text.decode(encoding='utf-8')#decode
output_file.write(page_text)#write to file
output_file.close()#close