Search code examples
pythonpdfutf-8character-encodingarabic

How to read Arabic text from PDF using Python script


I have a code written in Python that reads from PDF files and convert it to text file.

The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.

The system converts Arabic PDF files but the text file is empty. and display this error:

Traceback (most recent call last): File "C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)

Code:

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime

def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    

print "\n"

folder = check_path("Provide absolute path for the folder: ")

list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)

            list.append(t)

m=len(list)
print (m)
i=0
while i<=m-1:

    path=list[i]
    print(path)
    head,tail=os.path.split(path)
    var="\\"

    tail=tail.replace(".pdf",".txt")
    name=head+var+tail

    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
            # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    content.encode('utf-8')
    f.write(content)
    f.close
    i=i+1

Solution

  • You have a couple of problems:

    1. content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.

    Example (works for both Python 2 and 3):

     import io
     f = io.open(name,'w',encoding='utf8')
     f.write(content)
    
    1. If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.

    Example:

    import io
    with io.open(name,'w',encoding='utf8') as f:
        f.write(content)
    

    In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.