Search code examples
pythonpypdf

UnicodeEncodeError when reading pdf with pyPdf


Guys i had posted a question earlier pypdf python tool .dont mark this as duplicate as i get this error indicated below

  import sys
  import pyPdf

  def convertPdf2String(path):
      content = ""
      # load PDF file
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      # iterate pages
      for i in range(0, pdf.getNumPages()):
          # extract the text from each page
          content += pdf.getPage(i).extractText() + " \n"
      # collapse whitespaces
      content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

  # convert contents of a PDF file and store retult to TXT file
  f = open('a.txt','w+')
  f.write(convertPdf2String(sys.argv[1]))
  f.close()

  # or print contents to the standard out stream
  print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

I get this error for a the 1st pdf file UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) and the following error for this pdf http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)

How to resolve this


Solution

  • I tried it myself and got the same result. Ignore my comment, I hadn't seen that you're writing the output to a file as well. This is the problem:

    f.write(convertPdf2String(sys.argv[1]))
    

    As convertPdf2String returns a Unicode string, but file.write can only write bytes, the call to f.write tries to automatically convert the Unicode string using ASCII encoding. As the PDF obviously contains non-ASCII characters, that fails. So it should be something like

    f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
    # or
    f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
    

    EDIT:

    The working source code, only one line changed.

    # Execute with "Hindi_Book.pdf" in the same directory
    import sys
    import pyPdf
    
    def convertPdf2String(path):
        content = ""
        # load PDF file
        pdf = pyPdf.PdfFileReader(file(path, "rb"))
        # iterate pages
        for i in range(0, pdf.getNumPages()):
            # extract the text from each page
            content += pdf.getPage(i).extractText() + " \n"
        # collapse whitespaces
        content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
        return content
    
    # convert contents of a PDF file and store retult to TXT file
    f = open('a.txt','w+')
    f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
    f.close()
    
    # or print contents to the standard out stream
    print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")