Search code examples
pythonstringunicodeencode

python encoding error when searching a string


I get the following error while trying to search the string below

ERROR:

SyntaxError: Non-ASCII character '\xd8' in file Hadith_scraper.py on line 44, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

STRING:

دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ

CODE:

arabic_hadith = "دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ"
arabic_hadith.encode('utf8')
print arabic_hadith
if "الجمعة" in arabic_hadith:‎
    day = "5"
else:
    day = ""

Solution

  • You have a byte string, not a unicode value. Trying to encode a byte string in Python 2 means that Python will first try to decode it to unicode so that it can then encode.

    Use unicode values here instead, and make sure you set the codec at the top of the file first. See PEP 263 - Defining Python Source Code Encodings (which your error message pointed you to).

    Note that there is no need to encode to UTF8 here, that'll only complicate text comparisons:

    # encoding: utf8
    arabic_hadith = u"دَّثَنَا عَبْدَانُ، قَالَ أَخْبَرَنَا عَبْ"
    print arabic_hadith
    if u"الجمعة" in arabic_hadith:‎
        day = "5"
    else:
        day = ""
    

    Rule of thumb: decode bytes from incoming sources (files, network data) to Unicode, process only Unicode in your program, and only encode again for any outgoing data.

    I urge you to read:

    before you continue.