Search code examples
pythonunicodearabic

find arabic word string in string give error 'ascii' codec can't decode


i write this function for check if month in persian exist in uicode string, replace it with number of month . i use this encode in header

`#!/usr/bin/python
# -*- coding: utf-8 -*-`

this is my def to convert month

def changeData(date):
                if date:
                   date.encode('utf-8')
                    if "فروردین".encode('utf-8') in date:
                        return str.replace(":فروردین", ":1")
                    elif "اردیبهشت".encode('utf-8') in date:
                        return str.replace(":اردیبهشت", ":2")
                    elif "خرداد".encode('utf-8') in date:
                        return str.replace(":خرداد", ":3")
                    elif "تیر".encode('utf-8') in date:
                        return str.replace(":تیر", ":41")
                    elif "مرداد".encode('utf-8') in date:
                        return str.replace(":مرداد", ":5")
                    elif "شهریور".encode('utf-8') in date:
                        return str.replace(":شهریور", ":6")
                    elif "مهر".encode('utf-8') in date:
                        return str.replace(":مهر", ":7")
                    elif "آبان".encode('utf-8') in date:
                        return str.replace(":آبان", ":8")
                    elif "آذر".encode('utf-8') in date:
                        return str.replace(":آذر", ":9")
                    elif "دی".encode('utf-8') in date:
                        return str.replace(":دی", ":10")
                    elif "بهمن".encode('utf-8') in date:
                        return str.replace(":بهمن", ":11")
                    elif "اسفند".encode('utf-8') in date:
                        return str.replace(":اسفند", ":12")

i pass date with unicode format in function then convert it to encode('utf-8') but give me this error

if "فروردین".encode('utf-8') in date:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 0: ordinal not in range(128)

how i can solve this problem


Solution

  • I assume Python 2.7.


    So:

    "فروردین".encode('utf-8') # UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 0: ordinal not in range(128)
    

    The problem is the fact that in Python 2.7 strings are bytes:

    print(repr("فروردین")) # '\xd9\x81\xd8\xb1\xd9\x88\xd8\xb1\xd8\xaf\xdb\x8c\xd9\x86'
    

    With the following code:

    "فروردین".encode('utf-8')
    

    you're trying to encode bytes which is logically incorrect because:

    ENCODING: unicode --> bytes 
    DECODING: bytes --> unicode 
    

    But Python doesn't throw smth like TypeError, because Python is smart.
    In such a case it tries first to decode the given bytes to unicode and then execute encoding specified by user.
    The problem is that Python does the described decoding with a default encoding which is ASCII in Python 2. Therefore the program terminates with the UnicodeDecodeError.

    The described decoding is similar to the:

    unicode("فروردین") # UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 0: ordinal not in range(128)
    

    So, you shouldn't encode byte-string and you have to DECODE it in order to receive unicode:

    u = "فروردین".decode('utf-8') 
    print(type(u)) # <type 'unicode'>
    

    Another way to get unicode is to use u-literal + encoding declaration:

    # coding: utf-8
    
    u = u"فروردین"
    print(type(u)) # <type 'unicode'> 
    
    print(u == "فروردین".decode('utf-8')) # True