Search code examples
pythonutf-8asciidecodeencode

Decoding string in python3


How can I convert

str1 = 'Sabrau00AE Family Size Roasted Pine Nut Hummus - 17 oz' 

to

final_str = 'Sabra® Family Size Roasted Pine Nut Hummus - 17oz'` in python3.

I have tried:

  1. str1.encode('utf-8') html.unescape
  2. str1.encode('utf-8').decode('unicode_escape')
  3. str1.encode('utf-8').decode('ascii')

But no luck.

Output of isinstance(str1,str) is True Output of str1.encode('utf=8') is bytes string b'Sabrau00AE Family Size Roasted Pine Nut Hummus - 17 oz'

I also imported charade, but I got errors in the encoding and decode function.

AttributeError: 'str' object has no attribute 'decode'  
AttributeError: 'str' object has no attribute 'encoding'

Solution

  • Thanks, @Mark Tolonen for help on the regex. In your output, I was getting 'u' also in the name along with the decoded symbol. So, I fixed the edge cases using the below code by

    1. Finding the substring with 'u' and 4 digit/characters next to it.
    2. converting this substring to Unicode string using replace function
    3. decoding using Unicode-escape

    Below code works:

    def convert(s):
        # return re.sub(r'[0-9A-F]{4}',lambda m: chr(int(m.group(),16)), s)
        return str.encode(re.sub(r'u[0-9A-F]{4}',lambda m:(m.group().replace('u','\\u')),s),'utf-8').decode('unicode-escape')
    

    Input:

     str1 = 'Sabrau00AE Family Size Roasted Pine Nut Hummus - 17 oz'
    

    Code:

    str2=convert(str1)
    print (str2)
    print(type(str2))
    

    Output:

    Sabra® Family Size Roasted Pine Nut Hummus - 17 oz
    <class 'str'>