Search code examples
pythonjsonhtml-parsingdecodeencode

Issue in decoding string in python


I have a set of strings that need to be decoded. The strings format varies with products on the site. So its pretty unpredictable. Few examples of the format are given below:

 1. longDescription":"\u003cul\u003e  \u003cli\u003eTender grill’d bites made " (unicode and symbol combination)

 2. longDescription":"Goodness You Can See™" (all decoded, to be picked as is)

 3. longDescription":"With a wide variety of headphones,  \u003cbr /\u003e \u003cb\u003e\u003cbr /\u003eBlackWeb Flat CAT6 Network Cable:\u003c/b\u003e \u003cbr /\u003e \u003cul\u003e  \u003cli\u003eFlat CAT6 Network Cable\u003c/li\u003e  \u003cli\u003eLength: 14'\u003c/li\u003e  \u003cli\u003eUltra-slim design\u003c/li\u003e  \u003cli\u003e1GBPS"  (all unicode)

Basically, I want to extract this long description key (backend) or (bulleted list in the front end) from products like https://www.walmart.com/ip/Friskies-Gravy-Wet-Cat-Food-Warm-d-Serv-d-Grill-d-Bites-With-Shrimp-3-5-oz-Pouch/842464118

I have tried the below codes:

if '\\u' in longdescription:
   try:
       #temp['Key_Features'] =longdescription
       temp['Key_Features'] =longdescription.decode("unicode-escape").encode()
   except Exception as e:
       temp['Key_Features'] =HTMLParser.HTMLParser().unescape(longdescription)
else:
   temp['Key_Features'] =longdescription

I have tried all these above cases separately and the above one is with a combination. These work for most cases but in cases like the 1st one, it encodes and decodes ' symbol (or any other symbol) too and my output becomes:

Tender grillâd bites  (see the change in grill'd)

We have a dependency on python2 for this code, so requesting a solution in python2. Also, I am ok with HTML tags coming in output. Just need to have a code that works for all three cases. Thanks.


Solution

  • This is fixed in python3 now. Used below code to convert :

    temp['Key_Features']=longDescription.encode().decode('unicode-escape').encode('latin1').decode('utf8').replace('&','&').replace(' ','').replace('"','"')

    This happened because data was in different encoding formats and couldn't be handled by a single encoding/decoding. The above logic works for all.