I have a set of strings that need to be decoded. The strings format varies with products on the site. So its pretty unpredictable. Few examples of the format are given below:
1. longDescription":"\u003cul\u003e \u003cli\u003eTender grill’d bites made " (unicode and symbol combination)
2. longDescription":"Goodness You Can See™" (all decoded, to be picked as is)
3. longDescription":"With a wide variety of headphones, \u003cbr /\u003e \u003cb\u003e\u003cbr /\u003eBlackWeb Flat CAT6 Network Cable:\u003c/b\u003e \u003cbr /\u003e \u003cul\u003e \u003cli\u003eFlat CAT6 Network Cable\u003c/li\u003e \u003cli\u003eLength: 14'\u003c/li\u003e \u003cli\u003eUltra-slim design\u003c/li\u003e \u003cli\u003e1GBPS" (all unicode)
Basically, I want to extract this long description key (backend) or (bulleted list in the front end) from products like https://www.walmart.com/ip/Friskies-Gravy-Wet-Cat-Food-Warm-d-Serv-d-Grill-d-Bites-With-Shrimp-3-5-oz-Pouch/842464118
I have tried the below codes:
if '\\u' in longdescription:
try:
#temp['Key_Features'] =longdescription
temp['Key_Features'] =longdescription.decode("unicode-escape").encode()
except Exception as e:
temp['Key_Features'] =HTMLParser.HTMLParser().unescape(longdescription)
else:
temp['Key_Features'] =longdescription
I have tried all these above cases separately and the above one is with a combination. These work for most cases but in cases like the 1st one, it encodes and decodes ' symbol (or any other symbol) too and my output becomes:
Tender grillâd bites (see the change in grill'd)
We have a dependency on python2 for this code, so requesting a solution in python2. Also, I am ok with HTML tags coming in output. Just need to have a code that works for all three cases. Thanks.
This is fixed in python3 now. Used below code to convert :
temp['Key_Features']=longDescription.encode().decode('unicode-escape').encode('latin1').decode('utf8').replace('&','&').replace(' ','').replace('"','"')
This happened because data was in different encoding formats and couldn't be handled by a single encoding/decoding. The above logic works for all.