I have a dataset with JSON records like the one below:
{"reviewerID": "A3SSQRWUP2A04Q", "asin": "B0000224UE", "reviewerName": "Wilson", "helpful": [0, 0], "reviewText": "I love this tool. Yes it is heavy. Yes you can wear it on your belt and no one will notice. Really. And yes, you get a great amount of tools. I already always carry a Swiss Army Knife on me, so I went with this model to get the serrated blade, and not the scissors.It is perfectly aligned, buttery smooth, nicer than my Leatherman Rebar, and yes, bigger and heavier. The tools come out on the outside, and lock with a satisfying "snick." The release is easier to use than the Leatherman Rebar. Which itself is a nice tool, I carry that in my daily work messenger bag.The pliers are strong and super easy. One nice touch -- the ruler, inches/centimeters, is far easier to read than the Leatherman. The result I think of the brightly polished steel.", "overall": 5.0, "summary": "Top of the line heavy multi tool", "unixReviewTime": 1396224000, "reviewTime": "03 31, 2014"}
When I search the symbols "
in google, it automatically converts it in the double quote ("
) symbol. What form is this in, and how do I get them into a readable format for something like nltk
?
Ah, those are the HTML encodings of them! See here for a table: https://www.toptal.com/designers/htmlarrows/symbols/
In Python, you can convert these to normal characters with the built-in html
module.
import html
normal_string = html.unescape(string_with_html_entities)