I call open(file, "r") and read some lines in Python. This gives me:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)
If I add 'utf-8', I get:
'utf8' codec can't decode bytes in position 28-29: invalid continuation byte
If I add 'ISO-8859-1', I get no errors but a line is read like this:
2890 ready to try Argh� Fantasy Surfer Carnage� Dane, Marlon & Nat C all out! #fantasysurfer
As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..
What is the best approach to clean these lines up?
I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...
Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Argh� Carnage� elements.
In general, these are causing issues with the encoding.
Try first use decode and then encode:
u"text".decode('latin-1').encode('utf-8')
Or try open file with codecs:
import codecs
with codecs.open('file', encoding="your coding")
Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.
Also if you get text in ASCII use it:
'abc'.decode('ascii')
or
unicode('abc', 'ascii')