How to remove all conflicting characters between latin1 and utf-8 using python?

I call open(file, "r") and read some lines in Python. This gives me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)

If I add 'utf-8', I get:

'utf8' codec can't decode bytes in position 28-29: invalid continuation byte

If I add 'ISO-8859-1', I get no errors but a line is read like this:

2890 ready to try Arghï¿½ Fantasy Surfer Carnageï¿½ Dane, Marlon &amp; Nat C all out!  #fantasysurfer

As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..

What is the best approach to clean these lines up?

I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...

Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Arghï¿½ Carnageï¿½ elements.

In general, these are causing issues with the encoding.

Solution

Try first use decode and then encode:

u"text".decode('latin-1').encode('utf-8')

Or try open file with codecs:

import codecs
with codecs.open('file', encoding="your coding")

Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.

Also if you get text in ASCII use it:

'abc'.decode('ascii')

unicode('abc', 'ascii')