Search code examples
pythonutf-8character-encodingascii

How to remove all conflicting characters between latin1 and utf-8 using python?


I call open(file, "r") and read some lines in Python. This gives me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)

If I add 'utf-8', I get:

'utf8' codec can't decode bytes in position 28-29: invalid continuation byte

If I add 'ISO-8859-1', I get no errors but a line is read like this:

2890 ready to try Argh� Fantasy Surfer Carnage� Dane, Marlon & Nat C all out!  #fantasysurfer

As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..

What is the best approach to clean these lines up?

I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...

Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Argh� Carnage� elements.

In general, these are causing issues with the encoding.


Solution

  • Try first use decode and then encode:

    u"text".decode('latin-1').encode('utf-8')
    

    Or try open file with codecs:

    import codecs
    with codecs.open('file', encoding="your coding")
    

    Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.

    Also if you get text in ASCII use it:

    'abc'.decode('ascii')
    

    or

    unicode('abc', 'ascii')