Search code examples
pythonencodingutf-8non-unicode

Delete weird ANSI character and convert accented ones using Python


I've downloaded a bunch of Spanish tweets using the Twitter API, but some of them have strange ANSI characters that I don't want there. I have around 18000 files and I want to remove those characters. I have all my files encoded as UTF-8. For example:

b'Me quedo con una frase de nuestra reuni\xc3\xb3n de hoy.'

If they are accented characters (we have plenty in spanish) I want to delete the accented letter and replace it for the non-accented version of it. That's because after that I'm doing some text mining analysis and I want to unify the words because there could be people not using accents. That b means is in byte mode, I think.

In the case before if I put the following in python:

print(u'Me quedo con una frase de nuestra reuni\xc3\xb3n de hoy con @Colegas')

And I get this in the terminal:

Me quedo con una frase de nuestra reunión de hoy con @Colegas

Which I don't like because it's not a used accent in Spanish. There should be the character ó. I don't get why is nor getting it right. I also would like the b at the beginning of the files to disappear. To encode the files I used the following:

f.write(str(FILE.encode('utf-8','strict')))

There I create my files from some json in UTF-8 which contains a lot of keys for each tweet. Maybe I should change it or I'm doing something wrong there.

In some cases there's also a problem when trying to get the characters in the python terminal. For instance:

print(u'\uD83D\uDC1F')

I think that's because python can't represent those characters (� in the example above). Is that so? I would also want to remove them.

Sorry if there's some English mistakes and feel free to ask if something is not clear.

Thanks in advance.

EDIT: I'm using Python 3.4


Solution

  • You are mixing apples and oranges. b'reuni\xc3\xb3n' is the UTF-8 encoding of u'reuni\u00f3n' which of course is reunión in human-readable format.

    >>> print b'reuni\xc3\xb3n'.decode('utf-8')
    reunión
    >>> repr(b'reuni\xc3\xb3n'.decode('utf-8'))
    "u'reuni\\xf3n'"
    

    There is no "ANSI" here (it's a misnomer anyway; commonly it is used to refer to Windows character encodings, but not necessarily the one you expect).

    As for how to remove the accents from accented characters, the short version is to normalize to the Unicode "NFD" representation, then discard any code points which have a "diacritic" classification. This is covered e.g. in What is the best way to remove accents in a Python unicode string?; in order to make this answer self-contained, here is the gist of one of the answers to that question -- but do read all of them, if only to decide which suits your use case the best.

    import unicodedata
    stripped = u"".join([c for c in unicodedata.normalize('NFKD', input_str)
        if not unicodedata.combining(c)])