Search code examples
pythonnon-ascii-charactersn-gramnon-english

n-gram generation for words of non english languages


I am performing Bigram generation for words of Czech Language. I am able to generate Bi-grams using Python. The problem is with non English characters in Czech language.

Input:

republikán strategii proti znovuzvolení Obamy.

Performing Bigram ,the output is

[['republik\xc3\xa1n', 'strategii'], ['strategii', 'proti'], ['proti', 'znovuzvolen\xc3\xad'], ['znovuzvolen\xc3\xad', 'Obamy']]

The special letters of Czech language is converted as \xc3\xad in bigram. What changes needs should be done with code to get the special letters in proper way in output


Solution

  • The data is correct, but when you convert a list to a string, the output is prepared using repr for the list items, not str. Compare:

    >>> x = [['republikán']]
    >>> print(x)
    [['republik\xc3\xa1n']]
    >>> print(x[0])
    ['republik\xc3\xa1n']
    >>> print(x[0][0])
    republikán
    >>>