I am performing Bigram generation for words of Czech Language. I am able to generate Bi-grams using Python. The problem is with non English characters in Czech language.
Input:
republikán strategii proti znovuzvolení Obamy.
Performing Bigram ,the output is
[['republik\xc3\xa1n', 'strategii'], ['strategii', 'proti'], ['proti', 'znovuzvolen\xc3\xad'], ['znovuzvolen\xc3\xad', 'Obamy']]
The special letters of Czech language is converted as \xc3\xad in bigram. What changes needs should be done with code to get the special letters in proper way in output
The data is correct, but when you convert a list to a string, the output is prepared using repr
for the list items, not str
. Compare:
>>> x = [['republikán']]
>>> print(x)
[['republik\xc3\xa1n']]
>>> print(x[0])
['republik\xc3\xa1n']
>>> print(x[0][0])
republikán
>>>