Search code examples
pythonnltkarabic

Python bigram - foreign script


I am finding a list of bigrams using Python that include foreign text: Arabic, Russian, Farsi

The results show as such: ('\xd9\x85\xd9\x86\xd8\xa7\xd8\xb8\xd8\xb1\xd9\x87', '\xd9\x85\xd9\x88\xd8\xb3\xd9\x88\xdb\x8c')

What is this script called and how can I convert it to its Arabic/Russian/Farsi counterpart.

I am running this on the terminal in MAC OS using NLTK.


Solution

  • This is a bytestring containing utf-8 encoded text:

    In [5]: '\xd9\x85\xd9\x86\xd8\xa7\xd8\xb8\xd8\xb1\xd9\x87'.decode('utf-8')
    Out[5]: u'\u0645\u0646\u0627\u0638\u0631\u0647'
    
    In [6]: print '\xd9\x85\xd9\x86\xd8\xa7\xd8\xb8\xd8\xb1\xd9\x87'.decode('utf-8')         
    مناظره