Search code examples
pythonpython-unicodeunicode-escapes

Process unicode strings in python


I am using fasttext pre-trained model based on english wikipedia. It works as expected...

https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_english.ipynb

But when I try the same code with some other language, I get an error as shown on this page...

https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_marathi.ipynb

The error is related to unicode:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte

I tried to open the file using Raw Binary option. I changed the function load_words_raw in load.py file:

with open(file_path, 'rb') as f:

And now I get a different error:

ValueError: could not convert string to float: b'\x00l\x02'

I have no idea how to handle this.


Solution

  • You should change the second line of the notebook file to:

    #!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.vec.gz
    

    So pointing to the vec file, instead of the bin file:

    #!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.bin.gz