Process unicode strings in python

I am using fasttext pre-trained model based on english wikipedia. It works as expected...

But when I try the same code with some other language, I get an error as shown on this page...

The error is related to unicode:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte

I tried to open the file using Raw Binary option. I changed the function load_words_raw in load.py file:

with open(file_path, 'rb') as f:

And now I get a different error:

ValueError: could not convert string to float: b'\x00l\x02'

I have no idea how to handle this.

Solution

You should change the second line of the notebook file to:

#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.vec.gz

So pointing to the vec file, instead of the bin file:

#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.bin.gz