I am using fasttext pre-trained model based on english wikipedia. It works as expected...
https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_english.ipynb
But when I try the same code with some other language, I get an error as shown on this page...
https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_marathi.ipynb
The error is related to unicode:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte
I tried to open the file using Raw Binary option. I changed the function load_words_raw in load.py file:
with open(file_path, 'rb') as f:
And now I get a different error:
ValueError: could not convert string to float: b'\x00l\x02'
I have no idea how to handle this.
You should change the second line of the notebook file to:
#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.vec.gz
So pointing to the vec file, instead of the bin file:
#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.bin.gz