Search code examples
pythonnlpgensimfasttext

Loading a pretrained fastText model with Gensim


I'm trying to load a pretrained German fastText model (source: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz) with Gensim. My intention is to fine-tune it using my own dataset. However, there occurs an error when loading the model.

My code:

import gensim

print("gensim", gensim.__version__) # Out: gensim 4.3.1
bin_path = "cc.de.300.bin"
model = gensim.models.fasttext.load_facebook_model(bin_path)

The error:

AssertionError: expected (600000000,),  got (306814511,)

The whole traceback:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [3], in <module>
----> 1 model = gensim.models.fasttext.load_facebook_model(bin_path)

File /opt/conda/lib/python3.9/site-packages/gensim/models/fasttext.py:728, in load_facebook_model(path, encoding)
    666 def load_facebook_model(path, encoding='utf-8'):
    667     """Load the model from Facebook's native fasttext `.bin` output file.
    668 
    669     Notes
   (...)
    726 
    727     """
--> 728     return _load_fasttext_format(path, encoding=encoding, full_model=True)

File /opt/conda/lib/python3.9/site-packages/gensim/models/fasttext.py:808, in _load_fasttext_format(model_file, encoding, full_model)
    789 """Load the input-hidden weight matrix from Facebook's native fasttext `.bin` output files.
    790 
    791 Parameters
   (...)
    805 
    806 """
    807 with utils.open(model_file, 'rb') as fin:
--> 808     m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
    810 model = FastText(
    811     vector_size=m.dim,
    812     window=m.ws,
   (...)
    821     max_n=m.maxn,
    822 )
    823 model.corpus_total_words = m.ntokens

File /opt/conda/lib/python3.9/site-packages/gensim/models/_fasttext_bin.py:353, in load(fin, encoding, full_model)
    351     hidden_output = None
    352 else:
--> 353     hidden_output = _load_matrix(fin, new_format=new_format)
    354     assert fin.read() == b'', 'expected to reach EOF'
    356 model.update(vectors_ngrams=vectors_ngrams, hidden_output=hidden_output)

File /opt/conda/lib/python3.9/site-packages/gensim/models/_fasttext_bin.py:284, in _load_matrix(fin, new_format)
    281 else:
    282     matrix = np.fromfile(fin, _FLOAT_DTYPE, count)
--> 284 assert matrix.shape == (count,), 'expected (%r,),  got %r' % (count, matrix.shape)
    285 matrix = matrix.reshape((num_vectors, dim))
    286 return matrix

AssertionError: expected (600000000,),  got (306814511,)

What's the cause of this error and how can I load the dataset properly?


Solution

  • That's the sort of error you might get from a file that's been truncated, to not contain everything expected.)

    Are you sure your cc.de.300.bin file is complete & undamaged? What's it's size, and can you try re-dowloading it to ensure you have a full copy?

    Separately: there's no official support for 'fine-tuning' FastText vectors in Gensim. You can call usual training methods in atypical ways, including on an already-trained model, to attempt an effect like that – but there are no guides for ways to do that effectively in Gensim. Further, I've never seen any good writeup explaining how fine-tuning a FastText model could be attempted & verified.

    If you want confidence in the usual benefits of FastText, including its ability to synthesize useful vectors for out-of-vocabulary words, it's safest to use/train it in the usual way: via a single training session which includes representative training texts for all words of interest. If improvising some other approach for patching in other words, or differing word senses for existing words, you should pay special attention to monitoring in what ways the novel steps are helping or hurting the overall model.