I'm trying to load a pretrained German fastText model (source: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz) with Gensim. My intention is to fine-tune it using my own dataset. However, there occurs an error when loading the model.
My code:
import gensim
print("gensim", gensim.__version__) # Out: gensim 4.3.1
bin_path = "cc.de.300.bin"
model = gensim.models.fasttext.load_facebook_model(bin_path)
The error:
AssertionError: expected (600000000,), got (306814511,)
The whole traceback:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [3], in <module>
----> 1 model = gensim.models.fasttext.load_facebook_model(bin_path)
File /opt/conda/lib/python3.9/site-packages/gensim/models/fasttext.py:728, in load_facebook_model(path, encoding)
666 def load_facebook_model(path, encoding='utf-8'):
667 """Load the model from Facebook's native fasttext `.bin` output file.
668
669 Notes
(...)
726
727 """
--> 728 return _load_fasttext_format(path, encoding=encoding, full_model=True)
File /opt/conda/lib/python3.9/site-packages/gensim/models/fasttext.py:808, in _load_fasttext_format(model_file, encoding, full_model)
789 """Load the input-hidden weight matrix from Facebook's native fasttext `.bin` output files.
790
791 Parameters
(...)
805
806 """
807 with utils.open(model_file, 'rb') as fin:
--> 808 m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
810 model = FastText(
811 vector_size=m.dim,
812 window=m.ws,
(...)
821 max_n=m.maxn,
822 )
823 model.corpus_total_words = m.ntokens
File /opt/conda/lib/python3.9/site-packages/gensim/models/_fasttext_bin.py:353, in load(fin, encoding, full_model)
351 hidden_output = None
352 else:
--> 353 hidden_output = _load_matrix(fin, new_format=new_format)
354 assert fin.read() == b'', 'expected to reach EOF'
356 model.update(vectors_ngrams=vectors_ngrams, hidden_output=hidden_output)
File /opt/conda/lib/python3.9/site-packages/gensim/models/_fasttext_bin.py:284, in _load_matrix(fin, new_format)
281 else:
282 matrix = np.fromfile(fin, _FLOAT_DTYPE, count)
--> 284 assert matrix.shape == (count,), 'expected (%r,), got %r' % (count, matrix.shape)
285 matrix = matrix.reshape((num_vectors, dim))
286 return matrix
AssertionError: expected (600000000,), got (306814511,)
What's the cause of this error and how can I load the dataset properly?
That's the sort of error you might get from a file that's been truncated, to not contain everything expected.)
Are you sure your cc.de.300.bin
file is complete & undamaged? What's it's size, and can you try re-dowloading it to ensure you have a full copy?
Separately: there's no official support for 'fine-tuning' FastText vectors in Gensim. You can call usual training methods in atypical ways, including on an already-trained model, to attempt an effect like that – but there are no guides for ways to do that effectively in Gensim. Further, I've never seen any good writeup explaining how fine-tuning a FastText model could be attempted & verified.
If you want confidence in the usual benefits of FastText, including its ability to synthesize useful vectors for out-of-vocabulary words, it's safest to use/train it in the usual way: via a single training session which includes representative training texts for all words of interest. If improvising some other approach for patching in other words, or differing word senses for existing words, you should pay special attention to monitoring in what ways the novel steps are helping or hurting the overall model.