I am hosting a pretrained fasttext model on s3 (uncompressed) and I am trying to load it in a lambda function. I am using the gensim.models.fasttext
module to load the model:
from gensim.models.fasttext import load_facebook_vectors
def load_model(obj):
model = load_facebook_vectors(obj["path"])
with obj["path"]
being the s3 path, but I keep getting the following error:
"errorMessage": "fileno"
"errorType": "UnsupportedOperation"
"stackTrace": [
...
" File \"/var/task/gensim/models/fasttext.py\", line 784, in load_facebook_vectors\n full_model = _load_fasttext_format(path, encoding=encoding, full_model=False)\n"
" File \"/var/task/gensim/models/fasttext.py\", line 808, in _load_fasttext_format\n m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)\n"
" File \"/var/task/gensim/models/_fasttext_bin.py\", line 348, in load\n vectors_ngrams = _load_matrix(fin, new_format=new_format)\n"
" File \"/var/task/gensim/models/_fasttext_bin.py\", line 282, in _load_matrix\n matrix = np.fromfile(fin, _FLOAT_DTYPE, count)\n"
]
Unfortunately, the np.fromfile()
method on which this load depends doesn't work on a streamed-from-S3 file.
Some alternate options include:
load_facebook_vectors()
from there; or…pickle
functionality to save it to a single file (now of Python's format), then put that file on S3, and in the future re-load it using Python's unpicklingThe utility functions in gensim.utils
pickle()
and unpickle()
(which take a file path, including S3 URLs) may be helpful for the 2nd option, eg:
https://radimrehurek.com/gensim/utils.html#gensim.utils.unpickle
Since your prior code only shows using the vectors (via .load_facebook_vector
), not the whole model, you could just pickle & upload the model.wv
subcomponent of the loaded model, rather than the whole model, to save some storage/bandwidth.
If perhaps in future Gensim versions, the FastText
-model related classes change in shape/operation, an old pickled-model might not cleanly load. In such an eventuality, you could potentially either:
.save()
(which may split it over multiple local files), then in the newer Gensim use Gensim's native FastText.load()
to load those older files (which will usually handle older formats), then re-pickle that loaded model, for future re-unpickles into the matching latest Gensim.