Search code examples
pythonamazon-s3aws-lambdagensimfasttext

Loading fasttext binary model from s3 fails


I am hosting a pretrained fasttext model on s3 (uncompressed) and I am trying to load it in a lambda function. I am using the gensim.models.fasttext module to load the model:

from gensim.models.fasttext import load_facebook_vectors

def load_model(obj):
    model = load_facebook_vectors(obj["path"])

with obj["path"] being the s3 path, but I keep getting the following error:

"errorMessage": "fileno"
"errorType": "UnsupportedOperation"
"stackTrace": [
...
"  File \"/var/task/gensim/models/fasttext.py\", line 784, in load_facebook_vectors\n    full_model = _load_fasttext_format(path, encoding=encoding, full_model=False)\n"
"  File \"/var/task/gensim/models/fasttext.py\", line 808, in _load_fasttext_format\n    m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)\n"
"  File \"/var/task/gensim/models/_fasttext_bin.py\", line 348, in load\n    vectors_ngrams = _load_matrix(fin, new_format=new_format)\n"
"  File \"/var/task/gensim/models/_fasttext_bin.py\", line 282, in _load_matrix\n    matrix = np.fromfile(fin, _FLOAT_DTYPE, count)\n"
]

Solution

  • Unfortunately, the np.fromfile() method on which this load depends doesn't work on a streamed-from-S3 file.

    Some alternate options include:

    • download the S3 file to a local path first, then use load_facebook_vectors() from there; or…
    • while having the FastText file local, load it locally, then use Python's pickle functionality to save it to a single file (now of Python's format), then put that file on S3, and in the future re-load it using Python's unpickling

    The utility functions in gensim.utils pickle() and unpickle() (which take a file path, including S3 URLs) may be helpful for the 2nd option, eg:

    https://radimrehurek.com/gensim/utils.html#gensim.utils.unpickle

    Since your prior code only shows using the vectors (via .load_facebook_vector), not the whole model, you could just pickle & upload the model.wv subcomponent of the loaded model, rather than the whole model, to save some storage/bandwidth.

    If perhaps in future Gensim versions, the FastText-model related classes change in shape/operation, an old pickled-model might not cleanly load. In such an eventuality, you could potentially either:

    • go back to the original Facebook-format model file (which could then be loaded, & then re-saved in a modern format, again); OR...
    • load your pickled model into the older Gensim where it works, save it locally using Gensim's native .save() (which may split it over multiple local files), then in the newer Gensim use Gensim's native FastText.load() to load those older files (which will usually handle older formats), then re-pickle that loaded model, for future re-unpickles into the matching latest Gensim.