Search code examples
pythonscikit-learnpymongogridfsjoblib

Cannot load joblib serialized model from GridFS


I can dump sklearn models to gridFS :

import gridfs
fs = gridfs.GridFS(db)
gridFS_file = fs.new_file()
joblib.dump(vectorizer, gridFS_file)

This works and I can see the model stored in my Mongo.

But I can't read directly from GridFS :

from bson.objectid import ObjectId
new_file = fs.get(ObjectId("59df36ebe46a520014e0771d"))
vectorizer2 = joblib.load(new_file)

This takes forever and never finishes. However, this works (and finishes quickly) :

with open('vec.pkl', 'wb') as f:
    f.write(new_file.read())
    vectorizer3 = joblib.load("vec.pkl")

What am I missing ?


Solution

  • A better workaround consists of first reading the file to a variable and then convert it to a stream, as following:

    joblib.load(io.BytesIO(new_file.read()))