I can dump sklearn models to gridFS :
import gridfs
fs = gridfs.GridFS(db)
gridFS_file = fs.new_file()
joblib.dump(vectorizer, gridFS_file)
This works and I can see the model stored in my Mongo.
But I can't read directly from GridFS :
from bson.objectid import ObjectId
new_file = fs.get(ObjectId("59df36ebe46a520014e0771d"))
vectorizer2 = joblib.load(new_file)
This takes forever and never finishes. However, this works (and finishes quickly) :
with open('vec.pkl', 'wb') as f:
f.write(new_file.read())
vectorizer3 = joblib.load("vec.pkl")
What am I missing ?
A better workaround consists of first reading the file to a variable and then convert it to a stream, as following:
joblib.load(io.BytesIO(new_file.read()))