Search code examples
pythonmachine-learningscikit-learnjoblib

Algorithmia Model Persistence with Sklearn


I'm pretty new to Algorithmia but I've used scikit-learn a bit and I know how to persist my machine learning model after I've trained it with joblib:

from sklearn.externals joblib

model = RandomForestRegressor()
# Train the model, etc
joblib.dump(model, "prediction/model/model.pkl")

Now I want to host my ML model and call it as a service using Algorithmia, but I can't figure out how to read the model back. I've created a collection in Algorithmia called "testcollection" with a file called "model.pkl" that is the result of the joblib.dump call. According to the docs, this means my file should be located at

data://(username)/testcollection/model.pkl

I want to read in that model from the file using joblib.load. Here's my current algorithm in Algorithmia:

import Algorithmia

def apply(input):
    client = Algorithmia.client()
    f = client.file("data://(username)/testcollection/model.pkl")
    print(f.path)
    print(f.url)
    print(f.getName())
    model = joblib.load(f.url) # Or f.path, both don't work
    return "empty"

Here's the output:

(username)/testcollection/model.pkl
/v1/data/(username)/testcollection/model.pkl
model.pkl

And it errors at the joblib.load line, giving the "No such file or directory (whatever path I put in)"

Here's all the paths / urls I've tried in calling joblib.load:

How do I load a model in from a file using joblib? Am I going about this the wrong way?


Solution

  • There are a few ways to access data on the DataAPI.

    Here are 4 different methods to access files via the Python Client:

    import Algorithmia
    
    client = Algorithmia.client("<YOUR_API_KEY>")
    
    dataFile = client.file("data://<USER_NAME>/<COLLECTION_NAME>/<FILE_NAME>").getFile()
    
    dataText = client.file("data://<USER_NAME>/<COLLECTION_NAME>/<FILE_NAME>").getString()
    
    dataJSON = client.file("data://<USER_NAME>/<COLLECTION_NAME>/<FILE_NAME>").getJson()
    
    dataBytes = client.file("data://<USER_NAME>/<COLLECTION_NAME>/<FILE_NAME>").getBytes()
    

    Since Sklearn expects the path to the model file, the easiest way to get that would be through a file object (aka. dataFile).

    According to the Official Python2.7 Documentation, if a file object is created other than the open() function, the object attribute name usually corresponds to the path of the file.

    In this case, you would need to write something like this:

    import Algorithmia
    
    def apply(input):
    
        # You don't need to write your API key if you're editing in the web editor
        client = Algorithmia.client()
    
        modelFile = client.file("data://(username)/testcollection/model.pkl").getFile()
    
        modelFilePath = modelFile.name
    
        model = joblib.load(modelFilePath)
    
        return "empty"
    

    But according to the Official Sklearn Model Persistence Documentation, you should also be able to just pass file-like objects instead of file names.

    Hence, we can just skip the part where we try to get the filename, and just pass the modelFile object:

    import Algorithmia
    
    def apply(input):
    
        # You don't need to write your API key if you're editing in the web editor
        client = Algorithmia.client()
    
        modelFile = client.file("data://(username)/testcollection/model.pkl").getFile()
    
        model = joblib.load(modelFile)
    
        return "empty"
    

    Edit: Here's also an article in the Offical Algorithmia Developer Center talking about Model Persistence in Scikit-Learn.

    Full discloser: I work as an Algorithm Engineer at Algorithmia.