python machine-learning module libraries python-packaging

Packaging Libraries with ML models in Python

I have a saved model for Sentiment Analysis and code and data along with it. I am trying to create a library that will have functionalities from this code and uses this trained model. I do not get how will I incorporate the model and functionalities dependent upon it.

Can anyone guide me on how to do that specifically?

Edit: Using pickle is the method I went with (answered below)

Solution

You need to know about three things if you want to maintain such a library properly:

how to build a package
how to version a package
how to distribute a package

There is a few ways how you could do that, the most user-friendly at the moment is probably poetry, so I'll use that as an example. It needs to be installed if you want to use this post as a tutorial.

In order to have some very basic project skeleton to work with, I'll just assume that you have something similar to this:

modelpersister
├───modelpersister
│   ├───model.pkl
│   ├───__init__.py
│   ├───model_definition.py
│   ├───train.py
│   └───analyze.py
└───pyproject.toml

model.pkl: the model artifact that you're going to ship with your package
__init__.py: empty, needs to be there to make this folder a python module
model_definition.py: contains the class definition and features that define your model
train.py: accepts data to train you model and overwrite the current model.pkl file with the result, something roughly like this:

import pickle
from pathlib import Path

from modelpersister.model_definition import SentimentAnalyzer

# overwrite the current model given some new data
def train(data):
    model = SentimentAnalyzer.train(data)

    with open(Path(__file__).parent / "model.pkl") as model_file:
        pickle.dump(model, model_file)

analyze.py: accepts data points to analyze them given the current model.pkl, something roughly like this:

import pickle
import importlib.resources

from modelpersister.model_definition import MyModel

# load the current model as a package resource (small but important detail)
with importlib.resources.path("modelpersister", "model.pkl") as model_file:
    model: MyModel = pickle.load(model_file)

# make meaningful analyzes available in this file
def estimate(data_point):
    return model.estimate(data_point)

pyproject.toml: a metadata file that poetry needs in order to package this code, something very similar to this:

[tool.poetry]
name = "modelpersister"
version = "0.1.0"
description = "Ship a sentiment analysis model."
authors = ["Mishaal <my@mail.com>"]
license = "MIT"  # a good default as far as licenses go

[tool.poetry.dependencies]
python = "^3.8"
sklearn = "^0.23"  # or whichever ML library you used for your model definition

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

Given all of these files being filled with meaningful code and hopefully using a better name than modelpersister for the project, your workflow would look roughly like this:

update your features in model_definition.py, train your model with train.py on better data, or add new functions in analysis.py until you feel like your model is now noticeably better than before
run poetry version minor to update the package version
run poetry build to build your code and model into a source distribution and wheel file that you can, if you want, perform some final tests on
run poetry publish to distribute your package - by default to the global Python package index, but you can also set up a private PyPI instance and tell poetry about it, or upload it manually somewhere else