Search code examples
machine-learninglogistic-regression

Is there a standard format for saving trained logistic regression models?


I was asked in an interview the other day what data structure I would use to save a trained logistic regression model, and I was kind of at a loss for words. I asked for some clarification, but didn't receive much other than just having the question restated. I just said something about how I would use sklearn.linear_model.Logistic_Regression to instantiate a model object, and then use the .fit() method to fit it, at which point you could save the model with pickle/joblib. This seemed like it was probably not the right answer, but it was the best I could think of in the moment.

I'm familiar with saving PyTorch models as a state_dict which are basically just Python dicts, but as far as I'm aware, using pickle or joblib just save them in a binary format, and I'm not really sure if that even counts as a data structure. (Just a note - the question was not specific to sklearn or even to Python, but those are the tools I use most so I defaulted to them.)

After some Googling and digging on SO, I have not been able to find anything that answers my question.

My questions are a) what data structure could you use to hold a trained logistic regression model? And b) is there some widely accepted way of doing this that I'm unaware of? Is there just some gap in my knowledge here?


Solution

  • There is an industry standard called the Predictive Model Markup Language (PMML).

    This standard gives you two data structures for representing linear models:

    1. RegressionModel - for simpler models
    2. GeneralRegressionModel - for unlimited complexity models

    Scikit-Learn models fall into the "simpler models" category. You can convert Scikit-Learn pipelines (that end with a final linear model step) using the sklearn2pmml package.

    The standardized representation of linear models is way more complicated than just capturing the "regression table" part. You will also need to give a complete and unambiguous description of model schema (what are model's input and outputs), specify its applicability domain, etc.

    Over the years, various people/projects have dismissed PMML as outdated (mostly due to its XML background), and proceeded to re-inventing their own approach. Hasn't worked out all that well.