Search code examples
pythonpandasseldon

How to convert a Pandas DataFrame into a valid MLserver Predict V2-encoded payload?


I recently found the KServe and MLserver projects which are open source tools for serving ML models. These are great. What's not so great is that these both use a (new to me) and novel formatting for inference inputs, documented here: https://kserve.github.io/website/modelserving/inference_api/

An input looks like

{
  "id" : "42",
  "inputs" : [
    {
      "name" : "input0",
      "shape" : [ 2, 2 ],
      "datatype" : "UINT32",
      "data" : [ 1, 2, 3, 4 ]
    },
    {
      "name" : "input1",
      "shape" : [ 3 ],
      "datatype" : "BOOL",
      "data" : [ true ]
    }
  ]
}

While I understand this format from the docs, I don't understand how I'm supposed to easily convert a Pandas DataFrame into this format. I've looked online for "Dataframe to MLserve V2 format converter" but I can't find anything.

Does anyone know how I would go about making this conversion? Surely I wouldn't have to write my own.. right?


Solution

  • The V2 Inference Protocol can be thought of as a lower-level spec. It doesn't try to define how to encode higher-level data types (e.g. a Pandas Dataframe) and leaves this to the inference servers themselves.

    Based on this, MLServer introduces its own conventions which, if followed, ensure that the payload gets converted into a higher-level Python data type. These are covered in the Content Types section of the docs.

    In particular, for Pandas Dataframes, the simplest way would be to use the "codecs" which were introduced in MLServer 1.1.0. These include a set of helpers which let you do something like:

    import pandas as pd
    
    from mlserver.codecs import PandasCodec
    
    foo = pd.DataFrame({
      "A": ["a1", "a2", "a3", "a4"],
      "B": ["b1", "b2", "b3", "b4"],
      "C": ["c1", "c2", "c3", "c4"]
    })
    
    v2_request = PandasCodec.encode_request(foo)
    

    Alternatively, you can also craft your own payload following the rules outlined in the docs (i.e. each column goes into a separate input, etc.).